public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Dinakar Guniguntala <dino@in.ibm.com>
To: david singleton <dsingleton@mvista.com>
Cc: Ingo Molnar <mingo@elte.hu>,
	linux-kernel@vger.kernel.org, robustmutexes@lists.osdl.org
Subject: Re: Recursion bug in -rt
Date: Sat, 17 Dec 2005 00:12:09 +0530	[thread overview]
Message-ID: <20051216184209.GD4732@in.ibm.com> (raw)
In-Reply-To: <43F8915C-6DC7-11DA-A45A-000A959BB91E@mvista.com>

On Thu, Dec 15, 2005 at 04:02:36PM -0800, david singleton wrote:
> Dinakar,
> 
> 	I believe the problem we have is that the library is not checking
> to see if the mutex is a recursive mutex, and then checking to see
> if the recursive mutex is already owned by the calling thread.  If a 
> recursive mutex
> is owned by the calling thread the library should increment the lock
> count (POSIX says recursive mutexes must be unlocked as
> many times as they are locked) and return 0 (success) to the caller.

Sorry I seem to have confused you. I am _not_ talking about recursive
mutexes here.

I have two testcases. Here is a code snippet of testcase I

        void* test_thread (void* arg)
        {
            pthread_mutex_lock(&child_mutex);
            printf("test_thread got lock\n");

            pthread_mutex_lock(&child_mutex);
            printf("test_thread got lock 2nd time !!\n");

            printf("test_thread exiting\n");
            return NULL;
        }

        main ()
        {
            ...

            if (pthread_mutexattr_init(&mutexattr) != 0) {
              printf("Failed to init mutexattr\n");
            };
            if (pthread_mutexattr_setprotocol(&mutexattr,
                                        PTHREAD_PRIO_INHERIT) != 0) {
              printf("Can't set protocol prio inherit\n");
            }
            if (pthread_mutexattr_getprotocol(&mutexattr, &protocol) != 0) {
              printf("Can't get mutexattr protocol\n");
            } else {
              printf("protocol in mutexattr is %d\n", protocol);
            }
            if ((retc = pthread_mutex_init(&child_mutex, &mutexattr)) != 0) {
              printf("Failed to init mutex: %d\n", retc);
            }

            ...
        }


Clearly what the application is doing here is wrong. However,

1. In the case of normal (non-robust) non recursive mutexes, the
behaviour when we make the second pthread_mutex_lock call is for glibc
to make a futex_wait call which will block forever.
(Which is the right behaviour)

2. In the case of a robust/PI non recursive mutex, the current
behaviour is the glibc makes a futex_wait_robust call (which is right)
The kernel (2.6.15-rc5-rt1) rt_mutex lock is currently unowned and
since we do not call down_try_futex if current == owner_task, we end
up grabbing the lock in __down_interruptible and returning succesfully !

3. Adding the check below in down_futex is also wrong

   if (!owner_task || owner_task == current) {
        up(sem);
        up_read(&current->mm->mmap_sem);
        return -EAGAIN;
   }

   This will result in glibc calling the kernel continuosly in a
   loop and we will end up context switching to death

I guess we need to cleanup this path to ensure that the application
blocks forever.

I also have testcase II which does not do anything illegal as the
above one, instead it exercises the PI boosting code path in the
kernel and that is where I see the system hang up yet again
and this is the race that I am currently investigating

Hope that clearls up things a bit

	-Dinakar


> 
>      I don't see anywhere in the library where the check for a recursive
> mutex is being made.  The mutex_trylock code just does a compare
> and exchange on the lock to see if it can get it.  A recursive mutex
> will fail this test and erroneously enter the kernel.
> 
>     I believe we need the down_futex change and a patch
> to glibc in which recursive mutexes are handled in the library.
> 
>     I'm talking to the library people now to see if that's the way
> it's supposed to work for recursive mutexes.  If it is we'll have
> to provide a new glibc patch along with the kernel patch.
> 
>      I think I'll have to provide a new glibc patch anyways to
> fix the INLINE_SYSCALL problem.
> 
>      Does this make sense?
> 
> David
> 
> On Dec 15, 2005, at 11:44 AM, Dinakar Guniguntala wrote:
> 
> >On Wed, Dec 14, 2005 at 05:03:13PM -0800, david singleton wrote:
> >>Dinakar,
> >>you may be correct, since reverting the change in down_futex seems
> >>to work.
> >
> >Well it did not. It ran for much longer than previously but still
> >hung up.
> >
> >Well we have a very basic problem in the current implementation.
> >
> >Currently if a thread calls pthread_mutex_lock on the same lock
> >(normal, non recursive lock) twice without unlocking in between, the
> >application hangs. Which is the right behaviour.
> >However if the same thing is done with a non recursive robust mutex,
> >it manages to acquire the lock.
> >
> >I see many problems here (I am assuming that the right behaviour
> >with robust mutexes is for application to ultimately block
> >indefinitely in the kernel)
> >
> >1. In down_futex we do the following check
> >
> >	if (owner_task != current)
> >		down_try_futex(lock, owner_task->thread_info __EIP__);
> >
> >   In the above scenario, the thread would have acquired the 
> >uncontended
> >   lock first time around in userspace. The second time it tries to
> >   acquire the same mutex, because of the above check, does not
> >   call down_try_futex and hence will not initialize the lock structure
> >   in the kernel. It then goes to __down_interruptible where it is
> >   granted the lock, which is wrong.
> >
> >   So IMO the above check is not right. However removing this check
> >   is not the end of story.  This time it gets to task_blocks_on_lock
> >   and tries to grab the task->pi_lock of the owvner which is itself
> >   and results in a system hang. (Assuming CONFIG_DEBUG_DEADLOCKS
> >   is not set). So it looks like we need to add some check to
> >   prevent this below in case lock_owner happens to be current.
> >
> >    _raw_spin_lock(&lock_owner(lock)->task->pi_lock);
> >
> >
> >>However, I'm  wondering if you've hit a bug that Dave Carlson is
> >>reporting that he's tracked down to an inline in the glibc patch.
> >
> >Yes I noticed that, basically it looks like INLINE_SYSCALL, on error
> >returns -1 with the error code in errno. Whereas we expect the
> >return code to be the same as the kernel return code. Are you referring
> >to this or something else ??
> >
> >However even with all of the above fixes (remove the check in 
> >down_futex
> >as mentioned above, add a check in task_blocks_on_lock and the glibc
> >changes) my test program continues to hang up the system, though it
> >takes a lot longer to recreate the problem now
> >
> >[snip]
> >
> >>1) Why did the library call into the kernel if the calling thread
> >>owned the lock?
> >
> >This is something I still havent figured out and leads me to believe
> >that we still have a tiny race race somewhere
> >
> >	-Dinakar
> 

  reply	other threads:[~2005-12-16 18:25 UTC|newest]

Thread overview: 37+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2005-12-14 22:39 Recursion bug in -rt Dinakar Guniguntala
2005-12-15  1:03 ` david singleton
2005-12-15 19:44   ` Dinakar Guniguntala
2005-12-15 20:40     ` David Singleton
2005-12-16  0:02     ` david singleton
2005-12-16 18:42       ` Dinakar Guniguntala [this message]
2005-12-16 21:26         ` David Singleton
2005-12-19 11:56           ` Dinakar Guniguntala
2005-12-19 20:11             ` David Singleton
2005-12-15 19:00 ` David Singleton
2005-12-15 19:52   ` Dinakar Guniguntala
2005-12-20 13:19   ` Ingo Molnar
2005-12-20 15:50     ` Dinakar Guniguntala
2005-12-20 17:43       ` Esben Nielsen
2005-12-20 19:33         ` Steven Rostedt
2005-12-20 20:42           ` Esben Nielsen
2005-12-20 21:20             ` Steven Rostedt
2005-12-20 21:55               ` david singleton
2005-12-20 22:56                 ` Esben Nielsen
2005-12-20 23:12                   ` Steven Rostedt
2005-12-20 23:55                     ` Esben Nielsen
2005-12-22  4:37                       ` david singleton
2005-12-20 22:43               ` Esben Nielsen
2005-12-20 22:59                 ` Steven Rostedt
2006-01-03  1:54       ` david singleton
2006-01-05  2:14       ` david singleton
2006-01-05  9:43         ` Esben Nielsen
2006-01-05 17:11           ` david singleton
2006-01-05 17:47             ` Esben Nielsen
2006-01-05 18:26               ` david singleton
2006-01-07  2:40               ` robust futex deadlock detection patch david singleton
     [not found]                 ` <a36005b50601071145y7e2ead9an4a4ca7896f35a85e@mail.gmail.com>
2006-01-07 19:49                   ` Ulrich Drepper
2006-01-09  9:23                 ` Esben Nielsen
2006-01-09 20:01                   ` David Singleton
2006-01-09 20:16                     ` Esben Nielsen
2006-01-09 21:08                       ` Steven Rostedt
2006-01-09 21:19                         ` Esben Nielsen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20051216184209.GD4732@in.ibm.com \
    --to=dino@in.ibm.com \
    --cc=dsingleton@mvista.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@elte.hu \
    --cc=robustmutexes@lists.osdl.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox