Re: [RFC][PATCH 2/8] sched/rtmutex/deadline: Fix a PI crash for deadline tasks

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Xunlei Pang <xpang@redhat.com>
To: Juri Lelli <juri.lelli@arm.com>, Peter Zijlstra <peterz@infradead.org>
Cc: mingo@kernel.org, tglx@linutronix.de, rostedt@goodmis.org,
	xlpang@redhat.com, linux-kernel@vger.kernel.org,
	mathieu.desnoyers@efficios.com, jdesfossez@efficios.com,
	bristot@redhat.com, Ingo Molnar <mingo@redhat.com>
Subject: Re: [RFC][PATCH 2/8] sched/rtmutex/deadline: Fix a PI crash for deadline tasks
Date: Tue, 14 Jun 2016 20:53:44 +0800	[thread overview]
Message-ID: <575FFE58.1090102@redhat.com> (raw)
In-Reply-To: <20160614102109.GF5981@e106622-lin>

[-- Attachment #1: Type: text/plain, Size: 9770 bytes --]

On 2016/06/14 at 18:21, Juri Lelli wrote:
> Hi,
>
> On 07/06/16 21:56, Peter Zijlstra wrote:
>> From: Xunlei Pang <xlpang@redhat.com>
>>
>> A crash happened while I was playing with deadline PI rtmutex.
>>
>>     BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
>>     IP: [<ffffffff810eeb8f>] rt_mutex_get_top_task+0x1f/0x30
>>     PGD 232a75067 PUD 230947067 PMD 0
>>     Oops: 0000 [#1] SMP
>>     CPU: 1 PID: 10994 Comm: a.out Not tainted
>>
>>     Call Trace:
>>     [<ffffffff810b658c>] enqueue_task+0x2c/0x80
>>     [<ffffffff810ba763>] activate_task+0x23/0x30
>>     [<ffffffff810d0ab5>] pull_dl_task+0x1d5/0x260
>>     [<ffffffff810d0be6>] pre_schedule_dl+0x16/0x20
>>     [<ffffffff8164e783>] __schedule+0xd3/0x900
>>     [<ffffffff8164efd9>] schedule+0x29/0x70
>>     [<ffffffff8165035b>] __rt_mutex_slowlock+0x4b/0xc0
>>     [<ffffffff81650501>] rt_mutex_slowlock+0xd1/0x190
>>     [<ffffffff810eeb33>] rt_mutex_timed_lock+0x53/0x60
>>     [<ffffffff810ecbfc>] futex_lock_pi.isra.18+0x28c/0x390
>>     [<ffffffff810ed8b0>] do_futex+0x190/0x5b0
>>     [<ffffffff810edd50>] SyS_futex+0x80/0x180
>>
> This seems to be caused by the race condition you detail below between
> load balancing and PI code. I tried to reproduce the BUG on my box, but
> it looks hard to get. Do you have a reproducer I can give a try?
>
>> This is because rt_mutex_enqueue_pi() and rt_mutex_dequeue_pi()
>> are only protected by pi_lock when operating pi waiters, while
>> rt_mutex_get_top_task(), will access them with rq lock held but
>> not holding pi_lock.
>>
>> In order to tackle it, we introduce new "pi_top_task" pointer
>> cached in task_struct, and add new rt_mutex_update_top_task()
>> to update its value, it can be called by rt_mutex_setprio()
>> which held both owner's pi_lock and rq lock. Thus "pi_top_task"
>> can be safely accessed by enqueue_task_dl() under rq lock.
>>
>> [XXX this next section is unparsable]
> Yes, a bit hard to understand. However, am I correct in assuming this
> patch and the previous one should fix this problem? Or are there still
> other races causing issues?

Yes, these two patches can fix the problem.

>
>> One problem is when rt_mutex_adjust_prio()->...->rt_mutex_setprio(),
>> at that time rtmutex lock was released and owner was marked off,
>> this can cause "pi_top_task" dereferenced to be a running one(as it
>> can be falsely woken up by others before rt_mutex_setprio() is
>> made to update "pi_top_task"). We solve this by directly calling
>> __rt_mutex_adjust_prio() in mark_wakeup_next_waiter() which held
>> pi_lock and rtmutex lock, and remove rt_mutex_adjust_prio(). Since
>> now we moved the deboost point, in order to avoid current to be
>> preempted due to deboost earlier before wake_up_q(), we also moved
>> preempt_disable() before unlocking rtmutex.
>>
>> Cc: Steven Rostedt <rostedt@goodmis.org>
>> Cc: Ingo Molnar <mingo@redhat.com>
>> Cc: Juri Lelli <juri.lelli@arm.com>
>> Originally-From: Peter Zijlstra <peterz@infradead.org>
>> Signed-off-by: Xunlei Pang <xlpang@redhat.com>
>> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
>> Link: http://lkml.kernel.org/r/1461659449-19497-2-git-send-email-xlpang@redhat.com
> The idea of this fix makes sense to me. But, I would like to be able to
> see the BUG and test the fix. What I have is a test in which I create N
> DEADLINE workers that share a PI mutex. They get migrated around and
> seem to stress PI code. But I couldn't hit the BUG yet. Maybe I let it
> run for some more time.

You can use this reproducer attached(gcc crash_deadline_pi.c -lpthread -lrt ).
Start multiple instances, then it will hit the bug very soon.

Regards,
Xunlei

>
> Best,
>
> - Juri
>
>> ---
>>
>>  include/linux/init_task.h |    1 
>>  include/linux/sched.h     |    2 +
>>  include/linux/sched/rt.h  |    1 
>>  kernel/fork.c             |    1 
>>  kernel/locking/rtmutex.c  |   65 +++++++++++++++++++---------------------------
>>  kernel/sched/core.c       |    2 +
>>  6 files changed, 34 insertions(+), 38 deletions(-)
>>
>> --- a/include/linux/init_task.h
>> +++ b/include/linux/init_task.h
>> @@ -162,6 +162,7 @@ extern struct task_group root_task_group
>>  #ifdef CONFIG_RT_MUTEXES
>>  # define INIT_RT_MUTEXES(tsk)						\
>>  	.pi_waiters = RB_ROOT,						\
>> +	.pi_top_task = NULL,						\
>>  	.pi_waiters_leftmost = NULL,
>>  #else
>>  # define INIT_RT_MUTEXES(tsk)
>> --- a/include/linux/sched.h
>> +++ b/include/linux/sched.h
>> @@ -1681,6 +1681,8 @@ struct task_struct {
>>  	/* PI waiters blocked on a rt_mutex held by this task */
>>  	struct rb_root pi_waiters;
>>  	struct rb_node *pi_waiters_leftmost;
>> +	/* Updated under owner's pi_lock and rq lock */
>> +	struct task_struct *pi_top_task;
>>  	/* Deadlock detection and priority inheritance handling */
>>  	struct rt_mutex_waiter *pi_blocked_on;
>>  #endif
>> --- a/include/linux/sched/rt.h
>> +++ b/include/linux/sched/rt.h
>> @@ -19,6 +19,7 @@ static inline int rt_task(struct task_st
>>  extern int rt_mutex_getprio(struct task_struct *p);
>>  extern void rt_mutex_setprio(struct task_struct *p, int prio);
>>  extern int rt_mutex_get_effective_prio(struct task_struct *task, int newprio);
>> +extern void rt_mutex_update_top_task(struct task_struct *p);
>>  extern struct task_struct *rt_mutex_get_top_task(struct task_struct *task);
>>  extern void rt_mutex_adjust_pi(struct task_struct *p);
>>  static inline bool tsk_is_pi_blocked(struct task_struct *tsk)
>> --- a/kernel/fork.c
>> +++ b/kernel/fork.c
>> @@ -1219,6 +1219,7 @@ static void rt_mutex_init_task(struct ta
>>  #ifdef CONFIG_RT_MUTEXES
>>  	p->pi_waiters = RB_ROOT;
>>  	p->pi_waiters_leftmost = NULL;
>> +	p->pi_top_task = NULL;
>>  	p->pi_blocked_on = NULL;
>>  #endif
>>  }
>> --- a/kernel/locking/rtmutex.c
>> +++ b/kernel/locking/rtmutex.c
>> @@ -256,6 +256,16 @@ rt_mutex_dequeue_pi(struct task_struct *
>>  	RB_CLEAR_NODE(&waiter->pi_tree_entry);
>>  }
>>  
>> +void rt_mutex_update_top_task(struct task_struct *p)
>> +{
>> +	if (!task_has_pi_waiters(p)) {
>> +		p->pi_top_task = NULL;
>> +		return;
>> +	}
>> +
>> +	p->pi_top_task = task_top_pi_waiter(p)->task;
>> +}
>> +
>>  /*
>>   * Calculate task priority from the waiter tree priority
>>   *
>> @@ -273,10 +283,7 @@ int rt_mutex_getprio(struct task_struct
>>  
>>  struct task_struct *rt_mutex_get_top_task(struct task_struct *task)
>>  {
>> -	if (likely(!task_has_pi_waiters(task)))
>> -		return NULL;
>> -
>> -	return task_top_pi_waiter(task)->task;
>> +	return task->pi_top_task;
>>  }
>>  
>>  /*
>> @@ -285,12 +292,12 @@ struct task_struct *rt_mutex_get_top_tas
>>   */
>>  int rt_mutex_get_effective_prio(struct task_struct *task, int newprio)
>>  {
>> -	if (!task_has_pi_waiters(task))
>> +	struct task_struct *top_task = rt_mutex_get_top_task(task);
>> +
>> +	if (!top_task)
>>  		return newprio;
>>  
>> -	if (task_top_pi_waiter(task)->task->prio <= newprio)
>> -		return task_top_pi_waiter(task)->task->prio;
>> -	return newprio;
>> +	return min(top_task->prio, newprio);
>>  }
>>  
>>  /*
>> @@ -307,24 +314,6 @@ static void __rt_mutex_adjust_prio(struc
>>  }
>>  
>>  /*
>> - * Adjust task priority (undo boosting). Called from the exit path of
>> - * rt_mutex_slowunlock() and rt_mutex_slowlock().
>> - *
>> - * (Note: We do this outside of the protection of lock->wait_lock to
>> - * allow the lock to be taken while or before we readjust the priority
>> - * of task. We do not use the spin_xx_mutex() variants here as we are
>> - * outside of the debug path.)
>> - */
>> -void rt_mutex_adjust_prio(struct task_struct *task)
>> -{
>> -	unsigned long flags;
>> -
>> -	raw_spin_lock_irqsave(&task->pi_lock, flags);
>> -	__rt_mutex_adjust_prio(task);
>> -	raw_spin_unlock_irqrestore(&task->pi_lock, flags);
>> -}
>> -
>> -/*
>>   * Deadlock detection is conditional:
>>   *
>>   * If CONFIG_DEBUG_RT_MUTEXES=n, deadlock detection is only conducted
>> @@ -987,6 +976,7 @@ static void mark_wakeup_next_waiter(stru
>>  	 * lock->wait_lock.
>>  	 */
>>  	rt_mutex_dequeue_pi(current, waiter);
>> +	__rt_mutex_adjust_prio(current);
>>  
>>  	/*
>>  	 * As we are waking up the top waiter, and the waiter stays
>> @@ -1325,6 +1315,16 @@ static bool __sched rt_mutex_slowunlock(
>>  	 */
>>  	mark_wakeup_next_waiter(wake_q, lock);
>>  
>> +	/*
>> +	 * We should deboost before waking the top waiter task such that
>> +	 * we don't run two tasks with the 'same' priority. This however
>> +	 * can lead to prio-inversion if we would get preempted after
>> +	 * the deboost but before waking our high-prio task, hence the
>> +	 * preempt_disable before unlock. Pairs with preempt_enable() in
>> +	 * rt_mutex_postunlock();
>> +	 */
>> +	preempt_disable();
>> +
>>  	raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
>>  
>>  	/* check PI boosting */
>> @@ -1400,20 +1400,9 @@ rt_mutex_fastunlock(struct rt_mutex *loc
>>   */
>>  void rt_mutex_postunlock(struct wake_q_head *wake_q, bool deboost)
>>  {
>> -	/*
>> -	 * We should deboost before waking the top waiter task such that
>> -	 * we don't run two tasks with the 'same' priority. This however
>> -	 * can lead to prio-inversion if we would get preempted after
>> -	 * the deboost but before waking our high-prio task, hence the
>> -	 * preempt_disable.
>> -	 */
>> -	if (deboost) {
>> -		preempt_disable();
>> -		rt_mutex_adjust_prio(current);
>> -	}
>> -
>>  	wake_up_q(wake_q);
>>  
>> +	/* Pairs with preempt_disable() in rt_mutex_slowunlock() */
>>  	if (deboost)
>>  		preempt_enable();
>>  }
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -3568,6 +3568,8 @@ void rt_mutex_setprio(struct task_struct
>>  		goto out_unlock;
>>  	}
>>  
>> +	rt_mutex_update_top_task(p);
>> +
>>  	trace_sched_pi_setprio(p, prio);
>>  	oldprio = p->prio;
>>  
>>
>>


[-- Attachment #2: crash_deadline_pi.c --]
[-- Type: text/x-csrc, Size: 4149 bytes --]

#define _GNU_SOURCE
#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>
#include <linux/unistd.h>
#include <linux/kernel.h>
#include <linux/types.h>
#include <sys/syscall.h>
#include <pthread.h>
#include <sched.h>
#include <time.h>

 #define gettid() syscall(__NR_gettid)

 #define SCHED_DEADLINE	6

 /* XXX use the proper syscall numbers */
 #ifdef __x86_64__
 #define __NR_sched_setattr		314
 #define __NR_sched_getattr		315
 #endif

 #ifdef __i386__
 #define __NR_sched_setattr		351
 #define __NR_sched_getattr		352
 #endif

 #ifdef __ppc64__
 #define __NR_sched_setattr		355
 #define __NR_sched_getattr		356
 #endif

#ifdef __s390x__
#define __NR_sched_setattr     345
#define __NR_sched_getattr     346
#endif


static volatile int done;

 struct sched_attr {
	__u32 size;

	__u32 sched_policy;
	__u64 sched_flags;

	/* SCHED_NORMAL, SCHED_BATCH */
	__s32 sched_nice;

	/* SCHED_FIFO, SCHED_RR */
	__u32 sched_priority;

	/* SCHED_DEADLINE (nsec) */
	__u64 sched_runtime;
	__u64 sched_deadline;
	__u64 sched_period;
 };

 int sched_setattr(pid_t pid,
		  const struct sched_attr *attr,
		  unsigned int flags)
 {
	return syscall(__NR_sched_setattr, pid, attr, flags);
 }

 int sched_getattr(pid_t pid,
		  struct sched_attr *attr,
		  unsigned int size,
		  unsigned int flags)
 {
	return syscall(__NR_sched_getattr, pid, attr, size, flags);
 }

pthread_mutex_t mutex_obj;
pthread_mutexattr_t mutex_attr;
int x = 0;

static int decide(void)
{
	struct timespec ts;

	clock_gettime(CLOCK_MONOTONIC, &ts);

	if (ts.tv_nsec & 1)
		return 1;
	else
		return 0;
}

static void mutex_init(void)
{
	pthread_mutexattr_init(&mutex_attr);
	pthread_mutexattr_setprotocol(&mutex_attr, PTHREAD_PRIO_INHERIT);
	pthread_mutex_init(&mutex_obj, &mutex_attr);
}

static deadline_ndelay(unsigned int cnt)
{
	unsigned int i;

	for (i = 0; i < 10000 * cnt; i++);
}

 void *run_deadline_special(void *data)
 {
	struct sched_attr attr;
	int ret, take;
	unsigned int flags = 0;

	attr.size = sizeof(attr);
	attr.sched_flags = 0;
	attr.sched_nice = 0;
	attr.sched_priority = 0;

	/* This creates a 10ms/30ms reservation */
	attr.sched_policy = SCHED_DEADLINE;
	attr.sched_runtime = 100 * 1000 * 1000;
	attr.sched_deadline = 200 * 1000 * 1000;
	attr.sched_period = 300 * 1000 * 1000;

	ret = sched_setattr(0, &attr, flags);
	if (ret < 0) {
		done = 0;
		perror("sched_setattr");
		exit(-1);
	}

	printf("special deadline thread started [%ld]\n", gettid());
	while (!done) {
		take = decide();
		if (take)
			pthread_mutex_lock(&mutex_obj);

		x++;
		deadline_ndelay((unsigned long) attr.sched_runtime % 7);

		if (take)
			pthread_mutex_unlock(&mutex_obj);
	}

	printf("special deadline thread dies [%ld]\n", gettid());
	return NULL;
 }

 void *run_deadline(void *data)
 {
	struct sched_attr attr;
	int ret, take;
	unsigned int flags = 0;
	static unsigned int delta = 0;


	attr.size = sizeof(attr);
	attr.sched_flags = 0;
	attr.sched_nice = 0;
	attr.sched_priority = 0;

	/* This creates a 10ms/30ms reservation */
	delta += 1000 * 1000 * 2;
	attr.sched_policy = SCHED_DEADLINE;
	attr.sched_runtime = 20 * 1000 * 1000 + delta;
	attr.sched_deadline = 400 * 1000 * 1000;
	attr.sched_period = 400 * 1000 * 1000;

	ret = sched_setattr(0, &attr, flags);
	if (ret < 0) {
		done = 0;
		perror("sched_setattr");
		exit(-1);
	}

	printf("deadline thread started [%ld]\n", gettid());
	while (!done) {
		take = decide();
		if (take)
			pthread_mutex_lock(&mutex_obj);

		x++;
		deadline_ndelay((unsigned long) attr.sched_runtime % 7);

		if (take)
			pthread_mutex_unlock(&mutex_obj);
	}

	printf("deadline thread dies [%ld]\n", gettid());
	return NULL;
 }

#define THREAD_NUM 10

 int main (int argc, char **argv)
 {
	pthread_t thread[THREAD_NUM];
	int i;

	mutex_init();

	printf("main thread [%ld]\n", gettid());

	for (i = 0; i < THREAD_NUM-1; i++)
		pthread_create(&thread[i], NULL, run_deadline, NULL);

	pthread_create(&thread[THREAD_NUM-1], NULL, run_deadline_special, NULL);

	sleep(3600*300);

	done = 1;
	for (i = 0; i < THREAD_NUM; i++)
		pthread_join(thread[i], NULL);

	printf("main dies [%ld]\n", gettid());
	return 0;
 }

next prev parent reply	other threads:[~2016-06-14 12:53 UTC|newest]

Thread overview: 40+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-06-07 19:56 [RFC][PATCH 0/8] PI and assorted failings Peter Zijlstra
2016-06-07 19:56 ` [RFC][PATCH 1/8] rtmutex: Deboost before waking up the top waiter Peter Zijlstra
2016-06-14  9:09   ` Juri Lelli
2016-06-14 12:54     ` Peter Zijlstra
2016-06-14 13:20       ` Juri Lelli
2016-06-14 13:59         ` Peter Zijlstra
2016-06-14 16:36     ` Davidlohr Bueso
2016-06-14 17:01       ` Juri Lelli
2016-06-14 18:22   ` Steven Rostedt
2016-06-07 19:56 ` [RFC][PATCH 2/8] sched/rtmutex/deadline: Fix a PI crash for deadline tasks Peter Zijlstra
2016-06-14 10:21   ` Juri Lelli
2016-06-14 12:30     ` Peter Zijlstra
2016-06-14 12:53     ` Xunlei Pang [this message]
2016-06-14 13:07       ` Juri Lelli
2016-06-14 16:39         ` Juri Lelli
2016-06-14 18:42   ` Steven Rostedt
2016-06-14 20:28     ` Peter Zijlstra
2016-06-15 16:14       ` Steven Rostedt
2016-06-07 19:56 ` [RFC][PATCH 3/8] sched/deadline/rtmutex: Dont miss the dl_runtime/dl_period update Peter Zijlstra
2016-06-14 10:43   ` Juri Lelli
2016-06-14 12:09     ` Peter Zijlstra
2016-06-15 16:30   ` Steven Rostedt
2016-06-15 17:55     ` Peter Zijlstra
2016-06-07 19:56 ` [RFC][PATCH 4/8] rtmutex: Remove rt_mutex_fastunlock() Peter Zijlstra
2016-06-15 16:43   ` Steven Rostedt
2016-06-07 19:56 ` [RFC][PATCH 5/8] rtmutex: Clean up Peter Zijlstra
2016-06-14 12:08   ` Juri Lelli
2016-06-14 12:32     ` Peter Zijlstra
2016-06-14 12:41       ` Juri Lelli
2016-06-07 19:56 ` [RFC][PATCH 6/8] sched/rtmutex: Refactor rt_mutex_setprio() Peter Zijlstra
2016-06-14 13:14   ` Juri Lelli
2016-06-14 14:08     ` Peter Zijlstra
2016-06-07 19:56 ` [RFC][PATCH 7/8] sched,tracing: Update trace_sched_pi_setprio() Peter Zijlstra
2016-06-07 19:56 ` [RFC][PATCH 8/8] rtmutex: Fix PI chain order integrity Peter Zijlstra
2016-06-14 17:39   ` Juri Lelli
2016-06-14 19:44     ` Peter Zijlstra
2016-06-15  7:25       ` Juri Lelli
2016-06-27 12:23         ` Peter Zijlstra
2016-06-27 12:40           ` Thomas Gleixner
2016-06-28  9:05           ` Juri Lelli

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=575FFE58.1090102@redhat.com \
    --to=xpang@redhat.com \
    --cc=bristot@redhat.com \
    --cc=jdesfossez@efficios.com \
    --cc=juri.lelli@arm.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mathieu.desnoyers@efficios.com \
    --cc=mingo@kernel.org \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=rostedt@goodmis.org \
    --cc=tglx@linutronix.de \
    --cc=xlpang@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.