From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751466AbcFNMx7 (ORCPT ); Tue, 14 Jun 2016 08:53:59 -0400 Received: from mx1.redhat.com ([209.132.183.28]:59794 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751029AbcFNMx6 (ORCPT ); Tue, 14 Jun 2016 08:53:58 -0400 Reply-To: xlpang@redhat.com Subject: Re: [RFC][PATCH 2/8] sched/rtmutex/deadline: Fix a PI crash for deadline tasks References: <20160607195635.710022345@infradead.org> <20160607200215.719626477@infradead.org> <20160614102109.GF5981@e106622-lin> To: Juri Lelli , Peter Zijlstra Cc: mingo@kernel.org, tglx@linutronix.de, rostedt@goodmis.org, xlpang@redhat.com, linux-kernel@vger.kernel.org, mathieu.desnoyers@efficios.com, jdesfossez@efficios.com, bristot@redhat.com, Ingo Molnar From: Xunlei Pang Message-ID: <575FFE58.1090102@redhat.com> Date: Tue, 14 Jun 2016 20:53:44 +0800 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.2.0 MIME-Version: 1.0 In-Reply-To: <20160614102109.GF5981@e106622-lin> Content-Type: multipart/mixed; boundary="------------080308030006020401030700" X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.25]); Tue, 14 Jun 2016 12:53:57 +0000 (UTC) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org This is a multi-part message in MIME format. --------------080308030006020401030700 Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: 7bit On 2016/06/14 at 18:21, Juri Lelli wrote: > Hi, > > On 07/06/16 21:56, Peter Zijlstra wrote: >> From: Xunlei Pang >> >> A crash happened while I was playing with deadline PI rtmutex. >> >> BUG: unable to handle kernel NULL pointer dereference at 0000000000000018 >> IP: [] rt_mutex_get_top_task+0x1f/0x30 >> PGD 232a75067 PUD 230947067 PMD 0 >> Oops: 0000 [#1] SMP >> CPU: 1 PID: 10994 Comm: a.out Not tainted >> >> Call Trace: >> [] enqueue_task+0x2c/0x80 >> [] activate_task+0x23/0x30 >> [] pull_dl_task+0x1d5/0x260 >> [] pre_schedule_dl+0x16/0x20 >> [] __schedule+0xd3/0x900 >> [] schedule+0x29/0x70 >> [] __rt_mutex_slowlock+0x4b/0xc0 >> [] rt_mutex_slowlock+0xd1/0x190 >> [] rt_mutex_timed_lock+0x53/0x60 >> [] futex_lock_pi.isra.18+0x28c/0x390 >> [] do_futex+0x190/0x5b0 >> [] SyS_futex+0x80/0x180 >> > This seems to be caused by the race condition you detail below between > load balancing and PI code. I tried to reproduce the BUG on my box, but > it looks hard to get. Do you have a reproducer I can give a try? > >> This is because rt_mutex_enqueue_pi() and rt_mutex_dequeue_pi() >> are only protected by pi_lock when operating pi waiters, while >> rt_mutex_get_top_task(), will access them with rq lock held but >> not holding pi_lock. >> >> In order to tackle it, we introduce new "pi_top_task" pointer >> cached in task_struct, and add new rt_mutex_update_top_task() >> to update its value, it can be called by rt_mutex_setprio() >> which held both owner's pi_lock and rq lock. Thus "pi_top_task" >> can be safely accessed by enqueue_task_dl() under rq lock. >> >> [XXX this next section is unparsable] > Yes, a bit hard to understand. However, am I correct in assuming this > patch and the previous one should fix this problem? Or are there still > other races causing issues? Yes, these two patches can fix the problem. > >> One problem is when rt_mutex_adjust_prio()->...->rt_mutex_setprio(), >> at that time rtmutex lock was released and owner was marked off, >> this can cause "pi_top_task" dereferenced to be a running one(as it >> can be falsely woken up by others before rt_mutex_setprio() is >> made to update "pi_top_task"). We solve this by directly calling >> __rt_mutex_adjust_prio() in mark_wakeup_next_waiter() which held >> pi_lock and rtmutex lock, and remove rt_mutex_adjust_prio(). Since >> now we moved the deboost point, in order to avoid current to be >> preempted due to deboost earlier before wake_up_q(), we also moved >> preempt_disable() before unlocking rtmutex. >> >> Cc: Steven Rostedt >> Cc: Ingo Molnar >> Cc: Juri Lelli >> Originally-From: Peter Zijlstra >> Signed-off-by: Xunlei Pang >> Signed-off-by: Peter Zijlstra (Intel) >> Link: http://lkml.kernel.org/r/1461659449-19497-2-git-send-email-xlpang@redhat.com > The idea of this fix makes sense to me. But, I would like to be able to > see the BUG and test the fix. What I have is a test in which I create N > DEADLINE workers that share a PI mutex. They get migrated around and > seem to stress PI code. But I couldn't hit the BUG yet. Maybe I let it > run for some more time. You can use this reproducer attached(gcc crash_deadline_pi.c -lpthread -lrt ). Start multiple instances, then it will hit the bug very soon. Regards, Xunlei > > Best, > > - Juri > >> --- >> >> include/linux/init_task.h | 1 >> include/linux/sched.h | 2 + >> include/linux/sched/rt.h | 1 >> kernel/fork.c | 1 >> kernel/locking/rtmutex.c | 65 +++++++++++++++++++--------------------------- >> kernel/sched/core.c | 2 + >> 6 files changed, 34 insertions(+), 38 deletions(-) >> >> --- a/include/linux/init_task.h >> +++ b/include/linux/init_task.h >> @@ -162,6 +162,7 @@ extern struct task_group root_task_group >> #ifdef CONFIG_RT_MUTEXES >> # define INIT_RT_MUTEXES(tsk) \ >> .pi_waiters = RB_ROOT, \ >> + .pi_top_task = NULL, \ >> .pi_waiters_leftmost = NULL, >> #else >> # define INIT_RT_MUTEXES(tsk) >> --- a/include/linux/sched.h >> +++ b/include/linux/sched.h >> @@ -1681,6 +1681,8 @@ struct task_struct { >> /* PI waiters blocked on a rt_mutex held by this task */ >> struct rb_root pi_waiters; >> struct rb_node *pi_waiters_leftmost; >> + /* Updated under owner's pi_lock and rq lock */ >> + struct task_struct *pi_top_task; >> /* Deadlock detection and priority inheritance handling */ >> struct rt_mutex_waiter *pi_blocked_on; >> #endif >> --- a/include/linux/sched/rt.h >> +++ b/include/linux/sched/rt.h >> @@ -19,6 +19,7 @@ static inline int rt_task(struct task_st >> extern int rt_mutex_getprio(struct task_struct *p); >> extern void rt_mutex_setprio(struct task_struct *p, int prio); >> extern int rt_mutex_get_effective_prio(struct task_struct *task, int newprio); >> +extern void rt_mutex_update_top_task(struct task_struct *p); >> extern struct task_struct *rt_mutex_get_top_task(struct task_struct *task); >> extern void rt_mutex_adjust_pi(struct task_struct *p); >> static inline bool tsk_is_pi_blocked(struct task_struct *tsk) >> --- a/kernel/fork.c >> +++ b/kernel/fork.c >> @@ -1219,6 +1219,7 @@ static void rt_mutex_init_task(struct ta >> #ifdef CONFIG_RT_MUTEXES >> p->pi_waiters = RB_ROOT; >> p->pi_waiters_leftmost = NULL; >> + p->pi_top_task = NULL; >> p->pi_blocked_on = NULL; >> #endif >> } >> --- a/kernel/locking/rtmutex.c >> +++ b/kernel/locking/rtmutex.c >> @@ -256,6 +256,16 @@ rt_mutex_dequeue_pi(struct task_struct * >> RB_CLEAR_NODE(&waiter->pi_tree_entry); >> } >> >> +void rt_mutex_update_top_task(struct task_struct *p) >> +{ >> + if (!task_has_pi_waiters(p)) { >> + p->pi_top_task = NULL; >> + return; >> + } >> + >> + p->pi_top_task = task_top_pi_waiter(p)->task; >> +} >> + >> /* >> * Calculate task priority from the waiter tree priority >> * >> @@ -273,10 +283,7 @@ int rt_mutex_getprio(struct task_struct >> >> struct task_struct *rt_mutex_get_top_task(struct task_struct *task) >> { >> - if (likely(!task_has_pi_waiters(task))) >> - return NULL; >> - >> - return task_top_pi_waiter(task)->task; >> + return task->pi_top_task; >> } >> >> /* >> @@ -285,12 +292,12 @@ struct task_struct *rt_mutex_get_top_tas >> */ >> int rt_mutex_get_effective_prio(struct task_struct *task, int newprio) >> { >> - if (!task_has_pi_waiters(task)) >> + struct task_struct *top_task = rt_mutex_get_top_task(task); >> + >> + if (!top_task) >> return newprio; >> >> - if (task_top_pi_waiter(task)->task->prio <= newprio) >> - return task_top_pi_waiter(task)->task->prio; >> - return newprio; >> + return min(top_task->prio, newprio); >> } >> >> /* >> @@ -307,24 +314,6 @@ static void __rt_mutex_adjust_prio(struc >> } >> >> /* >> - * Adjust task priority (undo boosting). Called from the exit path of >> - * rt_mutex_slowunlock() and rt_mutex_slowlock(). >> - * >> - * (Note: We do this outside of the protection of lock->wait_lock to >> - * allow the lock to be taken while or before we readjust the priority >> - * of task. We do not use the spin_xx_mutex() variants here as we are >> - * outside of the debug path.) >> - */ >> -void rt_mutex_adjust_prio(struct task_struct *task) >> -{ >> - unsigned long flags; >> - >> - raw_spin_lock_irqsave(&task->pi_lock, flags); >> - __rt_mutex_adjust_prio(task); >> - raw_spin_unlock_irqrestore(&task->pi_lock, flags); >> -} >> - >> -/* >> * Deadlock detection is conditional: >> * >> * If CONFIG_DEBUG_RT_MUTEXES=n, deadlock detection is only conducted >> @@ -987,6 +976,7 @@ static void mark_wakeup_next_waiter(stru >> * lock->wait_lock. >> */ >> rt_mutex_dequeue_pi(current, waiter); >> + __rt_mutex_adjust_prio(current); >> >> /* >> * As we are waking up the top waiter, and the waiter stays >> @@ -1325,6 +1315,16 @@ static bool __sched rt_mutex_slowunlock( >> */ >> mark_wakeup_next_waiter(wake_q, lock); >> >> + /* >> + * We should deboost before waking the top waiter task such that >> + * we don't run two tasks with the 'same' priority. This however >> + * can lead to prio-inversion if we would get preempted after >> + * the deboost but before waking our high-prio task, hence the >> + * preempt_disable before unlock. Pairs with preempt_enable() in >> + * rt_mutex_postunlock(); >> + */ >> + preempt_disable(); >> + >> raw_spin_unlock_irqrestore(&lock->wait_lock, flags); >> >> /* check PI boosting */ >> @@ -1400,20 +1400,9 @@ rt_mutex_fastunlock(struct rt_mutex *loc >> */ >> void rt_mutex_postunlock(struct wake_q_head *wake_q, bool deboost) >> { >> - /* >> - * We should deboost before waking the top waiter task such that >> - * we don't run two tasks with the 'same' priority. This however >> - * can lead to prio-inversion if we would get preempted after >> - * the deboost but before waking our high-prio task, hence the >> - * preempt_disable. >> - */ >> - if (deboost) { >> - preempt_disable(); >> - rt_mutex_adjust_prio(current); >> - } >> - >> wake_up_q(wake_q); >> >> + /* Pairs with preempt_disable() in rt_mutex_slowunlock() */ >> if (deboost) >> preempt_enable(); >> } >> --- a/kernel/sched/core.c >> +++ b/kernel/sched/core.c >> @@ -3568,6 +3568,8 @@ void rt_mutex_setprio(struct task_struct >> goto out_unlock; >> } >> >> + rt_mutex_update_top_task(p); >> + >> trace_sched_pi_setprio(p, prio); >> oldprio = p->prio; >> >> >> --------------080308030006020401030700 Content-Type: text/x-csrc; name="crash_deadline_pi.c" Content-Transfer-Encoding: 7bit Content-Disposition: attachment; filename="crash_deadline_pi.c" #define _GNU_SOURCE #include #include #include #include #include #include #include #include #include #include #include #include #define gettid() syscall(__NR_gettid) #define SCHED_DEADLINE 6 /* XXX use the proper syscall numbers */ #ifdef __x86_64__ #define __NR_sched_setattr 314 #define __NR_sched_getattr 315 #endif #ifdef __i386__ #define __NR_sched_setattr 351 #define __NR_sched_getattr 352 #endif #ifdef __ppc64__ #define __NR_sched_setattr 355 #define __NR_sched_getattr 356 #endif #ifdef __s390x__ #define __NR_sched_setattr 345 #define __NR_sched_getattr 346 #endif static volatile int done; struct sched_attr { __u32 size; __u32 sched_policy; __u64 sched_flags; /* SCHED_NORMAL, SCHED_BATCH */ __s32 sched_nice; /* SCHED_FIFO, SCHED_RR */ __u32 sched_priority; /* SCHED_DEADLINE (nsec) */ __u64 sched_runtime; __u64 sched_deadline; __u64 sched_period; }; int sched_setattr(pid_t pid, const struct sched_attr *attr, unsigned int flags) { return syscall(__NR_sched_setattr, pid, attr, flags); } int sched_getattr(pid_t pid, struct sched_attr *attr, unsigned int size, unsigned int flags) { return syscall(__NR_sched_getattr, pid, attr, size, flags); } pthread_mutex_t mutex_obj; pthread_mutexattr_t mutex_attr; int x = 0; static int decide(void) { struct timespec ts; clock_gettime(CLOCK_MONOTONIC, &ts); if (ts.tv_nsec & 1) return 1; else return 0; } static void mutex_init(void) { pthread_mutexattr_init(&mutex_attr); pthread_mutexattr_setprotocol(&mutex_attr, PTHREAD_PRIO_INHERIT); pthread_mutex_init(&mutex_obj, &mutex_attr); } static deadline_ndelay(unsigned int cnt) { unsigned int i; for (i = 0; i < 10000 * cnt; i++); } void *run_deadline_special(void *data) { struct sched_attr attr; int ret, take; unsigned int flags = 0; attr.size = sizeof(attr); attr.sched_flags = 0; attr.sched_nice = 0; attr.sched_priority = 0; /* This creates a 10ms/30ms reservation */ attr.sched_policy = SCHED_DEADLINE; attr.sched_runtime = 100 * 1000 * 1000; attr.sched_deadline = 200 * 1000 * 1000; attr.sched_period = 300 * 1000 * 1000; ret = sched_setattr(0, &attr, flags); if (ret < 0) { done = 0; perror("sched_setattr"); exit(-1); } printf("special deadline thread started [%ld]\n", gettid()); while (!done) { take = decide(); if (take) pthread_mutex_lock(&mutex_obj); x++; deadline_ndelay((unsigned long) attr.sched_runtime % 7); if (take) pthread_mutex_unlock(&mutex_obj); } printf("special deadline thread dies [%ld]\n", gettid()); return NULL; } void *run_deadline(void *data) { struct sched_attr attr; int ret, take; unsigned int flags = 0; static unsigned int delta = 0; attr.size = sizeof(attr); attr.sched_flags = 0; attr.sched_nice = 0; attr.sched_priority = 0; /* This creates a 10ms/30ms reservation */ delta += 1000 * 1000 * 2; attr.sched_policy = SCHED_DEADLINE; attr.sched_runtime = 20 * 1000 * 1000 + delta; attr.sched_deadline = 400 * 1000 * 1000; attr.sched_period = 400 * 1000 * 1000; ret = sched_setattr(0, &attr, flags); if (ret < 0) { done = 0; perror("sched_setattr"); exit(-1); } printf("deadline thread started [%ld]\n", gettid()); while (!done) { take = decide(); if (take) pthread_mutex_lock(&mutex_obj); x++; deadline_ndelay((unsigned long) attr.sched_runtime % 7); if (take) pthread_mutex_unlock(&mutex_obj); } printf("deadline thread dies [%ld]\n", gettid()); return NULL; } #define THREAD_NUM 10 int main (int argc, char **argv) { pthread_t thread[THREAD_NUM]; int i; mutex_init(); printf("main thread [%ld]\n", gettid()); for (i = 0; i < THREAD_NUM-1; i++) pthread_create(&thread[i], NULL, run_deadline, NULL); pthread_create(&thread[THREAD_NUM-1], NULL, run_deadline_special, NULL); sleep(3600*300); done = 1; for (i = 0; i < THREAD_NUM; i++) pthread_join(thread[i], NULL); printf("main dies [%ld]\n", gettid()); return 0; } --------------080308030006020401030700--