From mboxrd@z Thu Jan 1 00:00:00 1970 Message-ID: <504B214F.7020709@xenomai.org> Date: Sat, 08 Sep 2012 12:43:27 +0200 From: Gilles Chanteperdrix MIME-Version: 1.0 References: <50460BCE.8010505@xenomai.org> <50464969.2000902@xenomai.org> <5046549C.7030008@xenomai.org> <5046FF0A.9000208@xenomai.org> <50470D2D.8020004@xenomai.org> <5047318D.8060106@xenomai.org> <504735A6.5040800@xenomai.org> <5047B2B3.2050309@xenomai.org> <504B20CF.8010100@xenomai.org> In-Reply-To: <504B20CF.8010100@xenomai.org> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Subject: Re: [Xenomai] kernel NULL pointer dereference List-Id: Discussions about the Xenomai project List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Philippe Gerum Cc: Xenomai On 09/08/2012 12:41 PM, Philippe Gerum wrote: > On 09/05/2012 10:14 PM, Gilles Chanteperdrix wrote: >> On 09/05/2012 02:10 PM, Henri Roosen wrote: >> >>> On Wed, Sep 5, 2012 at 1:21 PM, Gilles Chanteperdrix >>> wrote: >>>>> Anyway, what seems to happen is that your application calls exit, while >>>>> some thread was waiting for a a PI mutex, the nucleus tries to send a >>>>> signal to the mutex holder. However, something gets wrong, and the mutex >>>>> holder task pointer is invalid. >>>>> >>>>> What is strange, also, is how a task can be waiting for a mutex and >>>>> calling exit at the same time. Could you try to increase the number of >>>>> trace points to say 1000 points? >>>> >>>> >>>> Answering myself. The thread killed is the one holding the mutex. The >>>> signal is sent to this precise thread, so this may fail because the >>>> thread is in the process of being destroyed, and its user_task pointer >>>> is no longer valid. >>> >>> Please find attached ipipe_trace_2.txt that has the number of >>> tracepoints to 1000. Note that this log also doesn't trace whether >>> irqs are off (arch_irqs_disable_flags is not in the current ipipe tree >>> yet either). >>> >>> I will find out why the application is doing a sys_exit. However I'm >>> not sure how this is related to the thread affinity; when not setting >>> the affinity, the problem is not reproducable. >> >> >> Please try the following patch, it should avoid the bug, but I will >> wait for Philippe's ack before pushing it: >> >> diff --git a/ksrc/nucleus/shadow.c b/ksrc/nucleus/shadow.c >> index 260fdef..1f1a737 100644 >> --- a/ksrc/nucleus/shadow.c >> +++ b/ksrc/nucleus/shadow.c >> @@ -274,7 +274,9 @@ static void rpi_update(struct xnthread *thread) >> if (rpi_p(thread)) { >> xnsched_pop_rpi(thread); >> thread->rpi = NULL; >> - rpi_push(sched, thread); >> + >> + if (xnthread_user_task(thread)) >> + rpi_push(sched, thread); >> } >> >> xnlock_put_irqrestore(&sched->rpilock, s); >> @@ -1516,15 +1518,18 @@ EXPORT_SYMBOL_GPL(xnshadow_start); >> /* Called with nklock locked, Xenomai interrupts off. */ >> void xnshadow_renice(struct xnthread *thread) >> { >> - /* >> - * We need to bound the priority values in the >> - * [1..MAX_RT_PRIO-1] range, since the Xenomai priority scale >> - * is a superset of the Linux priority scale. >> - */ >> - int prio = normalize_priority(thread->cprio); >> + if (xnthread_user_task(thread)) { >> + /* >> + * We need to bound the priority values in the >> + * [1..MAX_RT_PRIO-1] range, since the Xenomai priority scale >> + * is a superset of the Linux priority scale. >> + */ >> + int prio = normalize_priority(thread->cprio); >> >> - xnshadow_send_sig(thread, SIGSHADOW, >> - sigshadow_int(SIGSHADOW_ACTION_RENICE, prio), 1); >> + xnshadow_send_sig >> + (thread, SIGSHADOW, >> + sigshadow_int(SIGSHADOW_ACTION_RENICE, prio), 1); >> + } >> >> if (!xnthread_test_state(thread, XNDORMANT) && >> xnthread_sched(thread) == xnpod_current_sched()) >> >> > > This patch is correct, but it does not fix the root cause. RPI seems to > be a victim here, not the culprit. As you already noticed, the main > issue is with a thread holding a contented mutex, which exits. This > causes the PIP boost to be cleared for it over its own taskexit handler, > after its task_struct backlink has been cleared. This leads to > schedule_linux_call() being called for a NULL task; enabling > CONFIG_XENO_OPT_DEBUG_NUCLEUS triggers the assertion properly in this > routine. > > The source of all issues in this case is xnsynch_clear_boost() not > handling the particular case of a thread exiting with a contended mutex > held. We can reuse the thread zombie bit to flag this, and avoid any > useless renice. > > Normally, there should not be any SMP-specific triggers of this bug, > the nucleus lock is held all through the deletion and synch de-boost paths. Are we not missing an rpi_pop ? -- Gilles.