From mboxrd@z Thu Jan 1 00:00:00 1970 Message-ID: <504B329C.2070805@xenomai.org> Date: Sat, 08 Sep 2012 13:57:16 +0200 From: Gilles Chanteperdrix MIME-Version: 1.0 References: <50460BCE.8010505@xenomai.org> <50464969.2000902@xenomai.org> <5046549C.7030008@xenomai.org> <5046FF0A.9000208@xenomai.org> <50470D2D.8020004@xenomai.org> <5047318D.8060106@xenomai.org> <504735A6.5040800@xenomai.org> <5047B2B3.2050309@xenomai.org> <504B20CF.8010100@xenomai.org> <504B214F.7020709@xenomai.org> In-Reply-To: <504B214F.7020709@xenomai.org> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Subject: Re: [Xenomai] kernel NULL pointer dereference List-Id: Discussions about the Xenomai project List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Philippe Gerum Cc: Xenomai On 09/08/2012 12:43 PM, Gilles Chanteperdrix wrote: > On 09/08/2012 12:41 PM, Philippe Gerum wrote: > >> On 09/05/2012 10:14 PM, Gilles Chanteperdrix wrote: >>> On 09/05/2012 02:10 PM, Henri Roosen wrote: >>> >>>> On Wed, Sep 5, 2012 at 1:21 PM, Gilles Chanteperdrix >>>> wrote: >>>>>> Anyway, what seems to happen is that your application calls exit, while >>>>>> some thread was waiting for a a PI mutex, the nucleus tries to send a >>>>>> signal to the mutex holder. However, something gets wrong, and the mutex >>>>>> holder task pointer is invalid. >>>>>> >>>>>> What is strange, also, is how a task can be waiting for a mutex and >>>>>> calling exit at the same time. Could you try to increase the number of >>>>>> trace points to say 1000 points? >>>>> >>>>> >>>>> Answering myself. The thread killed is the one holding the mutex. The >>>>> signal is sent to this precise thread, so this may fail because the >>>>> thread is in the process of being destroyed, and its user_task pointer >>>>> is no longer valid. >>>> >>>> Please find attached ipipe_trace_2.txt that has the number of >>>> tracepoints to 1000. Note that this log also doesn't trace whether >>>> irqs are off (arch_irqs_disable_flags is not in the current ipipe tree >>>> yet either). >>>> >>>> I will find out why the application is doing a sys_exit. However I'm >>>> not sure how this is related to the thread affinity; when not setting >>>> the affinity, the problem is not reproducable. >>> >>> >>> Please try the following patch, it should avoid the bug, but I will >>> wait for Philippe's ack before pushing it: >>> >>> diff --git a/ksrc/nucleus/shadow.c b/ksrc/nucleus/shadow.c >>> index 260fdef..1f1a737 100644 >>> --- a/ksrc/nucleus/shadow.c >>> +++ b/ksrc/nucleus/shadow.c >>> @@ -274,7 +274,9 @@ static void rpi_update(struct xnthread *thread) >>> if (rpi_p(thread)) { >>> xnsched_pop_rpi(thread); >>> thread->rpi = NULL; >>> - rpi_push(sched, thread); >>> + >>> + if (xnthread_user_task(thread)) >>> + rpi_push(sched, thread); >>> } >>> >>> xnlock_put_irqrestore(&sched->rpilock, s); >>> @@ -1516,15 +1518,18 @@ EXPORT_SYMBOL_GPL(xnshadow_start); >>> /* Called with nklock locked, Xenomai interrupts off. */ >>> void xnshadow_renice(struct xnthread *thread) >>> { >>> - /* >>> - * We need to bound the priority values in the >>> - * [1..MAX_RT_PRIO-1] range, since the Xenomai priority scale >>> - * is a superset of the Linux priority scale. >>> - */ >>> - int prio = normalize_priority(thread->cprio); >>> + if (xnthread_user_task(thread)) { >>> + /* >>> + * We need to bound the priority values in the >>> + * [1..MAX_RT_PRIO-1] range, since the Xenomai priority scale >>> + * is a superset of the Linux priority scale. >>> + */ >>> + int prio = normalize_priority(thread->cprio); >>> >>> - xnshadow_send_sig(thread, SIGSHADOW, >>> - sigshadow_int(SIGSHADOW_ACTION_RENICE, prio), 1); >>> + xnshadow_send_sig >>> + (thread, SIGSHADOW, >>> + sigshadow_int(SIGSHADOW_ACTION_RENICE, prio), 1); >>> + } >>> >>> if (!xnthread_test_state(thread, XNDORMANT) && >>> xnthread_sched(thread) == xnpod_current_sched()) >>> >>> >> >> This patch is correct, but it does not fix the root cause. RPI seems to >> be a victim here, not the culprit. As you already noticed, the main >> issue is with a thread holding a contented mutex, which exits. This >> causes the PIP boost to be cleared for it over its own taskexit handler, >> after its task_struct backlink has been cleared. This leads to >> schedule_linux_call() being called for a NULL task; enabling >> CONFIG_XENO_OPT_DEBUG_NUCLEUS triggers the assertion properly in this >> routine. >> >> The source of all issues in this case is xnsynch_clear_boost() not >> handling the particular case of a thread exiting with a contended mutex >> held. We can reuse the thread zombie bit to flag this, and avoid any >> useless renice. >> >> Normally, there should not be any SMP-specific triggers of this bug, >> the nucleus lock is held all through the deletion and synch de-boost paths. > > > Are we not missing an rpi_pop ? > Ok. It happens in xnshadow_unmap anyway. -- Gilles.