From mboxrd@z Thu Jan 1 00:00:00 1970 Message-ID: <542D87AA.6000607@xenomai.org> Date: Thu, 02 Oct 2014 19:13:14 +0200 From: Gilles Chanteperdrix MIME-Version: 1.0 References: <542A945E.9000904@xenomai.org> <542A9F17.7090802@xenomai.org> <542BB31F.8070803@xenomai.org> <542BC753.8000605@xenomai.org> <542D54C0.3070001@xenomai.org> In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Subject: Re: [Xenomai] Switchtest failures on ODROIDU3 List-Id: Discussions about the Xenomai project List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: GP Orcullo Cc: "xenomai@xenomai.org" On 10/02/2014 05:52 PM, GP Orcullo wrote: > On Thu, Oct 2, 2014 at 9:36 PM, Gilles Chanteperdrix > wrote: >> On 10/02/2014 03:27 PM, GP Orcullo wrote: >>> On Wed, Oct 1, 2014 at 5:20 PM, Gilles Chanteperdrix >>> wrote: >>>> On 10/01/2014 11:12 AM, GP Orcullo wrote: >>>>> On Oct 1, 2014 3:54 PM, "Gilles Chanteperdrix" < >>>>> gilles.chanteperdrix@xenomai.org> wrote: >>>>>> >>>>>> On 10/01/2014 01:32 AM, GP Orcullo wrote: >>>>>>> On Sep 30, 2014 8:16 PM, "Gilles Chanteperdrix" < >>>>>>> gilles.chanteperdrix@xenomai.org> wrote: >>>>>>>> >>>>>>>> On 09/30/2014 02:04 PM, GP Orcullo wrote: >>>>>>>>> On Sep 30, 2014 7:30 PM, "Gilles Chanteperdrix" < >>>>>>>>> gilles.chanteperdrix@xenomai.org> wrote: >>>>>>>>>> >>>>>>>>>> On 09/30/2014 07:31 AM, GP Orcullo wrote: >>>>>>>>>>> Hi, >>>>>>>>>>> >>>>>>>>>>> Running the switchtest for extended periods (>10 mins) causes the >>>>>>>>>>> machine to lockup. >>>>>>>>>>> >>>>>>>>>>> I'm running a modified xeno-regression-test which contains only the >>>>>>>>>>> following tests: >>>>>>>>>>> >>>>>>>>>>> check_alive /usr/lib/xenomai/testsuite/switchtest >>>>>>>>>>> check_alive /usr/lib/xenomai/testsuite/switchtest -s 1000 >>>>>>>>>>> check_alive /usr/lib/xenomai/testsuite/latency ${1+"$@"} >>>>>>>>>>> >>>>>>>>>>> The script is invoked with the following arguments: >>>>>>>>>>> >>>>>>>>>>> nohup sudo ./xeno-regression-test -l >>>>>>>>>>> "/usr/lib/xenomai/testsuite/dohell -m /media/work 36000" -t 2 > >>>>>>>>>>> /dev/null & top -d0.5 >>>>>>>>>>> >>>>>>>>>>> The kernel dumps the OOPS information intermittently so it's >>>>> difficult >>>>>>>>>>> to diagnose the issue. >>>>>>>>>>> >>>>>>>>>>> Attached is the kernel config and the logfile. >>>>>>>>>> >>>>>>>>>> Ok, this is an exynos. Sorry, but I have never seen the patch for >>>>>>>>>> exynos, so I do not know what is inside. You should direct your >>>>>>>>>> questions to whoever provided you with this support. >>>>>>>>> >>>>>>>>> I'm in the process of porting xenomai to run on exynos. >>>>>>>>> >>>>>>>>> The ipipe-core-3.8.13-arm-3.patch applies cleanly to the 3.8.13.11 >>>>>>> kernel >>>>>>>>> used by the odroid U3 board. >>>>>>>>> >>>>>>>>> Attached is the ipipe patch that I've made. >>>>>>>>> >>>>>>>>> I was just wondering what would cause switchtest to fail. The error >>>>>>> that I >>>>>>>>> can see is that the system is running out of memory and I don't know >>>>>>>>> exactly what is causing this. >>>>>>>> >>>>>>>> Certainly not switchtest as it does not do any memory allocation. >>>>>>>> However, the dohell script has a loop creating a large file and >>>>> removing >>>>>>>> it. So, could you try and run the dohell script with an unpatched >>>>> kernel >>>>>>>> and see if you have the error? >>>>>>>> >>>>>>> >>>>>>> Running dohell on a patched and unpatched kernel doesn't trigger the >>>>> lockup. >>>>>>> >>>>>>> Running switchtest without dohell works OK. >>>>>> >>>>>> Is the problem a lockup, or an OOM? >>>>>> >>>>> >>>>> It's a lockup. >>>>> >>>>> The OOM message is the only one that I've captured so far. Most of the >>>>> time the kernel doesn't spew any messages before the lockup. >>>>> >>>>> The lockups are repeatable but generating any error messages isn't. >>>> >>>> Are you running the tests on the serial console, or with ssh? Do you >>>> have unlocked context switch enabled? Have you tried enabling some debug >>>> options? >>>> >>> >>> I'm using the serial console to log the kernel messages and ssh to run >>> the command. Using purely the serial console has the same results. >> >> The main point was to avoid redirecting standard error to /dev/null to >> see any application error message. Doing this on the serial console may >> be a better idea that on ssh, because it means you are less likely to >> miss a message that would be sent just prior to the system dying. >> >>> >>> Is this the context switch?: "CONFIG_XENO_HW_UNLOCKED_SWITCH=y" >> >> Yes, please try to disable it if you have it enabled. >> >>> >>> I will try playing again with the debug options and see if I can get >>> something useful. >>> >>>> Also note that xeno-regression-test puts the system under a lot of >>>> stress, so it may happen that there is no output for some time (several >>>> minutes), normally the test should stop by itself if there is no output >>>> for something like 30 minutes. So, I would recommend not redirecting >>>> xeno-test output to see if there is any error before the lockup, and >>>> when you see the lockup, leave the system for 30 minutes to see if it >>>> does not restart or if xeno-regression-test can exit gracefully. >>>> >>> >>> This is a total lockup. There's a heartbeat led that dies when it occurs. >> >> Well the heartbeat led does not prove anything: some Linux kernel >> activity can very well prevent it from being toggled. Say if for >> instance it is toggled by a thread and the activity that hogs the kernel >> is a softirq that never ends. >> >>> >>> Attached is one error log that I had captured previously and this one >>> had the CONFIG_CPU_IDLE enabled. I've lost track on which kernel this >>> trace came from but maybe the error looks familiar. >> >> This trace misses an important information: the reason for the error. >> So, please capture the serial console to a file, and post the complete >> file, from boot up to the error. >> >> Anyway, you did not answered my question: did you try to leave the >> system on for say 30 minutes of 1 hour after the lockup to see if it >> does not recover? >> >> > > The system never recovered. > > With the context switch disabled, I was able to capture this error: > > [ 210.482299] INFO: rcu_preempt detected stalls on CPUs/tasks:) > [ 210.487790] Task dump for CPU 2: > [ 210.490995] switchtest R running 0 3915 3639 0x00000002 > [ 210.497340] [] (__schedule+0x1fc/0x5f8) from [<00000010>] (0x10) > [ 390.507943] INFO: rcu_preempt detected stalls on CPUs/tasks: { 2} (detected ) > [ 390.513510] Task dump for CPU 2: > [ 390.516716] switchtest R running 0 3915 3639 0x00000002 > [ 390.523065] [] (__schedule+0x1fc/0x5f8) from [<00000010>] (0x10) > > points to the following section: > > #ifndef __ARCH_WANT_UNLOCKED_CTXSW > spin_release(&rq->lock.dep_map, 1, _THIS_IP_); > c0453dc8: ebf04b13 bl c0066a1c > #endif > > context_tracking_task_switch(prev, next); You do not have context tracking enabled, right? -- Gilles.