From mboxrd@z Thu Jan 1 00:00:00 1970 Message-ID: <513723EC.6070203@xenomai.org> Date: Wed, 06 Mar 2013 12:09:32 +0100 From: Philippe Gerum MIME-Version: 1.0 References: <512FB9B5.9040709@xenomai.org> <51306545.1010200@xenomai.org> <5130663F.7070209@xenomai.org> <51306710.5030201@xenomai.org> <5130673D.2090700@xenomai.org> <513069D5.1090508@xenomai.org> <5135FC42.3010606@xenomai.org> In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Subject: Re: [Xenomai] Xenomai-forge: thread using 100% cpu load List-Id: Discussions about the Xenomai project List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Ronny Meeus Cc: xenomai@xenomai.org On 03/06/2013 11:55 AM, Ronny Meeus wrote: > On Tue, Mar 5, 2013 at 3:08 PM, Philippe Gerum wrote: >> On 03/05/2013 01:43 PM, Ronny Meeus wrote: >>> >>> On Sat, Mar 2, 2013 at 12:13 PM, Ronny Meeus >>> wrote: >>>> >>>> On Fri, Mar 1, 2013 at 9:41 AM, Philippe Gerum wrote: >>>>> >>>>> On 03/01/2013 09:30 AM, Gilles Chanteperdrix wrote: >>>>>> >>>>>> >>>>>> On 03/01/2013 09:30 AM, Philippe Gerum wrote: >>>>>> >>>>>>> On 03/01/2013 09:26 AM, Gilles Chanteperdrix wrote: >>>>>>>> >>>>>>>> >>>>>>>> On 03/01/2013 09:22 AM, Philippe Gerum wrote: >>>>>>>> >>>>>>>>> On 02/28/2013 09:22 PM, Thomas De Schampheleire wrote: >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Thu, Feb 28, 2013 at 9:10 PM, Gilles Chanteperdrix >>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On 02/28/2013 08:19 PM, Ronny Meeus wrote: >>>>>>>>>>> >>>>>>>>>>>> Hello >>>>>>>>>>>> >>>>>>>>>>>> we are using the PSOS interface of Xenomai forge, running >>>>>>>>>>>> completely >>>>>>>>>>>> in user-space using the mercury code. >>>>>>>>>>>> We deploy our application on different processors, one product is >>>>>>>>>>>> running on PPC multicore (P4040, P4080, P4034) and another one on >>>>>>>>>>>> Cavium (8 core device). >>>>>>>>>>>> The Linux version we use is 2.6.32 but I would assume that this >>>>>>>>>>>> is >>>>>>>>>>>> not >>>>>>>>>>>> so relevant. >>>>>>>>>>>> >>>>>>>>>>>> Our Xenomai application is running on one of the cores (affinity >>>>>>>>>>>> is >>>>>>>>>>>> set), while the other cores are running other code. >>>>>>>>>>>> >>>>>>>>>>>> On both architectures we recently start to see issues that one >>>>>>>>>>>> thread >>>>>>>>>>>> is consuming 100% of the core on which the application is pinned. >>>>>>>>>>>> The thread that monopolizes the core is the thread internally >>>>>>>>>>>> used >>>>>>>>>>>> to >>>>>>>>>>>> manage the timers, running at the highest priority. >>>>>>>>>>>> The trigger for running into this behavior is currently unclear. >>>>>>>>>>>> If we only start a part of the application (platform management >>>>>>>>>>>> only), >>>>>>>>>>>> the issue is not observed. >>>>>>>>>>>> We see this on both an old version of Xenomai and a very recent >>>>>>>>>>>> one >>>>>>>>>>>> (pulled from the git repo yesterday). >>>>>>>>>>>> >>>>>>>>>>>> I will continue to debug this issue in the coming days and try >>>>>>>>>>>> isolate >>>>>>>>>>>> the code that is triggering it, but I can use hints from the >>>>>>>>>>>> community. >>>>>>>>>>>> Debugging is complex since once the load starts, the debugger is >>>>>>>>>>>> not >>>>>>>>>>>> reacting anymore. >>>>>>>>>>>> If I put breakpoints in the functions that are called when the >>>>>>>>>>>> timer >>>>>>>>>>>> expires (both oneshot and periodic), the process starts to clone >>>>>>>>>>>> itself and I endup with tens of them. >>>>>>>>>>>> >>>>>>>>>>>> Has anybody seen an issue like this before or does somebody has >>>>>>>>>>>> some >>>>>>>>>>>> hints on how to debug this problem? >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> First enable the watchdog. It will send a signal to the >>>>>>>>>>> application >>>>>>>>>>> when >>>>>>>>>>> detecting a problem, then you can use the watchdog to trigger an >>>>>>>>>>> I-pipe >>>>>>>>>>> tracer trace when the bug happens. You will probably have to >>>>>>>>>>> increase >>>>>>>>>>> the watchdog polling frequency, in order to have a meaningful >>>>>>>>>>> trace. >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> I don't think an I-pipe tracer will be possible when using the >>>>>>>>>> Mercury >>>>>>>>>> core, right (xenomai-forge) ? >>>>>>>>>> >>>>>>>>> >>>>>>>>> Correct. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> I do not think so. The way I see it, you can enable the I-pipe tracer >>>>>>>> without CONFIG_XENOMAI. >>>>>>>> >>>>>>> >>>>>>> Mercury has NO pipeline in the kernel. >>>>>>> >>>>>> >>>>>> You mean mercury can not run with an I-pipe kernel? >>>>>> >>>>> >>>>> I mean it does not care about the pipeline, it does not need it. So if >>>>> this >>>>> is about observing kernel activity, then ftrace should be fine, or >>>>> possibly >>>>> perf to find out where userland spends time. >>>>> >>>>> -- >>>>> Philippe. >>>>> >>>>> >>>>> _______________________________________________ >>>>> Xenomai mailing list >>>>> Xenomai@xenomai.org >>>>> http://www.xenomai.org/mailman/listinfo/xenomai >>>> >>>> >>>> Hello >>>> >>>> An update on the investigation: >>>> I was able to make this issue disappear by changing the timeout value >>>> of the smallest timers we use. >>>> We use a couple of timers with a timeout of 25ms. By enlarging these >>>> to 25sec and the problem is gone. >>>> >>>> Yesterday I was also able to see (using the"strace" tool) the process >>>> executing constantly "clone" system calls. >>>> Note that the process we use is large (2Gb) and uses an mlockall call. >>>> >>>> In >>>> http://stackoverflow.com/questions/4263958/some-information-on-timer-helper-thread-of-librt-so-1/4935895#4935895 >>>> I see that a new thread is created when the timer_create is called for >>>> the first time. This thread stays alive until the program exits and is >>>> used to process the timer expiries. >>>> I have the feeling that there is an issue during the creation of this >>>> thread. For example what would happen if the clone operation takes >>>> longer than the time needed to perform the clone operation? >>>> In the past we already observed issues with the clone call that we >>>> could not explain (creation of the clone simply failed on our >>>> application while it was working fine on a smaller application). >>>> >>>> Do you guys know whether there is an impact on the clone operation by >>>> this mlockall call? >>>> >>>> I will try to make a small test application on which the issue can be >>>> reproduced. >>>> >>>> --- >>>> Ronny >>> >>> >>> I'm able to reproduce the issue on a small test build: >>> >>> #include >>> #include >>> #include >>> #include >>> #include >>> #include >>> #include >>> #include >>> >>> static void foo (u_long a0, u_long a1, u_long a2, u_long a3) >>> { >>> u_long ret, ev = 0, tmid,tmid2; >>> >>> ret = tm_evevery(1,1,&tmid); >>> ret = tm_evafter(30000,4,&tmid2); >>> while (1) { >>> ret = ev_receive(0xFF,EV_ANY|EV_WAIT,0,&ev); >>> if (ev & 4) { >>> printf("%lx Restarting one-shot timer. >>> ev=%lx\n",ret,ev); >>> tm_evafter(30000,4,&tmid2); >>> } >>> ev = 0; >>> } >>> tm_wkafter(100); >>> } >>> >>> int main(int argc, char * const *argv) >>> { >>> u_long ret, tid = 0, args[4]; >>> >>> mlockall(MCL_CURRENT | MCL_FUTURE); >>> copperplate_init(&argc,&argv); >>> >>> ret = t_create("TEST",97, 0, 0, 0, &tid); >>> printf("t_create(tid=%lu) = %lu\n", tid, ret); >>> args[0] = 1; >>> args[1] = 2; >>> args[2] = 3; >>> args[3] = 4; >>> ret = t_start(tid, 0, foo, args); >>> printf("t_start(tid=%lu) = %lu\n", tid, ret); >>> >>> while (1) >>> tm_wkafter(100); >>> return 0; >>> } >>> >>> The TEST task starts 2 timers: one periodic and one 1shot timer. >>> Each time the one-shot timer expires, a print is done and the timer is >>> restarted. >>> >>> Observation is that once the one-shot timer expires, the application >>> starts to use 100% cpuload on one core and the application code is not >>> executed anymore. So it looks like there is constant processing in >>> either Xenomai or the library code to process the timer handling. If >>> periodic timers are used the issue is not observed. >>> >> >> I can't reproduce this bug using that test code, over glibc 2.15/x86. We >> might have a problem with SIGEV_THREAD. Which glibc release are you running? >> > > Philip, > do you have a reference to the issue that you are suspecting and a Nothing specific I can confirm yet. > view on which version of the glib we need to use to solve it? > > So far I have the test running fine over glibc 2.15(x86) and eglibc 2.13(ppc). I have an outdated glibc 2.8(arm) I'm about to test which might give me a different status. I don't think I'll keep running timers over SIGEV_THREAD in the upcoming -forge work anyway, the spec leaves too much room for interpretation with respect to the underlying implementation. Typically, one server thread per-timer would be quite of a problem with legacy systems firing tenths of timeout timers used as plain watchdogs. -- Philippe.