linux-rt-users.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* RT-thread on cpu0 affects performance of RT-thread on isolated cpu1
@ 2018-02-28 21:11 Yann le Chevoir
  2018-03-02  9:43 ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 5+ messages in thread
From: Yann le Chevoir @ 2018-02-28 21:11 UTC (permalink / raw)
  To: linux-rt-users

[-- Attachment #1: Type: text/plain, Size: 4549 bytes --]

Hello,

I am an engineering student and I try to proof that a 4000Hz hard real-time
application can run on an ARM board rather than on a more powerful machine.

I work with an IMX6 dual-core and PREEMPT_RT patch-4.1.38-rt46.
I expected that my 4000Hz thread will perform better if it is the only one
on core1, so I put the boot argument isolcpus=1 and bound my thread to cpu1.

With the isolcpus=1, note that it remains these processes on core1:

   PID    PSR    RTPRIO    CMD
   16     1      99        [migration/1]
   17     1      -         [rcuc/1]
   18     1      1         [ktimersoftd/1]
   19     1      -         [ksoftirqd/1]
   20     1      99        [posixcputmr/1]
   21     1      -         [kworker/1:0]
   22     1      -         [kworker/1:0H]

I tried several permutations in my kernel configuration and boot args
(rcu_nocbs is an example) and none affected the results I describe below.


I use a script to stress Linux. I expected that only cpu0 will be stressed
as cpu1 is isolated. But it has an impact on thread on cpu1 too.
I think it is normal.


First, as I draw it (in red) on “expected_behavior.png”, I expected much less
variations in the Latency and especially the Execution time.
(My thread always does the same thing).

How can we explain so much time variations? As I said, I tried to deactivate
all interrupts on cpu1 (rcu and others processes above) but I am not very
familiar with that.



Then, I am even more surprised when, trying to debug that, I decided to put
another thread on core0 and it improved the behavior of the thread on core1!


My application looks like:

main(){

     create a first 4000Hz thread (thread1), prio = 99, cpu = 1
     /*cpu1 is isolated*/

     create a second 4000Hz thread (thread0), prio = 98, cpu = 0
     /*To create this thread (cpu0) improves the performance of
      *the other thread (cpu1)!*/

     start both threads

     while(1){
          print_stat();
     }

}

thread1(){

     struct timespec start, stop, next, interval = 250us;

     /* Initialization of the periodicity */
     clock_gettime(CLOCK_REALTIME, &next);
     next += interval;

     while(1){
          /*Releases at specified rate*/
          clock_nanosleep(CLOCK_REALTIME, TIMER_ABSTIME, &next, NULL);
          /*Get time to check jitter and execution time*/
          clock_gettime(CLOCK_REALTIME, &start);
          do_job();
          /*Get time to check execution time*/
          clock_gettime(CLOCK_REALTIME, &stop);
          do_stat(); //jitter = start-next; exec_time = stop-start
          next += interval;
     }

}

thread0(){
    struct timespec next, interval = 250us;

     /* Initialization of the periodicity */
     clock_gettime(CLOCK_REALTIME, &next);
     next += interval;

     while(1){
          /*Releases at specified rate*/
          clock_nanosleep(CLOCK_REALTIME, TIMER_ABSTIME, &next, NULL);
          usleep(100);
          /****************************************************************
           * Without sleeping 100us, only the Latency of the other thread *
           * (on cpu1) is improved.                                       *
           * Sleeping 100us in this new 4000Hz thread (cpu0) improved     *
           * the execution time of the other thread (on cpu1)...          *
           ****************************************************************/
          next += interval;
     }

}

As you can see in “background_thread_on_core_0.png”, the Latency and the
Execution time (of the thread on core1) are improved (in comparison with
“no_background_thread.png”) when there is a new 4000Hz thread on cpu0
AND when this thread does something...

I tried a lot of permutations and I do not understand:
- If the new thread (cpu0) is at 5000Hz (>4000Hz), then observations
  are the same (performance of the thread on cpu1 improves)
- If the new thread is at 2000HZ (<4000Hz), then there is no improvement...

- If the new thread (4000Hz on cpu0) does something (even sleeping enough
  time), then the Execution time of the thread on cpu1 improves.
- If the new thread does nothing (or do too few stuff), then, ONLY the
  Latency of the thread on cpu1 is improved...

Do you have any experience with that, any idea to debug?
I wonder if the scheduler or the clock tick are bound to cpu0 and if it
can play a role in the responsiveness of the thread on cpu1 (isolated one).

Thanks,

Regards,

Yann

[-- Attachment #2: background_thread_on_core_0.png --]
[-- Type: image/png, Size: 5632 bytes --]

[-- Attachment #3: expected_behavior.png --]
[-- Type: image/png, Size: 12878 bytes --]

[-- Attachment #4: no_background_thread.png --]
[-- Type: image/png, Size: 5892 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: RT-thread on cpu0 affects performance of RT-thread on isolated cpu1
  2018-02-28 21:11 RT-thread on cpu0 affects performance of RT-thread on isolated cpu1 Yann le Chevoir
@ 2018-03-02  9:43 ` Sebastian Andrzej Siewior
  2018-03-06 21:16   ` Yann le Chevoir
  0 siblings, 1 reply; 5+ messages in thread
From: Sebastian Andrzej Siewior @ 2018-03-02  9:43 UTC (permalink / raw)
  To: Yann le Chevoir; +Cc: linux-rt-users

On 2018-02-28 22:11:10 [+0100], Yann le Chevoir wrote:
> Hello,
Hi,

> I am an engineering student and I try to proof that a 4000Hz hard real-time
> application can run on an ARM board rather than on a more powerful machine.
> 
> I work with an IMX6 dual-core and PREEMPT_RT patch-4.1.38-rt46.
> I expected that my 4000Hz thread will perform better if it is the only one
> on core1, so I put the boot argument isolcpus=1 and bound my thread to cpu1.
> 
> With the isolcpus=1, note that it remains these processes on core1:
> 
>    PID    PSR    RTPRIO    CMD
>    16     1      99        [migration/1]
>    17     1      -         [rcuc/1]
>    18     1      1         [ktimersoftd/1]
>    19     1      -         [ksoftirqd/1]
>    20     1      99        [posixcputmr/1]
>    21     1      -         [kworker/1:0]
>    22     1      -         [kworker/1:0H]
> 
> I tried several permutations in my kernel configuration and boot args
> (rcu_nocbs is an example) and none affected the results I describe below.

In general with isolcpus you should be able to get most task off CPU1.
The rcu_nocbs option should get the RCU callbacks off CPU1. And then you
can change the IRQ affinity to CPU0. But this looks good.

> I use a script to stress Linux. I expected that only cpu0 will be stressed
> as cpu1 is isolated. But it has an impact on thread on cpu1 too.
> I think it is normal.

The CPU-caches might be shared. Locks which which are held by CPU0 and
required by CPU1 will also slow down CPU1.
The sched_switch tracer should show you if anything of your "script to
stress" migrates to CPU1 and/or if the "delays" CPU1. That would make
your task on CPU1 go from state R to D and back to R.

> First, as I draw it (in red) on “expected_behavior.png”, I expected much less
> variations in the Latency and especially the Execution time.
> (My thread always does the same thing).

You could try cyclictest and compare its latency. However from the plot
it looks at about 25us so it is not that bad.

> How can we explain so much time variations? As I said, I tried to deactivate
> all interrupts on cpu1 (rcu and others processes above) but I am not very
> familiar with that.
As I suggested above, run sched_switch, measure the latency and if you
hit an execution time of 150us then tell your application to disable the
trace (a write '0' to the tracing_on sysfs file would do it) and then
you can look at the trace and see if got interrupted by anything.

> Then, I am even more surprised when, trying to debug that, I decided to put
> another thread on core0 and it improved the behavior of the thread on core1!

If you put more load on the system and the numbers improve then this
might be because the caches are filled with the "right" data. Also it
might prevent power management from doing its job. 

> My application looks like:
> thread1(){
> 
>      struct timespec start, stop, next, interval = 250us;
> 
>      /* Initialization of the periodicity */
>      clock_gettime(CLOCK_REALTIME, &next);

CLOCK_MONOTONIC is what you want.
>      next += interval;
> 
>      while(1){
>           /*Releases at specified rate*/
>           clock_nanosleep(CLOCK_REALTIME, TIMER_ABSTIME, &next, NULL);

I know, error checking is hard. But You should check if the above
returned an error and since you "just" looked at the time, you should
check if next <= now.

>           /*Get time to check jitter and execution time*/
>           clock_gettime(CLOCK_REALTIME, &start);
>           do_job();
>           /*Get time to check execution time*/
>           clock_gettime(CLOCK_REALTIME, &stop);
>           do_stat(); //jitter = start-next; exec_time = stop-start
>           next += interval;
>      }
> 
> }


> thread0(){
>     struct timespec next, interval = 250us;
> 
>      /* Initialization of the periodicity */
>      clock_gettime(CLOCK_REALTIME, &next);
>      next += interval;
> 
>      while(1){
>           /*Releases at specified rate*/
>           clock_nanosleep(CLOCK_REALTIME, TIMER_ABSTIME, &next, NULL);
>           usleep(100);
>           /****************************************************************
>            * Without sleeping 100us, only the Latency of the other thread *
>            * (on cpu1) is improved.                                       *
>            * Sleeping 100us in this new 4000Hz thread (cpu0) improved     *
>            * the execution time of the other thread (on cpu1)...          *
>            ****************************************************************/

I'm not sure but it looks like the usleep() is implemented as a
busy loop. A busy loop of 100us isn't *that* bad but if all RT tasks
take around 950us then the scheduler will prevent all RT tasks from
running (in order to prevent a possible lockup).

>           next += interval;
>      }
> 
> }
> 
> As you can see in “background_thread_on_core_0.png”, the Latency and the
> Execution time (of the thread on core1) are improved (in comparison with
> “no_background_thread.png”) when there is a new 4000Hz thread on cpu0
> AND when this thread does something...
> 
> I tried a lot of permutations and I do not understand:
> - If the new thread (cpu0) is at 5000Hz (>4000Hz), then observations
>   are the same (performance of the thread on cpu1 improves)
> - If the new thread is at 2000HZ (<4000Hz), then there is no improvement...
> 
> - If the new thread (4000Hz on cpu0) does something (even sleeping enough
>   time), then the Execution time of the thread on cpu1 improves.
> - If the new thread does nothing (or do too few stuff), then, ONLY the
>   Latency of the thread on cpu1 is improved...
> 
> Do you have any experience with that, any idea to debug?

- tracing to see if the scheduler puts another task on while your RT is
  running
- tracing to see if an interrupts fires which prevents your task from
  running. This should only be the timer interrupt since everything else
  should be only on CPU0.
- check if any power management is on and try to disable it.
- as time basis, try not to use a random time but start with usec = 25
  or 50 or so. You should be able to avoid the HZ timer.

> I wonder if the scheduler or the clock tick are bound to cpu0 and if it
> can play a role in the responsiveness of the thread on cpu1 (isolated one).
> 
> Thanks,
> 
> Regards,
> 
> Yann

Sebastian

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: RT-thread on cpu0 affects performance of RT-thread on isolated cpu1
  2018-03-02  9:43 ` Sebastian Andrzej Siewior
@ 2018-03-06 21:16   ` Yann le Chevoir
  2018-03-06 22:07     ` Julia Cartwright
  2018-03-08 17:21     ` Sebastian Andrzej Siewior
  0 siblings, 2 replies; 5+ messages in thread
From: Yann le Chevoir @ 2018-03-06 21:16 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior; +Cc: linux-rt-users

[-- Attachment #1: Type: text/plain, Size: 11054 bytes --]

Hello,

Thank you for helping :)

>> I am an engineering student and I try to proof that a 4000Hz hard real-time
>> application can run on an ARM board rather than on a more powerful machine.
>> 
>> I work with an IMX6 dual-core and PREEMPT_RT patch-4.1.38-rt46.
>> I expected that my 4000Hz thread will perform better if it is the only one
>> on core1, so I put the boot argument isolcpus=1 and bound my thread to cpu1.
>> 
>> With the isolcpus=1, note that it remains these processes on core1:
>> 
>>    PID    PSR    RTPRIO    CMD
>>    16     1      99        [migration/1]
>>    17     1      -         [rcuc/1]
>>    18     1      1         [ktimersoftd/1]
>>    19     1      -         [ksoftirqd/1]
>>    20     1      99        [posixcputmr/1]
>>    21     1      -         [kworker/1:0]
>>    22     1      -         [kworker/1:0H]
>> 
>> I tried several permutations in my kernel configuration and boot args
>> (rcu_nocbs is an example) and none affected the results I describe below.
> 
> In general with isolcpus you should be able to get most task off CPU1.
> The rcu_nocbs option should get the RCU callbacks off CPU1. And then you
> can change the IRQ affinity to CPU0. But this looks good.

As I said, in my previous message, I deactivated such options as I was not
sure and as I did not observe any enhancement when activating them.
I agree it is what I want so I reactivated them and my kernel config looks like:

#
# Timers subsystem
#
CONFIG_TICK_ONESHOT=y
CONFIG_NO_HZ_COMMON=y
# CONFIG_HZ_PERIODIC is not set
# CONFIG_NO_HZ_IDLE is not set
CONFIG_NO_HZ_FULL=y
CONFIG_NO_HZ_FULL_ALL=y
# CONFIG_NO_HZ_FULL_SYSIDLE is not set
CONFIG_NO_HZ=y
CONFIG_HIGH_RES_TIMERS=y

#
# RCU Subsystem
#
...
CONFIG_RCU_NOCB_CPU=y
# CONFIG_RCU_NOCB_CPU_NONE is not set
# CONFIG_RCU_NOCB_CPU_ZERO is not set
CONFIG_RCU_NOCB_CPU_ALL=y

#
# CPU Frequency scaling
#
CONFIG_CPU_FREQ=y
# CONFIG_CPU_FREQ_STAT is not set
CONFIG_CPU_FREQ_DEFAULT_GOV_PERFORMANCE=y
# CONFIG_CPU_FREQ_DEFAULT_GOV_POWERSAVE is not set
...
CONFIG_CPU_FREQ_GOV_PERFORMANCE=y
# CONFIG_CPU_FREQ_GOV_POWERSAVE is not set
...
CONFIG_ARM_IMX6Q_CPUFREQ=y
...

#
# CPU Idle
#
# CONFIG_CPU_IDLE is not set
# CONFIG_ARCH_NEEDS_CPU_IDLE_COUPLED is not set

#
# Kernel Features
#
CONFIG_HZ_FIXED=0
CONFIG_HZ_100=y
...
CONFIG_HZ=100




>> I use a script to stress Linux. I expected that only cpu0 will be stressed
>> as cpu1 is isolated. But it has an impact on thread on cpu1 too.
>> I think it is normal.
> 
> The CPU-caches might be shared. Locks which which are held by CPU0 and
> required by CPU1 will also slow down CPU1.
> The sched_switch tracer should show you if anything of your "script to
> stress" migrates to CPU1 and/or if the "delays" CPU1. That would make
> your task on CPU1 go from state R to D and back to R.

Thank you, I never used Linux kernel tracing before. It was not enabled in
my Kernel, I just did it. I have some results below.

>> First, as I draw it (in red) on “expected_behavior.png”, I expected much less
>> variations in the Latency and especially the Execution time.
>> (My thread always does the same thing).
> 
> You could try cyclictest and compare its latency. However from the plot
> it looks at about 25us so it is not that bad.
> 
>> How can we explain so much time variations? As I said, I tried to deactivate
>> all interrupts on cpu1 (rcu and others processes above) but I am not very
>> familiar with that.
>>
> As I suggested above, run sched_switch, measure the latency and if you
> hit an execution time of 150us then tell your application to disable the
> trace (a write '0' to the tracing_on sysfs file would do it) and then
> you can look at the trace and see if got interrupted by anything.
> 
>> Then, I am even more surprised when, trying to debug that, I decided to put
>> another thread on core0 and it improved the behavior of the thread on core1!
> 
> If you put more load on the system and the numbers improve then this
> might be because the caches are filled with the "right" data. Also it
> might prevent power management from doing its job.

About power management, as you can see in the kernel conf, *it is enabled*
but *only the performance mode* is enabled.
Indeed, I already worked on that, I first deactivated it but then my clock
was not at its maximum.
I succeeded in getting the max clock frequency back by doing this.


>> My application looks like:

Thanks for reading the code :)
I already checked if next <= now, look how I did it in the updated code.
I changed to CLOCK_MONOTONIC, thanks.
I try to do error checking!

> …
>> thread1(){
>> 
>>      struct timespec start, stop, next, interval = 250us;
>> 
>>      /* Initialization of the periodicity */
>>      clock_gettime(CLOCK_MONOTONIC, &next);
>>
>>      /* as time basis, try not to use a random time but start with usec = 25 *
>>       * or 50 or so. You should be able to avoid the HZ timer.               */
>>      next.tv_sec += 1;
>>      next.tv_nsec = 250000;

Is it what you expected I did?

>>      next += interval;
>> 
>>      while(1){
>>           /*Releases at specified rate*/
>>           if(clock_nanosleep(CLOCK_MONOTONIC, TIMER_ABSTIME, &next, NULL) != 0){
>>                perror("error");
>>                exit(-1);
>>           }
>>           /*Get time to check jitter and execution time*/
>>           if(clock_gettime(CLOCK_MONOTONIC, &start) != 0){
>>                perror("error");
>>                exit(-1);
>>           }
>>           do_job();
>>           /*Get time to check execution time*/
>>           if(clock_gettime(CLOCK_MONOTONIC, &stop) != 0){
>>                perror("error");
>>                exit(-1);
>>           }
>>           do_stat(); //jitter = start-next; exec_time = stop-start
>>           while(next <= stop){
>>                next += interval;
>>           }
>>      }
>> 
>> }
>> 
>> 
>> thread0(){
>>     struct timespec next, interval = 250us;
>> 
>>      /* Initialization of the periodicity */
>>      clock_gettime(CLOCK_REALTIME, &next);
>>      next += interval;
>> 
>>      while(1){
>>           /*Releases at specified rate*/
>>           clock_nanosleep(CLOCK_REALTIME, TIMER_ABSTIME, &next, NULL);
>>           usleep(100);
>>           /****************************************************************
>>            * Without sleeping 100us, only the Latency of the other thread *
>>            * (on cpu1) is improved.                                       *
>>            * Sleeping 100us in this new 4000Hz thread (cpu0) improved     *
>>            * the execution time of the other thread (on cpu1)...          *
>>            ****************************************************************/
> 
> I'm not sure but it looks like the usleep() is implemented as a
> busy loop. A busy loop of 100us isn't *that* bad but if all RT tasks
> take around 950us then the scheduler will prevent all RT tasks from
> running (in order to prevent a possible lockup).
> 
>>           next += interval;
>>      }
>> 
>> }
>> 
>> As you can see in “background_thread_on_core_0.png”, the Latency and the
>> Execution time (of the thread on core1) are improved (in comparison with
>> “no_background_thread.png”) when there is a new 4000Hz thread on cpu0
>> AND when this thread does something...
>> 
>> I tried a lot of permutations and I do not understand:
>> - If the new thread (cpu0) is at 5000Hz (>4000Hz), then observations
>>   are the same (performance of the thread on cpu1 improves)
>> - If the new thread is at 2000HZ (<4000Hz), then there is no improvement...
>> 
>> - If the new thread (4000Hz on cpu0) does something (even sleeping enough
>>   time), then the Execution time of the thread on cpu1 improves.
>> - If the new thread does nothing (or do too few stuff), then, ONLY the
>>   Latency of the thread on cpu1 is improved...
>> 
>> Do you have any experience with that, any idea to debug?
> 
> - tracing to see if the scheduler puts another task on while your RT is
>  running
> - tracing to see if an interrupts fires which prevents your task from
>  running. This should only be the timer interrupt since everything else
>  should be only on CPU0.

I analyse a trace below.
There is only the ktimersoftd/1 task which comes back every 10000us.
I first had CONFIG_HZ=1000 so it was every 1000us then.
I first thought it was responsible for my problems.
But finally it seems not. I am not sure.

> - check if any power management is on and try to disable it.

As I said above, it is *on* but *only perf mode is allowed*. Is it correct?

> - as time basis, try not to use a random time but start with usec = 25  
> or 50 or so. You should be able to avoid the HZ timer.

I tried to start with usec = 25. Is it correct in the code above?

Do you talk about the CONFIG_NO_HZ option?
I am confident it is what I want, but, according to the documentation,
I am afraid:
   "POSIX CPU timers prevent CPUs from entering adaptive-tick mode.
   Real-time applications needing to take actions based on CPU time
   consumption need to use other means of doing so."
Indeed, I want to continue to use the POSIX CPU timer (sleep and gettime).

What's more, this test from Frederic fails:
https://kernel.googlesource.com/pub/scm/linux/kernel/git/frederic/dynticks-testing/+/9b39ab06956d484d5311e0678e1fba8104a8e2e3
According to this message, I can not stop tick:
user_loop-15296 [001] dn.....  8131.952509: tick_stop: success=no msg=more than 1 task in runqueue
Moreover, I do not see this message when tracing my application below.




I attached the trace (trace.txt) to this message.
The trace stopped when exec_time was more than 200us.

I "normalized" (put to 0) the timestamp two times to understand what
is going on.

What I understand:

Lines 5-8:   It is the clock_nanosleep handler.
             It takes 13us to wake up the thread. Good.
Line 9:      The thread took 60us to run. Very good.
Line 10:     Let's start again 250us later. Good.
Line 11:     tick_sched_timer... It will wake up the ktimersoftd/1.
             I am afraid :(
Line 15:     It tooks 35us to wake up the thread...
             Not so bad, sometimes it is more...
Line 16:     The thread took 68us to run. Ok.
Lines 17-19: ktimersoftd/1 took 17us to run (370-353).
             I think it is not so embarrassing. I am reassured :) 
...
Line 69:     1000us, ok
Line 74:     1256us, 6us late :(
Line 77:     36us to wake up the thread, +6us = 42us jitter :( :(
Line 78:     ***255us (1547-1292) execution time !!!***
Line 79:     Overrun, next rdv at 1750us, 8us late...

Does this trace help?
According to me, there is nothing on CPU1. There is just this ktimersoftd/1.
Can the idle task (on cpu0) be delayed by the stress on cpu0?
Can we migrate the Linux scheduler to cpu1?

Thanks again,

Yann

[-- Attachment #2: trace.txt --]
[-- Type: text/plain, Size: 9080 bytes --]

	.	
	.	
	.	
		
          <idle>-0     [001] d..h1..  	0	: irq_handler_entry: irq=16 name=twd
          <idle>-0     [001] d..h1..  	2	: hrtimer_expire_entry: hrtimer=98117ef0 function=hrtimer_wakeup now=1864569754886
          <idle>-0     [001] dn.h1..  	7	: irq_handler_exit: irq=16 ret=handled
          <idle>-0     [001] d...2..  	13	: sched_switch: prev_comm=swapper/1 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=test_preempt next_pid=5583 next_prio=1
    test_preempt-5583  [001] d...2..  	73	: sched_switch: prev_comm=test_preempt prev_pid=5583 prev_prio=1 prev_state=S ==> next_comm=swapper/1 next_pid=0 next_prio=120
          <idle>-0     [001] d..h1..  	250	: irq_handler_entry: irq=16 name=twd
          <idle>-0     [001] d..h1..  	253	: hrtimer_expire_entry: hrtimer=bf7af990 function=tick_sched_timer now=1864570005553
          <idle>-0     [001] d..h1..  	259	: softirq_raise: vec=1 [action=TIMER]
          <idle>-0     [001] d..h1..  	263	: hrtimer_expire_entry: hrtimer=98117ef0 function=hrtimer_wakeup now=1864570005553
          <idle>-0     [001] dn.h1..  	268	: irq_handler_exit: irq=16 ret=handled
          <idle>-0     [001] d...2..  	285	: sched_switch: prev_comm=swapper/1 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=test_preempt next_pid=5583 next_prio=1
    test_preempt-5583  [001] d...2..  	353	: sched_switch: prev_comm=test_preempt prev_pid=5583 prev_prio=1 prev_state=S ==> next_comm=ktimersoftd/1 next_pid=21 next_prio=98
   ktimersoftd/1-21    [001] .....11  	361	: softirq_entry: vec=1 [action=TIMER]
   ktimersoftd/1-21    [001] .....11  	364	: softirq_exit: vec=1 [action=TIMER]
   ktimersoftd/1-21    [001] d...2..  	370	: sched_switch: prev_comm=ktimersoftd/1 prev_pid=21 prev_prio=98 prev_state=S ==> next_comm=swapper/1 next_pid=0 next_prio=120
          <idle>-0     [001] d..h1..  	501	: irq_handler_entry: irq=16 name=twd
          <idle>-0     [001] d..h1..  	505	: hrtimer_expire_entry: hrtimer=98117ef0 function=hrtimer_wakeup now=1864570257220
          <idle>-0     [001] dn.h1..  	512	: irq_handler_exit: irq=16 ret=handled
          <idle>-0     [001] d...2..  	520	: sched_switch: prev_comm=swapper/1 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=test_preempt next_pid=5583 next_prio=1
    test_preempt-5583  [001] d...2..  	605	: sched_switch: prev_comm=test_preempt prev_pid=5583 prev_prio=1 prev_state=S ==> next_comm=swapper/1 next_pid=0 next_prio=120
          <idle>-0     [001] d..h1..  	750	: irq_handler_entry: irq=16 name=twd
          <idle>-0     [001] d..h1..  	754	: hrtimer_expire_entry: hrtimer=98117ef0 function=hrtimer_wakeup now=1864570506220
          <idle>-0     [001] dn.h1..  	760	: irq_handler_exit: irq=16 ret=handled
          <idle>-0     [001] d...2..  	767	: sched_switch: prev_comm=swapper/1 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=test_preempt next_pid=5583 next_prio=1
    test_preempt-5583  [001] d...2..  	841	: sched_switch: prev_comm=test_preempt prev_pid=5583 prev_prio=1 prev_state=S ==> next_comm=swapper/1 next_pid=0 next_prio=120
          <idle>-0     [001] d..h1..  	1000	: irq_handler_entry: irq=16 name=twd
          <idle>-0     [001] d..h1..  	1004	: hrtimer_expire_entry: hrtimer=98117ef0 function=hrtimer_wakeup now=1864570755220
          <idle>-0     [001] dn.h1..  	1010	: irq_handler_exit: irq=16 ret=handled
          <idle>-0     [001] d...2..  	1016	: sched_switch: prev_comm=swapper/1 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=test_preempt next_pid=5583 next_prio=1
    test_preempt-5583  [001] d...2..  	1088	: sched_switch: prev_comm=test_preempt prev_pid=5583 prev_prio=1 prev_state=S ==> next_comm=swapper/1 next_pid=0 next_prio=120
          <idle>-0     [001] d..h1..  	1250	: irq_handler_entry: irq=16 name=twd
          <idle>-0     [001] d..h1..  	1254	: hrtimer_expire_entry: hrtimer=98117ef0 function=hrtimer_wakeup now=1864571005886
          <idle>-0     [001] dn.h1..  	1259	: irq_handler_exit: irq=16 ret=handled
          <idle>-0     [001] d...2..  	1265	: sched_switch: prev_comm=swapper/1 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=test_preempt next_pid=5583 next_prio=1
		
	.	
	.	
	.	
		
          <idle>-0     [001] d..h1..  	0	: irq_handler_entry: irq=16 name=twd
          <idle>-0     [001] d..h1..  	3	: hrtimer_expire_entry: hrtimer=bf7af990 function=tick_sched_timer now=1864580005886
          <idle>-0     [001] d..h1..  	10	: softirq_raise: vec=1 [action=TIMER]
          <idle>-0     [001] d..h1..  	15	: hrtimer_expire_entry: hrtimer=98117ef0 function=hrtimer_wakeup now=1864580005886
          <idle>-0     [001] dn.h1..  	21	: irq_handler_exit: irq=16 ret=handled
          <idle>-0     [001] d...2..  	41	: sched_switch: prev_comm=swapper/1 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=test_preempt next_pid=5583 next_prio=1
    test_preempt-5583  [001] d...2..  	123	: sched_switch: prev_comm=test_preempt prev_pid=5583 prev_prio=1 prev_state=S ==> next_comm=ktimersoftd/1 next_pid=21 next_prio=98
   ktimersoftd/1-21    [001] .....11  	135	: softirq_entry: vec=1 [action=TIMER]
   ktimersoftd/1-21    [001] .....11  	138	: softirq_exit: vec=1 [action=TIMER]
   ktimersoftd/1-21    [001] d...2..  	147	: sched_switch: prev_comm=ktimersoftd/1 prev_pid=21 prev_prio=98 prev_state=S ==> next_comm=swapper/1 next_pid=0 next_prio=120
          <idle>-0     [001] d..h1..  	251	: irq_handler_entry: irq=16 name=twd
          <idle>-0     [001] d..h1..  	255	: hrtimer_expire_entry: hrtimer=98117ef0 function=hrtimer_wakeup now=1864580256553
          <idle>-0     [001] dn.h1..  	260	: irq_handler_exit: irq=16 ret=handled
          <idle>-0     [001] d...2..  	267	: sched_switch: prev_comm=swapper/1 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=test_preempt next_pid=5583 next_prio=1
    test_preempt-5583  [001] d...2..  	359	: sched_switch: prev_comm=test_preempt prev_pid=5583 prev_prio=1 prev_state=S ==> next_comm=swapper/1 next_pid=0 next_prio=120
          <idle>-0     [001] d..h1..  	500	: irq_handler_entry: irq=16 name=twd
          <idle>-0     [001] d..h1..  	504	: hrtimer_expire_entry: hrtimer=98117ef0 function=hrtimer_wakeup now=1864580506220
          <idle>-0     [001] dn.h1..  	510	: irq_handler_exit: irq=16 ret=handled
          <idle>-0     [001] d...2..  	519	: sched_switch: prev_comm=swapper/1 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=test_preempt next_pid=5583 next_prio=1
    test_preempt-5583  [001] d...2..  	610	: sched_switch: prev_comm=test_preempt prev_pid=5583 prev_prio=1 prev_state=S ==> next_comm=swapper/1 next_pid=0 next_prio=120
          <idle>-0     [001] d..h1..  	750	: irq_handler_entry: irq=16 name=twd
          <idle>-0     [001] d..h1..  	754	: hrtimer_expire_entry: hrtimer=98117ef0 function=hrtimer_wakeup now=1864580755886
          <idle>-0     [001] dn.h1..  	760	: irq_handler_exit: irq=16 ret=handled
          <idle>-0     [001] d...2..  	769	: sched_switch: prev_comm=swapper/1 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=test_preempt next_pid=5583 next_prio=1
    test_preempt-5583  [001] d...2..  	856	: sched_switch: prev_comm=test_preempt prev_pid=5583 prev_prio=1 prev_state=S ==> next_comm=swapper/1 next_pid=0 next_prio=120
          <idle>-0     [001] d..h1..  	1000	: irq_handler_entry: irq=16 name=twd
          <idle>-0     [001] d..h1..  	1004	: hrtimer_expire_entry: hrtimer=98117ef0 function=hrtimer_wakeup now=1864581005886
          <idle>-0     [001] dn.h1..  	1010	: irq_handler_exit: irq=16 ret=handled
          <idle>-0     [001] d...2..  	1017	: sched_switch: prev_comm=swapper/1 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=test_preempt next_pid=5583 next_prio=1
    test_preempt-5583  [001] d...2..  	1110	: sched_switch: prev_comm=test_preempt prev_pid=5583 prev_prio=1 prev_state=S ==> next_comm=swapper/1 next_pid=0 next_prio=120
          <idle>-0     [001] d..h1..  	1256	: irq_handler_entry: irq=16 name=twd
          <idle>-0     [001] d..h1..  	1265	: hrtimer_expire_entry: hrtimer=98117ef0 function=hrtimer_wakeup now=1864581264220
          <idle>-0     [001] dn.h1..  	1277	: irq_handler_exit: irq=16 ret=handled
          <idle>-0     [001] d...2..  	1292	: sched_switch: prev_comm=swapper/1 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=test_preempt next_pid=5583 next_prio=1
    test_preempt-5583  [001] d...2..  	1547	: sched_switch: prev_comm=test_preempt prev_pid=5583 prev_prio=1 prev_state=S ==> next_comm=swapper/1 next_pid=0 next_prio=120
          <idle>-0     [001] d..h1..  	1758	: irq_handler_entry: irq=16 name=twd
          <idle>-0     [001] d..h1..  	1770	: hrtimer_expire_entry: hrtimer=98117ef0 function=hrtimer_wakeup now=1864581767220
          <idle>-0     [001] dn.h1..  	1788	: irq_handler_exit: irq=16 ret=handled
          <idle>-0     [001] d...2..  	1808	: sched_switch: prev_comm=swapper/1 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=test_preempt next_pid=5583 next_prio=1

	exec_time > 200 : echo 0 > tracing_on
		

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: RT-thread on cpu0 affects performance of RT-thread on isolated cpu1
  2018-03-06 21:16   ` Yann le Chevoir
@ 2018-03-06 22:07     ` Julia Cartwright
  2018-03-08 17:21     ` Sebastian Andrzej Siewior
  1 sibling, 0 replies; 5+ messages in thread
From: Julia Cartwright @ 2018-03-06 22:07 UTC (permalink / raw)
  To: Yann le Chevoir; +Cc: Sebastian Andrzej Siewior, linux-rt-users, Haris Okanovic

On Tue, Mar 06, 2018 at 10:16:50PM +0100, Yann le Chevoir wrote:
> Hello,
> 
> Thank you for helping :)
> 
> >> I am an engineering student and I try to proof that a 4000Hz hard real-time
> >> application can run on an ARM board rather than on a more powerful machine.
> >> 
> >> I work with an IMX6 dual-core and PREEMPT_RT patch-4.1.38-rt46.
> >>
[..]
>           <idle>-0     [001] d..h1..  	0	: irq_handler_entry: irq=16 name=twd
>           <idle>-0     [001] d..h1..  	2	: hrtimer_expire_entry: hrtimer=98117ef0 function=hrtimer_wakeup now=1864569754886
>           <idle>-0     [001] dn.h1..  	7	: irq_handler_exit: irq=16 ret=handled
>           <idle>-0     [001] d...2..  	13	: sched_switch: prev_comm=swapper/1 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=test_preempt next_pid=5583 next_prio=1
>     test_preempt-5583  [001] d...2..  	73	: sched_switch: prev_comm=test_preempt prev_pid=5583 prev_prio=1 prev_state=S ==> next_comm=swapper/1 next_pid=0 next_prio=120
>           <idle>-0     [001] d..h1..  	250	: irq_handler_entry: irq=16 name=twd
>           <idle>-0     [001] d..h1..  	253	: hrtimer_expire_entry: hrtimer=bf7af990 function=tick_sched_timer now=1864570005553
>           <idle>-0     [001] d..h1..  	259	: softirq_raise: vec=1 [action=TIMER]
>           <idle>-0     [001] d..h1..  	263	: hrtimer_expire_entry: hrtimer=98117ef0 function=hrtimer_wakeup now=1864570005553
>           <idle>-0     [001] dn.h1..  	268	: irq_handler_exit: irq=16 ret=handled
>           <idle>-0     [001] d...2..  	285	: sched_switch: prev_comm=swapper/1 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=test_preempt next_pid=5583 next_prio=1
>     test_preempt-5583  [001] d...2..  	353	: sched_switch: prev_comm=test_preempt prev_pid=5583 prev_prio=1 prev_state=S ==> next_comm=ktimersoftd/1 next_pid=21 next_prio=98
>    ktimersoftd/1-21    [001] .....11  	361	: softirq_entry: vec=1 [action=TIMER]
>    ktimersoftd/1-21    [001] .....11  	364	: softirq_exit: vec=1 [action=TIMER]

Unfortunately, v4.1-rt doesn't currently support NO_HZ=full, because it
didn't contain the hrtimer rework necessary to do so w/ PREEMPT_RT.
This is what's preventing the tick from being deferred (as you've
indicated, it's still ticking away at HZ).  This is the primary problem.

A secondary problem is that it appears that the issue Haris (CCed) has
been looking at applies further back than the hrtimer rework.  That is:
the TIMER_SOFTIRQ is unconditionally raised in the scheduler tick,
regardless if there is an armed timer in the queues.

   Julia

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: RT-thread on cpu0 affects performance of RT-thread on isolated cpu1
  2018-03-06 21:16   ` Yann le Chevoir
  2018-03-06 22:07     ` Julia Cartwright
@ 2018-03-08 17:21     ` Sebastian Andrzej Siewior
  1 sibling, 0 replies; 5+ messages in thread
From: Sebastian Andrzej Siewior @ 2018-03-08 17:21 UTC (permalink / raw)
  To: Yann le Chevoir; +Cc: linux-rt-users

On 2018-03-06 22:16:50 [+0100], Yann le Chevoir wrote:
> Hello,
Hi,

> #
> # Timers subsystem
> #
> CONFIG_TICK_ONESHOT=y
> CONFIG_NO_HZ_COMMON=y
> # CONFIG_HZ_PERIODIC is not set
> # CONFIG_NO_HZ_IDLE is not set
> CONFIG_NO_HZ_FULL=y
> CONFIG_NO_HZ_FULL_ALL=y

I am not sure NO_HZ_FULL is what you want. From what I know NO_HZ_FULL
has a little more overhead while calling into the kernel but is
perfectly fine if you plan to stay in userland and hardly call into the
kernel. Your application seems to be interval based and you do not stay
for longer periods of time in userland.
(not to mention what Julia said that NO_HZ_FULL is not working in the
old RT kernels).

…
> #
> # CPU Frequency scaling
> #
> CONFIG_CPU_FREQ=y
> # CONFIG_CPU_FREQ_STAT is not set
> CONFIG_CPU_FREQ_DEFAULT_GOV_PERFORMANCE=y
> # CONFIG_CPU_FREQ_DEFAULT_GOV_POWERSAVE is not set
> ...
> CONFIG_CPU_FREQ_GOV_PERFORMANCE=y
> # CONFIG_CPU_FREQ_GOV_POWERSAVE is not set
> ...
> CONFIG_ARM_IMX6Q_CPUFREQ=y
> ...
CPUFREQ may influence the latencies if it plans to switch the frequency
while you are in middle of something.

> About power management, as you can see in the kernel conf, *it is enabled*
> but *only the performance mode* is enabled.
> Indeed, I already worked on that, I first deactivated it but then my clock
> was not at its maximum.
> I succeeded in getting the max clock frequency back by doing this.

What you want to avoid is that the CPU-speed is changed at runtime while
you measure your application. If that is the case then you should be
good.

> >> thread1(){
> >> 
> >>      struct timespec start, stop, next, interval = 250us;
> >> 
> >>      /* Initialization of the periodicity */
> >>      clock_gettime(CLOCK_MONOTONIC, &next);
> >>
> >>      /* as time basis, try not to use a random time but start with usec = 25 *
> >>       * or 50 or so. You should be able to avoid the HZ timer.               */
> >>      next.tv_sec += 1;
> >>      next.tv_nsec = 250000;
> 
> Is it what you expected I did?

I said 25 or 50 us, you did 250us so no, I didn't expect that :)
The thing is with HZ=100 you get 100 ticks per second - one every 10ms.
With an offset of 250us and an interval of 250us you end up programming
your timer at the same point time as the timer which increments jiffies.
That means that your timer (and the other one) fire at the same time but
your application has to wait until the jiffies incremented. If you
program your timer 25us *after* a full second then the HZ timer
increments the jiffy at 0 and 25us *after* that timer your application's
timer should fire.

> I analyse a trace below.
> There is only the ktimersoftd/1 task which comes back every 10000us.
> I first had CONFIG_HZ=1000 so it was every 1000us then.
> I first thought it was responsible for my problems.
> But finally it seems not. I am not sure.

HZ=100 should be okay. The mod_timer() timer are less accurate with
HZ=100 than with HZ=1000. Your clock_nanosleep() is not affected by this
because you use a highres timer which is not affected by this.

> > - as time basis, try not to use a random time but start with usec = 25  
> > or 50 or so. You should be able to avoid the HZ timer.
> 
> I am confident it is what I want, but, according to the documentation,
> I am afraid:
>    "POSIX CPU timers prevent CPUs from entering adaptive-tick mode.
>    Real-time applications needing to take actions based on CPU time
>    consumption need to use other means of doing so."
> Indeed, I want to continue to use the POSIX CPU timer (sleep and gettime).

So maybe you don't want NO_HZ_FULL. As I explained above you don't spent
your whole time in userland - you do things and sleep until the next
cycle.

> I attached the trace (trace.txt) to this message.
> The trace stopped when exec_time was more than 200us.
> 
> I "normalized" (put to 0) the timestamp two times to understand what
> is going on.
> 
> What I understand:
> 
> Lines 5-8:   It is the clock_nanosleep handler.
>              It takes 13us to wake up the thread. Good.
> Line 9:      The thread took 60us to run. Very good.
> Line 10:     Let's start again 250us later. Good.
> Line 11:     tick_sched_timer... It will wake up the ktimersoftd/1.
>              I am afraid :(
> Line 15:     It tooks 35us to wake up the thread...
>              Not so bad, sometimes it is more...
> Line 16:     The thread took 68us to run. Ok.
> Lines 17-19: ktimersoftd/1 took 17us to run (370-353).
>              I think it is not so embarrassing. I am reassured :) 
> ...
> Line 69:     1000us, ok
> Line 74:     1256us, 6us late :(
> Line 77:     36us to wake up the thread, +6us = 42us jitter :( :(
> Line 78:     ***255us (1547-1292) execution time !!!***
> Line 79:     Overrun, next rdv at 1750us, 8us late...
> 
> Does this trace help?
> According to me, there is nothing on CPU1. There is just this ktimersoftd/1.

So the idea was to add +25us to your timer and start hopefully after all
this.

> Can the idle task (on cpu0) be delayed by the stress on cpu0?
The idle should ideally do nothing. It does usually the "wfi" opcode
which stands for "Wait For Interrupt". If something on CPU0 holds a lock
which is needed CPU1 then CPU1 has to wait. Looking at CPU0 and CPU1 at
the same time you could see what CPU0 is doing and then check if this
might interact with CPU1.
I'm not sure why there is a 6us delay. Also everything after that seem
to take a little longer than usual.

> Can we migrate the Linux scheduler to cpu1?
You have a runqueue per CPU. So in a sense you have a scheduler per-CPU.
Since all your task on CPU0 have the affinity mask set to CPU0 only then
the scheduler on CPU0 should never try to migrate migrate a task.

> Thanks again,
> 
> Yann

Sebastian

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2018-03-08 17:21 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2018-02-28 21:11 RT-thread on cpu0 affects performance of RT-thread on isolated cpu1 Yann le Chevoir
2018-03-02  9:43 ` Sebastian Andrzej Siewior
2018-03-06 21:16   ` Yann le Chevoir
2018-03-06 22:07     ` Julia Cartwright
2018-03-08 17:21     ` Sebastian Andrzej Siewior

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).