Real-time kernel thread performance and optimization

linux-rt-users.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Real-time kernel thread performance and optimization
@ 2012-11-30 15:46 Simon Falsig
  2012-11-30 22:31 ` Frank Rowand
  0 siblings, 1 reply; 16+ messages in thread
From: Simon Falsig @ 2012-11-30 15:46 UTC (permalink / raw)
  To: linux-rt-users

Hi,

Inspired by Thomas Gleixners LinuxCon '12 appeal for more
communication/feedback/interaction from people using the preempt-RT patch,
here comes a rather long (and hopefully at least slightly interesting) set
of questions.

First of all, a bit of background.  We have been using Linux and
preempt-RT on a custom ARM board for some years, and are currently in the
process of transitioning to a new AMD Fusion-based platform (also
custom-made, x86, 1.67 GHz dual-core). As we want to keep both systems in
production simultaneous for at least some time, we want to keep the
systems as similar as possible. For the new board, we have currently
settled on a 3.2.9 kernel with the rt16 patch (I can see that an rt17
patch has been released since we started though).

Our own system consists of a user-space application, communicating
with/over:
 - Ethernet (for our GUI, which runs on a separate machine)
 - Serial ports (various hardware)
 - A set of custom kernel modules (implementing device drivers for some
custom I/O hardware)

For the kernel modules we have a utility timer module, that allows other
modules to register a "poll" function, which is then run at a 10 ms cycle
rate. We want this to happen in real-time, so the timer module is made as
an rt-thread using hrtimers (the implementation is new, as the existing
code from our old board used the ARM hardware-timer). The following code
is used:

// Timer callback for 10ms polling of rackbus devices
static enum hrtimer_restart bus_10ms_callback(struct hrtimer *val) {
	struct custombus_device_driver *cbdrv, *next_cbdrv;
	ktime_t now = ktime_get();

	rt_mutex_lock(&list_10ms_mutex);

list_for_each_entry_safe(cbdrv,next_cbdrv,&polling_10ms_list,poll_list) {
		driver_for_each_device(&cbdrv->driver, NULL, NULL,
cbdrv->poll_function);
	}
	rt_mutex_unlock(&list_10ms_mutex);

	hrtimer_forward(&timer, now, kt);
	if(cancelCallback == 0) {
		return HRTIMER_RESTART;
	}
	else {
		return HRTIMER_NORESTART;
	}
}

// Thread to start 10ms timer
static int bus_rt_timer_init(void *arg) {
	kt = ktime_set(0, 10 * 1000 * 1000);		//10 ms = 10 *
1000 * 1000 ns
	cancelCallback = 0;
	hrtimer_init(&timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
	timer.function = bus_10ms_callback;
	hrtimer_start(&timer, kt, HRTIMER_MODE_REL);

	return 0;	
}

// Module initialization
int __init bus_timer_interrupt_init(void) {
	struct sched_param param = { .sched_priority = MAX_RT_PRIO - 1 };

	thread_10ms = kthread_create(bus_rt_timer_init, NULL, "bus_10ms");
	if (IS_ERR(thread_10ms)) {
		printk(KERN_ERR "Failed to create RT thread\n");
		return -ESRCH;
	}

	sched_setscheduler(thread_10ms, SCHED_FIFO, &param);

	wake_up_process(thread_10ms);

	printk(KERN_INFO "RT timer thread installed with priority %d.\n",
param.sched_priority);
	return 0;
}

I currently have a single module registered for polling. The poll function
is:

static inline void read_input(struct Io1000 *b)
{
	u16  *input = &b->ibuf[b->in];

	*input = le16_to_cpu((inb(REG_INPUT_1) << 8));

	process();
}

The "inb" function reads a register on an FPGA, attached over the LPC bus.
The pseudocode "process" function is a placeholder for some filtering of
the read inputs, performing mostly memory access (some of this protected
by a spin lock, although the lock should never be locked during the tests,
as there isn't anything else accessing it), and calling the kernel
"wake_up" function on the wait_queue containing our data.
To measure performance of the system, I've implemented a simple ChipScope
core in the FPGA, allowing me to count the number of cycles where the
period deviates above or below the desired 10 ms, and to store the maximum
period seen.

All this works just fine on an unloaded system. I'm consistently getting
cycle times very close to the 10 ms, with a range of 9.7 ms - 10.3 ms.

Once I start loading the system with various stress tests, I am getting
ranges of about 9.0 ms - 18.0 ms. I have however also seen rare 50-70 ms
spikes, typically  when starting the stress loads, but they don't seem to
be repeatable.
My stress loads are (inspired from Ingo Molnars dohell script
(https://lkml.org/lkml/2005/6/22/347)):

while true; do killall hackbench; sleep 5; done &
while true; do ./hackbench 20; done &
du / &
./dortc &
./serialspammer &

In addition to this, I'm also doing an external ping flood. The
serialspammer application basically just spams both our serial ports with
data (I've hardwired a physical loop-back to them), not because it's a lot
of data (at 115000kbps), but mostly as the serial chip is on the same LPC
bus as the FPGA. As our userspace application runs just fine on a 180 MHz
ARM, it only presents a very light load to our new platform. The used
stress loads should thus represent a very heavy load compared to what we
expect to see during normal operation.

Question 1:
- I'm rather content with the current performance, but I'd still like to
know if there is anything obvious (or anything obvious missing) in the
posted code that could be improved for better performance? I can see that
it is recommended to prefault and lock the used memory, but I haven't been
able to find anything about how to do this in a kernel thread?

Question 2:
 - Are latency spikes to be expected when starting the above stress loads?

Question 3:
 - As far as I can see spinlocks use priority inheritance - so I presume
that our spinlock calls from within our RT-thread should not pose a
potential major problem? According to
https://rt.wiki.kernel.org/index.php/HOWTO:_Build_an_RT-application
though, it seems that both spinlocks and "wake_up" are no-go's when called
in interrupt contexts - does the same apply to our timer context? (I've
had the "process" call commented out, without any seemingly noticeable
change in performance.)

Bonus-question:
 - Additionally, I've tried running cyclictest alongside with all the
above, and it actually performs rather well, without any substantial
spikes. A strange thing is though, that the results are actually better
with load than without? (running with -t1 -p 80 -n -i 10000 -l 10000)
 - Loaded: Min: 16, Avg: 41, Max: 177
 - No load: Min: 16, Avg: 97, Max: 263

Once I get this finished up, I'll be happy to do a complete write-up of
the timer-thread code, if anyone is interested. I remember looking for
something similar (but without success), when I wrote the code earlier
this year.

In any case, all kinds of answers or comments are welcome.
Thanks in advance!

Best regards,
Simon Falsig

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Real-time kernel thread performance and optimization
  2012-11-30 15:46 Real-time kernel thread performance and optimization Simon Falsig
@ 2012-11-30 22:31 ` Frank Rowand
  2012-12-03 12:39   ` Simon Falsig
                     ` (3 more replies)
  0 siblings, 4 replies; 16+ messages in thread
From: Frank Rowand @ 2012-11-30 22:31 UTC (permalink / raw)
  To: Simon Falsig; +Cc: linux-rt-users@vger.kernel.org

On 11/30/12 07:46, Simon Falsig wrote:
> Hi,
> 
> Inspired by Thomas Gleixners LinuxCon '12 appeal for more
> communication/feedback/interaction from people using the preempt-RT patch,
> here comes a rather long (and hopefully at least slightly interesting) set
> of questions.
> 
> First of all, a bit of background.  We have been using Linux and
> preempt-RT on a custom ARM board for some years, and are currently in the
> process of transitioning to a new AMD Fusion-based platform (also
> custom-made, x86, 1.67 GHz dual-core). As we want to keep both systems in
> production simultaneous for at least some time, we want to keep the
> systems as similar as possible. For the new board, we have currently
> settled on a 3.2.9 kernel with the rt16 patch (I can see that an rt17
> patch has been released since we started though).
> 
> Our own system consists of a user-space application, communicating
> with/over:
>  - Ethernet (for our GUI, which runs on a separate machine)
>  - Serial ports (various hardware)
>  - A set of custom kernel modules (implementing device drivers for some
> custom I/O hardware)
> 
> For the kernel modules we have a utility timer module, that allows other
> modules to register a "poll" function, which is then run at a 10 ms cycle
> rate. We want this to happen in real-time, so the timer module is made as
> an rt-thread using hrtimers (the implementation is new, as the existing
> code from our old board used the ARM hardware-timer). The following code
> is used:
> 
> // Timer callback for 10ms polling of rackbus devices
> static enum hrtimer_restart bus_10ms_callback(struct hrtimer *val) {
> 	struct custombus_device_driver *cbdrv, *next_cbdrv;
> 	ktime_t now = ktime_get();
> 	
> 	rt_mutex_lock(&list_10ms_mutex);
> 	
> list_for_each_entry_safe(cbdrv,next_cbdrv,&polling_10ms_list,poll_list) {
> 		driver_for_each_device(&cbdrv->driver, NULL, NULL,
> cbdrv->poll_function);
> 	}
> 	rt_mutex_unlock(&list_10ms_mutex);
> 
> 	hrtimer_forward(&timer, now, kt);
> 	if(cancelCallback == 0) {
> 		return HRTIMER_RESTART;
> 	}
> 	else {
> 		return HRTIMER_NORESTART;
> 	}
> }
> 
> // Thread to start 10ms timer
> static int bus_rt_timer_init(void *arg) {
> 	kt = ktime_set(0, 10 * 1000 * 1000);		//10 ms = 10 *
> 1000 * 1000 ns
> 	cancelCallback = 0;
> 	hrtimer_init(&timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
> 	timer.function = bus_10ms_callback;
> 	hrtimer_start(&timer, kt, HRTIMER_MODE_REL);
> 
> 	return 0;	
> }
> 
> // Module initialization
> int __init bus_timer_interrupt_init(void) {
> 	struct sched_param param = { .sched_priority = MAX_RT_PRIO - 1 };
> 	
> 	thread_10ms = kthread_create(bus_rt_timer_init, NULL, "bus_10ms");
> 	if (IS_ERR(thread_10ms)) {
> 		printk(KERN_ERR "Failed to create RT thread\n");
> 		return -ESRCH;
> 	}
> 
> 	sched_setscheduler(thread_10ms, SCHED_FIFO, &param);
> 
> 	wake_up_process(thread_10ms);
> 	
> 	printk(KERN_INFO "RT timer thread installed with priority %d.\n",
> param.sched_priority);
> 	return 0;
> }

I don't understand why you create a kernel thread to execute
bus_rt_timer_init().  That thread sets up your timer
and then immediately exits.  Is there a reason you can't
just move the contents of bus_rt_timer_init() into
bus_timer_interrupt_init() and avoid creating the thread?

> 
> 
> I currently have a single module registered for polling. The poll function
> is:
> 
> static inline void read_input(struct Io1000 *b)
> {
> 	u16  *input = &b->ibuf[b->in];
> 	
> 	*input = le16_to_cpu((inb(REG_INPUT_1) << 8));
> 
> 	process();
> }
> 
> 
> The "inb" function reads a register on an FPGA, attached over the LPC bus.
> The pseudocode "process" function is a placeholder for some filtering of
> the read inputs, performing mostly memory access (some of this protected
> by a spin lock, although the lock should never be locked during the tests,
> as there isn't anything else accessing it), and calling the kernel
> "wake_up" function on the wait_queue containing our data.
> To measure performance of the system, I've implemented a simple ChipScope
> core in the FPGA, allowing me to count the number of cycles where the
> period deviates above or below the desired 10 ms, and to store the maximum
> period seen.
> 
> All this works just fine on an unloaded system. I'm consistently getting
> cycle times very close to the 10 ms, with a range of 9.7 ms - 10.3 ms.
> 
> Once I start loading the system with various stress tests, I am getting
> ranges of about 9.0 ms - 18.0 ms. I have however also seen rare 50-70 ms
> spikes, typically  when starting the stress loads, but they don't seem to
> be repeatable.
> My stress loads are (inspired from Ingo Molnars dohell script
> (https://lkml.org/lkml/2005/6/22/347)):
> 
> while true; do killall hackbench; sleep 5; done &
> while true; do ./hackbench 20; done &
> du / &
> ./dortc &
> ./serialspammer &
> 
> In addition to this, I'm also doing an external ping flood. The
> serialspammer application basically just spams both our serial ports with
> data (I've hardwired a physical loop-back to them), not because it's a lot
> of data (at 115000kbps), but mostly as the serial chip is on the same LPC
> bus as the FPGA. As our userspace application runs just fine on a 180 MHz
> ARM, it only presents a very light load to our new platform. The used
> stress loads should thus represent a very heavy load compared to what we
> expect to see during normal operation.
> 
> 
> Question 1:
> - I'm rather content with the current performance, but I'd still like to
> know if there is anything obvious (or anything obvious missing) in the
> posted code that could be improved for better performance? I can see that
> it is recommended to prefault and lock the used memory, but I haven't been
> able to find anything about how to do this in a kernel thread?

Kernel memory does not get swapped, so access to it does not result in
a "major fault".  Access to kernel memory can result in a "minor fault"
(tlb miss), but that is not prevented by locking memory.  So you do not
need to worry about prefaulting and locking memory in a kernel thread.

The memory issue that a kernel thread may need to worry about is allocating
kernel memory.  The short answer is to not allocate kernel memory from
your kernel thread while it is in the real time domain.  Also, do not
call other parts of the kernel that allocate kernel memory.

> Question 2:
>  - Are latency spikes to be expected when starting the above stress loads?

Yes, latency spikes are possible when starting processes (if I remember
correctly, this is related to locking).

> Question 3:
>  - As far as I can see spinlocks use priority inheritance - so I presume
> that our spinlock calls from within our RT-thread should not pose a
> potential major problem? According to
> https://rt.wiki.kernel.org/index.php/HOWTO:_Build_an_RT-application
> though, it seems that both spinlocks and "wake_up" are no-go's when called
> in interrupt contexts - does the same apply to our timer context? (I've
> had the "process" call commented out, without any seemingly noticeable
> change in performance.)

Your hrtimer function bus_10ms_callback() is called from the hrtimer
softirq, so it needs to follow softirq rules. (At least in 3.6.7-rt18,
which is not the version you are using...)

> Bonus-question:
>  - Additionally, I've tried running cyclictest alongside with all the
> above, and it actually performs rather well, without any substantial
> spikes. A strange thing is though, that the results are actually better
> with load than without? (running with -t1 -p 80 -n -i 10000 -l 10000)
>  - Loaded: Min: 16, Avg: 41, Max: 177
>  - No load: Min: 16, Avg: 97, Max: 263

If the system is less loaded, then the idle thread might be able to
enter deeper levels of sleep.  Deeper levels of sleep have larger
latencies to exit.  You would have to look at your processor specific
values for exiting sleep states to see if this is sufficient to explain
the difference.

> Once I get this finished up, I'll be happy to do a complete write-up of
> the timer-thread code, if anyone is interested. I remember looking for
> something similar (but without success), when I wrote the code earlier
> this year.

It would be very useful to add your results to the wiki.

-Frank


^ permalink raw reply	[flat|nested] 16+ messages in thread

* RE: Real-time kernel thread performance and optimization
  2012-11-30 22:31 ` Frank Rowand
@ 2012-12-03 12:39   ` Simon Falsig
  2012-12-03 14:15   ` Carsten Emde
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 16+ messages in thread
From: Simon Falsig @ 2012-12-03 12:39 UTC (permalink / raw)
  To: frank.rowand; +Cc: linux-rt-users

Hi Frank,

Thanks for the quick and interesting reply! I'm out of the office for the
next couple of days, but will take a closer look at it once I get back
later this week.

Regards,
Simon

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Real-time kernel thread performance and optimization
  2012-11-30 22:31 ` Frank Rowand
  2012-12-03 12:39   ` Simon Falsig
@ 2012-12-03 14:15   ` Carsten Emde
  2012-12-11 14:43     ` Simon Falsig
  2012-12-19 14:59     ` John Kacur
  2012-12-11 14:30   ` Simon Falsig
  2012-12-12 15:39   ` Simon Falsig
  3 siblings, 2 replies; 16+ messages in thread
From: Carsten Emde @ 2012-12-03 14:15 UTC (permalink / raw)
  To: frank.rowand; +Cc: Simon Falsig, linux-rt-users@vger.kernel.org

On 11/30/2012 11:31 PM, Frank Rowand wrote:
> On 11/30/12 07:46, Simon Falsig wrote:
>> [..]
>> Bonus-question:
>>   - Additionally, I've tried running cyclictest alongside with all the
>> above, and it actually performs rather well, without any substantial
>> spikes. A strange thing is though, that the results are actually better
>> with load than without? (running with -t1 -p 80 -n -i 10000 -l 10000)
>>   - Loaded: Min: 16, Avg: 41, Max: 177
>>   - No load: Min: 16, Avg: 97, Max: 263
>
> If the system is less loaded, then the idle thread might be able to
> enter deeper levels of sleep.  Deeper levels of sleep have larger
> latencies to exit.  You would have to look at your processor specific
> values for exiting sleep states to see if this is sufficient to explain
> the difference.
If running a half-decent version of cyclictest, sleep states are 
generally disabled while cyclictest is running. Please watch the line
   # /dev/cpu_dma_latency set to 0us
which essentially documents this mechanism. Yes, the name of the 
variable "cpu_dma_latency" is not obvious and cyclictest could do a 
better job by writing
   Wrote 0 to /dev/cpu_dma_latency and keeping the path open to prevent
   all cores from entering any sleep state
but this is another story.

A patch that was merged to 3.7 allows to individually enable or disable 
sleep states of the ladder governor 
(http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=commit;h=62d6ae880e3e76098d5e345decd2dce443975889). 
It smoothly applies to 3.6-RT as well. This allows to fine-tune the 
sleep states by state and core, while the /dev/cpu_dma_latency mechanism 
acts on all states and cores, e.g. to disable sleep state 2 and all 
deeper states of the ladder governor on core #0, use:
   echo 1 >/sys/devices/system/cpu/cpu0/cpuidle/state2/disable

BTW: To analyze how much time a core spent in a specific sleep state, 
simply look repeatedly at the "time" variable of a core's sleep state, 
e.g. for core #0:
# for i in /sys/devices/system/cpu/cpu0/cpuidle/state[0-4]
 > do
 > echo -e "`cat $i/name`:\t`cat $i/time`"
 > done
POLL:	1342984105
C1-IVB:	737109
C3-IVB:	3852451
C6-IVB:	1702683112
C7-IVB:	4366946606
While cyclictest is running with /dev/cpu_dma_latency set to 0, only the 
POLL state times are increasing.

	-Carsten.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* RE: Real-time kernel thread performance and optimization
  2012-11-30 22:31 ` Frank Rowand
  2012-12-03 12:39   ` Simon Falsig
  2012-12-03 14:15   ` Carsten Emde
@ 2012-12-11 14:30   ` Simon Falsig
  2012-12-17 22:18     ` Frank Rowand
  2012-12-12 15:39   ` Simon Falsig
  3 siblings, 1 reply; 16+ messages in thread
From: Simon Falsig @ 2012-12-11 14:30 UTC (permalink / raw)
  To: frank.rowand; +Cc: linux-rt-users

I've finally had time to giving this a further look. After reading up on
some kernel internals, I decided to try and reimplement the timer system
using the usleep_range() call instead of the hrtimer/callback functions.

I've arrived at the following:


static int bus_rt_timer_thread(void *arg)
{
	struct custombus_device_driver *cbdrv, *next_cbdrv;
	cancelCallback = 0;
	printk(KERN_INFO "Entering RT loop\n");
	ktime_t startTime = ktime_get();
	while(cancelCallback == 0) {
		rt_mutex_lock(&list_10ms_mutex);
	
list_for_each_entry_safe(cbdrv,next_cbdrv,&polling_10ms_list,poll_list) {
			driver_for_each_device(&cbdrv->driver, NULL, NULL,
cbdrv->poll_function);
		}
		rt_mutex_unlock(&list_10ms_mutex);
		
		s64 timeTaken_us = ktime_us_delta(ktime_get(), startTime);
		if(timeTaken_us < 9900) {
			usleep_range(9900 - timeTaken_us, 10100 -
timeTaken_us);
		}
		
		startTime = ktime_get();
	}
	printk(KERN_INFO "RT Exited\n");
	return 0;	
}


int __init bus_timer_interrupt_init(void)
{
	struct sched_param param = { .sched_priority = MAX_RT_PRIO - 1 };
	
	thread_10ms = kthread_create(bus_rt_timer_thread, NULL,
"bus_10ms");
	if (IS_ERR(thread_10ms)) {
		printk(KERN_ERR "Failed to create RT thread\n");
		return -ESRCH;
	}

	sched_setscheduler(thread_10ms, SCHED_FIFO, &param);

	wake_up_process(thread_10ms);
	
	printk(KERN_INFO "RT timer thread installed with standard priority
%d.\n", param.sched_priority);
	return 0;
}


This is not only simpler than the previous implementation, it also
performs better. Results of a 30-minute stress-test:

Old implementation:
	Cycles over 10.3 ms:	3144
	Cycles under 9.7 ms:	3852
	Max. cycletime:	~56.5 ms

New implementation:
	Cycles over 10.3 ms:	26
	Cycles under 9.7 ms:	0
	Max. cycletime:	~10.4 ms

So all in all, much much better.
As far as I have found out, usleep_range() uses hrtimers also though (like
my previous implementation), so I'd be interested in knowing where the
main difference between the two implementations lies? I'd guess that it's
related to priorities somehow?

Additional answers/comments below:

> -----Original Message-----
> From: Frank Rowand [mailto:frank.rowand@am.sony.com]
> Sent: 30. november 2012 23:31
> To: Simon Falsig
> Cc: linux-rt-users@vger.kernel.org
> Subject: Re: Real-time kernel thread performance and optimization
>
> On 11/30/12 07:46, Simon Falsig wrote:
> > Hi,
> >
> > Inspired by Thomas Gleixners LinuxCon '12 appeal for more
> > communication/feedback/interaction from people using the preempt-RT
> > patch, here comes a rather long (and hopefully at least slightly
> > interesting) set of questions.
> >
> > First of all, a bit of background.  We have been using Linux and
> > preempt-RT on a custom ARM board for some years, and are currently in
> > the process of transitioning to a new AMD Fusion-based platform (also
> > custom-made, x86, 1.67 GHz dual-core). As we want to keep both systems
> > in production simultaneous for at least some time, we want to keep the
> > systems as similar as possible. For the new board, we have currently
> > settled on a 3.2.9 kernel with the rt16 patch (I can see that an rt17
> > patch has been released since we started though).
> >
> > Our own system consists of a user-space application, communicating
> > with/over:
> >  - Ethernet (for our GUI, which runs on a separate machine)
> >  - Serial ports (various hardware)
> >  - A set of custom kernel modules (implementing device drivers for
> > some custom I/O hardware)
> >
> > For the kernel modules we have a utility timer module, that allows
> > other modules to register a "poll" function, which is then run at a 10
> > ms cycle rate. We want this to happen in real-time, so the timer
> > module is made as an rt-thread using hrtimers (the implementation is
> > new, as the existing code from our old board used the ARM
> > hardware-timer). The following code is used:
> >
> > // Timer callback for 10ms polling of rackbus devices static enum
> > hrtimer_restart bus_10ms_callback(struct hrtimer *val) {
> > 	struct custombus_device_driver *cbdrv, *next_cbdrv;
> > 	ktime_t now = ktime_get();
> >
> > 	rt_mutex_lock(&list_10ms_mutex);
> >
> >
list_for_each_entry_safe(cbdrv,next_cbdrv,&polling_10ms_list,poll_list) {
> > 		driver_for_each_device(&cbdrv->driver, NULL,
> NULL,
> > cbdrv->poll_function);
> > 	}
> > 	rt_mutex_unlock(&list_10ms_mutex);
> >
> > 	hrtimer_forward(&timer, now, kt);
> > 	if(cancelCallback == 0) {
> > 		return HRTIMER_RESTART;
> > 	}
> > 	else {
> > 		return HRTIMER_NORESTART;
> > 	}
> > }
> >
> > // Thread to start 10ms timer
> > static int bus_rt_timer_init(void *arg) {
> > 	kt = ktime_set(0, 10 * 1000 * 1000);		//10
> ms = 10 *
> > 1000 * 1000 ns
> > 	cancelCallback = 0;
> > 	hrtimer_init(&timer, CLOCK_MONOTONIC,
> HRTIMER_MODE_REL);
> > 	timer.function = bus_10ms_callback;
> > 	hrtimer_start(&timer, kt, HRTIMER_MODE_REL);
> >
> > 	return 0;
> > }
> >
> > // Module initialization
> > int __init bus_timer_interrupt_init(void) {
> > 	struct sched_param param = { .sched_priority = MAX_RT_PRIO
> - 1 };
> >
> > 	thread_10ms = kthread_create(bus_rt_timer_init, NULL,
> "bus_10ms");
> > 	if (IS_ERR(thread_10ms)) {
> > 		printk(KERN_ERR "Failed to create RT
> thread\n");
> > 		return -ESRCH;
> > 	}
> >
> > 	sched_setscheduler(thread_10ms, SCHED_FIFO, &param);
> >
> > 	wake_up_process(thread_10ms);
> >
> > 	printk(KERN_INFO "RT timer thread installed with priority
> %d.\n",
> > param.sched_priority);
> > 	return 0;
> > }
>
> I don't understand why you create a kernel thread to execute
> bus_rt_timer_init().  That thread sets up your timer and then
immediately
> exits.  Is there a reason you can't just move the contents of
> bus_rt_timer_init() into
> bus_timer_interrupt_init() and avoid creating the thread?

My original idea with that was that starting the timer from a
high-priority thread would also cause the timer to run at a higher
priority (which I'm pretty sure that I read somewhere at the time I wrote
the code) - although that may of course have been a misunderstanding on my
side, or a feature that has been changed since.
Commenting out the sched_setschedule() call in bus_timer_interrupt_init(),
did also seem to give me worse performance, although not drastically.

>
> >
> >
> > I currently have a single module registered for polling. The poll
> > function
> > is:
> >
> > static inline void read_input(struct Io1000 *b) {
> > 	u16  *input = &b->ibuf[b->in];
> >
> > 	*input = le16_to_cpu((inb(REG_INPUT_1) << 8));
> >
> > 	process();
> > }
> >
> >
> > The "inb" function reads a register on an FPGA, attached over the LPC
bus.
> > The pseudocode "process" function is a placeholder for some filtering
> > of the read inputs, performing mostly memory access (some of this
> > protected by a spin lock, although the lock should never be locked
> > during the tests, as there isn't anything else accessing it), and
> > calling the kernel "wake_up" function on the wait_queue containing our
> data.
> > To measure performance of the system, I've implemented a simple
> > ChipScope core in the FPGA, allowing me to count the number of cycles
> > where the period deviates above or below the desired 10 ms, and to
> > store the maximum period seen.
> >
> > All this works just fine on an unloaded system. I'm consistently
> > getting cycle times very close to the 10 ms, with a range of 9.7 ms -
10.3 ms.
> >
> > Once I start loading the system with various stress tests, I am
> > getting ranges of about 9.0 ms - 18.0 ms. I have however also seen
> > rare 50-70 ms spikes, typically  when starting the stress loads, but
> > they don't seem to be repeatable.
> > My stress loads are (inspired from Ingo Molnars dohell script
> > (https://lkml.org/lkml/2005/6/22/347)):
> >
> > while true; do killall hackbench; sleep 5; done & while true; do
> > ./hackbench 20; done & du / & ./dortc & ./serialspammer &
> >
> > In addition to this, I'm also doing an external ping flood. The
> > serialspammer application basically just spams both our serial ports
> > with data (I've hardwired a physical loop-back to them), not because
> > it's a lot of data (at 115000kbps), but mostly as the serial chip is
> > on the same LPC bus as the FPGA. As our userspace application runs
> > just fine on a 180 MHz ARM, it only presents a very light load to our
> > new platform. The used stress loads should thus represent a very heavy
> > load compared to what we expect to see during normal operation.
> >
> >
> > Question 1:
> > - I'm rather content with the current performance, but I'd still like
> > to know if there is anything obvious (or anything obvious missing) in
> > the posted code that could be improved for better performance? I can
> > see that it is recommended to prefault and lock the used memory, but I
> > haven't been able to find anything about how to do this in a kernel
thread?
>
> Kernel memory does not get swapped, so access to it does not result in a
> "major fault".  Access to kernel memory can result in a "minor fault"
> (tlb miss), but that is not prevented by locking memory.  So you do not
need
> to worry about prefaulting and locking memory in a kernel thread.
>
> The memory issue that a kernel thread may need to worry about is
allocating
> kernel memory.  The short answer is to not allocate kernel memory from
> your kernel thread while it is in the real time domain.  Also, do not
call other
> parts of the kernel that allocate kernel memory.

Sounds good, it seems I'm in the clear with regard to that then. Thanks
for the info!

>
> > Question 2:
> >  - Are latency spikes to be expected when starting the above stress
loads?
>
> Yes, latency spikes are possible when starting processes (if I remember
> correctly, this is related to locking).
>

Sounds reasonable - we aren't starting any processes under normal runtime
circumstances, so this shouldn't be much of a problem. And it doesn't seem
to be an issue under the new implementation in any case.

> > Question 3:
> >  - As far as I can see spinlocks use priority inheritance - so I
> > presume that our spinlock calls from within our RT-thread should not
> > pose a potential major problem? According to
> > https://rt.wiki.kernel.org/index.php/HOWTO:_Build_an_RT-application
> > though, it seems that both spinlocks and "wake_up" are no-go's when
> > called in interrupt contexts - does the same apply to our timer
> > context? (I've had the "process" call commented out, without any
> > seemingly noticeable change in performance.)
>
> Your hrtimer function bus_10ms_callback() is called from the hrtimer
softirq,
> so it needs to follow softirq rules. (At least in 3.6.7-rt18, which is
not the
> version you are using...)
>

Again, thanks for the info.

> > Bonus-question:
> >  - Additionally, I've tried running cyclictest alongside with all the
> > above, and it actually performs rather well, without any substantial
> > spikes. A strange thing is though, that the results are actually
> > better with load than without? (running with -t1 -p 80 -n -i 10000 -l
> > 10000)
> >  - Loaded: Min: 16, Avg: 41, Max: 177
> >  - No load: Min: 16, Avg: 97, Max: 263
>
> If the system is less loaded, then the idle thread might be able to
enter
> deeper levels of sleep.  Deeper levels of sleep have larger latencies to
exit.
> You would have to look at your processor specific values for exiting
sleep
> states to see if this is sufficient to explain the difference.
>

This was my initial suspicion also. Our current board is apparently a bit
dodgy with regard to the processor P states though (apparently they aren't
yet fully implemented in the BIOS we're using), so I'm not really that
inclined to spend too much effort investigating it before I'm certain that
everything is as it should be.

> > Once I get this finished up, I'll be happy to do a complete write-up
> > of the timer-thread code, if anyone is interested. I remember looking
> > for something similar (but without success), when I wrote the code
> > earlier this year.
>
> It would be very useful to add your results to the wiki.
>
> -Frank

Cool - is there any particular place it should go? A how-to, FAQ entry,
etc? Just so I know how to do the write-up...


All in all, thanks for a very useful reply! Any further comments or
similar are of course welcome.

Best regards,
Simon

^ permalink raw reply	[flat|nested] 16+ messages in thread

* RE: Real-time kernel thread performance and optimization
  2012-12-03 14:15   ` Carsten Emde
@ 2012-12-11 14:43     ` Simon Falsig
  2012-12-19  8:10       ` Carsten Emde
  2012-12-19 14:59     ` John Kacur
  1 sibling, 1 reply; 16+ messages in thread
From: Simon Falsig @ 2012-12-11 14:43 UTC (permalink / raw)
  To: Carsten Emde, frank.rowand; +Cc: linux-rt-users

> -----Original Message-----
> From: Carsten Emde [mailto:C.Emde@osadl.org]
> Sent: 3. december 2012 15:16
> To: frank.rowand@am.sony.com
> Cc: Simon Falsig; linux-rt-users@vger.kernel.org
> Subject: Re: Real-time kernel thread performance and optimization
>
> On 11/30/2012 11:31 PM, Frank Rowand wrote:
> > On 11/30/12 07:46, Simon Falsig wrote:
> >> [..]
> >> Bonus-question:
> >>   - Additionally, I've tried running cyclictest alongside with all
> >> the above, and it actually performs rather well, without any
> >> substantial spikes. A strange thing is though, that the results are
> >> actually better with load than without? (running with -t1 -p 80 -n -i
10000 -l
> 10000)
> >>   - Loaded: Min: 16, Avg: 41, Max: 177
> >>   - No load: Min: 16, Avg: 97, Max: 263
> >
> > If the system is less loaded, then the idle thread might be able to
> > enter deeper levels of sleep.  Deeper levels of sleep have larger
> > latencies to exit.  You would have to look at your processor specific
> > values for exiting sleep states to see if this is sufficient to
> > explain the difference.
> If running a half-decent version of cyclictest, sleep states are
generally
> disabled while cyclictest is running. Please watch the line
>    # /dev/cpu_dma_latency set to 0us
> which essentially documents this mechanism. Yes, the name of the
variable
> "cpu_dma_latency" is not obvious and cyclictest could do a better job by
> writing
>    Wrote 0 to /dev/cpu_dma_latency and keeping the path open to prevent
>    all cores from entering any sleep state but this is another story.
>
> A patch that was merged to 3.7 allows to individually enable or disable
sleep
> states of the ladder governor
>
(http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=commit;h=6
2d
> 6ae880e3e76098d5e345decd2dce443975889).
> It smoothly applies to 3.6-RT as well. This allows to fine-tune the
sleep states
> by state and core, while the /dev/cpu_dma_latency mechanism acts on all
> states and cores, e.g. to disable sleep state 2 and all deeper states of
the
> ladder governor on core #0, use:
>    echo 1 >/sys/devices/system/cpu/cpu0/cpuidle/state2/disable
>
> BTW: To analyze how much time a core spent in a specific sleep state,
simply
> look repeatedly at the "time" variable of a core's sleep state, e.g. for
core #0:
> # for i in /sys/devices/system/cpu/cpu0/cpuidle/state[0-4]
>  > do
>  > echo -e "`cat $i/name`:\t`cat $i/time`"
>  > done
> POLL:	1342984105
> C1-IVB:	737109
> C3-IVB:	3852451
> C6-IVB:	1702683112
> C7-IVB:	4366946606
> While cyclictest is running with /dev/cpu_dma_latency set to 0, only the
POLL
> state times are increasing.
>
> 	-Carsten.

Thanks for the reply! As I wrote in my reply to Frank, I'm not completely
sure if P states are correctly implemented in our system. We're using a
custom BIOS for our custom board, and while P states do show up and are
modifiable (I've currently installed the userspace-governor, and am
manually setting the clock-frequency to the lowest possible at startup),
our board guy is not sure that changing it actually has any effect on the
processor. Yay...:/
So until I get a go-ahead on this, I'll wait with a further investigation.

Best regards,
Simon

^ permalink raw reply	[flat|nested] 16+ messages in thread

* RE: Real-time kernel thread performance and optimization
  2012-11-30 22:31 ` Frank Rowand
                     ` (2 preceding siblings ...)
  2012-12-11 14:30   ` Simon Falsig
@ 2012-12-12 15:39   ` Simon Falsig
  3 siblings, 0 replies; 16+ messages in thread
From: Simon Falsig @ 2012-12-12 15:39 UTC (permalink / raw)
  To: linux-rt-users

And another tweak. Having looked at the actual code for usleep_range(),
I'm now basically doing the same (calling schedule_hrtimeout_range()),
although in absolute mode. This makes everything simpler, as I can then
just keep forwarding my original timeout with 10ms, instead of having to
keep track of how long my work took - and it minimizes drift also.

In addition to the below changes, I've also changed to a tickless kernel,
btw.

static int bus_rt_timer_thread(void *arg)
{
	struct custombus_device_driver *cbdrv, *next_cbdrv;
	
	printk(KERN_INFO "RT Thread entering loop\n");
	
	ktime_t timeout = ktime_get();
	while(!kthread_should_stop()) {
				
		rt_mutex_lock(&list_10ms_mutex);
	
list_for_each_entry_safe(cbdrv,next_cbdrv,&polling_10ms_list,poll_list) {
			driver_for_each_device(&cbdrv->driver, NULL, NULL,
cbdrv->poll_function);
		}
		rt_mutex_unlock(&list_10ms_mutex);
				
		timeout = ktime_add_us(timeout, 10000);
		__set_current_state(TASK_UNINTERRUPTIBLE);
		schedule_hrtimeout_range(&timeout, 100, HRTIMER_MODE_ABS);
	}
	printk(KERN_INFO "RT Thread exited\n");

	return 0;	
}

int __init bus_timer_interrupt_init(void)
{
	struct sched_param param = { .sched_priority = MAX_RT_PRIO - 1 };
	
	thread_10ms = kthread_create(bus_rt_timer_thread, NULL,
"bus_10ms");
	if (IS_ERR(thread_10ms)) {
		printk(KERN_ERR "RT Failed to create RT thread\n");
		return -ESRCH;
	}

	sched_setscheduler(thread_10ms, SCHED_FIFO, &param);

	wake_up_process(thread_10ms);
	
	printk(KERN_INFO "RT Timer thread installed with standard priority
%d.\n", param.sched_priority);
	return 0;
}


This implementation is again even simpler, and also performs better.
30-minute stress test results:

Old implementation (original hr_timer):
	Cycles over 10.3 ms:	3144
	Cycles under 9.7 ms:	3852
	Max. cycletime:	~56.5 ms

New implementation (usleep_range):
	Cycles over 10.3 ms:	26
	Cycles under 9.7 ms:	0
	Max. cycletime:	~10.4 m

New new implementation (schedule_hrtimeout_range):
	Cycles over 10.3 ms:	0
	Cycles under 9.7 ms:	0
	Max. cycletime:	~10.2 m


Again, comments, questions, anything else is welcome,
Best regards,
Simon Falsig

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Real-time kernel thread performance and optimization
  2012-12-11 14:30   ` Simon Falsig
@ 2012-12-17 22:18     ` Frank Rowand
  2012-12-20  0:11       ` Darren Hart
  0 siblings, 1 reply; 16+ messages in thread
From: Frank Rowand @ 2012-12-17 22:18 UTC (permalink / raw)
  To: Simon Falsig; +Cc: linux-rt-users@vger.kernel.org, jkacur, dvhart

On 12/11/12 06:30, Simon Falsig wrote:

< snip >

>>> Once I get this finished up, I'll be happy to do a complete write-up
>>> of the timer-thread code, if anyone is interested. I remember looking
>>> for something similar (but without success), when I wrote the code
>>> earlier this year.
>>
>> It would be very useful to add your results to the wiki.
>>
>> -Frank
> 
> Cool - is there any particular place it should go? A how-to, FAQ entry,
> etc? Just so I know how to do the write-up...

https://rt.wiki.kernel.org/index.php/Main_Page would be my default
suggestion.  I'm not quite sure where on the wiki would be good
though.  Maybe under "Tips and Techniques"?

I added the rtwiki maintainers to the cc: list.

-Frank


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Real-time kernel thread performance and optimization
  2012-12-11 14:43     ` Simon Falsig
@ 2012-12-19  8:10       ` Carsten Emde
  2012-12-20  8:09         ` Simon Falsig
  0 siblings, 1 reply; 16+ messages in thread
From: Carsten Emde @ 2012-12-19  8:10 UTC (permalink / raw)
  To: Simon Falsig; +Cc: frank.rowand, linux-rt-users

Simon,

>>>> [..]
>>>> Bonus-question:
>>>>    - Additionally, I've tried running cyclictest alongside with all
>>>> the above, and it actually performs rather well, without any
>>>> substantial spikes. A strange thing is though, that the results are
>>>> actually better with load than without? (running with -t1 -p 80 -n -i 10000 -l 10000)
>>>>    - Loaded: Min: 16, Avg: 41, Max: 177
>>>>    - No load: Min: 16, Avg: 97, Max: 263
>>>
>>> If the system is less loaded, then the idle thread might be able to
>>> enter deeper levels of sleep.  Deeper levels of sleep have larger
>>> latencies to exit.  You would have to look at your processor specific
>>> values for exiting sleep states to see if this is sufficient to
>>> explain the difference.
>> If running a half-decent version of cyclictest, sleep states are generally
>> disabled while cyclictest is running. Please watch the line
>>     # /dev/cpu_dma_latency set to 0us
>> which essentially documents this mechanism. Yes, the name of the variable
>> "cpu_dma_latency" is not obvious and cyclictest could do a better job by
>> writing
>>     Wrote 0 to /dev/cpu_dma_latency and keeping the path open to prevent
>>     all cores from entering any sleep state but this is another story.
>>
>> A patch that was merged to 3.7 allows to individually enable or disable sleep
>> states of the ladder governor
>> (http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=commit;h=62d6ae880e3e76098d5e345decd2dce443975889).
>> It smoothly applies to 3.6-RT as well. This allows to fine-tune the sleep states
>> by state and core, while the /dev/cpu_dma_latency mechanism acts on all
>> states and cores, e.g. to disable sleep state 2 and all deeper states of the
>> ladder governor on core #0, use:
>>     echo 1>/sys/devices/system/cpu/cpu0/cpuidle/state2/disable
>>
>> BTW: To analyze how much time a core spent in a specific sleep state, simply
>> look repeatedly at the "time" variable of a core's sleep state, e.g. for core #0:
>> # for i in /sys/devices/system/cpu/cpu0/cpuidle/state[0-4]
>>   >  do
>>   >  echo -e "`cat $i/name`:\t`cat $i/time`"
>>   >  done
>> POLL:	1342984105
>> C1-IVB:	737109
>> C3-IVB:	3852451
>> C6-IVB:	1702683112
>> C7-IVB:	4366946606
>> While cyclictest is running with /dev/cpu_dma_latency set to 0, only the POLL
>> state times are increasing.
> Thanks for the reply! As I wrote in my reply to Frank, I'm not completely
> sure if P states are correctly implemented in our system. We're using a
> custom BIOS for our custom board, and while P states do show up and are
> modifiable (I've currently installed the userspace-governor, and am
> manually setting the clock-frequency to the lowest possible at startup),
> our board guy is not sure that changing it actually has any effect on the
> processor. Yay...:/
Sorry, but this is a complete misunderstanding. C states and P states 
are very different 
(http://software.intel.com/en-us/blogs/2008/03/12/c-states-and-p-states-are-very-different). 
The point made by Frank and my answer related to C states (aka sleep 
states) a processor may enter when idle. The Linux C state interface is 
called cpuidle. The P states you are referring to are related to the 
processor's clock frequency that may be lowered at any time irrespective 
of idle state. The Linux P state interface is called cpufreq. P states 
generally affect the real-time capabilities in a linear and proportional 
way, e.g. a CPU board with a worst-case latency of 100 microseconds at 1 
GHz will have a latency of approximately 200 microseconds at 500 MHz. 
When idle and in deep C state, however, the processor may take several 
milliseconds to wake up and answer an asynchronous external event. This 
is why deep C states should be disabled in a real-time system that may 
become idle. And this is why I mentioned the new interface that allows 
to individually disable a particular sleep state of a particular 
processor core to ensure its deterministic behavior while the other 
cores still may run in energy-saving mode.

Hope this helps,
Carsten.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Real-time kernel thread performance and optimization
  2012-12-03 14:15   ` Carsten Emde
  2012-12-11 14:43     ` Simon Falsig
@ 2012-12-19 14:59     ` John Kacur
  2012-12-19 15:20       ` Carsten Emde
  1 sibling, 1 reply; 16+ messages in thread
From: John Kacur @ 2012-12-19 14:59 UTC (permalink / raw)
  To: Carsten Emde; +Cc: frank.rowand, Simon Falsig, linux-rt-users@vger.kernel.org



On Mon, 3 Dec 2012, Carsten Emde wrote:

> On 11/30/2012 11:31 PM, Frank Rowand wrote:
> > On 11/30/12 07:46, Simon Falsig wrote:
> > > [..]
> > > Bonus-question:
> > >   - Additionally, I've tried running cyclictest alongside with all the
> > > above, and it actually performs rather well, without any substantial
> > > spikes. A strange thing is though, that the results are actually better
> > > with load than without? (running with -t1 -p 80 -n -i 10000 -l 10000)
> > >   - Loaded: Min: 16, Avg: 41, Max: 177
> > >   - No load: Min: 16, Avg: 97, Max: 263
> > 
> > If the system is less loaded, then the idle thread might be able to
> > enter deeper levels of sleep.  Deeper levels of sleep have larger
> > latencies to exit.  You would have to look at your processor specific
> > values for exiting sleep states to see if this is sufficient to explain
> > the difference.
> If running a half-decent version of cyclictest, sleep states are generally
> disabled while cyclictest is running. Please watch the line
>   # /dev/cpu_dma_latency set to 0us
> which essentially documents this mechanism. Yes, the name of the variable
> "cpu_dma_latency" is not obvious and cyclictest could do a better job by
> writing
>   Wrote 0 to /dev/cpu_dma_latency and keeping the path open to prevent
>   all cores from entering any sleep state
> but this is another story.

Not sure what you mean here, doesn't it keep the path open?

John> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Real-time kernel thread performance and optimization
  2012-12-19 14:59     ` John Kacur
@ 2012-12-19 15:20       ` Carsten Emde
  0 siblings, 0 replies; 16+ messages in thread
From: Carsten Emde @ 2012-12-19 15:20 UTC (permalink / raw)
  To: John Kacur; +Cc: frank.rowand, Simon Falsig, linux-rt-users@vger.kernel.org

Hi John,

>> [..]
>> If running a half-decent version of cyclictest, sleep states are generally
>> disabled while cyclictest is running. Please watch the line
>>    # /dev/cpu_dma_latency set to 0us
>> which essentially documents this mechanism. Yes, the name of the variable
>> "cpu_dma_latency" is not obvious and cyclictest could do a better job by
>> writing
>>    Wrote 0 to /dev/cpu_dma_latency and keeping the path open to prevent
>>    all cores from entering any sleep state
>> but this is another story.
> Not sure what you mean here, doesn't it keep the path open?
No, no, cyclictest does the right thing and keeps the path open as 
required. This is all good.

The message "/dev/cpu_dma_latency set to 0us" apparently is not clear 
enough. It has nothing to do with DMA but the purpose of setting the 
kernel variable cpu_dma_latency to 0 is to generally disable sleep 
states. But I better should have submitted a patch.

	-Carsten.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Real-time kernel thread performance and optimization
  2012-12-17 22:18     ` Frank Rowand
@ 2012-12-20  0:11       ` Darren Hart
  2012-12-20  8:21         ` Simon Falsig
  0 siblings, 1 reply; 16+ messages in thread
From: Darren Hart @ 2012-12-20  0:11 UTC (permalink / raw)
  To: frank.rowand; +Cc: Simon Falsig, linux-rt-users@vger.kernel.org, jkacur



On 12/17/2012 02:18 PM, Frank Rowand wrote:
> On 12/11/12 06:30, Simon Falsig wrote:
> 
> < snip >
> 
>>>> Once I get this finished up, I'll be happy to do a complete write-up
>>>> of the timer-thread code, if anyone is interested. I remember looking
>>>> for something similar (but without success), when I wrote the code
>>>> earlier this year.
>>>
>>> It would be very useful to add your results to the wiki.
>>>
>>> -Frank
>>
>> Cool - is there any particular place it should go? A how-to, FAQ entry,
>> etc? Just so I know how to do the write-up...
> 
> https://rt.wiki.kernel.org/index.php/Main_Page would be my default
> suggestion.  I'm not quite sure where on the wiki would be good
> though.  Maybe under "Tips and Techniques"?
> 
> I added the rtwiki maintainers to the cc: list.

I don't have all the context, but this sounds a bit more like something
for linux/Documentation (possibly for the preempt-rt patch set). If not,
the Documentation section on the wiki is a possibility.
-- 
Darren Hart
Intel Open Source Technology Center
Yocto Project - Technical Lead - Linux Kernel

^ permalink raw reply	[flat|nested] 16+ messages in thread

* RE: Real-time kernel thread performance and optimization
  2012-12-19  8:10       ` Carsten Emde
@ 2012-12-20  8:09         ` Simon Falsig
  0 siblings, 0 replies; 16+ messages in thread
From: Simon Falsig @ 2012-12-20  8:09 UTC (permalink / raw)
  To: Carsten Emde; +Cc: frank.rowand, linux-rt-users

>
> Simon,
>
> >>>> [..]
> >>>> Bonus-question:
> >>>>    - Additionally, I've tried running cyclictest alongside with all
> >>>> the above, and it actually performs rather well, without any
> >>>> substantial spikes. A strange thing is though, that the results are
> >>>> actually better with load than without? (running with -t1 -p 80 -n
-i
> 10000 -l 10000)
> >>>>    - Loaded: Min: 16, Avg: 41, Max: 177
> >>>>    - No load: Min: 16, Avg: 97, Max: 263
> >>>
> >>> If the system is less loaded, then the idle thread might be able to
> >>> enter deeper levels of sleep.  Deeper levels of sleep have larger
> >>> latencies to exit.  You would have to look at your processor
> >>> specific values for exiting sleep states to see if this is
> >>> sufficient to explain the difference.
> >> If running a half-decent version of cyclictest, sleep states are
> >> generally disabled while cyclictest is running. Please watch the line
> >>     # /dev/cpu_dma_latency set to 0us which essentially documents
> >> this mechanism. Yes, the name of the variable "cpu_dma_latency" is
> >> not obvious and cyclictest could do a better job by writing
> >>     Wrote 0 to /dev/cpu_dma_latency and keeping the path open to
> prevent
> >>     all cores from entering any sleep state but this is another
story.
> >>
> >> A patch that was merged to 3.7 allows to individually enable or
> >> disable sleep states of the ladder governor
> >>
>
(http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=commit;h=6
2d
> 6ae880e3e76098d5e345decd2dce443975889).
> >> It smoothly applies to 3.6-RT as well. This allows to fine-tune the
> >> sleep states by state and core, while the /dev/cpu_dma_latency
> >> mechanism acts on all states and cores, e.g. to disable sleep state 2
> >> and all deeper states of the ladder governor on core #0, use:
> >>     echo 1>/sys/devices/system/cpu/cpu0/cpuidle/state2/disable
> >>
> >> BTW: To analyze how much time a core spent in a specific sleep state,
> >> simply look repeatedly at the "time" variable of a core's sleep
state, e.g.
> for core #0:
> >> # for i in /sys/devices/system/cpu/cpu0/cpuidle/state[0-4]
> >>   >  do
> >>   >  echo -e "`cat $i/name`:\t`cat $i/time`"
> >>   >  done
> >> POLL:	1342984105
> >> C1-IVB:	737109
> >> C3-IVB:	3852451
> >> C6-IVB:	1702683112
> >> C7-IVB:	4366946606
> >> While cyclictest is running with /dev/cpu_dma_latency set to 0, only
> >> the POLL state times are increasing.
> > Thanks for the reply! As I wrote in my reply to Frank, I'm not
> > completely sure if P states are correctly implemented in our system.
> > We're using a custom BIOS for our custom board, and while P states do
> > show up and are modifiable (I've currently installed the
> > userspace-governor, and am manually setting the clock-frequency to the
> > lowest possible at startup), our board guy is not sure that changing
> > it actually has any effect on the processor. Yay...:/
> Sorry, but this is a complete misunderstanding. C states and P states
are very
> different
(http://software.intel.com/en-us/blogs/2008/03/12/c-states-and-
> p-states-are-very-different).
> The point made by Frank and my answer related to C states (aka sleep
> states) a processor may enter when idle. The Linux C state interface is
called
> cpuidle. The P states you are referring to are related to the
processor's clock
> frequency that may be lowered at any time irrespective of idle state.
The
> Linux P state interface is called cpufreq. P states generally affect the
real-
> time capabilities in a linear and proportional way, e.g. a CPU board
with a
> worst-case latency of 100 microseconds at 1 GHz will have a latency of
> approximately 200 microseconds at 500 MHz.
> When idle and in deep C state, however, the processor may take several
> milliseconds to wake up and answer an asynchronous external event. This
is
> why deep C states should be disabled in a real-time system that may
become
> idle. And this is why I mentioned the new interface that allows to
individually
> disable a particular sleep state of a particular processor core to
ensure its
> deterministic behavior while the other cores still may run in
energy-saving
> mode.
>
> Hope this helps,
> Carsten.

Ah, I actually came to suspect as much after I posted the above. But I
presume C-states also need to be supported in the BIOS? We have a new
revision of our board (including an updated BIOS) coming along soon(ish),
so I'll try having a further look at things once I get my hands on it.

In any case, thank you for the nice explanation - much appreciated!

Best regards,
Simon Falsig

^ permalink raw reply	[flat|nested] 16+ messages in thread

* RE: Real-time kernel thread performance and optimization
  2012-12-20  0:11       ` Darren Hart
@ 2012-12-20  8:21         ` Simon Falsig
  2013-01-02 17:21           ` Darren Hart
  0 siblings, 1 reply; 16+ messages in thread
From: Simon Falsig @ 2012-12-20  8:21 UTC (permalink / raw)
  To: Darren Hart, frank.rowand; +Cc: linux-rt-users, jkacur

On 12/20/2012 01:12 AM, Darren Hart wrote:
> On 12/17/2012 02:18 PM, Frank Rowand wrote:
> > On 12/11/12 06:30, Simon Falsig wrote:
> >
> > < snip >
> >
> >>>> Once I get this finished up, I'll be happy to do a complete
> >>>> write-up of the timer-thread code, if anyone is interested. I
> >>>> remember looking for something similar (but without success), when
> >>>> I wrote the code earlier this year.
> >>>
> >>> It would be very useful to add your results to the wiki.
> >>>
> >>> -Frank
> >>
> >> Cool - is there any particular place it should go? A how-to, FAQ
> >> entry, etc? Just so I know how to do the write-up...
> >
> > https://rt.wiki.kernel.org/index.php/Main_Page would be my default
> > suggestion.  I'm not quite sure where on the wiki would be good
> > though.  Maybe under "Tips and Techniques"?
> >
> > I added the rtwiki maintainers to the cc: list.
>
> I don't have all the context, but this sounds a bit more like something
for
> linux/Documentation (possibly for the preempt-rt patch set). If not, the
> Documentation section on the wiki is a possibility.
> --
> Darren Hart
> Intel Open Source Technology Center
> Yocto Project - Technical Lead - Linux Kernel

As I see it, the write-up could be done in two ways - 1) as a simple code
example of a real-time loop in a kernel module, 2) as a blog-like post of
the process I went through, investigating the performance, and optimizing
my code.

In the case of 1), I guess it could be added to
https://rt.wiki.kernel.org/index.php/RT_PREEMPT_HOWTO, as a kernel version
of the realtime example, or possibly to
https://rt.wiki.kernel.org/index.php/HOWTO:_Build_an_RT-application under
"Building Device Drivers"? In the case of 2) though, it could maybe be on
its own page under "Tips and techniques"?

Best regards,
Simon Falsig

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Real-time kernel thread performance and optimization
  2012-12-20  8:21         ` Simon Falsig
@ 2013-01-02 17:21           ` Darren Hart
  0 siblings, 0 replies; 16+ messages in thread
From: Darren Hart @ 2013-01-02 17:21 UTC (permalink / raw)
  To: Simon Falsig; +Cc: frank.rowand, linux-rt-users, jkacur



On 12/20/2012 12:21 AM, Simon Falsig wrote:
> On 12/20/2012 01:12 AM, Darren Hart wrote:
>> On 12/17/2012 02:18 PM, Frank Rowand wrote:
>>> On 12/11/12 06:30, Simon Falsig wrote:
>>>
>>> < snip >
>>>
>>>>>> Once I get this finished up, I'll be happy to do a complete
>>>>>> write-up of the timer-thread code, if anyone is interested. I
>>>>>> remember looking for something similar (but without success), when
>>>>>> I wrote the code earlier this year.
>>>>>
>>>>> It would be very useful to add your results to the wiki.
>>>>>
>>>>> -Frank
>>>>
>>>> Cool - is there any particular place it should go? A how-to, FAQ
>>>> entry, etc? Just so I know how to do the write-up...
>>>
>>> https://rt.wiki.kernel.org/index.php/Main_Page would be my default
>>> suggestion.  I'm not quite sure where on the wiki would be good
>>> though.  Maybe under "Tips and Techniques"?
>>>
>>> I added the rtwiki maintainers to the cc: list.
>>
>> I don't have all the context, but this sounds a bit more like something
> for
>> linux/Documentation (possibly for the preempt-rt patch set). If not, the
>> Documentation section on the wiki is a possibility.
>> --
>> Darren Hart
>> Intel Open Source Technology Center
>> Yocto Project - Technical Lead - Linux Kernel
> 
> As I see it, the write-up could be done in two ways - 1) as a simple code
> example of a real-time loop in a kernel module, 2) as a blog-like post of
> the process I went through, investigating the performance, and optimizing
> my code.
> 
> In the case of 1), I guess it could be added to
> https://rt.wiki.kernel.org/index.php/RT_PREEMPT_HOWTO, as a kernel version
> of the realtime example, or possibly to
> https://rt.wiki.kernel.org/index.php/HOWTO:_Build_an_RT-application under
> "Building Device Drivers"? In the case of 2) though, it could maybe be on
> its own page under "Tips and techniques"?

I'd leave the exploration type write-up to your blog and we can link to
it. An explicit example in one of the locations above also sounds
appropriate.

Thanks,

Darren


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Real-time kernel thread performance and optimization
@ 2013-07-11  6:32 Simon Falsig
  0 siblings, 0 replies; 16+ messages in thread
From: Simon Falsig @ 2013-07-11  6:32 UTC (permalink / raw)
  To: linux-rt-users; +Cc: dvhart, frank.rowand, jkacur

>On 12/20/2012 12:21 AM, Simon Falsig wrote:
>> On 12/20/2012 01:12 AM, Darren Hart wrote:
>>> On 12/17/2012 02:18 PM, Frank Rowand wrote:
>>>> On 12/11/12 06:30, Simon Falsig wrote:
>>>>
>>>> < snip >
>>>>
>>>>>>> Once I get this finished up, I'll be happy to do a complete
>>>>>>> write-up of the timer-thread code, if anyone is interested. I
>>>>>>> remember looking for something similar (but without success), when
>>>>>>> I wrote the code earlier this year.
>>>>>>
>>>>>> It would be very useful to add your results to the wiki.
>>>>>>
>>>>>> -Frank
>>>>>
>>>>> Cool - is there any particular place it should go? A how-to, FAQ
>>>>> entry, etc? Just so I know how to do the write-up...
>>>>
>>>> https://rt.wiki.kernel.org/index.php/Main_Page would be my default
>>>> suggestion.  I'm not quite sure where on the wiki would be good
>>>> though.  Maybe under "Tips and Techniques"?
>>>>
>>>> I added the rtwiki maintainers to the cc: list.
>>>
>>> I don't have all the context, but this sounds a bit more like
something
>> for
>>> linux/Documentation (possibly for the preempt-rt patch set). If not,
the
>>> Documentation section on the wiki is a possibility.
>>> --
>>> Darren Hart
>>> Intel Open Source Technology Center
>>> Yocto Project - Technical Lead - Linux Kernel
>>
>> As I see it, the write-up could be done in two ways - 1) as a simple
code
>> example of a real-time loop in a kernel module, 2) as a blog-like post
of
>> the process I went through, investigating the performance, and
optimizing
>> my code.
>>
>> In the case of 1), I guess it could be added to
>> https://rt.wiki.kernel.org/index.php/RT_PREEMPT_HOWTO, as a kernel
version
>> of the realtime example, or possibly to
>> https://rt.wiki.kernel.org/index.php/HOWTO:_Build_an_RT-application
under
>> "Building Device Drivers"? In the case of 2) though, it could maybe be
on
>> its own page under "Tips and techniques"?
>
> I'd leave the exploration type write-up to your blog and we can link to
> it. An explicit example in one of the locations above also sounds
> appropriate.
>
> Thanks,
>
> Darren

So, it's been long overdue for some time now, but I finally got around to
finishing the writeup on my own blog. If anyone is still interested, it
can be found as a 3-part story here:

http://www.falsig.org/simon/blog/2013/03/30/real-time-linux-kernel-drivers
-part-1-the-setup/
http://www.falsig.org/simon/blog/2013/07/10/real-time-linux-kernel-drivers
-part-2-testing-and-first-implementation/
http://www.falsig.org/simon/blog/2013/07/10/real-time-linux-kernel-drivers
-part-3-the-better-implementation/

Any comments are more than welcome - and feel free to adapt any of the
code examples and put them on the wiki, if that could be of any help.

Best regards,
Simon

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2013-07-11  6:32 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-11-30 15:46 Real-time kernel thread performance and optimization Simon Falsig
2012-11-30 22:31 ` Frank Rowand
2012-12-03 12:39   ` Simon Falsig
2012-12-03 14:15   ` Carsten Emde
2012-12-11 14:43     ` Simon Falsig
2012-12-19  8:10       ` Carsten Emde
2012-12-20  8:09         ` Simon Falsig
2012-12-19 14:59     ` John Kacur
2012-12-19 15:20       ` Carsten Emde
2012-12-11 14:30   ` Simon Falsig
2012-12-17 22:18     ` Frank Rowand
2012-12-20  0:11       ` Darren Hart
2012-12-20  8:21         ` Simon Falsig
2013-01-02 17:21           ` Darren Hart
2012-12-12 15:39   ` Simon Falsig
  -- strict thread matches above, loose matches on Subject: below --
2013-07-11  6:32 Simon Falsig

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).