All of lore.kernel.org
 help / color / mirror / Atom feed
From: Gilles Chanteperdrix <gilles.chanteperdrix@xenomai.org>
To: Xenomai <Xenomai@xenomai.org>
Subject: Re: [Xenomai] RFC: slow tsc optimization
Date: Fri, 07 Sep 2012 13:29:26 +0200	[thread overview]
Message-ID: <5049DA96.90200@xenomai.org> (raw)
In-Reply-To: <50484FCA.50909@xenomai.org>

On 09/06/2012 09:24 AM, Gilles Chanteperdrix wrote:

> 
> Hi,
> 
> The last few days, I have been working on getting the "rdtsc"
> instruction replaced with a call to a tsc emulation function dynamically
> at run time. It turned out to be easy with the Linux "alternative"
> mechanism, since it implements replacements based on CPU capabilities,
> and the TSC is such a capability. This modification allows to compile a
> kernel with Xenomai that will run on any x86_32 platform.
> 
> Now, when running kernels without tsc using the PIT based tsc emulation,
> I found out something pretty obvious, this PIT based tsc emulation is
> slow, it takes 4us every time we call it. And the nucleus reads the tsc
> a number of times when a timer interrupt happens up to the wake up of
> the latency user-space task:
> - at the very beginning of the timer interrupt
> - after the execution of the latency thread timer
> - in the timer programming function, to compute the timer delay
> - in the middle of the context switch, if the "statistics collection"
> feature is enabled
> - in the xnpod_wait_thread_periodic function, after the context switch,
> in order to compute the number of timer overruns.
> 
> That is 20us, and the thread is not yet running in user-space.
> 
> So, I have been thinking about reducing the number of calls to the PIT,
> unfortunately keeping the last tsc value around and reusing it is a bit
> heavy, and implies modifications which are completely useless for the
> non PIT case (which should be the vast majority), and in fact, the tsc
> emulation code has to keep the last read value, since it is required to
> convert clocksources with less than 64 bits to a 64 bits value. So, I
> propose the following approach:
> 
> The I-pipe core will provide two tsc reading functions:
> ipipe_read_tsc which reads the counter
> ipipe_read_tsc_fast which will read the tsc if the cpu has a tsc, or
> return the last value read if the tsc is emulated.


An update on this work. In fact the "read_tsc_fast" should be the most
common operation, and really reading the PIT counter is not. And reading
the slow tsc should be made at some critical points so that the fast_tsc
is reasonably accurate. So, in fact I implemented:
ipipe_read_tsc, which returns the last tsc value, and is replaced with
rdtsc when available
ipipe_read_slow_tsc, which reads the emulated tsc, and is also replaced
with rdtsc when available,
ipipe_touch_tsc, which reads the emulated tsc, but is replaced with a
nop when rdtsc is available.

Now, the real remaining problem is where to use
ipipe_touch_tsc/ipipe_read_slow_tsc, to have a rasonable accuracy, but
not read the hardware tsc too often.

I implemented the following approach: the tsc is read at every entry
point of the nucleus, that is: interrupts, xenomai syscalls, events for
xenomai tasks. We also need to reread the tsc before programming the
next shot, in order to avoid programming too long delays (with a restart
of xntimer_tick_aperiodic if we find out that the delay is too short,
instead of going through another irq). All in all, these are fairly
lightweight modifications, and the latency test seems reasonable. Even
on a kernel with statistics collection enabled. I suspect the statistics
are a bit off, but at least they are there.

Since we read the tsc twice per interrupt, and reading it takes 4us, the
minimum latency is around 8us, I thought about including the tsc latency
(twice) into the nktimerlat latency, but this results in negative
latencies, and anyway, we should leave the choice to the user to do that
with /proc/xenomai/latency if he wants.

Now the remaining issues are:
- kernel-space code. We can trap insmod/rmmod in losyscall, but if an
RTDM driver ioctl method takes a long time to execute, or when a
kernel-space thread runs long tasks before calling xenomai services, it
may use old clock data
- the time of a syscall is always at least 4us. That is a bit stupid
when, say, for instance you want to lock a mutex, to read the tsc, lock
the mutex, then return to user space. Working this around seems
complicated. We could for instance add a "NOTSC" syscall flag to
indicate that the tsc should not be read before a syscall callback, but
modifying correctly the syscall tables to add this flag to the proper
syscalls is probably not so easy. For instance, when statistics
collection is enabled, we want to read the tsc before locking the mutex,
since if there is a context switch, we will need the value for updating
the statistics.

-- 
                                                                Gilles.


  reply	other threads:[~2012-09-07 11:29 UTC|newest]

Thread overview: 3+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-09-06  7:24 [Xenomai] RFC: slow tsc optimization Gilles Chanteperdrix
2012-09-07 11:29 ` Gilles Chanteperdrix [this message]
2012-09-08 19:30   ` Gilles Chanteperdrix

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5049DA96.90200@xenomai.org \
    --to=gilles.chanteperdrix@xenomai.org \
    --cc=Xenomai@xenomai.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.