From mboxrd@z Thu Jan 1 00:00:00 1970 Message-ID: <5049DA96.90200@xenomai.org> Date: Fri, 07 Sep 2012 13:29:26 +0200 From: Gilles Chanteperdrix MIME-Version: 1.0 References: <50484FCA.50909@xenomai.org> In-Reply-To: <50484FCA.50909@xenomai.org> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Subject: Re: [Xenomai] RFC: slow tsc optimization List-Id: Discussions about the Xenomai project List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Xenomai On 09/06/2012 09:24 AM, Gilles Chanteperdrix wrote: > > Hi, > > The last few days, I have been working on getting the "rdtsc" > instruction replaced with a call to a tsc emulation function dynamically > at run time. It turned out to be easy with the Linux "alternative" > mechanism, since it implements replacements based on CPU capabilities, > and the TSC is such a capability. This modification allows to compile a > kernel with Xenomai that will run on any x86_32 platform. > > Now, when running kernels without tsc using the PIT based tsc emulation, > I found out something pretty obvious, this PIT based tsc emulation is > slow, it takes 4us every time we call it. And the nucleus reads the tsc > a number of times when a timer interrupt happens up to the wake up of > the latency user-space task: > - at the very beginning of the timer interrupt > - after the execution of the latency thread timer > - in the timer programming function, to compute the timer delay > - in the middle of the context switch, if the "statistics collection" > feature is enabled > - in the xnpod_wait_thread_periodic function, after the context switch, > in order to compute the number of timer overruns. > > That is 20us, and the thread is not yet running in user-space. > > So, I have been thinking about reducing the number of calls to the PIT, > unfortunately keeping the last tsc value around and reusing it is a bit > heavy, and implies modifications which are completely useless for the > non PIT case (which should be the vast majority), and in fact, the tsc > emulation code has to keep the last read value, since it is required to > convert clocksources with less than 64 bits to a 64 bits value. So, I > propose the following approach: > > The I-pipe core will provide two tsc reading functions: > ipipe_read_tsc which reads the counter > ipipe_read_tsc_fast which will read the tsc if the cpu has a tsc, or > return the last value read if the tsc is emulated. An update on this work. In fact the "read_tsc_fast" should be the most common operation, and really reading the PIT counter is not. And reading the slow tsc should be made at some critical points so that the fast_tsc is reasonably accurate. So, in fact I implemented: ipipe_read_tsc, which returns the last tsc value, and is replaced with rdtsc when available ipipe_read_slow_tsc, which reads the emulated tsc, and is also replaced with rdtsc when available, ipipe_touch_tsc, which reads the emulated tsc, but is replaced with a nop when rdtsc is available. Now, the real remaining problem is where to use ipipe_touch_tsc/ipipe_read_slow_tsc, to have a rasonable accuracy, but not read the hardware tsc too often. I implemented the following approach: the tsc is read at every entry point of the nucleus, that is: interrupts, xenomai syscalls, events for xenomai tasks. We also need to reread the tsc before programming the next shot, in order to avoid programming too long delays (with a restart of xntimer_tick_aperiodic if we find out that the delay is too short, instead of going through another irq). All in all, these are fairly lightweight modifications, and the latency test seems reasonable. Even on a kernel with statistics collection enabled. I suspect the statistics are a bit off, but at least they are there. Since we read the tsc twice per interrupt, and reading it takes 4us, the minimum latency is around 8us, I thought about including the tsc latency (twice) into the nktimerlat latency, but this results in negative latencies, and anyway, we should leave the choice to the user to do that with /proc/xenomai/latency if he wants. Now the remaining issues are: - kernel-space code. We can trap insmod/rmmod in losyscall, but if an RTDM driver ioctl method takes a long time to execute, or when a kernel-space thread runs long tasks before calling xenomai services, it may use old clock data - the time of a syscall is always at least 4us. That is a bit stupid when, say, for instance you want to lock a mutex, to read the tsc, lock the mutex, then return to user space. Working this around seems complicated. We could for instance add a "NOTSC" syscall flag to indicate that the tsc should not be read before a syscall callback, but modifying correctly the syscall tables to add this flag to the proper syscalls is probably not so easy. For instance, when statistics collection is enabled, we want to read the tsc before locking the mutex, since if there is a context switch, we will need the value for updating the statistics. -- Gilles.