* 2.6.14-rt13
@ 2005-11-15 9:08 Ingo Molnar
2005-11-15 16:36 ` 2.6.14-rt13 Mark Knecht
` (3 more replies)
0 siblings, 4 replies; 65+ messages in thread
From: Ingo Molnar @ 2005-11-15 9:08 UTC (permalink / raw)
To: linux-kernel
Cc: Paul E. McKenney, K.R. Foley, Steven Rostedt, Thomas Gleixner,
pluto, john cooper, Benedikt Spranger, Daniel Walker, Tom Rini,
George Anzinger
i have released the 2.6.14-rt13 tree, which can be downloaded from the
usual place:
http://redhat.com/~mingo/realtime-preempt/
lots of fixes in this release affecting all supported architectures, all
across the board. Big MIPS update from John Cooper.
Changes since 2.6.14-rt1:
- lots of RCU fixes and updates in signal handling and related areas
(Paul E. McKenney)
- big RCU torture-test update (Paul E. McKenney)
- fix netfilter/conntrack crash reported by Paweł Sikora
- big MIPS update (John Cooper)
- ARM updates (Daniel Walker)
- PPC updates (Benedikt Spranger)
- ktimers rounding fix (Thomas Gleixner)
- off by one fix in timespec normalization (George Anzinger)
- lpptest Kconfig dependency fix (Tom Rini)
- clean up get_cpu_tick() -> get_cycles() in blocker, lpptest and
latency.c. (Tom Rini)
- fix ppc32 bootwrapper code for new zlib (Tom Rini)
- rtc histogram fixes merged for real :-) (K.R. Foley)
- fix NMI watchdog false positive (Steven Rostedt, me)
- added the nsleep() kernel API, which uses high-resolution sleeps
- build fix on !PREEMPT_RT
- cleanup of the PER_CPU_LOCKED infrastructure
- fix softlockup false positives triggered by the RCU torture-test.
- do not send a false -ERESTART_RESTARTBLOCK to userspace if the
HRT timer hardware wakes us up early.
to build a 2.6.14-rt13 tree, the following patches should be applied:
http://kernel.org/pub/linux/kernel/v2.6/linux-2.6.14.tar.bz2
http://redhat.com/~mingo/realtime-preempt/patch-2.6.14-rt13
Ingo
^ permalink raw reply [flat|nested] 65+ messages in thread* Re: 2.6.14-rt13 2005-11-15 9:08 2.6.14-rt13 Ingo Molnar @ 2005-11-15 16:36 ` Mark Knecht 2005-11-15 19:57 ` 2.6.14-rt13 Paul E. McKenney 2005-11-16 3:48 ` 2.6.14-rt13 K.R. Foley ` (2 subsequent siblings) 3 siblings, 1 reply; 65+ messages in thread From: Mark Knecht @ 2005-11-15 16:36 UTC (permalink / raw) To: Ingo Molnar; +Cc: linux-kernel On 11/15/05, Ingo Molnar <mingo@elte.hu> wrote: > i have released the 2.6.14-rt13 tree, which can be downloaded from the > usual place: > > http://redhat.com/~mingo/realtime-preempt/ > > lots of fixes in this release affecting all supported architectures, all > across the board. Big MIPS update from John Cooper. <SNIP> 2.6.14-rt13 is up and running here. Everything looks fine in the first couple of hours. Nothing negative to report. Please let me know if there are any particular features that you'd like me to look at on an AMD64 machine. Cheers, Mark ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: 2.6.14-rt13 2005-11-15 16:36 ` 2.6.14-rt13 Mark Knecht @ 2005-11-15 19:57 ` Paul E. McKenney 0 siblings, 0 replies; 65+ messages in thread From: Paul E. McKenney @ 2005-11-15 19:57 UTC (permalink / raw) To: Mark Knecht; +Cc: Ingo Molnar, linux-kernel On Tue, Nov 15, 2005 at 08:36:40AM -0800, Mark Knecht wrote: > On 11/15/05, Ingo Molnar <mingo@elte.hu> wrote: > > i have released the 2.6.14-rt13 tree, which can be downloaded from the > > usual place: > > > > http://redhat.com/~mingo/realtime-preempt/ > > > > lots of fixes in this release affecting all supported architectures, all > > across the board. Big MIPS update from John Cooper. > <SNIP> > > 2.6.14-rt13 is up and running here. Everything looks fine in the first > couple of hours. Nothing negative to report. Ditto on an old x86 Netfinity box. Thanx, Paul ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: 2.6.14-rt13 2005-11-15 9:08 2.6.14-rt13 Ingo Molnar 2005-11-15 16:36 ` 2.6.14-rt13 Mark Knecht @ 2005-11-16 3:48 ` K.R. Foley 2005-11-16 8:40 ` 2.6.14-rt13 Ingo Molnar 2005-11-18 18:02 ` 2.6.14-rt13 Fernando Lopez-Lezcano 2005-11-21 21:32 ` 2.6.14-rt13 Fernando Lopez-Lezcano 3 siblings, 1 reply; 65+ messages in thread From: K.R. Foley @ 2005-11-16 3:48 UTC (permalink / raw) To: Ingo Molnar Cc: linux-kernel, Paul E. McKenney, Steven Rostedt, Thomas Gleixner, pluto, john cooper, Benedikt Spranger, Daniel Walker, Tom Rini, George Anzinger Ingo Molnar wrote: > i have released the 2.6.14-rt13 tree, which can be downloaded from the > usual place: > > http://redhat.com/~mingo/realtime-preempt/ > > lots of fixes in this release affecting all supported architectures, all > across the board. Big MIPS update from John Cooper. > > Changes since 2.6.14-rt1: > > - lots of RCU fixes and updates in signal handling and related areas > (Paul E. McKenney) > > - big RCU torture-test update (Paul E. McKenney) > In case anyone else makes the same mistake I did. If you are using the same config from a previous build, you may have RCU_TORTURE_TEST=Y (not module) and not even know it when running RT patches. You will however definitely notice it if you use the config to build a non RT kernel like 2.6.15-rc1. The previous RT patch defaulted RCU_TORTURE_TEST=y. By the way, the fact that I didn't even notice that the torture test was running with the RT kernel is a true measure of how well things have progressed. :-) -- kr ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: 2.6.14-rt13 2005-11-16 3:48 ` 2.6.14-rt13 K.R. Foley @ 2005-11-16 8:40 ` Ingo Molnar 2005-11-16 17:02 ` 2.6.14-rt13 Paul E. McKenney 0 siblings, 1 reply; 65+ messages in thread From: Ingo Molnar @ 2005-11-16 8:40 UTC (permalink / raw) To: K.R. Foley Cc: linux-kernel, Paul E. McKenney, Steven Rostedt, Thomas Gleixner, pluto, john cooper, Benedikt Spranger, Daniel Walker, Tom Rini, George Anzinger * K.R. Foley <kr@cybsft.com> wrote: > > - big RCU torture-test update (Paul E. McKenney) > > In case anyone else makes the same mistake I did. If you are using the > same config from a previous build, you may have RCU_TORTURE_TEST=Y > (not module) and not even know it when running RT patches. You will > however definitely notice it if you use the config to build a non RT > kernel like 2.6.15-rc1. The previous RT patch defaulted > RCU_TORTURE_TEST=y. By the way, the fact that I didn't even notice > that the torture test was running with the RT kernel is a true measure > of how well things have progressed. :-) yeah - i left it on by default, i usually do that with new debugging features, to give new code more exposure. In other words, mass distributed RCU stress-testing by stealth ;-) I'll make it default-off once the RCU related changes have calmed down. The rcutorture kernel threads run at nice +19 so they should be barely noticeable. (except for a sudden and unexplained spike in the world's power consumption, and the resulting energy crisis ;-) Ingo ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: 2.6.14-rt13 2005-11-16 8:40 ` 2.6.14-rt13 Ingo Molnar @ 2005-11-16 17:02 ` Paul E. McKenney 0 siblings, 0 replies; 65+ messages in thread From: Paul E. McKenney @ 2005-11-16 17:02 UTC (permalink / raw) To: Ingo Molnar Cc: K.R. Foley, linux-kernel, Steven Rostedt, Thomas Gleixner, pluto, john cooper, Benedikt Spranger, Daniel Walker, Tom Rini, George Anzinger On Wed, Nov 16, 2005 at 09:40:37AM +0100, Ingo Molnar wrote: > > * K.R. Foley <kr@cybsft.com> wrote: > > > > - big RCU torture-test update (Paul E. McKenney) > > > > In case anyone else makes the same mistake I did. If you are using the > > same config from a previous build, you may have RCU_TORTURE_TEST=Y > > (not module) and not even know it when running RT patches. You will > > however definitely notice it if you use the config to build a non RT > > kernel like 2.6.15-rc1. The previous RT patch defaulted > > RCU_TORTURE_TEST=y. By the way, the fact that I didn't even notice > > that the torture test was running with the RT kernel is a true measure > > of how well things have progressed. :-) > > yeah - i left it on by default, i usually do that with new debugging > features, to give new code more exposure. In other words, mass > distributed RCU stress-testing by stealth ;-) Cool!!! If anyone sees a printk line starting with "rcutorture:" that includes the string "!!!", please pass it along accompanied by your config and what your workload was doing at the time. Thanx, Paul > I'll make it default-off once the RCU related changes have calmed down. > The rcutorture kernel threads run at nice +19 so they should be barely > noticeable. (except for a sudden and unexplained spike in the world's > power consumption, and the resulting energy crisis ;-) > > Ingo > ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: 2.6.14-rt13 2005-11-15 9:08 2.6.14-rt13 Ingo Molnar 2005-11-15 16:36 ` 2.6.14-rt13 Mark Knecht 2005-11-16 3:48 ` 2.6.14-rt13 K.R. Foley @ 2005-11-18 18:02 ` Fernando Lopez-Lezcano 2005-11-18 21:54 ` 2.6.14-rt13 Lee Revell 2005-11-21 21:32 ` 2.6.14-rt13 Fernando Lopez-Lezcano 3 siblings, 1 reply; 65+ messages in thread From: Fernando Lopez-Lezcano @ 2005-11-18 18:02 UTC (permalink / raw) To: Ingo Molnar Cc: nando, linux-kernel, Paul E. McKenney, K.R. Foley, Steven Rostedt, Thomas Gleixner, pluto, john cooper, Benedikt Spranger, Daniel Walker, Tom Rini, George Anzinger On Tue, 2005-11-15 at 10:08 +0100, Ingo Molnar wrote: > i have released the 2.6.14-rt13 tree, which can be downloaded from the > usual place: > > http://redhat.com/~mingo/realtime-preempt/ > > lots of fixes in this release affecting all supported architectures, all > across the board. Big MIPS update from John Cooper. Hi Ingo, I'm back from the trip and built -rt13 to test on my dual core Athlons. As I emailed you yesterday off the list it looked good, but I guess it took longer than usual for things to degrade. This morning I'm seeing the usual warnings from Jack. And, for the first time in a while, actual xruns. I'll try your suggestion of booting with idle=poll. [begin speculation] You mentioned before that the TSC's from both cpus could drift from each other over time. Assuming that is the source of timing (I have no idea) that could explain the behavior of Jack, it gets a reference time from one of the cpus and then compares that with what it gets from either cpu depending on where it is running at a given time. If it is the same cpu all is fine, if it is the other and it has drifted then the warning is printed. -- Fernando ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: 2.6.14-rt13 2005-11-18 18:02 ` 2.6.14-rt13 Fernando Lopez-Lezcano @ 2005-11-18 21:54 ` Lee Revell 2005-11-18 22:05 ` 2.6.14-rt13 Fernando Lopez-Lezcano 0 siblings, 1 reply; 65+ messages in thread From: Lee Revell @ 2005-11-18 21:54 UTC (permalink / raw) To: Fernando Lopez-Lezcano Cc: Ingo Molnar, linux-kernel, Paul E. McKenney, K.R. Foley, Steven Rostedt, Thomas Gleixner, pluto, john cooper, Benedikt Spranger, Daniel Walker, Tom Rini, George Anzinger On Fri, 2005-11-18 at 10:02 -0800, Fernando Lopez-Lezcano wrote: > You mentioned before that the TSC's from both cpus could drift from > each other over time. Assuming that is the source of timing (I have no > idea) that could explain the behavior of Jack, it gets a reference > time from one of the cpus and then compares that with what it gets > from either cpu depending on where it is running at a given time. If > it is the same cpu all is fine, if it is the other and it has drifted > then the warning is printed. Yes, JACK uses rdtsc() for microsecond resolution timing and assumes that the TSCs are in sync. I've asked on this list what a better time source could be and didn't get any useful responses, people just told me "use gettimeofday()" which is WAY too slow. Lee ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: 2.6.14-rt13 2005-11-18 21:54 ` 2.6.14-rt13 Lee Revell @ 2005-11-18 22:05 ` Fernando Lopez-Lezcano 2005-11-18 22:07 ` 2.6.14-rt13 Ingo Molnar 2005-11-18 22:13 ` 2.6.14-rt13 Lee Revell 0 siblings, 2 replies; 65+ messages in thread From: Fernando Lopez-Lezcano @ 2005-11-18 22:05 UTC (permalink / raw) To: Lee Revell Cc: nando, Ingo Molnar, linux-kernel, Paul E. McKenney, K.R. Foley, Steven Rostedt, Thomas Gleixner, pluto, john cooper, Benedikt Spranger, Daniel Walker, Tom Rini, George Anzinger On Fri, 2005-11-18 at 16:54 -0500, Lee Revell wrote: > On Fri, 2005-11-18 at 10:02 -0800, Fernando Lopez-Lezcano wrote: > > You mentioned before that the TSC's from both cpus could drift from > > each other over time. Assuming that is the source of timing (I have no > > idea) that could explain the behavior of Jack, it gets a reference > > time from one of the cpus and then compares that with what it gets > > from either cpu depending on where it is running at a given time. If > > it is the same cpu all is fine, if it is the other and it has drifted > > then the warning is printed. > > Yes, JACK uses rdtsc() for microsecond resolution timing and assumes > that the TSCs are in sync. > > I've asked on this list what a better time source could be and didn't > get any useful responses, people just told me "use gettimeofday()" which > is WAY too slow. Arghhh, at least I take this as a confirmation that the TSCs do drift and there is no workaround. It currently makes the -rt/Jack combination not very useful, at least in my tests. Is there a way to resync the TSCs? -- Fernando ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: 2.6.14-rt13 2005-11-18 22:05 ` 2.6.14-rt13 Fernando Lopez-Lezcano @ 2005-11-18 22:07 ` Ingo Molnar 2005-11-18 22:15 ` 2.6.14-rt13 Lee Revell 2005-11-18 22:41 ` 2.6.14-rt13 Fernando Lopez-Lezcano 2005-11-18 22:13 ` 2.6.14-rt13 Lee Revell 1 sibling, 2 replies; 65+ messages in thread From: Ingo Molnar @ 2005-11-18 22:07 UTC (permalink / raw) To: Fernando Lopez-Lezcano Cc: Lee Revell, linux-kernel, Paul E. McKenney, K.R. Foley, Steven Rostedt, Thomas Gleixner, pluto, john cooper, Benedikt Spranger, Daniel Walker, Tom Rini, George Anzinger * Fernando Lopez-Lezcano <nando@ccrma.Stanford.EDU> wrote: > Arghhh, at least I take this as a confirmation that the TSCs do drift > and there is no workaround. It currently makes the -rt/Jack > combination not very useful, at least in my tests. > > Is there a way to resync the TSCs? no reasonable way. Does idle=poll make any difference? Ingo ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: 2.6.14-rt13 2005-11-18 22:07 ` 2.6.14-rt13 Ingo Molnar @ 2005-11-18 22:15 ` Lee Revell 2005-11-18 22:25 ` 2.6.14-rt13 Steven Rostedt 2005-11-18 22:41 ` 2.6.14-rt13 Fernando Lopez-Lezcano 1 sibling, 1 reply; 65+ messages in thread From: Lee Revell @ 2005-11-18 22:15 UTC (permalink / raw) To: Ingo Molnar Cc: Fernando Lopez-Lezcano, linux-kernel, Paul E. McKenney, K.R. Foley, Steven Rostedt, Thomas Gleixner, pluto, john cooper, Benedikt Spranger, Daniel Walker, Tom Rini, George Anzinger On Fri, 2005-11-18 at 23:07 +0100, Ingo Molnar wrote: > * Fernando Lopez-Lezcano <nando@ccrma.Stanford.EDU> wrote: > > > Arghhh, at least I take this as a confirmation that the TSCs do drift > > and there is no workaround. It currently makes the -rt/Jack > > combination not very useful, at least in my tests. > > > > Is there a way to resync the TSCs? > > no reasonable way. Does idle=poll make any difference? But JACK itself uses rdtsc() for timing calculations so TSC drift is invariably fatal. Lee ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: 2.6.14-rt13 2005-11-18 22:15 ` 2.6.14-rt13 Lee Revell @ 2005-11-18 22:25 ` Steven Rostedt 2005-11-18 23:36 ` 2.6.14-rt13 Fernando Lopez-Lezcano 0 siblings, 1 reply; 65+ messages in thread From: Steven Rostedt @ 2005-11-18 22:25 UTC (permalink / raw) To: Lee Revell Cc: Ingo Molnar, Fernando Lopez-Lezcano, linux-kernel, Paul E. McKenney, K.R. Foley, Thomas Gleixner, pluto, john cooper, Benedikt Spranger, Daniel Walker, Tom Rini, George Anzinger On Fri, 18 Nov 2005, Lee Revell wrote: > On Fri, 2005-11-18 at 23:07 +0100, Ingo Molnar wrote: > > * Fernando Lopez-Lezcano <nando@ccrma.Stanford.EDU> wrote: > > > > > Arghhh, at least I take this as a confirmation that the TSCs do drift > > > and there is no workaround. It currently makes the -rt/Jack > > > combination not very useful, at least in my tests. > > > > > > Is there a way to resync the TSCs? > > > > no reasonable way. Does idle=poll make any difference? > > But JACK itself uses rdtsc() for timing calculations so TSC drift is > invariably fatal. Can it simply be pinned to a cpu? -- Steve ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: 2.6.14-rt13 2005-11-18 22:25 ` 2.6.14-rt13 Steven Rostedt @ 2005-11-18 23:36 ` Fernando Lopez-Lezcano 2005-11-18 23:57 ` 2.6.14-rt13 Steven Rostedt 0 siblings, 1 reply; 65+ messages in thread From: Fernando Lopez-Lezcano @ 2005-11-18 23:36 UTC (permalink / raw) To: Steven Rostedt Cc: nando, Lee Revell, Ingo Molnar, linux-kernel, Paul E. McKenney, K.R. Foley, Thomas Gleixner, pluto, john cooper, Benedikt Spranger, Daniel Walker, Tom Rini, George Anzinger On Fri, 2005-11-18 at 17:25 -0500, Steven Rostedt wrote: > On Fri, 18 Nov 2005, Lee Revell wrote: > > > On Fri, 2005-11-18 at 23:07 +0100, Ingo Molnar wrote: > > > * Fernando Lopez-Lezcano <nando@ccrma.Stanford.EDU> wrote: > > > > > > > Arghhh, at least I take this as a confirmation that the TSCs do drift > > > > and there is no workaround. It currently makes the -rt/Jack > > > > combination not very useful, at least in my tests. > > > > > > > > Is there a way to resync the TSCs? > > > > > > no reasonable way. Does idle=poll make any difference? > > > > But JACK itself uses rdtsc() for timing calculations so TSC drift is > > invariably fatal. > > Can it simply be pinned to a cpu? Is there a way to know in which cpu a process is running? At least Jack could ignore timinig issues if the measurement is going to happen in a different cpu than the one where the original timestamp was collected. -- Fernando ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: 2.6.14-rt13 2005-11-18 23:36 ` 2.6.14-rt13 Fernando Lopez-Lezcano @ 2005-11-18 23:57 ` Steven Rostedt 0 siblings, 0 replies; 65+ messages in thread From: Steven Rostedt @ 2005-11-18 23:57 UTC (permalink / raw) To: Fernando Lopez-Lezcano Cc: Lee Revell, Ingo Molnar, linux-kernel, Paul E. McKenney, K.R. Foley, Thomas Gleixner, pluto, john cooper, Benedikt Spranger, Daniel Walker, Tom Rini, George Anzinger On Fri, 18 Nov 2005, Fernando Lopez-Lezcano wrote: > > Can it simply be pinned to a cpu? > > Is there a way to know in which cpu a process is running? At least Jack > could ignore timinig issues if the measurement is going to happen in a > different cpu than the one where the original timestamp was collected. > Simple answer? No. At least not meaningfully. If you do: cpu = fictitious_get_my_cpu(); if (cpu == last_cpu()) { rdtsc(oldtime); ... } There's no guarantee that jack doesn't switch cpu's from when it found out what CPU it was on to doing the calculation. So it would be easier to pin it. (apt-get schedutils) man 1 taskset or if you modify the code: mn 2 sched_setaffinity -- Steve ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: 2.6.14-rt13 2005-11-18 22:07 ` 2.6.14-rt13 Ingo Molnar 2005-11-18 22:15 ` 2.6.14-rt13 Lee Revell @ 2005-11-18 22:41 ` Fernando Lopez-Lezcano 2005-11-19 2:39 ` 2.6.14-rt13 Steven Rostedt 1 sibling, 1 reply; 65+ messages in thread From: Fernando Lopez-Lezcano @ 2005-11-18 22:41 UTC (permalink / raw) To: Ingo Molnar Cc: Lee Revell, linux-kernel, Paul E. McKenney, K.R. Foley, Steven Rostedt, Thomas Gleixner, pluto, john cooper, Benedikt Spranger, Daniel Walker, Tom Rini, George Anzinger On Fri, 2005-11-18 at 23:07 +0100, Ingo Molnar wrote: > * Fernando Lopez-Lezcano <nando@ccrma.Stanford.EDU> wrote: > > > Arghhh, at least I take this as a confirmation that the TSCs do drift > > and there is no workaround. It currently makes the -rt/Jack > > combination not very useful, at least in my tests. > > > > Is there a way to resync the TSCs? > > no reasonable way. Does idle=poll make any difference? I don't know yet, and I may never know :-) I've been running it for a while and so far works but that's what I thought yesterday of -rt13. It is not practical for normal use, it just heats the cpu unnecessarily and there's no way to control it other than a reboot. I'll keep my machine running like this till I go home later. -- Fernando ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: 2.6.14-rt13 2005-11-18 22:41 ` 2.6.14-rt13 Fernando Lopez-Lezcano @ 2005-11-19 2:39 ` Steven Rostedt 2005-11-24 15:07 ` 2.6.14-rt13 Ingo Molnar 0 siblings, 1 reply; 65+ messages in thread From: Steven Rostedt @ 2005-11-19 2:39 UTC (permalink / raw) To: Fernando Lopez-Lezcano Cc: Ingo Molnar, Lee Revell, linux-kernel, Paul E. McKenney, K.R. Foley, Thomas Gleixner, pluto, john cooper, Benedikt Spranger, Daniel Walker, Tom Rini, George Anzinger On Fri, 2005-11-18 at 14:41 -0800, Fernando Lopez-Lezcano wrote: > On Fri, 2005-11-18 at 23:07 +0100, Ingo Molnar wrote: > > * Fernando Lopez-Lezcano <nando@ccrma.Stanford.EDU> wrote: > > > > > Arghhh, at least I take this as a confirmation that the TSCs do drift > > > and there is no workaround. It currently makes the -rt/Jack > > > combination not very useful, at least in my tests. > > > > > > Is there a way to resync the TSCs? > > > > no reasonable way. Does idle=poll make any difference? > > I don't know yet, and I may never know :-) I've been running it for a > while and so far works but that's what I thought yesterday of -rt13. It > is not practical for normal use, it just heats the cpu unnecessarily and > there's no way to control it other than a reboot. Not anymore! OK, I used this as an exercise to learn how kobject and sysfs work (I've been putting this off for too long). So if this isn't exactly proper, let me know :-) Ingo, This could be a temporary patch until we come up with a better solution. This adds /sys/kernel/idle/idle_poll, which if idle=poll is _not_ set, it still lets you switch the machine to idle=poll on the fly, as well as turn it off. If you have idle=poll, this doesn't even show up. So for example (I'm currently running it): # cat /sys/kernel/idle/idle_poll off # echo 1 > /sys/kernel/idle/idle_poll # cat /sys/kernel/idle/idle_poll on # echo 0 > /sys/kernel/idle/idle_poll # cat /sys/kernel/idle/idle_poll off # echo on > /sys/kernel/idle/idle_poll and # echo off > /sys/kernel/idle/idle_poll also work. So like I said. This could be used for just those that need to have idle=poll for running benchmarks but don't want to reboot when they are done. -- Steve PS. I haven't tested to see if the idle actually changes, but it looks pretty obvious in the code in cpu_idle: idle = pm_idle; if (!idle) idle = default_idle; if (cpu_is_offline(smp_processor_id())) play_dead(); stop_critical_timing(); propagate_preempt_locks_value(); idle(); Index: linux-2.6.14-rt13/arch/x86_64/kernel/process.c =================================================================== --- linux-2.6.14-rt13.orig/arch/x86_64/kernel/process.c 2005-11-15 11:12:37.000000000 -0500 +++ linux-2.6.14-rt13/arch/x86_64/kernel/process.c 2005-11-18 21:12:53.000000000 -0500 @@ -822,3 +822,104 @@ sp -= get_random_int() % 8192; return sp & ~0xf; } + +#ifdef CONFIG_SYSFS +#include <linux/kobject.h> +#include <linux/sysfs.h> +#include <linux/spinlock.h> + +#define KERNEL_ATTR_RW(_name) \ +static struct subsys_attribute _name##_attr = \ + __ATTR(_name, 0644, _name##_show, _name##_store) + +static spinlock_t idle_switch_lock = SPIN_LOCK_UNLOCKED(idle_switch_lock); + +static struct idlep_kobject +{ + struct kobject kobj; + int is_poll; + void (*idle)(void); +} idle_kobj; + +static ssize_t idle_poll_show(struct subsystem *subsys, char *page) +{ + return sprintf(page, "%s\n", (idle_kobj.is_poll ? "on" : "off")); +} + +static ssize_t idle_poll_store(struct subsystem *subsys, + const char *buf, size_t len) +{ + unsigned long flags; + + spin_lock_irqsave(&idle_switch_lock, flags); + + if (strncmp(buf,"1",1)==0 || + (len >=2 && strncmp(buf,"on",2)==0)) { + if (idle_kobj.is_poll != 1) { + idle_kobj.is_poll = 1; + pm_idle = poll_idle; + } + } else if (strncmp(buf,"0",1)==0 || + (len >= 3 && strncmp(buf,"off",3)==0)) { + if (idle_kobj.is_poll != 0) { + idle_kobj.is_poll = 0; + pm_idle = idle_kobj.idle; + } + } + + spin_unlock_irqrestore(&idle_switch_lock, flags); + + return len; +} + + +KERNEL_ATTR_RW(idle_poll); + +static struct attribute * idle_attrs[] = { + &idle_poll_attr.attr, + NULL +}; + +static struct attribute_group idle_attr_group = { + .attrs = idle_attrs, +}; + +static int __init idle_poll_set_init(void) +{ + int err; + + /* + * If the default is alread poll_idle then + * don't even bother with this. + */ + if (pm_idle == poll_idle) + return 0; + + memset(&idle_kobj, 0, sizeof(idle_kobj)); + + idle_kobj.is_poll = 0; + idle_kobj.idle = pm_idle; + + err = kobject_set_name(&idle_kobj.kobj, "%s", "idle"); + if (err) + goto out; + + idle_kobj.kobj.parent = &kernel_subsys.kset.kobj; + err = kobject_register(&idle_kobj.kobj); + if (err) + goto out; + + err = sysfs_create_group(&idle_kobj.kobj, + &idle_attr_group); + if (err) + goto out; + + return 0; +out: + printk(KERN_INFO "Problem setting up sysfs idle_poll\n"); + return 0; +} + +late_initcall(idle_poll_set_init); +#endif /* CONFIG_FS */ + ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: 2.6.14-rt13 2005-11-19 2:39 ` 2.6.14-rt13 Steven Rostedt @ 2005-11-24 15:07 ` Ingo Molnar 2005-11-24 15:21 ` 2.6.14-rt13 Steven Rostedt 2005-11-25 20:56 ` [RFC][PATCH] Runtime switching to idle_poll (was: Re: 2.6.14-rt13) Steven Rostedt 0 siblings, 2 replies; 65+ messages in thread From: Ingo Molnar @ 2005-11-24 15:07 UTC (permalink / raw) To: Steven Rostedt Cc: Fernando Lopez-Lezcano, Lee Revell, linux-kernel, Paul E. McKenney, K.R. Foley, Thomas Gleixner, pluto, john cooper, Benedikt Spranger, Daniel Walker, Tom Rini, George Anzinger * Steven Rostedt <rostedt@goodmis.org> wrote: > OK, I used this as an exercise to learn how kobject and sysfs work > (I've been putting this off for too long). So if this isn't exactly > proper, let me know :-) > > Ingo, This could be a temporary patch until we come up with a better > solution. This adds /sys/kernel/idle/idle_poll, which if idle=poll is > _not_ set, it still lets you switch the machine to idle=poll on the > fly, as well as turn it off. If you have idle=poll, this doesn't even > show up. ok, i've applied this one too. Could you also submit it upstream (and implement it for x86)? It makes sense to enable/disable the polling-based idle routine runtime. Ingo ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: 2.6.14-rt13 2005-11-24 15:07 ` 2.6.14-rt13 Ingo Molnar @ 2005-11-24 15:21 ` Steven Rostedt 2005-11-25 20:56 ` [RFC][PATCH] Runtime switching to idle_poll (was: Re: 2.6.14-rt13) Steven Rostedt 1 sibling, 0 replies; 65+ messages in thread From: Steven Rostedt @ 2005-11-24 15:21 UTC (permalink / raw) To: Ingo Molnar Cc: Fernando Lopez-Lezcano, Lee Revell, linux-kernel, Paul E. McKenney, K.R. Foley, Thomas Gleixner, pluto, john cooper, Benedikt Spranger, Daniel Walker, Tom Rini, George Anzinger On Thu, 2005-11-24 at 16:07 +0100, Ingo Molnar wrote: > * Steven Rostedt <rostedt@goodmis.org> wrote: > > > OK, I used this as an exercise to learn how kobject and sysfs work > > (I've been putting this off for too long). So if this isn't exactly > > proper, let me know :-) > > > > Ingo, This could be a temporary patch until we come up with a better > > solution. This adds /sys/kernel/idle/idle_poll, which if idle=poll is > > _not_ set, it still lets you switch the machine to idle=poll on the > > fly, as well as turn it off. If you have idle=poll, this doesn't even > > show up. > > ok, i've applied this one too. Could you also submit it upstream (and > implement it for x86)? It makes sense to enable/disable the > polling-based idle routine runtime. OK, it'll have to wait till tomorrow. As you probably know, it is Thanksgiving here in the US. And my wife would kill me if I work today ;-) -- Steve ^ permalink raw reply [flat|nested] 65+ messages in thread
* [RFC][PATCH] Runtime switching to idle_poll (was: Re: 2.6.14-rt13) 2005-11-24 15:07 ` 2.6.14-rt13 Ingo Molnar 2005-11-24 15:21 ` 2.6.14-rt13 Steven Rostedt @ 2005-11-25 20:56 ` Steven Rostedt 2005-11-26 13:05 ` Ingo Molnar 1 sibling, 1 reply; 65+ messages in thread From: Steven Rostedt @ 2005-11-25 20:56 UTC (permalink / raw) To: Ingo Molnar Cc: acpi-devel, len.brown, Andrew Morton, Fernando Lopez-Lezcano, Lee Revell, linux-kernel, Paul E. McKenney, K.R. Foley, Thomas Gleixner, pluto, john cooper, Benedikt Spranger, Daniel Walker, Tom Rini, George Anzinger On Thu, 2005-11-24 at 16:07 +0100, Ingo Molnar wrote: > * Steven Rostedt <rostedt@goodmis.org> wrote: > > Ingo, This could be a temporary patch until we come up with a better > > solution. This adds /sys/kernel/idle/idle_poll, which if idle=poll is > > _not_ set, it still lets you switch the machine to idle=poll on the > > fly, as well as turn it off. If you have idle=poll, this doesn't even > > show up. > > ok, i've applied this one too. Could you also submit it upstream (and > implement it for x86)? It makes sense to enable/disable the > polling-based idle routine runtime. As a request from Ingo, I fixed up this patch a little to allow both x86_64 and i386 to switch to and from idle_poll at runtime. I noticed that the APCI driver in drivers/acpi/processor_idle.c may cause some race condition with this patch so I added some protection there. Basically, if the acpi code changes pm_idle, then you can't change to idle_poll, and vice-versa. What this patch does is creates an entry into /sys/kernel/idle/idle_poll. It will show whether or not the idle_poll is being used as a runtime idle routine. It is also used to set the runtime idle. with: # echo 1 > /sys/kernel/idle/idle_poll or # echo on > /sys/kernel/idle/idle_poll The system will switch to the idle_poll idle routine. with: # echo 0 > /sys/kernel/idle/idle_poll or # echo off > /sys/kernel/idle/idle_poll The system will switch out of idle poll. Note that if the command line states "idle=poll" then this will not be implemented. This is still a work-in-progress. Since I only own a x86_64 and i386 that is all I ported the code for and tested. Looking for who else exports pm_idle I see that the following archs may also need to be updated: arm, arm26, i64, sparc. I also have not yet protected the pm_idle in arch/i386/kernel/apm.c I figure that I should get some comments before I spend any more time on this. Thanks, -- Steve Index: linux-2.6.15-rc2-git5/arch/i386/kernel/Makefile =================================================================== --- linux-2.6.15-rc2-git5.orig/arch/i386/kernel/Makefile 2005-10-27 20:02:08.000000000 -0400 +++ linux-2.6.15-rc2-git5/arch/i386/kernel/Makefile 2005-11-25 11:56:25.000000000 -0500 @@ -34,6 +34,7 @@ obj-$(CONFIG_HPET_TIMER) += time_hpet.o obj-$(CONFIG_EFI) += efi.o efi_stub.o obj-$(CONFIG_EARLY_PRINTK) += early_printk.o +obj-$(CONFIG_SYSFS) += switch2poll.o EXTRA_AFLAGS := -traditional Index: linux-2.6.15-rc2-git5/arch/i386/kernel/process.c =================================================================== --- linux-2.6.15-rc2-git5.orig/arch/i386/kernel/process.c 2005-11-25 10:58:53.000000000 -0500 +++ linux-2.6.15-rc2-git5/arch/i386/kernel/process.c 2005-11-25 12:18:12.000000000 -0500 @@ -39,6 +39,7 @@ #include <linux/ptrace.h> #include <linux/random.h> #include <linux/kprobes.h> +#include <linux/spinlock.h> #include <asm/uaccess.h> #include <asm/pgtable.h> @@ -64,6 +65,12 @@ unsigned long boot_option_idle_override = 0; EXPORT_SYMBOL(boot_option_idle_override); +spinlock_t pm_idle_switch_lock = SPIN_LOCK_UNLOCKED; +EXPORT_SYMBOL(pm_idle_switch_lock); + +int pm_idle_locked = 0; +EXPORT_SYMBOL(pm_idle_locked); + /* * Return saved PC of a blocked thread. */ @@ -126,7 +133,7 @@ * to poll the ->work.need_resched flag instead of waiting for the * cross-CPU IPI to arrive. Use this option with caution. */ -static void poll_idle (void) +void poll_idle (void) { local_irq_enable(); Index: linux-2.6.15-rc2-git5/arch/i386/kernel/switch2poll.c =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-2.6.15-rc2-git5/arch/i386/kernel/switch2poll.c 2005-11-25 11:55:19.000000000 -0500 @@ -0,0 +1,5 @@ +/* + * Same type of hack used for early_printk. This keeps the code + * in one place. + */ +#include "../../x86_64/kernel/switch2poll.c" Index: linux-2.6.15-rc2-git5/arch/x86_64/kernel/Makefile =================================================================== --- linux-2.6.15-rc2-git5.orig/arch/x86_64/kernel/Makefile 2005-11-22 12:13:24.000000000 -0500 +++ linux-2.6.15-rc2-git5/arch/x86_64/kernel/Makefile 2005-11-25 11:56:40.000000000 -0500 @@ -30,6 +30,7 @@ obj-$(CONFIG_DUMMY_IOMMU) += pci-nommu.o pci-dma.o obj-$(CONFIG_KPROBES) += kprobes.o obj-$(CONFIG_X86_PM_TIMER) += pmtimer.o +obj-$(CONFIG_SYSFS) += switch2poll.o obj-$(CONFIG_MODULES) += module.o Index: linux-2.6.15-rc2-git5/arch/x86_64/kernel/process.c =================================================================== --- linux-2.6.15-rc2-git5.orig/arch/x86_64/kernel/process.c 2005-11-25 10:58:53.000000000 -0500 +++ linux-2.6.15-rc2-git5/arch/x86_64/kernel/process.c 2005-11-25 12:17:53.000000000 -0500 @@ -36,6 +36,7 @@ #include <linux/utsname.h> #include <linux/random.h> #include <linux/kprobes.h> +#include <linux/spinlock.h> #include <asm/uaccess.h> #include <asm/pgtable.h> @@ -60,6 +61,12 @@ unsigned long boot_option_idle_override = 0; EXPORT_SYMBOL(boot_option_idle_override); +spinlock_t pm_idle_switch_lock = SPIN_LOCK_UNLOCKED; +EXPORT_SYMBOL(pm_idle_switch_lock); + +int pm_idle_locked = 0; +EXPORT_SYMBOL(pm_idle_locked); + /* * Powermanagement idle function, if any.. */ @@ -110,7 +117,7 @@ * to poll the ->need_resched flag instead of waiting for the * cross-CPU IPI to arrive. Use this option with caution. */ -static void poll_idle (void) +void poll_idle (void) { local_irq_enable(); Index: linux-2.6.15-rc2-git5/arch/x86_64/kernel/switch2poll.c =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-2.6.15-rc2-git5/arch/x86_64/kernel/switch2poll.c 2005-11-25 12:23:22.000000000 -0500 @@ -0,0 +1,112 @@ +#include <linux/module.h> +#include <linux/kobject.h> +#include <linux/sysfs.h> +#include <linux/spinlock.h> +#include <linux/pm.h> + +extern void poll_idle (void); + +#define KERNEL_ATTR_RW(_name) \ +static struct subsys_attribute _name##_attr = \ + __ATTR(_name, 0644, _name##_show, _name##_store) + +static struct idlep_kobject +{ + struct kobject kobj; + int is_poll; + void (*idle)(void); +} idle_kobj; + +static ssize_t idle_poll_show(struct subsystem *subsys, char *page) +{ + return sprintf(page, "%s\n", (idle_kobj.is_poll ? "on" : "off")); +} + +static ssize_t idle_poll_store(struct subsystem *subsys, + const char *buf, size_t len) +{ + unsigned long flags; + + spin_lock_irqsave(&pm_idle_switch_lock, flags); + + /* + * If power management is handling the idle function, + * then leave it be. + */ + if (pm_idle_locked) { + len = -EBUSY; + goto out; + } + + if (strncmp(buf,"1",1)==0 || + (len >=2 && strncmp(buf,"on",2)==0)) { + if (idle_kobj.is_poll != 1) { + idle_kobj.is_poll = 1; + boot_option_idle_override = 1; + idle_kobj.idle = pm_idle; + pm_idle = poll_idle; + } + } else if (strncmp(buf,"0",1)==0 || + (len >= 3 && strncmp(buf,"off",3)==0)) { + if (idle_kobj.is_poll != 0) { + boot_option_idle_override = 0; + idle_kobj.is_poll = 0; + pm_idle = idle_kobj.idle; + } + } + +out: + spin_unlock_irqrestore(&pm_idle_switch_lock, flags); + + return len; +} + + +KERNEL_ATTR_RW(idle_poll); + +static struct attribute * idle_attrs[] = { + &idle_poll_attr.attr, + NULL +}; + +static struct attribute_group idle_attr_group = { + .attrs = idle_attrs, +}; + +static int __init idle_poll_set_init(void) +{ + int err; + + /* + * If the default is alread poll_idle then + * don't even bother with this. + */ + if (pm_idle == poll_idle) + return 0; + + memset(&idle_kobj, 0, sizeof(idle_kobj)); + + idle_kobj.is_poll = 0; + idle_kobj.idle = pm_idle; + + err = kobject_set_name(&idle_kobj.kobj, "%s", "idle"); + if (err) + goto out; + + idle_kobj.kobj.parent = &kernel_subsys.kset.kobj; + err = kobject_register(&idle_kobj.kobj); + if (err) + goto out; + + err = sysfs_create_group(&idle_kobj.kobj, + &idle_attr_group); + if (err) + goto out; + + return 0; +out: + printk(KERN_INFO "Problem setting up sysfs idle_poll\n"); + return 0; +} + +late_initcall(idle_poll_set_init); Index: linux-2.6.15-rc2-git5/drivers/acpi/processor_idle.c =================================================================== --- linux-2.6.15-rc2-git5.orig/drivers/acpi/processor_idle.c 2005-11-22 12:13:24.000000000 -0500 +++ linux-2.6.15-rc2-git5/drivers/acpi/processor_idle.c 2005-11-25 13:15:59.000000000 -0500 @@ -38,6 +38,7 @@ #include <linux/dmi.h> #include <linux/moduleparam.h> #include <linux/sched.h> /* need_resched() */ +#include <linux/spinlock.h> #include <asm/io.h> #include <asm/uaccess.h> @@ -990,6 +991,7 @@ static int first_run = 0; struct proc_dir_entry *entry = NULL; unsigned int i; + unsigned long flags; ACPI_FUNCTION_TRACE("acpi_processor_power_init"); @@ -1023,6 +1025,7 @@ * Note that we use previously set idle handler will be used on * platforms that only support C1. */ + spin_lock_irqsave(&pm_idle_switch_lock, flags); if ((pr->flags.power) && (!boot_option_idle_override)) { printk(KERN_INFO PREFIX "CPU%d (power states:", pr->id); for (i = 1; i <= pr->power.count; i++) @@ -1034,8 +1037,13 @@ if (pr->id == 0) { pm_idle_save = pm_idle; pm_idle = acpi_processor_idle; + /* + * Don't allow switching of the pm_idle to poll. + */ + pm_idle_locked = 1; } } + spin_unlock_irqrestore(&pm_idle_switch_lock, flags); /* 'power' [R] */ entry = create_proc_entry(ACPI_PROCESSOR_FILE_POWER, @@ -1078,5 +1086,7 @@ cpu_idle_wait(); } + pm_idle_locked = 0; + return_VALUE(0); } Index: linux-2.6.15-rc2-git5/include/linux/pm.h =================================================================== --- linux-2.6.15-rc2-git5.orig/include/linux/pm.h 2005-11-25 12:05:33.000000000 -0500 +++ linux-2.6.15-rc2-git5/include/linux/pm.h 2005-11-25 12:17:17.000000000 -0500 @@ -25,6 +25,7 @@ #include <linux/config.h> #include <linux/list.h> +#include <linux/spinlock.h> #include <asm/atomic.h> /* @@ -102,6 +103,8 @@ */ extern void (*pm_idle)(void); extern void (*pm_power_off)(void); +extern spinlock_t pm_idle_switch_lock; +extern int pm_idle_locked; typedef int __bitwise suspend_state_t; ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [RFC][PATCH] Runtime switching to idle_poll (was: Re: 2.6.14-rt13) 2005-11-25 20:56 ` [RFC][PATCH] Runtime switching to idle_poll (was: Re: 2.6.14-rt13) Steven Rostedt @ 2005-11-26 13:05 ` Ingo Molnar 2005-11-29 2:48 ` [RFC][PATCH] Runtime switching of the idle function [take 2] Steven Rostedt 0 siblings, 1 reply; 65+ messages in thread From: Ingo Molnar @ 2005-11-26 13:05 UTC (permalink / raw) To: Steven Rostedt Cc: acpi-devel, len.brown, Andrew Morton, Fernando Lopez-Lezcano, Lee Revell, linux-kernel, Paul E. McKenney, K.R. Foley, Thomas Gleixner, pluto, john cooper, Benedikt Spranger, Daniel Walker, Tom Rini, George Anzinger * Steven Rostedt <rostedt@goodmis.org> wrote: > As a request from Ingo, I fixed up this patch a little to allow both > x86_64 and i386 to switch to and from idle_poll at runtime. I noticed > that the APCI driver in drivers/acpi/processor_idle.c may cause some > race condition with this patch so I added some protection there. > Basically, if the acpi code changes pm_idle, then you can't change to > idle_poll, and vice-versa. > > What this patch does is creates an entry into > /sys/kernel/idle/idle_poll. It will show whether or not the idle_poll > is being used as a runtime idle routine. It is also used to set the > runtime idle. > > with: > > # echo 1 > /sys/kernel/idle/idle_poll > or > # echo on > /sys/kernel/idle/idle_poll find some minor cleanups below. a more general question is, shouldnt the configuration method rather be something like: echo idle > /sys/kernel/idle and there could also be a /sys/kernel/idle_methods which would enumerate all the strings that are possible? This way we'd not hardcode 'idle-poll' in any way. Ingo Signed-off-by: Ingo Molnar <mingo@elte.hu> arch/i386/kernel/process.c | 6 +++--- arch/x86_64/kernel/process.c | 6 +++--- 2 files changed, 6 insertions(+), 6 deletions(-) Index: linux/arch/i386/kernel/process.c =================================================================== --- linux.orig/arch/i386/kernel/process.c +++ linux/arch/i386/kernel/process.c @@ -65,11 +65,11 @@ static int hlt_counter; unsigned long boot_option_idle_override = 0; EXPORT_SYMBOL(boot_option_idle_override); -spinlock_t pm_idle_switch_lock = SPIN_LOCK_UNLOCKED; -EXPORT_SYMBOL(pm_idle_switch_lock); +DEFINE_SPINLOCK(pm_idle_switch_lock); +EXPORT_SYMBOL_GPL(pm_idle_switch_lock); int pm_idle_locked = 0; -EXPORT_SYMBOL(pm_idle_locked); +EXPORT_SYMBOL_GPL(pm_idle_locked); /* * Return saved PC of a blocked thread. Index: linux/arch/x86_64/kernel/process.c =================================================================== --- linux.orig/arch/x86_64/kernel/process.c +++ linux/arch/x86_64/kernel/process.c @@ -61,11 +61,11 @@ static atomic_t hlt_counter = ATOMIC_INI unsigned long boot_option_idle_override = 0; EXPORT_SYMBOL(boot_option_idle_override); -spinlock_t pm_idle_switch_lock = SPIN_LOCK_UNLOCKED; -EXPORT_SYMBOL(pm_idle_switch_lock); +DEFINE_SPINLOCK(pm_idle_switch_lock); +EXPORT_SYMBOL_GPL(pm_idle_switch_lock); int pm_idle_locked = 0; -EXPORT_SYMBOL(pm_idle_locked); +EXPORT_SYMBOL_GPL(pm_idle_locked); /* * Powermanagement idle function, if any.. ^ permalink raw reply [flat|nested] 65+ messages in thread
* [RFC][PATCH] Runtime switching of the idle function [take 2] 2005-11-26 13:05 ` Ingo Molnar @ 2005-11-29 2:48 ` Steven Rostedt 2005-11-29 3:02 ` Andrew Morton 2005-11-29 13:08 ` Pavel Machek 0 siblings, 2 replies; 65+ messages in thread From: Steven Rostedt @ 2005-11-29 2:48 UTC (permalink / raw) To: Ingo Molnar Cc: acpi-devel, len.brown, Andrew Morton, Fernando Lopez-Lezcano, Lee Revell, linux-kernel, Paul E. McKenney, K.R. Foley, Thomas Gleixner, pluto, john cooper, Benedikt Spranger, Daniel Walker, Tom Rini, George Anzinger Here's an update on the switching of the idle function. As Ingo has suggested, I removed this from being specific to the poll_idle function. Description: This patch creates a directory in /sys/kernel called idle. This directory contains two files: idle_ctrl and idle_methods. Reading idle_ctrl will show the function that is currently being used for idle, and idle_methods shows the available methods for the user to send write into idle_ctrl to change which function to use for idle. If the freeze attribute is set for an idle function (defined in the idle_info struct explained below), then the user cannot add or remove that function. This is used by the acpi since I wasn't sure how it would handle having that function added or removed dynamically. Functions that are frozen are shown in the idle_methods (and idle_ctrl when used) with an asterisk (*) in front of the name. I moved the code from arch/x86_64 to outside the arch directories into kernel. The file is called idle.c. This implements functions to register idle and unregister idle. It also has the functions to set which idle to use. This file also creates the entries into the sysfs directory. Currently this is only compiled for i386, x86_64, and ia64. Since I only have i386 and x86_64, I was only able to test the changes in those two archs. I modified ia64, but haven't even tried to compile it. If someone with that arch would like to do me the favor, please do ;-) I've created an idle_info structure that is used to register the idle functions. This is now how acpi adds its functions. struct idle_info { struct list_head list; /* used to link in with all other registered */ const char *name; /* name to be used to add as well as to show */ idlefunc_t func; /* the function to be called for idle */ int freeze; /* set to disallow the user from adding or removing it */ int inuse; /* set when being used as the idle function */ }; This is a much more robust way of handling changes of the idle function and can easily be adapted to other archs that would like to also implement dynamic changes of the idle function. This would be nice to add to sparc (hint hint). Here's the patch: Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Index: linux-2.6.15-rc2-git5/arch/i386/kernel/process.c =================================================================== --- linux-2.6.15-rc2-git5.orig/arch/i386/kernel/process.c 2005-11-28 19:59:34.000000000 -0500 +++ linux-2.6.15-rc2-git5/arch/i386/kernel/process.c 2005-11-28 20:30:51.000000000 -0500 @@ -39,6 +39,7 @@ #include <linux/ptrace.h> #include <linux/random.h> #include <linux/kprobes.h> +#include <linux/idle.h> #include <asm/uaccess.h> #include <asm/pgtable.h> @@ -72,11 +73,6 @@ return ((unsigned long *)tsk->thread.esp)[3]; } -/* - * Powermanagement idle function, if any.. - */ -void (*pm_idle)(void); -EXPORT_SYMBOL(pm_idle); static DEFINE_PER_CPU(unsigned int, cpu_idle_state); void disable_hlt(void) @@ -185,7 +181,7 @@ __get_cpu_var(cpu_idle_state) = 0; rmb(); - idle = pm_idle; + idle = idle_func; if (!idle) idle = default_idle; @@ -230,6 +226,8 @@ } EXPORT_SYMBOL_GPL(cpu_idle_wait); +static struct idle_info idle_mwait; + /* * This uses new MONITOR/MWAIT instructions on P4 processors with PNI, * which can obviate IPI to trigger checking of need_resched. @@ -258,25 +256,62 @@ * Skip, if setup has overridden idle. * One CPU supports mwait => All CPUs supports mwait */ - if (!pm_idle) { + memset(&idle_mwait, 0, sizeof(idle_mwait)); + idle_mwait.name = "mwait"; + idle_mwait.func = mwait_idle; + register_idle(&idle_mwait); + + if (!idle_func) { printk("using mwait in idle threads.\n"); - pm_idle = mwait_idle; + set_idle("mwait"); } } } +static struct idle_info idle_default; +static struct idle_info idle_poll; + +static int __init add_idle(void) +{ + static int set; + + if (set) + return 0; + set = 1; + + memset(&idle_poll, 0, sizeof(idle_poll)); + idle_poll.name = "poll"; + idle_poll.func = poll_idle; + register_idle(&idle_poll); + + /* + * Allow the user to switch out of poll_idle even + * if it was a boot option. + */ + memset(&idle_default, 0, sizeof(idle_default)); + idle_default.name = "default"; + idle_default.func = default_idle; + register_idle(&idle_default); + + return 0; +} + +arch_initcall(add_idle); + static int __init idle_setup (char *str) { + add_idle(); if (!strncmp(str, "poll", 4)) { printk("using polling idle threads.\n"); - pm_idle = poll_idle; + set_idle("poll"); + #ifdef CONFIG_X86_SMP if (smp_num_siblings > 1) printk("WARNING: polling idle and HT enabled, performance may degrade.\n"); #endif } else if (!strncmp(str, "halt", 4)) { printk("using halt in idle threads.\n"); - pm_idle = default_idle; + set_idle("default"); } boot_option_idle_override = 1; Index: linux-2.6.15-rc2-git5/arch/x86_64/kernel/process.c =================================================================== --- linux-2.6.15-rc2-git5.orig/arch/x86_64/kernel/process.c 2005-11-28 19:59:34.000000000 -0500 +++ linux-2.6.15-rc2-git5/arch/x86_64/kernel/process.c 2005-11-28 20:30:21.000000000 -0500 @@ -36,6 +36,8 @@ #include <linux/utsname.h> #include <linux/random.h> #include <linux/kprobes.h> +#include <linux/spinlock.h> +#include <linux/idle.h> #include <asm/uaccess.h> #include <asm/pgtable.h> @@ -60,10 +62,6 @@ unsigned long boot_option_idle_override = 0; EXPORT_SYMBOL(boot_option_idle_override); -/* - * Powermanagement idle function, if any.. - */ -void (*pm_idle)(void); static DEFINE_PER_CPU(unsigned int, cpu_idle_state); void disable_hlt(void) @@ -195,7 +193,7 @@ __get_cpu_var(cpu_idle_state) = 0; rmb(); - idle = pm_idle; + idle = idle_func; if (!idle) idle = default_idle; if (cpu_is_offline(smp_processor_id())) @@ -209,6 +207,8 @@ } } +struct idle_info idle_mwait; + /* * This uses new MONITOR/MWAIT instructions on P4 processors with PNI, * which can obviate IPI to trigger checking of need_resched. @@ -233,25 +233,61 @@ { static int printed; if (cpu_has(c, X86_FEATURE_MWAIT)) { + memset(&idle_mwait, 0, sizeof(idle_mwait)); + idle_mwait.name = "mwait"; + idle_mwait.func = mwait_idle; + register_idle(&idle_mwait); + /* * Skip, if setup has overridden idle. * One CPU supports mwait => All CPUs supports mwait */ - if (!pm_idle) { + if (!idle_func) { if (!printed) { printk("using mwait in idle threads.\n"); printed = 1; } - pm_idle = mwait_idle; + set_idle("mwait"); } } } +static struct idle_info idle_default; +static struct idle_info idle_poll; + +static int __init add_idle(void) +{ + static int set; + + if (set) + return 0; + set = 1; + + memset(&idle_poll, 0, sizeof(idle_poll)); + idle_poll.name = "poll"; + idle_poll.func = poll_idle; + register_idle(&idle_poll); + + /* + * Allow the user to switch out of poll_idle even + * if it was a boot option. + */ + memset(&idle_default, 0, sizeof(idle_default)); + idle_default.name = "default"; + idle_default.func = default_idle; + register_idle(&idle_default); + + return 0; +} +arch_initcall(add_idle); + static int __init idle_setup (char *str) { + add_idle(); + if (!strncmp(str, "poll", 4)) { printk("using polling idle threads.\n"); - pm_idle = poll_idle; + set_idle("poll"); } boot_option_idle_override = 1; Index: linux-2.6.15-rc2-git5/drivers/acpi/processor_idle.c =================================================================== --- linux-2.6.15-rc2-git5.orig/drivers/acpi/processor_idle.c 2005-11-28 19:59:34.000000000 -0500 +++ linux-2.6.15-rc2-git5/drivers/acpi/processor_idle.c 2005-11-28 19:59:42.000000000 -0500 @@ -38,6 +38,8 @@ #include <linux/dmi.h> #include <linux/moduleparam.h> #include <linux/sched.h> /* need_resched() */ +#include <linux/spinlock.h> +#include <linux/idle.h> #include <asm/io.h> #include <asm/uaccess.h> @@ -56,6 +58,7 @@ #define C3_OVERHEAD 4 /* 1us (3.579 ticks per us) */ static void (*pm_idle_save) (void); module_param(max_cstate, uint, 0644); +#define PM_IDLE_NAME "pm_idle" static unsigned int nocst = 0; module_param(nocst, uint, 0000); @@ -891,13 +894,13 @@ return_VALUE(-ENODEV); /* Fall back to the default idle loop */ - pm_idle = pm_idle_save; + set_idle(NULL); synchronize_sched(); /* Relies on interrupts forcing exit from idle. */ pr->flags.power = 0; result = acpi_processor_get_power_info(pr); if ((pr->flags.power == 1) && (pr->flags.power_setup_done)) - pm_idle = acpi_processor_idle; + set_idle(PM_IDLE_NAME); return_VALUE(result); } @@ -983,6 +986,8 @@ .release = single_release, }; +static struct idle_info pm_idle_info; + int acpi_processor_power_init(struct acpi_processor *pr, struct acpi_device *device) { @@ -1032,8 +1037,17 @@ printk(")\n"); if (pr->id == 0) { - pm_idle_save = pm_idle; - pm_idle = acpi_processor_idle; + memset(&pm_idle_info, 0, sizeof(pm_idle_info)); + pm_idle_info.name = PM_IDLE_NAME; + pm_idle_info.func = acpi_processor_idle; + pm_idle_info.freeze = 1; + + register_idle(&pm_idle_info); + /* + * Just use the default idle + */ + pm_idle_save = get_idle(NULL); + set_idle(PM_IDLE_NAME); } } @@ -1068,7 +1082,29 @@ /* Unregister the idle handler when processor #0 is removed. */ if (pr->id == 0) { - pm_idle = pm_idle_save; + int tries = 0; + int ret; + set_idle(NULL); + do { + if ((ret = unregister_idle(PM_IDLE_NAME)) == 0) + break; + /* + * for some reason the idle function is being used. + * Wait a little and then try again. + */ + if (ret == -EINVAL) { + printk(KERN_WARNING + "ACPI idle function never registered?\n"); + break; + } + yield(); + } while (tries++ < 10); + if (tries > 10) { + printk(KERN_WARNING + "Unable to unresgister ACPI idle function\n"); + /* don't unregister */ + return_VALUE(ret); + } /* * We are about to unload the current idle thread pm callback Index: linux-2.6.15-rc2-git5/include/linux/pm.h =================================================================== --- linux-2.6.15-rc2-git5.orig/include/linux/pm.h 2005-11-28 19:59:34.000000000 -0500 +++ linux-2.6.15-rc2-git5/include/linux/pm.h 2005-11-28 19:59:42.000000000 -0500 @@ -25,6 +25,7 @@ #include <linux/config.h> #include <linux/list.h> +#include <linux/spinlock.h> #include <asm/atomic.h> /* @@ -102,6 +103,8 @@ */ extern void (*pm_idle)(void); extern void (*pm_power_off)(void); +extern spinlock_t pm_idle_switch_lock; +extern int pm_idle_locked; typedef int __bitwise suspend_state_t; Index: linux-2.6.15-rc2-git5/arch/x86_64/Kconfig =================================================================== --- linux-2.6.15-rc2-git5.orig/arch/x86_64/Kconfig 2005-11-28 19:59:34.000000000 -0500 +++ linux-2.6.15-rc2-git5/arch/x86_64/Kconfig 2005-11-28 19:59:42.000000000 -0500 @@ -69,6 +69,10 @@ bool default y +config DYNAMIC_IDLE + bool + default y + source "init/Kconfig" Index: linux-2.6.15-rc2-git5/arch/x86_64/kernel/x8664_ksyms.c =================================================================== --- linux-2.6.15-rc2-git5.orig/arch/x86_64/kernel/x8664_ksyms.c 2005-11-28 19:59:34.000000000 -0500 +++ linux-2.6.15-rc2-git5/arch/x86_64/kernel/x8664_ksyms.c 2005-11-28 19:59:42.000000000 -0500 @@ -58,7 +58,6 @@ EXPORT_SYMBOL(disable_irq_nosync); EXPORT_SYMBOL(probe_irq_mask); EXPORT_SYMBOL(kernel_thread); -EXPORT_SYMBOL(pm_idle); EXPORT_SYMBOL(pm_power_off); EXPORT_SYMBOL(get_cmos_time); Index: linux-2.6.15-rc2-git5/include/linux/idle.h =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-2.6.15-rc2-git5/include/linux/idle.h 2005-11-28 21:36:00.000000000 -0500 @@ -0,0 +1,67 @@ +/* + * idle.h - Registering of the idle function (for supported archs) + * + * Copyright (C) 2005 Steven Rostedt <rostedt@goodmis.org> + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + */ + +#ifndef _LINUX_IDLE_H +#define _LINUX_IDLE_H + +#include <linux/config.h> +#include <linux/list.h> +#include <linux/spinlock.h> +#include <linux/list.h> +#include <asm/atomic.h> + +typedef void (*idlefunc_t)(void); + +struct idle_info { + struct list_head list; + const char *name; /* Name visible to users */ + idlefunc_t func; /* idle function to run */ + int freeze; /* Only allow kernel to add or remove */ + int inuse; /* set when being used */ +}; + +/* + * Registering and unregistering functions that may be used + * instead of the default idle function. This only adds + * them to the list of functions to be used, it does not + * set the + */ +extern int register_idle(struct idle_info *info); +extern int unregister_idle(const char *name); + +/* + * This sets the idle function to the registered function + * by name. Use NULL to set the idle function back to + * the default. + */ +extern int set_idle(const char *name); + +/* + * Return the function that is registered by name. + * Use NULL to get the default function. + * NULL may be returned (as that may be what the current + * idle function is set to, to use a default). NULL will + * also be returned if name is not registered. + */ +extern idlefunc_t get_idle(const char *name); + +extern idlefunc_t idle_func; + +#endif /* _LINUX_IDLE_H */ Index: linux-2.6.15-rc2-git5/kernel/Makefile =================================================================== --- linux-2.6.15-rc2-git5.orig/kernel/Makefile 2005-11-28 19:59:34.000000000 -0500 +++ linux-2.6.15-rc2-git5/kernel/Makefile 2005-11-28 19:59:42.000000000 -0500 @@ -32,6 +32,7 @@ obj-$(CONFIG_CRASH_DUMP) += crash_dump.o obj-$(CONFIG_SECCOMP) += seccomp.o obj-$(CONFIG_RCU_TORTURE_TEST) += rcutorture.o +obj-$(CONFIG_DYNAMIC_IDLE) += idle.o ifneq ($(CONFIG_SCHED_NO_NO_OMIT_FRAME_POINTER),y) # According to Alan Modra <alan@linuxcare.com.au>, the -fno-omit-frame-pointer is Index: linux-2.6.15-rc2-git5/kernel/idle.c =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-2.6.15-rc2-git5/kernel/idle.c 2005-11-28 20:29:57.000000000 -0500 @@ -0,0 +1,308 @@ +/* + * kernel/idle.c + * + * Setting up of the idle function to be dynamic. + * + * Copyright (C) 2005 Steven Rostedt + */ +#include <linux/module.h> +#include <linux/kobject.h> +#include <linux/sysfs.h> +#include <linux/spinlock.h> +#include <linux/idle.h> + +idlefunc_t idle_func; + +static void (*idle_default)(void); +static LIST_HEAD(idle_elements); +static DECLARE_MUTEX(idle_sem); +static struct idle_info *curr_idle; + +#ifdef CONFIG_SYSFS +int idle_sysfs_init; +#endif + +extern void poll_idle (void); + +static struct idle_info *__find_idle_info(const char *name) +{ + struct list_head *curr; + struct idle_info *p; + /* + * A little inefficient, but this isn't called often. + */ + list_for_each(curr, &idle_elements) { + p = list_entry(curr, struct idle_info, list); + if (!strcmp(name, p->name)) + break; + } + if (curr == &idle_elements) + p = NULL; + + return p; +} + +int register_idle(struct idle_info *info) +{ + struct idle_info *p; + int ret = -EEXIST; + + BUG_ON(!info->name); + + down(&idle_sem); + + p = __find_idle_info(info->name); + if (p) + goto out; + ret = 0; + + list_add(&info->list, &idle_elements); + +out: + up(&idle_sem); + return ret; +} +EXPORT_SYMBOL_GPL(register_idle); + +int unregister_idle(const char *name) +{ + struct idle_info *p; + int ret = -EINVAL; + + BUG_ON(!name); + + down(&idle_sem); + + p = __find_idle_info(name); + if (!p) + goto out; + if (p->inuse) { + ret = -EBUSY; + goto out; + } + + ret = 0; + + list_del_init(&p->list); + +out: + up(&idle_sem); + return ret; +} +EXPORT_SYMBOL_GPL(unregister_idle); + +static int __set_idle(struct idle_info *info) +{ + if (curr_idle) + curr_idle->inuse--; + info->inuse++; + curr_idle = info; + return 0; +} + +int set_idle(const char *name) +{ + struct idle_info *p; + int ret = 0; + + down(&idle_sem); + + if (!name) { + /* Set to the default function */ + if (curr_idle) { + curr_idle->inuse--; + curr_idle = NULL; + } + idle_func = idle_default; + goto out; + } + + ret = -EINVAL; + p = __find_idle_info(name); + if (!p) + goto out; + + __set_idle(p); +out: + up(&idle_sem); + return ret; +} +EXPORT_SYMBOL_GPL(set_idle); + +idlefunc_t get_idle(const char *name) +{ + struct idle_info *p; + idlefunc_t ret = idle_default; + + down(&idle_sem); + + if (!name) + goto out; + + p = __find_idle_info(name); + if (!p) + goto out; + + ret = p->func; +out: + up(&idle_sem); + return ret; +} +EXPORT_SYMBOL_GPL(get_idle); + +#ifdef CONFIG_SYSFS +#define KERNEL_ATTR_RW(_name) \ +static struct subsys_attribute _name##_attr = \ + __ATTR(_name, 0644, _name##_show, _name##_store) + +static struct idlep_kobject +{ + struct kobject kobj; +} idle_kobj; + +static ssize_t idle_ctrl_show(struct subsystem *subsys, char *page) +{ + ssize_t ret; + char *star = ""; + const char *name = "default"; + + down(&idle_sem); + if (curr_idle) { + name = curr_idle->name; + if (curr_idle->freeze) + star = "*"; + } + ret = sprintf(page, "%s%s\n", star, name); + up(&idle_sem); + + return ret; +} + +static ssize_t idle_ctrl_store(struct subsystem *subsys, + const char *buf, size_t len) +{ + struct list_head *curr; + struct idle_info *p; + ssize_t ret = -EBUSY; + + down(&idle_sem); + + if (curr_idle && curr_idle->freeze) + goto out; + + list_for_each(curr, &idle_elements) { + int size; + p = list_entry(curr, struct idle_info, list); + + size = strlen(p->name); + if (len <= size) + continue; + if (!strncmp(p->name, buf, size)) + break; + } + if (curr == &idle_elements) { + ret = -EINVAL; + goto out; + } + + /* + * This idle routine may have been registered to + * not allow users to add or remove this. + */ + if (p->freeze) + goto out; + + __set_idle(p); + + ret = len; +out: + up(&idle_sem); + + return ret; +} + +KERNEL_ATTR_RW(idle_ctrl); + +static ssize_t idle_methods_show(struct subsystem *subsys, char *page) +{ + struct list_head *curr; + struct idle_info *p; + ssize_t len = 0; + + down(&idle_sem); + list_for_each(curr, &idle_elements) { + p = list_entry(curr, struct idle_info, list); + if (len + 3 + strlen(p->name) >= PAGE_SIZE) { + printk("idle functions overflowed sysfs??\n"); + break; + } + len += sprintf(page+len, "%s%s%s", + len ? " " : "", + p->freeze ? "*" : "", + p->name); + } + if (len + 2 < PAGE_SIZE) + len += sprintf(page+len, "\n"); + + up(&idle_sem); + return len; +} + +static ssize_t idle_methods_store(struct subsystem *subsys, + const char *buf, size_t len) +{ + /* do nothing */ + return len; +} + +KERNEL_ATTR_RW(idle_methods); + +static struct attribute * idle_attrs[] = { + &idle_ctrl_attr.attr, + &idle_methods_attr.attr, + NULL +}; + +static struct attribute_group idle_attr_group = { + .attrs = idle_attrs, +}; + +static int __init idle_setup_sysfs(void) +{ + int err; + + memset(&idle_kobj, 0, sizeof(idle_kobj)); + err = kobject_set_name(&idle_kobj.kobj, "%s", "idle"); + if (err) + goto out; + + kobj_set_kset_s(&idle_kobj, kernel_subsys); + + idle_kobj.kobj.parent = &kernel_subsys.kset.kobj; + err = kobject_register(&idle_kobj.kobj); + if (err) + goto out; + + err = sysfs_create_group(&idle_kobj.kobj, + &idle_attr_group); + if (err) + goto out; + + return 0; +out: + printk(KERN_INFO "Problem setting up sysfs idle_ctrl\n"); + return 0; +} +#endif /* CONFIG_SYSFS */ + +static int __init idle_setup(void) +{ + idle_default = idle_func; + +#ifdef CONFIG_SYSFS + idle_setup_sysfs(); +#endif + return 0; +} + +late_initcall(idle_setup); Index: linux-2.6.15-rc2-git5/arch/i386/Kconfig =================================================================== --- linux-2.6.15-rc2-git5.orig/arch/i386/Kconfig 2005-11-28 19:59:34.000000000 -0500 +++ linux-2.6.15-rc2-git5/arch/i386/Kconfig 2005-11-28 19:59:42.000000000 -0500 @@ -45,6 +45,10 @@ bool default y +config DYNAMIC_IDLE + bool + default y + source "init/Kconfig" menu "Processor type and features" Index: linux-2.6.15-rc2-git5/arch/i386/kernel/apm.c =================================================================== --- linux-2.6.15-rc2-git5.orig/arch/i386/kernel/apm.c 2005-11-28 19:59:34.000000000 -0500 +++ linux-2.6.15-rc2-git5/arch/i386/kernel/apm.c 2005-11-28 19:59:42.000000000 -0500 @@ -225,6 +225,7 @@ #include <linux/smp_lock.h> #include <linux/dmi.h> #include <linux/suspend.h> +#include <linux/idle.h> #include <asm/system.h> #include <asm/uaccess.h> @@ -2220,6 +2221,9 @@ { } }; +static struct idle_info apm_idle; +#define APM_IDLE_NAME "apm" + /* * Just start the APM thread. We do NOT want to do APM BIOS * calls from anything but the APM thread, if for no other reason @@ -2373,8 +2377,14 @@ if (HZ != 100) idle_period = (idle_period * HZ) / 100; if (idle_threshold < 100) { - original_pm_idle = pm_idle; - pm_idle = apm_cpu_idle; + memset(&apm_idle, 0, sizeof(apm_idle)); + apm_idle.name = APM_IDLE_NAME; + apm_idle.func = apm_cpu_idle; + apm_idle.freeze = 1; + register_idle(&apm_idle); + + original_pm_idle = get_idle(NULL); + set_idle(APM_IDLE_NAME); set_pm_idle = 1; } @@ -2386,7 +2396,26 @@ int error; if (set_pm_idle) { - pm_idle = original_pm_idle; + int tries = 0; + int ret; + set_idle(NULL); + do { + if ((ret = unregister_idle(APM_IDLE_NAME)) == 0) + break; + /* + * for some reason the idle function is being used. + * Wait a little and then try again. + */ + if (ret == -EINVAL) { + printk(KERN_WARNING + "APM idle function never registered?\n"); + break; + } + yield(); + } while (tries++ < 10); + if (tries > 10) + printk(KERN_WARNING + "Unable to unresgister APM idle function\n"); /* * We are about to unload the current idle thread pm callback * (pm_idle), Wait for all processors to update cached/local Index: linux-2.6.15-rc2-git5/arch/ia64/Kconfig =================================================================== --- linux-2.6.15-rc2-git5.orig/arch/ia64/Kconfig 2005-11-22 12:13:22.000000000 -0500 +++ linux-2.6.15-rc2-git5/arch/ia64/Kconfig 2005-11-28 20:17:30.000000000 -0500 @@ -62,6 +62,10 @@ bool default y +config DYNAMIC_IDLE + bool + default y + choice prompt "System type" default IA64_GENERIC Index: linux-2.6.15-rc2-git5/arch/ia64/kernel/acpi.c =================================================================== --- linux-2.6.15-rc2-git5.orig/arch/ia64/kernel/acpi.c 2005-11-22 12:13:22.000000000 -0500 +++ linux-2.6.15-rc2-git5/arch/ia64/kernel/acpi.c 2005-11-28 20:23:41.000000000 -0500 @@ -60,8 +60,6 @@ #define PREFIX "ACPI: " -void (*pm_idle) (void); -EXPORT_SYMBOL(pm_idle); void (*pm_power_off) (void); EXPORT_SYMBOL(pm_power_off); Index: linux-2.6.15-rc2-git5/arch/ia64/kernel/process.c =================================================================== --- linux-2.6.15-rc2-git5.orig/arch/ia64/kernel/process.c 2005-11-25 10:58:53.000000000 -0500 +++ linux-2.6.15-rc2-git5/arch/ia64/kernel/process.c 2005-11-28 20:29:33.000000000 -0500 @@ -31,6 +31,7 @@ #include <linux/interrupt.h> #include <linux/delay.h> #include <linux/kprobes.h> +#include <linux/idle.h> #include <asm/cpu.h> #include <asm/delay.h> @@ -289,7 +290,7 @@ if (mark_idle) (*mark_idle)(1); - idle = pm_idle; + idle = idle_func; if (!idle) idle = default_idle; (*idle)(); Index: linux-2.6.15-rc2-git5/arch/ia64/kernel/setup.c =================================================================== --- linux-2.6.15-rc2-git5.orig/arch/ia64/kernel/setup.c 2005-11-22 12:13:22.000000000 -0500 +++ linux-2.6.15-rc2-git5/arch/ia64/kernel/setup.c 2005-11-28 20:23:09.000000000 -0500 @@ -43,6 +43,7 @@ #include <linux/initrd.h> #include <linux/platform.h> #include <linux/pm.h> +#include <linux/idle.h> #include <asm/ia32.h> #include <asm/machvec.h> @@ -738,6 +739,8 @@ ia64_max_cacheline_size = max; } +struct idle_info idle_default; + /* * cpu_init() initializes state that is per-CPU. This function acts * as a 'CPU state barrier', nothing should get across. @@ -861,7 +864,13 @@ /* size of physical stacked register partition plus 8 bytes: */ __get_cpu_var(ia64_phys_stacked_size_p8) = num_phys_stacked*8 + 8; platform_cpu_init(); - pm_idle = default_idle; + + memset(&idle_default, 0, sizeof(idle_default)); + idle_default.name = "default"; + idle_default.func = default_idle; + register_idle(&idle_default); + + set_idle("default"); } void ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [RFC][PATCH] Runtime switching of the idle function [take 2] 2005-11-29 2:48 ` [RFC][PATCH] Runtime switching of the idle function [take 2] Steven Rostedt @ 2005-11-29 3:02 ` Andrew Morton 2005-11-29 3:42 ` Steven Rostedt 2005-11-29 13:08 ` Pavel Machek 1 sibling, 1 reply; 65+ messages in thread From: Andrew Morton @ 2005-11-29 3:02 UTC (permalink / raw) To: Steven Rostedt Cc: mingo, acpi-devel, len.brown, nando, rlrevell, linux-kernel, paulmck, kr, tglx, pluto, john.cooper, bene, dwalker, trini, george Steven Rostedt <rostedt@goodmis.org> wrote: > > This patch creates a directory in /sys/kernel called idle. > At no point do you appear to explain _why_ the kernel needs this feature? > ... > - pm_idle = pm_idle_save; > + int tries = 0; > + int ret; > + set_idle(NULL); > + do { > + if ((ret = unregister_idle(PM_IDLE_NAME)) == 0) > + break; > + /* > + * for some reason the idle function is being used. > + * Wait a little and then try again. > + */ > + if (ret == -EINVAL) { > + printk(KERN_WARNING > + "ACPI idle function never registered?\n"); > + break; > + } > + yield(); > + } while (tries++ < 10); The use of yield() could be problematic - its semantics are rather ill-defined. Maybe msleep(1) or something? What's this loop here for anyway? Looks kludgy. > + if (tries > 10) { > + printk(KERN_WARNING > + "Unable to unresgister ACPI idle function\n"); tpyo > + memset(&idle_kobj, 0, sizeof(idle_kobj)); There are several memsets of statically allocated structures which are already all-zero. ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [RFC][PATCH] Runtime switching of the idle function [take 2] 2005-11-29 3:02 ` Andrew Morton @ 2005-11-29 3:42 ` Steven Rostedt 2005-11-29 4:01 ` Andrew Morton 2005-11-29 4:22 ` john stultz 0 siblings, 2 replies; 65+ messages in thread From: Steven Rostedt @ 2005-11-29 3:42 UTC (permalink / raw) To: Andrew Morton Cc: mingo, acpi-devel, len.brown, nando, rlrevell, linux-kernel, paulmck, kr, tglx, pluto, john.cooper, bene, dwalker, trini, george On Mon, 2005-11-28 at 19:02 -0800, Andrew Morton wrote: > Steven Rostedt <rostedt@goodmis.org> wrote: > > > > This patch creates a directory in /sys/kernel called idle. > > > > At no point do you appear to explain _why_ the kernel needs this feature? Sorry about that. This originally came up when we had problems with the AMD64 x2 in the -rt patch. It was noted that the TSCs would get very far out of sync and cause problems. The way to solve this was to set idle=poll. The original patch I sent was to allow the user to change to idle=poll dynamically. This way they could switch to the poll_idle and run there tests (requiring tsc not to drift) and then switch back to the default idle to save on electricity. Note: It's been stated that the tsc drift can cause problems with the vanilla kernel too. Ingo asked if I could make this more robust and not dependent on idle_poll. Maybe Ingo can give a better explanation? > > > ... > > - pm_idle = pm_idle_save; > > + int tries = 0; > > + int ret; > > + set_idle(NULL); > > + do { > > + if ((ret = unregister_idle(PM_IDLE_NAME)) == 0) > > + break; > > + /* > > + * for some reason the idle function is being used. > > + * Wait a little and then try again. > > + */ > > + if (ret == -EINVAL) { > > + printk(KERN_WARNING > > + "ACPI idle function never registered?\n"); > > + break; > > + } > > + yield(); > > + } while (tries++ < 10); > > The use of yield() could be problematic - its semantics are rather > ill-defined. Maybe msleep(1) or something? > > What's this loop here for anyway? Looks kludgy. Oops! That was required by some other garbage that I had earlier. I cleaned up the patch some more, and this is no longer required. (will remove). > > > + if (tries > 10) { > > + printk(KERN_WARNING > > + "Unable to unresgister ACPI idle function\n"); > > tpyo Will fix. > > > + memset(&idle_kobj, 0, sizeof(idle_kobj)); > > There are several memsets of statically allocated structures which are > already all-zero. > :) I'm really paranoid! OK, I always like to do a memset even when it's not needed. I'll purge them too. Thanks for having a look. -- Steve ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [RFC][PATCH] Runtime switching of the idle function [take 2] 2005-11-29 3:42 ` Steven Rostedt @ 2005-11-29 4:01 ` Andrew Morton 2005-11-29 6:44 ` Ingo Molnar 2005-11-29 4:22 ` john stultz 1 sibling, 1 reply; 65+ messages in thread From: Andrew Morton @ 2005-11-29 4:01 UTC (permalink / raw) To: Steven Rostedt Cc: mingo, acpi-devel, len.brown, nando, rlrevell, linux-kernel, paulmck, kr, tglx, pluto, john.cooper, bene, dwalker, trini, george Steven Rostedt <rostedt@goodmis.org> wrote: > > On Mon, 2005-11-28 at 19:02 -0800, Andrew Morton wrote: > > Steven Rostedt <rostedt@goodmis.org> wrote: > > > > > > This patch creates a directory in /sys/kernel called idle. > > > > > > > At no point do you appear to explain _why_ the kernel needs this feature? > > Sorry about that. This originally came up when we had problems with the > AMD64 x2 in the -rt patch. It was noted that the TSCs would get very > far out of sync and cause problems. Unsynced TSCs are rare, but they happen. I guess even if we were to resync them, these measurements would screw up. > The way to solve this was to set > idle=poll. The original patch I sent was to allow the user to change to > idle=poll dynamically. This way they could switch to the poll_idle and > run there tests (requiring tsc not to drift) and then switch back to the > default idle to save on electricity. Use gettimeofday()? If it's just for some sort of instrumentation, run NR_CPUS instances of a niced-down busyloop, pin each one to a different CPU? That way the idle function doesn't get called at all.. ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [RFC][PATCH] Runtime switching of the idle function [take 2] 2005-11-29 4:01 ` Andrew Morton @ 2005-11-29 6:44 ` Ingo Molnar 2005-11-29 6:55 ` Nick Piggin 2005-11-29 18:05 ` Andi Kleen 0 siblings, 2 replies; 65+ messages in thread From: Ingo Molnar @ 2005-11-29 6:44 UTC (permalink / raw) To: Andrew Morton Cc: Steven Rostedt, acpi-devel, len.brown, nando, rlrevell, linux-kernel, paulmck, kr, tglx, pluto, john.cooper, bene, dwalker, trini, george * Andrew Morton <akpm@osdl.org> wrote: > > The way to solve this was to set > > idle=poll. The original patch I sent was to allow the user to change to > > idle=poll dynamically. This way they could switch to the poll_idle and > > run there tests (requiring tsc not to drift) and then switch back to the > > default idle to save on electricity. > > Use gettimeofday()? > > If it's just for some sort of instrumentation, run NR_CPUS instances > of a niced-down busyloop, pin each one to a different CPU? That way > the idle function doesn't get called at all.. idle=poll is also frequently done for performance reasons [it reduces idle wakeup latency by 10 usecs] - while it could be turned off if the system has been idle for some time. E.g. cpufreqd could sample idle time and turn on/off idle=poll. High-performance setups could enable it all the time. as long as it can be done with zero-cost, i dont see why Steven's patch wouldnt be a plus for us. It's a performance thing, and having runtime switches for seemless performance features cannot be bad. Ingo ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [RFC][PATCH] Runtime switching of the idle function [take 2] 2005-11-29 6:44 ` Ingo Molnar @ 2005-11-29 6:55 ` Nick Piggin 2005-11-29 18:05 ` Andi Kleen 1 sibling, 0 replies; 65+ messages in thread From: Nick Piggin @ 2005-11-29 6:55 UTC (permalink / raw) To: Ingo Molnar Cc: Andrew Morton, Steven Rostedt, acpi-devel, len.brown, nando, rlrevell, linux-kernel, paulmck, kr, tglx, pluto, john.cooper, bene, dwalker, trini, george Ingo Molnar wrote: > * Andrew Morton <akpm@osdl.org> wrote: > > >>>The way to solve this was to set >>> idle=poll. The original patch I sent was to allow the user to change to >>> idle=poll dynamically. This way they could switch to the poll_idle and >>> run there tests (requiring tsc not to drift) and then switch back to the >>> default idle to save on electricity. >> >>Use gettimeofday()? >> >>If it's just for some sort of instrumentation, run NR_CPUS instances >>of a niced-down busyloop, pin each one to a different CPU? That way >>the idle function doesn't get called at all.. > > > idle=poll is also frequently done for performance reasons [it reduces > idle wakeup latency by 10 usecs] - while it could be turned off if the > system has been idle for some time. E.g. cpufreqd could sample idle time > and turn on/off idle=poll. High-performance setups could enable it all > the time. > > as long as it can be done with zero-cost, i dont see why Steven's patch > wouldnt be a plus for us. It's a performance thing, and having runtime > switches for seemless performance features cannot be bad. > Why not just slightly cleanup and extend (eg. to ACPI) the hlt_counter thingy that many architectures already have? Nick -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [RFC][PATCH] Runtime switching of the idle function [take 2] 2005-11-29 6:44 ` Ingo Molnar 2005-11-29 6:55 ` Nick Piggin @ 2005-11-29 18:05 ` Andi Kleen 2005-11-29 14:19 ` Steven Rostedt 2005-12-02 1:27 ` Max Krasnyansky 1 sibling, 2 replies; 65+ messages in thread From: Andi Kleen @ 2005-11-29 18:05 UTC (permalink / raw) To: Ingo Molnar Cc: Steven Rostedt, acpi-devel, len.brown, nando, rlrevell, linux-kernel, paulmck, kr, tglx, pluto, john.cooper, bene, dwalker, trini, george, akpm Ingo Molnar <mingo@elte.hu> writes: > * Andrew Morton <akpm@osdl.org> wrote: > > > > The way to solve this was to set > > > idle=poll. The original patch I sent was to allow the user to change to > > > idle=poll dynamically. This way they could switch to the poll_idle and > > > run there tests (requiring tsc not to drift) and then switch back to the > > > default idle to save on electricity. > > > > Use gettimeofday()? > > > > If it's just for some sort of instrumentation, run NR_CPUS instances > > of a niced-down busyloop, pin each one to a different CPU? That way > > the idle function doesn't get called at all.. > > idle=poll is also frequently done for performance reasons [it reduces > idle wakeup latency by 10 usecs] And it's obsolete on CPUs with monitor/mwait. And in practice the CPU will run so hot that only benchmarkers like it. I think switching idle is the wrong way to do. We should rather fix the various problems. For fixing the TSC issue it is 100% the wrong approach Imho. Basically software has to live with TSCs being unsynchronized and gettimeofday should do the right thing (and if not it should be fixed) - while it could be turned off if the > system has been idle for some time. E.g. cpufreqd could sample idle time > and turn on/off idle=poll. High-performance setups could enable it all > the time. And upgrade their server air condition or issue additional ear protection to the desktop user? Most likely you will just drive the CPUs into thermal throttle at some point with that, not get more performance anyways. > as long as it can be done with zero-cost, i dont see why Steven's patch > wouldnt be a plus for us. It's a performance thing, and having runtime > switches for seemless performance features cannot be bad. The interface is ugly and I suspect fixing the various obscure race this obscure feature would undoubtedly add will be a long term maintenance issue. And it's the wrong thing to do anyways because it just papers over other problems that should be fixed in the right way. -Andi ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [RFC][PATCH] Runtime switching of the idle function [take 2] 2005-11-29 18:05 ` Andi Kleen @ 2005-11-29 14:19 ` Steven Rostedt 2005-11-29 14:50 ` Andi Kleen 2005-12-02 1:27 ` Max Krasnyansky 1 sibling, 1 reply; 65+ messages in thread From: Steven Rostedt @ 2005-11-29 14:19 UTC (permalink / raw) To: Andi Kleen Cc: Ingo Molnar, acpi-devel, len.brown, nando, rlrevell, linux-kernel, paulmck, kr, tglx, pluto, john.cooper, bene, dwalker, trini, george, akpm On Tue, 2005-11-29 at 11:05 -0700, Andi Kleen so nicely wrote: > > idle=poll is also frequently done for performance reasons [it reduces > > idle wakeup latency by 10 usecs] > > And it's obsolete on CPUs with monitor/mwait. And I wish my system supported it. > And in practice the CPU will run so hot that only benchmarkers like it. Why would it run hot? What's the difference between polling and doing other things. How many transistors does it take to poll? > > I think switching idle is the wrong way to do. We should rather > fix the various problems. > > For fixing the TSC issue it is 100% the wrong approach Imho. I would only say 80% the wrong approach, but that's me ;-) > Basically software has to live with TSCs being unsynchronized > and gettimeofday should do the right thing (and if not it should be fixed) I guess the biggest complaint most have is that the rdtsc _is_ the fastest way to read a clock. If it isn't reliable, then what good is it? It's unfortunate that Intel didn't solidify the clock usage. Yes, use HPET, or something else, but those are slower, and may not be on all systems. Every system that I owned had a tsc but for critical systems it isn't up to par (what a shame). > > - while it could be turned off if the > > system has been idle for some time. E.g. cpufreqd could sample idle time > > and turn on/off idle=poll. High-performance setups could enable it all > > the time. > > And upgrade their server air condition or issue additional ear protection > to the desktop user? Most likely you will just drive the CPUs into > thermal throttle at some point with that, not get more performance anyways. Again, what would make it so hot? It is a waste of CPU cycles, and does waste energy that way, but does it really heat up the CPU that much? It's just a loop. I've run much more complex algorithms for days without any problems. I only once over heated a CPU and that was doing some brute force calculations of prime numbers. > > > as long as it can be done with zero-cost, i dont see why Steven's patch > > wouldnt be a plus for us. It's a performance thing, and having runtime > > switches for seemless performance features cannot be bad. > > The interface is ugly and I suspect fixing the various obscure race this > obscure feature would undoubtedly add will be a long term maintenance > issue. And it's the wrong thing to do anyways because it just papers > over other problems that should be fixed in the right way. Oh come now, it's not that ugly. And it would not produce any more obscure race conditions than the current method of changing idle with the acpi processor_idle module has. But I'll agree that this is more of a paper over than a solution. Too bad I wasted a day writing and testing it (mostly just to learn about kobjects and sysfs which I still feel is very clumsy). But since I did clean up the patch, and it is still useful for those debugging problems with timers. I'm supplying this cleaned up version (Thank you Andrew for the comments). -- Steve Ingo, would you like this for -rt? Even if it will never be accepted into mainline. [take 3]: Index: linux-2.6.15-rc2-git5/arch/i386/kernel/process.c =================================================================== --- linux-2.6.15-rc2-git5.orig/arch/i386/kernel/process.c 2005-11-28 20:31:24.000000000 -0500 +++ linux-2.6.15-rc2-git5/arch/i386/kernel/process.c 2005-11-29 07:43:52.000000000 -0500 @@ -39,6 +39,7 @@ #include <linux/ptrace.h> #include <linux/random.h> #include <linux/kprobes.h> +#include <linux/idle.h> #include <asm/uaccess.h> #include <asm/pgtable.h> @@ -72,11 +73,6 @@ return ((unsigned long *)tsk->thread.esp)[3]; } -/* - * Powermanagement idle function, if any.. - */ -void (*pm_idle)(void); -EXPORT_SYMBOL(pm_idle); static DEFINE_PER_CPU(unsigned int, cpu_idle_state); void disable_hlt(void) @@ -185,7 +181,7 @@ __get_cpu_var(cpu_idle_state) = 0; rmb(); - idle = pm_idle; + idle = idle_func; if (!idle) idle = default_idle; @@ -250,6 +246,11 @@ } } +static struct idle_info idle_mwait = { + .name = "mwait", + .func = mwait_idle +}; + void __devinit select_idle_routine(const struct cpuinfo_x86 *c) { if (cpu_has(c, X86_FEATURE_MWAIT)) { @@ -258,25 +259,60 @@ * Skip, if setup has overridden idle. * One CPU supports mwait => All CPUs supports mwait */ - if (!pm_idle) { + register_idle(&idle_mwait); + + if (!idle_func) { printk("using mwait in idle threads.\n"); - pm_idle = mwait_idle; + set_idle("mwait"); } } } +static struct idle_info idle_default = { + .name = "default", + .func = default_idle +}; + +static struct idle_info idle_poll = { + .name = "poll", + .func = poll_idle +}; + +static int __init add_idle(void) +{ + static int set; + + if (set) + return 0; + set = 1; + + register_idle(&idle_poll); + + /* + * Allow the user to switch out of poll_idle even + * if it was a boot option. + */ + register_idle(&idle_default); + + return 0; +} + +arch_initcall(add_idle); + static int __init idle_setup (char *str) { + add_idle(); if (!strncmp(str, "poll", 4)) { printk("using polling idle threads.\n"); - pm_idle = poll_idle; + set_idle("poll"); + #ifdef CONFIG_X86_SMP if (smp_num_siblings > 1) printk("WARNING: polling idle and HT enabled, performance may degrade.\n"); #endif } else if (!strncmp(str, "halt", 4)) { printk("using halt in idle threads.\n"); - pm_idle = default_idle; + set_idle("default"); } boot_option_idle_override = 1; Index: linux-2.6.15-rc2-git5/arch/x86_64/kernel/process.c =================================================================== --- linux-2.6.15-rc2-git5.orig/arch/x86_64/kernel/process.c 2005-11-28 20:31:24.000000000 -0500 +++ linux-2.6.15-rc2-git5/arch/x86_64/kernel/process.c 2005-11-29 07:45:44.000000000 -0500 @@ -36,6 +36,8 @@ #include <linux/utsname.h> #include <linux/random.h> #include <linux/kprobes.h> +#include <linux/spinlock.h> +#include <linux/idle.h> #include <asm/uaccess.h> #include <asm/pgtable.h> @@ -60,10 +62,6 @@ unsigned long boot_option_idle_override = 0; EXPORT_SYMBOL(boot_option_idle_override); -/* - * Powermanagement idle function, if any.. - */ -void (*pm_idle)(void); static DEFINE_PER_CPU(unsigned int, cpu_idle_state); void disable_hlt(void) @@ -195,7 +193,7 @@ __get_cpu_var(cpu_idle_state) = 0; rmb(); - idle = pm_idle; + idle = idle_func; if (!idle) idle = default_idle; if (cpu_is_offline(smp_processor_id())) @@ -229,29 +227,68 @@ } } +static struct idle_info idle_mwait = { + .name = "mwait", + .func = mwait_idle +}; + void __cpuinit select_idle_routine(const struct cpuinfo_x86 *c) { static int printed; if (cpu_has(c, X86_FEATURE_MWAIT)) { + register_idle(&idle_mwait); + /* * Skip, if setup has overridden idle. * One CPU supports mwait => All CPUs supports mwait */ - if (!pm_idle) { + if (!idle_func) { if (!printed) { printk("using mwait in idle threads.\n"); printed = 1; } - pm_idle = mwait_idle; + set_idle("mwait"); } } } +static struct idle_info idle_default = { + .name = "default", + .func = default_idle +}; + +static struct idle_info idle_poll = { + .name = "poll", + .func = poll_idle +}; + +static int __init add_idle(void) +{ + static int set; + + if (set) + return 0; + set = 1; + + register_idle(&idle_poll); + + /* + * Allow the user to switch out of poll_idle even + * if it was a boot option. + */ + register_idle(&idle_default); + + return 0; +} +arch_initcall(add_idle); + static int __init idle_setup (char *str) { + add_idle(); + if (!strncmp(str, "poll", 4)) { printk("using polling idle threads.\n"); - pm_idle = poll_idle; + set_idle("poll"); } boot_option_idle_override = 1; Index: linux-2.6.15-rc2-git5/drivers/acpi/processor_idle.c =================================================================== --- linux-2.6.15-rc2-git5.orig/drivers/acpi/processor_idle.c 2005-11-28 20:31:24.000000000 -0500 +++ linux-2.6.15-rc2-git5/drivers/acpi/processor_idle.c 2005-11-29 07:47:52.000000000 -0500 @@ -38,6 +38,8 @@ #include <linux/dmi.h> #include <linux/moduleparam.h> #include <linux/sched.h> /* need_resched() */ +#include <linux/spinlock.h> +#include <linux/idle.h> #include <asm/io.h> #include <asm/uaccess.h> @@ -56,6 +58,7 @@ #define C3_OVERHEAD 4 /* 1us (3.579 ticks per us) */ static void (*pm_idle_save) (void); module_param(max_cstate, uint, 0644); +#define PM_IDLE_NAME "pm_idle" static unsigned int nocst = 0; module_param(nocst, uint, 0000); @@ -891,13 +894,13 @@ return_VALUE(-ENODEV); /* Fall back to the default idle loop */ - pm_idle = pm_idle_save; + set_idle(NULL); synchronize_sched(); /* Relies on interrupts forcing exit from idle. */ pr->flags.power = 0; result = acpi_processor_get_power_info(pr); if ((pr->flags.power == 1) && (pr->flags.power_setup_done)) - pm_idle = acpi_processor_idle; + set_idle(PM_IDLE_NAME); return_VALUE(result); } @@ -983,6 +986,12 @@ .release = single_release, }; +static struct idle_info pm_idle_info = { + .name = PM_IDLE_NAME, + .func = acpi_processor_idle, + .freeze = 1 +}; + int acpi_processor_power_init(struct acpi_processor *pr, struct acpi_device *device) { @@ -1032,8 +1041,12 @@ printk(")\n"); if (pr->id == 0) { - pm_idle_save = pm_idle; - pm_idle = acpi_processor_idle; + register_idle(&pm_idle_info); + /* + * Just use the default idle + */ + pm_idle_save = get_idle(NULL); + set_idle(PM_IDLE_NAME); } } @@ -1068,8 +1081,8 @@ /* Unregister the idle handler when processor #0 is removed. */ if (pr->id == 0) { - pm_idle = pm_idle_save; - + set_idle(NULL); + unregister_idle(PM_IDLE_NAME); /* * We are about to unload the current idle thread pm callback * (pm_idle), Wait for all processors to update cached/local Index: linux-2.6.15-rc2-git5/include/linux/pm.h =================================================================== --- linux-2.6.15-rc2-git5.orig/include/linux/pm.h 2005-11-28 20:31:24.000000000 -0500 +++ linux-2.6.15-rc2-git5/include/linux/pm.h 2005-11-28 20:31:47.000000000 -0500 @@ -25,6 +25,7 @@ #include <linux/config.h> #include <linux/list.h> +#include <linux/spinlock.h> #include <asm/atomic.h> /* @@ -102,6 +103,8 @@ */ extern void (*pm_idle)(void); extern void (*pm_power_off)(void); +extern spinlock_t pm_idle_switch_lock; +extern int pm_idle_locked; typedef int __bitwise suspend_state_t; Index: linux-2.6.15-rc2-git5/arch/x86_64/Kconfig =================================================================== --- linux-2.6.15-rc2-git5.orig/arch/x86_64/Kconfig 2005-11-28 20:31:24.000000000 -0500 +++ linux-2.6.15-rc2-git5/arch/x86_64/Kconfig 2005-11-28 20:31:47.000000000 -0500 @@ -69,6 +69,10 @@ bool default y +config DYNAMIC_IDLE + bool + default y + source "init/Kconfig" Index: linux-2.6.15-rc2-git5/arch/x86_64/kernel/x8664_ksyms.c =================================================================== --- linux-2.6.15-rc2-git5.orig/arch/x86_64/kernel/x8664_ksyms.c 2005-11-28 20:31:24.000000000 -0500 +++ linux-2.6.15-rc2-git5/arch/x86_64/kernel/x8664_ksyms.c 2005-11-28 20:31:47.000000000 -0500 @@ -58,7 +58,6 @@ EXPORT_SYMBOL(disable_irq_nosync); EXPORT_SYMBOL(probe_irq_mask); EXPORT_SYMBOL(kernel_thread); -EXPORT_SYMBOL(pm_idle); EXPORT_SYMBOL(pm_power_off); EXPORT_SYMBOL(get_cmos_time); Index: linux-2.6.15-rc2-git5/include/linux/idle.h =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-2.6.15-rc2-git5/include/linux/idle.h 2005-11-28 20:31:47.000000000 -0500 @@ -0,0 +1,71 @@ +/* + * idle.h - Registering of the idle function (for supported archs) + * + * Copyright (C) 2005 Steven Rostedt <rostedt@goodmis.org> + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + */ + +#ifndef _LINUX_IDLE_H +#define _LINUX_IDLE_H + +#include <linux/config.h> +#include <linux/list.h> +#include <linux/spinlock.h> +#include <linux/list.h> +#include <linux/kobject.h> +#include <asm/atomic.h> + +typedef void (*idlefunc_t)(void); + +struct idle_info { + struct list_head list; + const char *name; /* Name visible to users */ + idlefunc_t func; /* idle function to run */ + int freeze; /* Only allow kernel to add or remove */ + int inuse; /* set when being used */ +#ifdef CONFIG_SYSFS + struct kobject kobj; +#endif +}; + +/* + * Registering and unregistering functions that may be used + * instead of the default idle function. This only adds + * them to the list of functions to be used, it does not + * set the + */ +extern int register_idle(struct idle_info *info); +extern int unregister_idle(const char *name); + +/* + * This sets the idle function to the registered function + * by name. Use NULL to set the idle function back to + * the default. + */ +extern int set_idle(const char *name); + +/* + * Return the function that is registered by name. + * Use NULL to get the default function. + * NULL may be returned (as that may be what the current + * idle function is set to, to use a default). NULL will + * also be returned if name is not registered. + */ +extern idlefunc_t get_idle(const char *name); + +extern idlefunc_t idle_func; + +#endif /* _LINUX_IDLE_H */ Index: linux-2.6.15-rc2-git5/kernel/Makefile =================================================================== --- linux-2.6.15-rc2-git5.orig/kernel/Makefile 2005-11-28 20:31:24.000000000 -0500 +++ linux-2.6.15-rc2-git5/kernel/Makefile 2005-11-28 20:31:47.000000000 -0500 @@ -32,6 +32,7 @@ obj-$(CONFIG_CRASH_DUMP) += crash_dump.o obj-$(CONFIG_SECCOMP) += seccomp.o obj-$(CONFIG_RCU_TORTURE_TEST) += rcutorture.o +obj-$(CONFIG_DYNAMIC_IDLE) += idle.o ifneq ($(CONFIG_SCHED_NO_NO_OMIT_FRAME_POINTER),y) # According to Alan Modra <alan@linuxcare.com.au>, the -fno-omit-frame-pointer is Index: linux-2.6.15-rc2-git5/kernel/idle.c =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-2.6.15-rc2-git5/kernel/idle.c 2005-11-28 20:31:47.000000000 -0500 @@ -0,0 +1,308 @@ +/* + * kernel/idle.c + * + * Setting up of the idle function to be dynamic. + * + * Copyright (C) 2005 Steven Rostedt + */ +#include <linux/module.h> +#include <linux/kobject.h> +#include <linux/sysfs.h> +#include <linux/spinlock.h> +#include <linux/idle.h> + +idlefunc_t idle_func; + +static void (*idle_default)(void); +static LIST_HEAD(idle_elements); +static DECLARE_MUTEX(idle_sem); +static struct idle_info *curr_idle; + +#ifdef CONFIG_SYSFS +int idle_sysfs_init; +#endif + +extern void poll_idle (void); + +static struct idle_info *__find_idle_info(const char *name) +{ + struct list_head *curr; + struct idle_info *p; + /* + * A little inefficient, but this isn't called often. + */ + list_for_each(curr, &idle_elements) { + p = list_entry(curr, struct idle_info, list); + if (!strcmp(name, p->name)) + break; + } + if (curr == &idle_elements) + p = NULL; + + return p; +} + +int register_idle(struct idle_info *info) +{ + struct idle_info *p; + int ret = -EEXIST; + + BUG_ON(!info->name); + + down(&idle_sem); + + p = __find_idle_info(info->name); + if (p) + goto out; + ret = 0; + + list_add(&info->list, &idle_elements); + +out: + up(&idle_sem); + return ret; +} +EXPORT_SYMBOL_GPL(register_idle); + +int unregister_idle(const char *name) +{ + struct idle_info *p; + int ret = -EINVAL; + + BUG_ON(!name); + + down(&idle_sem); + + p = __find_idle_info(name); + if (!p) + goto out; + if (p->inuse) { + ret = -EBUSY; + goto out; + } + + ret = 0; + + list_del_init(&p->list); + +out: + up(&idle_sem); + return ret; +} +EXPORT_SYMBOL_GPL(unregister_idle); + +static int __set_idle(struct idle_info *info) +{ + if (curr_idle) + curr_idle->inuse--; + info->inuse++; + curr_idle = info; + return 0; +} + +int set_idle(const char *name) +{ + struct idle_info *p; + int ret = 0; + + down(&idle_sem); + + if (!name) { + /* Set to the default function */ + if (curr_idle) { + curr_idle->inuse--; + curr_idle = NULL; + } + idle_func = idle_default; + goto out; + } + + ret = -EINVAL; + p = __find_idle_info(name); + if (!p) + goto out; + + __set_idle(p); +out: + up(&idle_sem); + return ret; +} +EXPORT_SYMBOL_GPL(set_idle); + +idlefunc_t get_idle(const char *name) +{ + struct idle_info *p; + idlefunc_t ret = idle_default; + + down(&idle_sem); + + if (!name) + goto out; + + p = __find_idle_info(name); + if (!p) + goto out; + + ret = p->func; +out: + up(&idle_sem); + return ret; +} +EXPORT_SYMBOL_GPL(get_idle); + +#ifdef CONFIG_SYSFS +#define KERNEL_ATTR_RW(_name) \ +static struct subsys_attribute _name##_attr = \ + __ATTR(_name, 0644, _name##_show, _name##_store) + +static struct idlep_kobject +{ + struct kobject kobj; +} idle_kobj; + +static ssize_t idle_ctrl_show(struct subsystem *subsys, char *page) +{ + ssize_t ret; + char *star = ""; + const char *name = "default"; + + down(&idle_sem); + if (curr_idle) { + name = curr_idle->name; + if (curr_idle->freeze) + star = "*"; + } + ret = sprintf(page, "%s%s\n", star, name); + up(&idle_sem); + + return ret; +} + +static ssize_t idle_ctrl_store(struct subsystem *subsys, + const char *buf, size_t len) +{ + struct list_head *curr; + struct idle_info *p; + ssize_t ret = -EBUSY; + + down(&idle_sem); + + if (curr_idle && curr_idle->freeze) + goto out; + + list_for_each(curr, &idle_elements) { + int size; + p = list_entry(curr, struct idle_info, list); + + size = strlen(p->name); + if (len <= size) + continue; + if (!strncmp(p->name, buf, size)) + break; + } + if (curr == &idle_elements) { + ret = -EINVAL; + goto out; + } + + /* + * This idle routine may have been registered to + * not allow users to add or remove this. + */ + if (p->freeze) + goto out; + + __set_idle(p); + + ret = len; +out: + up(&idle_sem); + + return ret; +} + +KERNEL_ATTR_RW(idle_ctrl); + +static ssize_t idle_methods_show(struct subsystem *subsys, char *page) +{ + struct list_head *curr; + struct idle_info *p; + ssize_t len = 0; + + down(&idle_sem); + list_for_each(curr, &idle_elements) { + p = list_entry(curr, struct idle_info, list); + if (len + 3 + strlen(p->name) >= PAGE_SIZE) { + printk("idle functions overflowed sysfs??\n"); + break; + } + len += sprintf(page+len, "%s%s%s", + len ? " " : "", + p->freeze ? "*" : "", + p->name); + } + if (len + 2 < PAGE_SIZE) + len += sprintf(page+len, "\n"); + + up(&idle_sem); + return len; +} + +static ssize_t idle_methods_store(struct subsystem *subsys, + const char *buf, size_t len) +{ + /* do nothing */ + return len; +} + +KERNEL_ATTR_RW(idle_methods); + +static struct attribute * idle_attrs[] = { + &idle_ctrl_attr.attr, + &idle_methods_attr.attr, + NULL +}; + +static struct attribute_group idle_attr_group = { + .attrs = idle_attrs, +}; + +static int __init idle_setup_sysfs(void) +{ + int err; + + memset(&idle_kobj, 0, sizeof(idle_kobj)); + err = kobject_set_name(&idle_kobj.kobj, "%s", "idle"); + if (err) + goto out; + + kobj_set_kset_s(&idle_kobj, kernel_subsys); + + idle_kobj.kobj.parent = &kernel_subsys.kset.kobj; + err = kobject_register(&idle_kobj.kobj); + if (err) + goto out; + + err = sysfs_create_group(&idle_kobj.kobj, + &idle_attr_group); + if (err) + goto out; + + return 0; +out: + printk(KERN_INFO "Problem setting up sysfs idle_ctrl\n"); + return 0; +} +#endif /* CONFIG_SYSFS */ + +static int __init idle_setup(void) +{ + idle_default = idle_func; + +#ifdef CONFIG_SYSFS + idle_setup_sysfs(); +#endif + return 0; +} + +late_initcall(idle_setup); Index: linux-2.6.15-rc2-git5/arch/i386/Kconfig =================================================================== --- linux-2.6.15-rc2-git5.orig/arch/i386/Kconfig 2005-11-28 20:31:24.000000000 -0500 +++ linux-2.6.15-rc2-git5/arch/i386/Kconfig 2005-11-28 20:31:47.000000000 -0500 @@ -45,6 +45,10 @@ bool default y +config DYNAMIC_IDLE + bool + default y + source "init/Kconfig" menu "Processor type and features" Index: linux-2.6.15-rc2-git5/arch/i386/kernel/apm.c =================================================================== --- linux-2.6.15-rc2-git5.orig/arch/i386/kernel/apm.c 2005-11-28 20:31:24.000000000 -0500 +++ linux-2.6.15-rc2-git5/arch/i386/kernel/apm.c 2005-11-28 20:31:47.000000000 -0500 @@ -225,6 +225,7 @@ #include <linux/smp_lock.h> #include <linux/dmi.h> #include <linux/suspend.h> +#include <linux/idle.h> #include <asm/system.h> #include <asm/uaccess.h> @@ -2220,6 +2221,9 @@ { } }; +static struct idle_info apm_idle; +#define APM_IDLE_NAME "apm" + /* * Just start the APM thread. We do NOT want to do APM BIOS * calls from anything but the APM thread, if for no other reason @@ -2373,8 +2377,14 @@ if (HZ != 100) idle_period = (idle_period * HZ) / 100; if (idle_threshold < 100) { - original_pm_idle = pm_idle; - pm_idle = apm_cpu_idle; + memset(&apm_idle, 0, sizeof(apm_idle)); + apm_idle.name = APM_IDLE_NAME; + apm_idle.func = apm_cpu_idle; + apm_idle.freeze = 1; + register_idle(&apm_idle); + + original_pm_idle = get_idle(NULL); + set_idle(APM_IDLE_NAME); set_pm_idle = 1; } @@ -2386,7 +2396,26 @@ int error; if (set_pm_idle) { - pm_idle = original_pm_idle; + int tries = 0; + int ret; + set_idle(NULL); + do { + if ((ret = unregister_idle(APM_IDLE_NAME)) == 0) + break; + /* + * for some reason the idle function is being used. + * Wait a little and then try again. + */ + if (ret == -EINVAL) { + printk(KERN_WARNING + "APM idle function never registered?\n"); + break; + } + yield(); + } while (tries++ < 10); + if (tries > 10) + printk(KERN_WARNING + "Unable to unresgister APM idle function\n"); /* * We are about to unload the current idle thread pm callback * (pm_idle), Wait for all processors to update cached/local Index: linux-2.6.15-rc2-git5/arch/ia64/Kconfig =================================================================== --- linux-2.6.15-rc2-git5.orig/arch/ia64/Kconfig 2005-11-28 20:31:24.000000000 -0500 +++ linux-2.6.15-rc2-git5/arch/ia64/Kconfig 2005-11-28 20:31:47.000000000 -0500 @@ -62,6 +62,10 @@ bool default y +config DYNAMIC_IDLE + bool + default y + choice prompt "System type" default IA64_GENERIC Index: linux-2.6.15-rc2-git5/arch/ia64/kernel/acpi.c =================================================================== --- linux-2.6.15-rc2-git5.orig/arch/ia64/kernel/acpi.c 2005-11-28 20:31:24.000000000 -0500 +++ linux-2.6.15-rc2-git5/arch/ia64/kernel/acpi.c 2005-11-28 20:31:47.000000000 -0500 @@ -60,8 +60,6 @@ #define PREFIX "ACPI: " -void (*pm_idle) (void); -EXPORT_SYMBOL(pm_idle); void (*pm_power_off) (void); EXPORT_SYMBOL(pm_power_off); Index: linux-2.6.15-rc2-git5/arch/ia64/kernel/process.c =================================================================== --- linux-2.6.15-rc2-git5.orig/arch/ia64/kernel/process.c 2005-11-28 20:31:24.000000000 -0500 +++ linux-2.6.15-rc2-git5/arch/ia64/kernel/process.c 2005-11-28 20:31:47.000000000 -0500 @@ -31,6 +31,7 @@ #include <linux/interrupt.h> #include <linux/delay.h> #include <linux/kprobes.h> +#include <linux/idle.h> #include <asm/cpu.h> #include <asm/delay.h> @@ -289,7 +290,7 @@ if (mark_idle) (*mark_idle)(1); - idle = pm_idle; + idle = idle_func; if (!idle) idle = default_idle; (*idle)(); Index: linux-2.6.15-rc2-git5/arch/ia64/kernel/setup.c =================================================================== --- linux-2.6.15-rc2-git5.orig/arch/ia64/kernel/setup.c 2005-11-28 20:31:24.000000000 -0500 +++ linux-2.6.15-rc2-git5/arch/ia64/kernel/setup.c 2005-11-29 07:46:59.000000000 -0500 @@ -43,6 +43,7 @@ #include <linux/initrd.h> #include <linux/platform.h> #include <linux/pm.h> +#include <linux/idle.h> #include <asm/ia32.h> #include <asm/machvec.h> @@ -738,6 +739,11 @@ ia64_max_cacheline_size = max; } +struct idle_info idle_default = { + .name = "default", + .func = default_idle +}; + /* * cpu_init() initializes state that is per-CPU. This function acts * as a 'CPU state barrier', nothing should get across. @@ -861,7 +867,10 @@ /* size of physical stacked register partition plus 8 bytes: */ __get_cpu_var(ia64_phys_stacked_size_p8) = num_phys_stacked*8 + 8; platform_cpu_init(); - pm_idle = default_idle; + + register_idle(&idle_default); + + set_idle("default"); } void ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [RFC][PATCH] Runtime switching of the idle function [take 2] 2005-11-29 14:19 ` Steven Rostedt @ 2005-11-29 14:50 ` Andi Kleen 2005-11-29 15:42 ` Steven Rostedt 0 siblings, 1 reply; 65+ messages in thread From: Andi Kleen @ 2005-11-29 14:50 UTC (permalink / raw) To: Steven Rostedt Cc: Andi Kleen, Ingo Molnar, acpi-devel, len.brown, nando, rlrevell, linux-kernel, paulmck, kr, tglx, pluto, john.cooper, bene, dwalker, trini, george, akpm On Tue, Nov 29, 2005 at 09:19:31AM -0500, Steven Rostedt wrote: > > And in practice the CPU will run so hot that only benchmarkers like it. > > Why would it run hot? What's the difference between polling and doing > other things. How many transistors does it take to poll? It will prevent the CPU from going into sleep states and essentially keep most of it enabled. > > > > > I think switching idle is the wrong way to do. We should rather > > fix the various problems. > > > > For fixing the TSC issue it is 100% the wrong approach Imho. > > I would only say 80% the wrong approach, but that's me ;-) > > > Basically software has to live with TSCs being unsynchronized > > and gettimeofday should do the right thing (and if not it should be fixed) > > I guess the biggest complaint most have is that the rdtsc _is_ the > fastest way to read a clock. If it isn't reliable, then what good is It's the fastest way to read something which needs quite complex knowledge to turn into a reliable clock value. In general only the kernel has this knowledge. And gettimeofday is optimized to give you the fatest reliable clock. > it? It's unfortunate that Intel didn't solidify the clock usage. Yes, > use HPET, or something else, but those are slower, and may not be on all > systems. Every system that I owned had a tsc but for critical systems > it isn't up to par (what a shame). Just use gettimeofday. It shields you from all that and when the hardware supports it is quite fast too. > > > system has been idle for some time. E.g. cpufreqd could sample idle time > > > and turn on/off idle=poll. High-performance setups could enable it all > > > the time. > > > > And upgrade their server air condition or issue additional ear protection > > to the desktop user? Most likely you will just drive the CPUs into > > thermal throttle at some point with that, not get more performance anyways. > > Again, what would make it so hot? It is a waste of CPU cycles, and does > waste energy that way, but does it really heat up the CPU that much? Yes it does. -Andi ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [RFC][PATCH] Runtime switching of the idle function [take 2] 2005-11-29 14:50 ` Andi Kleen @ 2005-11-29 15:42 ` Steven Rostedt 0 siblings, 0 replies; 65+ messages in thread From: Steven Rostedt @ 2005-11-29 15:42 UTC (permalink / raw) To: Andi Kleen Cc: Ingo Molnar, acpi-devel, len.brown, nando, rlrevell, linux-kernel, paulmck, kr, tglx, pluto, john.cooper, bene, dwalker, trini, george, akpm On Tue, 2005-11-29 at 15:50 +0100, Andi Kleen wrote: > On Tue, Nov 29, 2005 at 09:19:31AM -0500, Steven Rostedt wrote: > > > And in practice the CPU will run so hot that only benchmarkers like it. > > > > Why would it run hot? What's the difference between polling and doing > > other things. How many transistors does it take to poll? > > It will prevent the CPU from going into sleep states and essentially > keep most of it enabled. Well, there's one thing that my patch _does_ help with. (And it has just helped me now). If you boot up with idle=poll and forget about it, you can check what idle routine is being used and switch out of poll without rebooting. (like I'm doing right now :-) -- Steve ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [RFC][PATCH] Runtime switching of the idle function [take 2] 2005-11-29 18:05 ` Andi Kleen 2005-11-29 14:19 ` Steven Rostedt @ 2005-12-02 1:27 ` Max Krasnyansky 2005-12-02 1:45 ` Andi Kleen 1 sibling, 1 reply; 65+ messages in thread From: Max Krasnyansky @ 2005-12-02 1:27 UTC (permalink / raw) To: Andi Kleen Cc: Ingo Molnar, Steven Rostedt, acpi-devel, len.brown, nando, rlrevell, linux-kernel, paulmck, kr, tglx, pluto, john.cooper, bene, dwalker, trini, george, akpm Andi Kleen wrote: > Ingo Molnar <mingo@elte.hu> writes: >>> If it's just for some sort of instrumentation, run NR_CPUS instances >>> of a niced-down busyloop, pin each one to a different CPU? That way >>> the idle function doesn't get called at all.. >> idle=poll is also frequently done for performance reasons [it reduces >> idle wakeup latency by 10 usecs] > > And it's obsolete on CPUs with monitor/mwait. There are some platforms for example IBM ZPro Xeon based machines where monitor/mwait seems to trigger some kind of SMM and introduce horrible latencies. With idle=poll ZPros show pretty good worst case latencies, in the order of 10usec (tested with RTAI/Fusion). With default idle (ie mwait) even average latency is in hundreds of milliseconds. You might argue that it's a bug in the their HW design or something but as it stands today I wouldn't say that monitor/mwait obsoletes idle=poll. Also IMO saying that CPU will run too hot with idle=poll is basically saying that those CPUs cannot be used for simulations and stuff which run flat out for days (months actually). Which is obviously not true (again speaking from experience :)). Max ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [RFC][PATCH] Runtime switching of the idle function [take 2] 2005-12-02 1:27 ` Max Krasnyansky @ 2005-12-02 1:45 ` Andi Kleen 2005-12-03 2:17 ` Max Krasnyansky 0 siblings, 1 reply; 65+ messages in thread From: Andi Kleen @ 2005-12-02 1:45 UTC (permalink / raw) To: Max Krasnyansky Cc: Andi Kleen, Ingo Molnar, Steven Rostedt, acpi-devel, len.brown, nando, rlrevell, linux-kernel, paulmck, kr, tglx, pluto, john.cooper, bene, dwalker, trini, george, akpm > Also IMO saying that CPU will run too hot with idle=poll is basically > saying that those > CPUs cannot be used for simulations and stuff which run flat out for days > (months actually). > Which is obviously not true (again speaking from experience :)). The CPUs can be used, but many cooling setups (both AirCon in complete data centers, cooling in Blade Racks, laptops) the cooling is now often designed to not cool the maximum thermal output of all systems in parallel, but instead throttle the systems when things get too hot. This usually works because in most workloads systems are more often idle than busy, so no throttling is needed. On desktops it probably won't throttle, but just become noisy when all the fans spin up. All things you don't really want. Super computing is different of course, but even there maximum capacity of the air condition often limits how many CPUs you can buy. And you need all the help you can get. That said you're right that there is still a small niche where idle=poll makes sense, but it's definitely nothing that should be encouraged to be used regularly like that original patch would. -Andi ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [RFC][PATCH] Runtime switching of the idle function [take 2] 2005-12-02 1:45 ` Andi Kleen @ 2005-12-03 2:17 ` Max Krasnyansky 0 siblings, 0 replies; 65+ messages in thread From: Max Krasnyansky @ 2005-12-03 2:17 UTC (permalink / raw) To: Andi Kleen Cc: Ingo Molnar, Steven Rostedt, acpi-devel, len.brown, nando, rlrevell, linux-kernel, paulmck, kr, tglx, pluto, john.cooper, bene, dwalker, trini, george, akpm Andi Kleen wrote: >> Also IMO saying that CPU will run too hot with idle=poll is basically >> saying that those >> CPUs cannot be used for simulations and stuff which run flat out for days >> (months actually). >> Which is obviously not true (again speaking from experience :)). > > The CPUs can be used, but many cooling setups > (both AirCon in complete data centers, cooling in Blade Racks, laptops) > the cooling is now often designed to not cool > the maximum thermal output of all systems in parallel, but instead > throttle the systems when things get too hot. This usually > works because in most workloads systems are more often idle > than busy, so no throttling is needed. > > On desktops it probably won't throttle, but just become noisy > when all the fans spin up. > > All things you don't really want. We do it (simulations that is) on normal 1U and desktop machines. No special cooling and stuff. And it does not cause any problems. Granted we don't use cheap/crappy machines but still it's unmodified off-the-shelf HW. btw That ZPro machine that I mentioned used to run with idle=poll for weeks and fans would never spin up unless you put real load on it. > Super computing is different of course, but even there maximum > capacity of the air condition often limits how many CPUs you can buy. > And you need all the help you can get. > > That said you're right that there is still a small niche > where idle=poll makes sense, but it's definitely nothing > that should be encouraged to be used regularly like that > original patch would. Agreed. Max ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [RFC][PATCH] Runtime switching of the idle function [take 2] 2005-11-29 3:42 ` Steven Rostedt 2005-11-29 4:01 ` Andrew Morton @ 2005-11-29 4:22 ` john stultz 2005-11-29 14:22 ` Steven Rostedt 1 sibling, 1 reply; 65+ messages in thread From: john stultz @ 2005-11-29 4:22 UTC (permalink / raw) To: Steven Rostedt Cc: Andrew Morton, mingo, acpi-devel, len.brown, nando, rlrevell, linux-kernel, paulmck, kr, tglx, pluto, john.cooper, bene, dwalker, trini, george On Mon, 2005-11-28 at 22:42 -0500, Steven Rostedt wrote: > On Mon, 2005-11-28 at 19:02 -0800, Andrew Morton wrote: > > Steven Rostedt <rostedt@goodmis.org> wrote: > > > > > > This patch creates a directory in /sys/kernel called idle. > > > > > > > At no point do you appear to explain _why_ the kernel needs this feature? > > Sorry about that. This originally came up when we had problems with the > AMD64 x2 in the -rt patch. It was noted that the TSCs would get very > far out of sync and cause problems. The way to solve this was to set > idle=poll. The original patch I sent was to allow the user to change to > idle=poll dynamically. This way they could switch to the poll_idle and > run there tests (requiring tsc not to drift) and then switch back to the > default idle to save on electricity. The problem with this is that this must be a one way transition. That is, once the TSCs have become unsynchronized, there is no use going back to using the polling idle unless you add some code to re-sync the TSCs which would be ugly to do after the system has booted. Using idle=poll (for anything other then debugging) is really a worst case workaround for systems that do not have alternative clocksources like ACPI PM or HPET. Its an interesting bit of code, but I'm not really sure I understand its usefulness. thanks -john ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [RFC][PATCH] Runtime switching of the idle function [take 2] 2005-11-29 4:22 ` john stultz @ 2005-11-29 14:22 ` Steven Rostedt 0 siblings, 0 replies; 65+ messages in thread From: Steven Rostedt @ 2005-11-29 14:22 UTC (permalink / raw) To: john stultz Cc: Andrew Morton, mingo, acpi-devel, len.brown, nando, rlrevell, linux-kernel, paulmck, kr, tglx, pluto, john.cooper, bene, dwalker, trini, george On Mon, 2005-11-28 at 20:22 -0800, john stultz wrote: > On Mon, 2005-11-28 at 22:42 -0500, Steven Rostedt wrote: > > On Mon, 2005-11-28 at 19:02 -0800, Andrew Morton wrote: > > > Steven Rostedt <rostedt@goodmis.org> wrote: > > > > > > > > This patch creates a directory in /sys/kernel called idle. > > > > > > > > > > At no point do you appear to explain _why_ the kernel needs this feature? > > > > Sorry about that. This originally came up when we had problems with the > > AMD64 x2 in the -rt patch. It was noted that the TSCs would get very > > far out of sync and cause problems. The way to solve this was to set > > idle=poll. The original patch I sent was to allow the user to change to > > idle=poll dynamically. This way they could switch to the poll_idle and > > run there tests (requiring tsc not to drift) and then switch back to the > > default idle to save on electricity. > > The problem with this is that this must be a one way transition. That > is, once the TSCs have become unsynchronized, there is no use going back > to using the polling idle unless you add some code to re-sync the TSCs > which would be ugly to do after the system has booted. > I've thought about that too. But this patch does allow you to start with idle=poll and then switch back. Also, if you do lock to a cpu, you don't need to worry about the tsc from slipping if you switch to idle=poll. -- Steve > Using idle=poll (for anything other then debugging) is really a worst > case workaround for systems that do not have alternative clocksources > like ACPI PM or HPET. > > Its an interesting bit of code, but I'm not really sure I understand its > usefulness. > > thanks > -john > > > ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [RFC][PATCH] Runtime switching of the idle function [take 2] 2005-11-29 2:48 ` [RFC][PATCH] Runtime switching of the idle function [take 2] Steven Rostedt 2005-11-29 3:02 ` Andrew Morton @ 2005-11-29 13:08 ` Pavel Machek 2005-12-18 15:26 ` Steven Rostedt 1 sibling, 1 reply; 65+ messages in thread From: Pavel Machek @ 2005-11-29 13:08 UTC (permalink / raw) To: Steven Rostedt Cc: Ingo Molnar, acpi-devel, len.brown, Andrew Morton, Fernando Lopez-Lezcano, Lee Revell, linux-kernel, Paul E. McKenney, K.R. Foley, Thomas Gleixner, pluto, john cooper, Benedikt Spranger, Daniel Walker, Tom Rini, George Anzinger Hi! > Description: > > This patch creates a directory in /sys/kernel called idle. This > directory contains two files: idle_ctrl and idle_methods. Reading > idle_ctrl will show the function that is currently being used for idle, > and idle_methods shows the available methods for the user to send write > into idle_ctrl to change which function to use for idle. Pretty ugly interface, I'd say... is listing function really neccessary? Pavel -- 64 bytes from 195.113.31.123: icmp_seq=28 ttl=51 time=448769.1 ms ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [RFC][PATCH] Runtime switching of the idle function [take 2] 2005-11-29 13:08 ` Pavel Machek @ 2005-12-18 15:26 ` Steven Rostedt 0 siblings, 0 replies; 65+ messages in thread From: Steven Rostedt @ 2005-12-18 15:26 UTC (permalink / raw) To: Pavel Machek Cc: Ingo Molnar, acpi-devel, len.brown, Andrew Morton, Fernando Lopez-Lezcano, Lee Revell, linux-kernel, Paul E. McKenney, K.R. Foley, Thomas Gleixner, pluto, john cooper, Benedikt Spranger, Daniel Walker, Tom Rini, George Anzinger On Tue, 29 Nov 2005, Pavel Machek wrote: > Hi! > > > Description: > > > > This patch creates a directory in /sys/kernel called idle. This > > directory contains two files: idle_ctrl and idle_methods. Reading > > idle_ctrl will show the function that is currently being used for idle, > > and idle_methods shows the available methods for the user to send write > > into idle_ctrl to change which function to use for idle. > > Pretty ugly interface, I'd say... is listing function really neccessary? > What interface would you prefer? And the listing was a feature request made by Ingo. But this is pretty much moot, since the patch is not going any further than the RT patch. And even then, it probably is only temporary, if it is even still in there (I haven't checked). --Steve ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: 2.6.14-rt13 2005-11-18 22:05 ` 2.6.14-rt13 Fernando Lopez-Lezcano 2005-11-18 22:07 ` 2.6.14-rt13 Ingo Molnar @ 2005-11-18 22:13 ` Lee Revell 2005-11-18 22:32 ` 2.6.14-rt13 Vojtech Pavlik 1 sibling, 1 reply; 65+ messages in thread From: Lee Revell @ 2005-11-18 22:13 UTC (permalink / raw) To: Fernando Lopez-Lezcano Cc: Ingo Molnar, linux-kernel, Paul E. McKenney, K.R. Foley, Steven Rostedt, Thomas Gleixner, pluto, john cooper, Benedikt Spranger, Daniel Walker, Tom Rini, George Anzinger On Fri, 2005-11-18 at 14:05 -0800, Fernando Lopez-Lezcano wrote: > On Fri, 2005-11-18 at 16:54 -0500, Lee Revell wrote: > > On Fri, 2005-11-18 at 10:02 -0800, Fernando Lopez-Lezcano wrote: > > > You mentioned before that the TSC's from both cpus could drift from > > > each other over time. Assuming that is the source of timing (I have no > > > idea) that could explain the behavior of Jack, it gets a reference > > > time from one of the cpus and then compares that with what it gets > > > from either cpu depending on where it is running at a given time. If > > > it is the same cpu all is fine, if it is the other and it has drifted > > > then the warning is printed. > > > > Yes, JACK uses rdtsc() for microsecond resolution timing and assumes > > that the TSCs are in sync. > > > > I've asked on this list what a better time source could be and didn't > > get any useful responses, people just told me "use gettimeofday()" which > > is WAY too slow. > > Arghhh, at least I take this as a confirmation that the TSCs do drift > and there is no workaround. It currently makes the -rt/Jack combination > not very useful, at least in my tests. > > Is there a way to resync the TSCs? I don't think so. A better question is what mechanism have the hardware vendors provided to replace the apparently-no-longer-reliable TSC for cheap high res timing on modern machines. Unfortunately I suspect the answer at this point is "nothing, you're screwed". I've read that gettimeofday() does not have to enter the kernel on x86-64, maybe it's fast enough, though almost certainly orders of magnitude slower than rdtsc(). It seems like a huge step backwards for any apps with high res timing requirements. Lee ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: 2.6.14-rt13 2005-11-18 22:13 ` 2.6.14-rt13 Lee Revell @ 2005-11-18 22:32 ` Vojtech Pavlik 2005-11-19 2:28 ` 2.6.14-rt13 George Anzinger 0 siblings, 1 reply; 65+ messages in thread From: Vojtech Pavlik @ 2005-11-18 22:32 UTC (permalink / raw) To: Lee Revell Cc: Fernando Lopez-Lezcano, Ingo Molnar, linux-kernel, Paul E. McKenney, K.R. Foley, Steven Rostedt, Thomas Gleixner, pluto, john cooper, Benedikt Spranger, Daniel Walker, Tom Rini, George Anzinger On Fri, Nov 18, 2005 at 05:13:03PM -0500, Lee Revell wrote: > On Fri, 2005-11-18 at 14:05 -0800, Fernando Lopez-Lezcano wrote: > > On Fri, 2005-11-18 at 16:54 -0500, Lee Revell wrote: > > > On Fri, 2005-11-18 at 10:02 -0800, Fernando Lopez-Lezcano wrote: > > > > You mentioned before that the TSC's from both cpus could drift from > > > > each other over time. Assuming that is the source of timing (I have no > > > > idea) that could explain the behavior of Jack, it gets a reference > > > > time from one of the cpus and then compares that with what it gets > > > > from either cpu depending on where it is running at a given time. If > > > > it is the same cpu all is fine, if it is the other and it has drifted > > > > then the warning is printed. > > > > > > Yes, JACK uses rdtsc() for microsecond resolution timing and assumes > > > that the TSCs are in sync. > > > > > > I've asked on this list what a better time source could be and didn't > > > get any useful responses, people just told me "use gettimeofday()" which > > > is WAY too slow. > > > > Arghhh, at least I take this as a confirmation that the TSCs do drift > > and there is no workaround. It currently makes the -rt/Jack combination > > not very useful, at least in my tests. > > > > Is there a way to resync the TSCs? > > I don't think so. A better question is what mechanism have the hardware > vendors provided to replace the apparently-no-longer-reliable TSC for > cheap high res timing on modern machines. Unfortunately I suspect the > answer at this point is "nothing, you're screwed". There are many mechanisms to keep time: 1) RTC: 0.5 sec resolution, interrupts 2) PIT: takes ages to read, overflows at each timer interrupt 3) PMTMR: takes ages to read, overflows in approx 4 seconds, no interrupt 4) HPET: slow to read, overflows in 5 minutes. Nice, but usually not present. 5) TSC: fast, completely unreliable. Frequency changes, CPUs diverge over time. 6) LAPIC: reasonably fast, unreliable, per-cpu Userspace can only use 1), 4) and 5). mplayer uses the RTC to synchronize, using it as a 1 kHz interrupt source. The kernel does quite a lot of magic and jumps through many hoops to make a reliable and fast time source combining these. > I've read that gettimeofday() does not have to enter the kernel on > x86-64, maybe it's fast enough, though almost certainly orders of > magnitude slower than rdtsc(). It depends on the hardware config, and kernel version. With my latest patch it takes approximately 175 ns on a reasonably fast CPU to do gettimeofday() from userspace. And much better results will be possible (~5x better) when RDTSCP enabled CPUs become available. This patch still has problems, and as such I'll still have to rewrite significant portions before I release it. Anyway, current gettimeofday() on SMP AMD x86-64 can be as bad as 1500ns. > It seems like a huge step backwards for > any apps with high res timing requirements. gettimeofday() is the only guaranteed working mechanism. And it's as fast as the hardware allows. -- Vojtech Pavlik SuSE Labs, SuSE CR ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: 2.6.14-rt13 2005-11-18 22:32 ` 2.6.14-rt13 Vojtech Pavlik @ 2005-11-19 2:28 ` George Anzinger 2005-11-19 7:45 ` 2.6.14-rt13 Vojtech Pavlik 0 siblings, 1 reply; 65+ messages in thread From: George Anzinger @ 2005-11-19 2:28 UTC (permalink / raw) To: Vojtech Pavlik Cc: Lee Revell, Fernando Lopez-Lezcano, Ingo Molnar, linux-kernel, Paul E. McKenney, K.R. Foley, Steven Rostedt, Thomas Gleixner, pluto, john cooper, Benedikt Spranger, Daniel Walker, Tom Rini Vojtech Pavlik wrote: > On Fri, Nov 18, 2005 at 05:13:03PM -0500, Lee Revell wrote: > >>On Fri, 2005-11-18 at 14:05 -0800, Fernando Lopez-Lezcano wrote: >> >>>On Fri, 2005-11-18 at 16:54 -0500, Lee Revell wrote: >>> >>>>On Fri, 2005-11-18 at 10:02 -0800, Fernando Lopez-Lezcano wrote: >>>> >>>>>You mentioned before that the TSC's from both cpus could drift from >>>>>each other over time. Assuming that is the source of timing (I have no >>>>>idea) that could explain the behavior of Jack, it gets a reference >>>>>time from one of the cpus and then compares that with what it gets >>>>>from either cpu depending on where it is running at a given time. If >>>>>it is the same cpu all is fine, if it is the other and it has drifted >>>>>then the warning is printed. >>>> >>>>Yes, JACK uses rdtsc() for microsecond resolution timing and assumes >>>>that the TSCs are in sync. >>>> >>>>I've asked on this list what a better time source could be and didn't >>>>get any useful responses, people just told me "use gettimeofday()" which >>>>is WAY too slow. >>> >>>Arghhh, at least I take this as a confirmation that the TSCs do drift >>>and there is no workaround. It currently makes the -rt/Jack combination >>>not very useful, at least in my tests. >>> >>>Is there a way to resync the TSCs? >> >>I don't think so. A better question is what mechanism have the hardware >>vendors provided to replace the apparently-no-longer-reliable TSC for >>cheap high res timing on modern machines. Unfortunately I suspect the >>answer at this point is "nothing, you're screwed". > > > There are many mechanisms to keep time: > > 1) RTC: 0.5 sec resolution, interrupts > 2) PIT: takes ages to read, overflows at each timer interrupt > 3) PMTMR: takes ages to read, overflows in approx 4 seconds, no interrupt The PMTMR can be read from user space (if you can find it). See the "iopl" man page. It is an I/O access and so is slow, but you can read it. Finding it is another matter. It does not have a fixed address (i.e. it differs from machine to machine, but is constant on any given machine). The boot code roots it out of an info block put in memory by the BIOS. I suppose one could put a printk in the boot code to disclose it... George -- > 4) HPET: slow to read, overflows in 5 minutes. Nice, but usually not present. > 5) TSC: fast, completely unreliable. Frequency changes, CPUs diverge over time. > 6) LAPIC: reasonably fast, unreliable, per-cpu > > Userspace can only use 1), 4) and 5). mplayer uses the RTC to > synchronize, using it as a 1 kHz interrupt source. > > The kernel does quite a lot of magic and jumps through many hoops to > make a reliable and fast time source combining these. > > >>I've read that gettimeofday() does not have to enter the kernel on >>x86-64, maybe it's fast enough, though almost certainly orders of >>magnitude slower than rdtsc(). > > > It depends on the hardware config, and kernel version. With my latest > patch it takes approximately 175 ns on a reasonably fast CPU to do > gettimeofday() from userspace. And much better results will be possible > (~5x better) when RDTSCP enabled CPUs become available. > > This patch still has problems, and as such I'll still have to rewrite > significant portions before I release it. > > Anyway, current gettimeofday() on SMP AMD x86-64 can be as bad as 1500ns. > > >>It seems like a huge step backwards for >>any apps with high res timing requirements. > > > gettimeofday() is the only guaranteed working mechanism. And it's as > fast as the hardware allows. > -- George Anzinger george@mvista.com HRT (High-res-timers): http://sourceforge.net/projects/high-res-timers/ ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: 2.6.14-rt13 2005-11-19 2:28 ` 2.6.14-rt13 George Anzinger @ 2005-11-19 7:45 ` Vojtech Pavlik 2005-11-19 18:27 ` 2.6.14-rt13 Lee Revell 0 siblings, 1 reply; 65+ messages in thread From: Vojtech Pavlik @ 2005-11-19 7:45 UTC (permalink / raw) To: George Anzinger Cc: Lee Revell, Fernando Lopez-Lezcano, Ingo Molnar, linux-kernel, Paul E. McKenney, K.R. Foley, Steven Rostedt, Thomas Gleixner, pluto, john cooper, Benedikt Spranger, Daniel Walker, Tom Rini On Fri, Nov 18, 2005 at 06:28:24PM -0800, George Anzinger wrote: > >There are many mechanisms to keep time: > > > >1) RTC: 0.5 sec resolution, interrupts > >2) PIT: takes ages to read, overflows at each timer interrupt > >3) PMTMR: takes ages to read, overflows in approx 4 seconds, no interrupt > > The PMTMR can be read from user space (if you can find it). See the > "iopl" man page. It is an I/O access and so is slow, but you can read > it. Yes, however this must be limited to a small number of privileged applications - iopl() is only available to CAP_SYS_RAWIO IIRC, and thus it's not suitable for general use. > Finding it is another matter. It does not have a fixed address (i.e. > it differs from machine to machine, but is constant on any given > machine). The boot code roots it out of an info block put in memory > by the BIOS. I suppose one could put a printk in the boot code to > disclose it... There is really no reason to do that, since the time to read it (~1200 ns) is much less than the time to enter the kernel (less than 200 ns), so gettimeofday() is definitely easier to use and also doesn't overflow. -- Vojtech Pavlik SuSE Labs, SuSE CR ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: 2.6.14-rt13 2005-11-19 7:45 ` 2.6.14-rt13 Vojtech Pavlik @ 2005-11-19 18:27 ` Lee Revell 0 siblings, 0 replies; 65+ messages in thread From: Lee Revell @ 2005-11-19 18:27 UTC (permalink / raw) To: Vojtech Pavlik Cc: George Anzinger, Fernando Lopez-Lezcano, Ingo Molnar, linux-kernel, Paul E. McKenney, K.R. Foley, Steven Rostedt, Thomas Gleixner, pluto, john cooper, Benedikt Spranger, Daniel Walker, Tom Rini On Sat, 2005-11-19 at 08:45 +0100, Vojtech Pavlik wrote: > On Fri, Nov 18, 2005 at 06:28:24PM -0800, George Anzinger wrote: > > Finding it is another matter. It does not have a fixed address (i.e. > > it differs from machine to machine, but is constant on any given > > machine). The boot code roots it out of an info block put in memory > > by the BIOS. I suppose one could put a printk in the boot code to > > disclose it... > > There is really no reason to do that, since the time to read it (~1200 > ns) is much less than the time to enter the kernel (less than 200 ns), > so gettimeofday() is definitely easier to use and also doesn't overflow. > Thanks very much, you have answered my question. We would prefer gettimeofday() anyway for portability, so if the plan is to make it faster then we can deal with losing the TSC. Lee ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: 2.6.14-rt13 2005-11-15 9:08 2.6.14-rt13 Ingo Molnar ` (2 preceding siblings ...) 2005-11-18 18:02 ` 2.6.14-rt13 Fernando Lopez-Lezcano @ 2005-11-21 21:32 ` Fernando Lopez-Lezcano 2005-11-21 21:41 ` 2.6.14-rt13 john stultz ` (2 more replies) 3 siblings, 3 replies; 65+ messages in thread From: Fernando Lopez-Lezcano @ 2005-11-21 21:32 UTC (permalink / raw) To: Ingo Molnar Cc: nando, linux-kernel, Paul E. McKenney, K.R. Foley, Steven Rostedt, Thomas Gleixner, pluto, john cooper, Benedikt Spranger, Daniel Walker, Tom Rini, George Anzinger On Tue, 2005-11-15 at 10:08 +0100, Ingo Molnar wrote: > i have released the 2.6.14-rt13 tree, which can be downloaded from the > usual place: > > http://redhat.com/~mingo/realtime-preempt/ > > lots of fixes in this release affecting all supported architectures, all > across the board. Big MIPS update from John Cooper. Can someone tell me if 2.6.14-rt13 is supposed to be fixed re: the problems I was having with random screensaver triggering and keyboard repeats? It is apparently not fixed. I just had a short burst of key repeats and saw one random screen blank. Right now everything seems normal but I was not allucinating :-) -- Fernando ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: 2.6.14-rt13 2005-11-21 21:32 ` 2.6.14-rt13 Fernando Lopez-Lezcano @ 2005-11-21 21:41 ` john stultz [not found] ` <20051121221511.GA7255@elte.hu> 2005-11-22 11:19 ` 2.6.14-rt13 Ingo Molnar 2 siblings, 0 replies; 65+ messages in thread From: john stultz @ 2005-11-21 21:41 UTC (permalink / raw) To: Fernando Lopez-Lezcano Cc: Ingo Molnar, linux-kernel, Paul E. McKenney, K.R. Foley, Steven Rostedt, Thomas Gleixner, pluto, john cooper, Benedikt Spranger, Daniel Walker, Tom Rini, George Anzinger On Mon, 2005-11-21 at 13:32 -0800, Fernando Lopez-Lezcano wrote: > On Tue, 2005-11-15 at 10:08 +0100, Ingo Molnar wrote: > > i have released the 2.6.14-rt13 tree, which can be downloaded from the > > usual place: > > > > http://redhat.com/~mingo/realtime-preempt/ > > > > lots of fixes in this release affecting all supported architectures, all > > across the board. Big MIPS update from John Cooper. > > Can someone tell me if 2.6.14-rt13 is supposed to be fixed re: the > problems I was having with random screensaver triggering and keyboard > repeats? > > It is apparently not fixed. > > I just had a short burst of key repeats and saw one random screen blank. > Right now everything seems normal but I was not allucinating :-) Hmm. Sounds like timekeeping issues, could you send me dmesg output? thanks -john ^ permalink raw reply [flat|nested] 65+ messages in thread
[parent not found: <20051121221511.GA7255@elte.hu>]
* test time-warps [was: Re: 2.6.14-rt13] [not found] ` <20051121221511.GA7255@elte.hu> @ 2005-11-21 22:19 ` Ingo Molnar 2005-11-21 23:08 ` Fernando Lopez-Lezcano ` (3 more replies) 0 siblings, 4 replies; 65+ messages in thread From: Ingo Molnar @ 2005-11-21 22:19 UTC (permalink / raw) To: Fernando Lopez-Lezcano Cc: linux-kernel, Paul E. McKenney, K.R. Foley, Steven Rostedt, Thomas Gleixner, pluto, john cooper, Benedikt Spranger, Daniel Walker, Tom Rini, George Anzinger * Fernando Lopez-Lezcano <nando@ccrma.Stanford.EDU> wrote: > On Tue, 2005-11-15 at 10:08 +0100, Ingo Molnar wrote: > > i have released the 2.6.14-rt13 tree, which can be downloaded from the > > usual place: > > > > http://redhat.com/~mingo/realtime-preempt/ > > > > lots of fixes in this release affecting all supported architectures, all > > across the board. Big MIPS update from John Cooper. > > Can someone tell me if 2.6.14-rt13 is supposed to be fixed re: the > problems I was having with random screensaver triggering and keyboard > repeats? > > It is apparently not fixed. > > I just had a short burst of key repeats and saw one random screen > blank. Right now everything seems normal but I was not allucinating > :-) is this on the dual-core X2 box, running 32-bit code? Did it happen with idle=poll? Without idle=poll the TSCs run apart and a number of artifacts may happen. With idle=poll specified the TSC _should_ be fully synchronized. To make sure could you run the attached time-warp-test utility i wrote today? Compile it with: gcc -Wall -O2 -o time-warp-test time-warp-test.c it detects and reports time-warps (and does a maximum search for them over time, that way you can see systematic drifts too). (It auto-detects the # of CPUs and runs the appropriate number of tasks.) running this tool on a X2 with idle=poll and an -rt kernel should give a silent test-output. running a vanilla kernel should give TSC level time warps: #CPUs: 2 running 2 tasks to check for time-warps. warp .. -1 cycles, ... 00000277ed9520c6 -> 00000277ed9520c5 ? warp .. -18 cycles, ... 00000277ed97ac77 -> 00000277ed97ac65 ? warp .. -19 cycles, ... 00000277edaedd54 -> 00000277edaedd41 ? warp .. -84 cycles, ... 00000277ede0558a -> 00000277ede05536 ? warp .. -97 cycles, ... 00000278035328a5 -> 0000027803532844 ? warp .. -224 cycles, ... 000002781ed2db04 -> 000002781ed2da24 ? (because the vanilla kernel doesnt do TSC synchronization accurately) running it without idle=poll should give some really big time warps: neptune:~> ./time-warp-test #CPUs: 2 running 2 tasks to check for time-warps. warp .. -435934 cycles, ... 00000101a2db4a8f -> 00000101a2d4a3b1 ? WARP .. -123 usecs, .... 0003e96c2f3bb579 -> 0003e96c2f3bb4fe ? WARP .. -198 usecs, .... 0003e96c2f3bb625 -> 0003e96c2f3bb55f ? WARP .. -199 usecs, .... 0003e96c2f3bb659 -> 0003e96c2f3bb592 ? warp .. -436117 cycles, ... 00000101a2e5aaf0 -> 00000101a2df035b ? warp .. -437143 cycles, ... 00000101a2e84590 -> 00000101a2e199f9 ? warp .. -437314 cycles, ... 00000101a2ead1b1 -> 00000101a2e4256f ? warp .. -437363 cycles, ... 00000101a2ed9b19 -> 00000101a2e6eea6 ? WARP .. -1951680 usecs, .... 0003e96c2f597f70 -> 0003e96c2f3bb7b0 ? WARP .. -1951879 usecs, .... 0003e96c2f598016 -> 0003e96c2f3bb78f ? WARP .. -1951681 usecs, .... 0003e96c2f598014 -> 0003e96c2f3bb853 ? warp .. -437365 cycles, ... 00000101a4c5be7b -> 00000101a4bf1206 ? warp .. -437366 cycles, ... 00000101a8f4af76 -> 00000101a8ee0300 ? warp .. -437367 cycles, ... 00000101a968a34a -> 00000101a961f6d3 ? these time warps will get worse over time - as the two cores drift apart. (note that they wont drift during the test itself, because the test makes all cores artificially busy and the X2 TSC drifting depends on the core being idle) but in any case, -rt13 should be silent and there should be no time warps. If there are any then those could cause the keyboard repeat problems. Ingo -------{ CUT HERE time-warp-test.c }--------------> /* * Copyright (C) 2005, Ingo Molnar * * time-warp-test.c: check TSC synchronity on x86 CPUs. Also detects * gettimeofday()-level time warps. */ #include <stdio.h> #include <stdarg.h> #include <stdlib.h> #include <signal.h> #include <sys/wait.h> #include <linux/unistd.h> #include <unistd.h> #include <string.h> #include <pwd.h> #include <grp.h> #include <sys/stat.h> #include <sys/types.h> #include <sys/stat.h> #include <sys/time.h> #include <regex.h> #include <fcntl.h> #include <time.h> #include <sys/mman.h> #include <dlfcn.h> #include <popt.h> #include <sys/socket.h> #include <ctype.h> #include <assert.h> #include <sched.h> #define TEST_TSC #define TEST_TOD #define MAX_TASKS 128 #if DEBUG # define Printf(x...) printf(x) #else # define Printf(x...) do { } while (0) #endif enum { SHARED_TSC = 0, SHARED_LOCK = 2, SHARED_TOD = 3, SHARED_WORST_TSC = 5, SHARED_WORST_TOD = 7, SHARED_LOCK2 = 200, }; #define BUG_ON(c) assert(!(c)) typedef unsigned long long cycles_t; typedef unsigned long long usecs_t; #define rdtscll(val) \ __asm__ __volatile__("rdtsc" : "=A" (val)) #define rdtod(val) \ do { \ struct timeval tv; \ \ gettimeofday(&tv, NULL); \ (val) = tv.tv_sec * 1000000LL + tv.tv_usec; \ } while (0) #define mb() \ __asm__ __volatile__("lock; addl $0, (%esp)") static unsigned long *setup_shared_var(void) { char zerobuff [4096] = { 0, }; int ret, fd; unsigned long *buf; fd = creat(".tmp_mmap", 0700); BUG_ON(fd == -1); close(fd); fd = open(".tmp_mmap", O_RDWR|O_CREAT|O_TRUNC); BUG_ON(fd == -1); ret = write(fd, zerobuff, 4096); BUG_ON(ret != 4096); buf = (void *)mmap(0, 4096, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0); BUG_ON(buf == (void *)-1); close(fd); return buf; } #define LOOPS 1000000 static inline unsigned long cmpxchg(volatile unsigned long *ptr, unsigned long old, unsigned long new) { unsigned long prev; __asm__ __volatile__("lock; cmpxchg %b1,%2" : "=a"(prev) : "q"(new), "m"(*(ptr)), "0"(old) : "memory"); return prev; } static inline void lock(unsigned long *flag) { while (cmpxchg(flag, 0, 1) != 0) /* nothing */; } static inline void unlock(unsigned long *flag) { *flag = 0; mb(); } static void print_status(void) { const char progress[] = "\\|/-"; static usecs_t prev_tod; static int count; usecs_t tod; rdtod(tod); if (tod - prev_tod < 100000ULL) return; prev_tod = tod; count++; printf("%c\r", progress[count & 3]); fflush(stdout); } int main(int argc, char **argv) { int i, parent, me; unsigned long *shared; unsigned long cpus, tasks; cpus = system("exit `grep processor /proc/cpuinfo | wc -l`"); cpus = WEXITSTATUS(cpus); if (argc > 2) { usage: fprintf(stderr, "usage: tsc-sync-test <threads>\n"); exit(-1); } if (argc == 2) { tasks = atol(argv[1]); if (!tasks) goto usage; } else tasks = cpus; printf("#CPUs: %ld\n", cpus); printf("running %ld tasks to check for time-warps.\n", tasks); shared = setup_shared_var(); parent = getpid(); for (i = 1; i < tasks; i++) if (!fork()) break; me = getpid(); while (1) { cycles_t t0, t1; usecs_t T0, T1; long long delta; #ifdef TEST_TSC lock(shared + SHARED_LOCK); rdtscll(t1); t0 = *(cycles_t *)(shared + SHARED_TSC); *(cycles_t *)(shared + SHARED_TSC) = t1; unlock(shared + SHARED_LOCK); delta = t1-t0; if (delta < *(long long *)(shared + SHARED_WORST_TSC)) { *(long long *)(shared + SHARED_WORST_TSC) = delta; printf("\rwarp .. %9Ld cycles, ... %016Lx -> %016Lx ?\n", delta, t0, t1); } // occasionally disturb things a bit if (!(t0 & 7)) { lock(shared + SHARED_LOCK2); unlock(shared + SHARED_LOCK2); } #endif #ifdef TEST_TOD lock(shared + SHARED_LOCK); rdtod(T1); T0 = *(usecs_t *)(shared + SHARED_TOD); *(usecs_t *)(shared + SHARED_TOD) = T1; unlock(shared + SHARED_LOCK); delta = T1-T0; if (delta < *(long long *)(shared + SHARED_WORST_TOD)) { *(long long *)(shared + SHARED_WORST_TOD) = delta; printf("\rWARP .. %9Ld usecs, .... %016Lx -> %016Lx ?\n", delta, T0, T1); } if (!(T0 & 7)) { lock(shared + SHARED_LOCK2); unlock(shared + SHARED_LOCK2); } #endif if (me == parent) print_status(); } return 0; } ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: test time-warps [was: Re: 2.6.14-rt13] 2005-11-21 22:19 ` test time-warps [was: Re: 2.6.14-rt13] Ingo Molnar @ 2005-11-21 23:08 ` Fernando Lopez-Lezcano 2005-11-21 23:38 ` Fernando Lopez-Lezcano ` (2 subsequent siblings) 3 siblings, 0 replies; 65+ messages in thread From: Fernando Lopez-Lezcano @ 2005-11-21 23:08 UTC (permalink / raw) To: Ingo Molnar Cc: nando, linux-kernel, Paul E. McKenney, K.R. Foley, Steven Rostedt, Thomas Gleixner, pluto, john cooper, Benedikt Spranger, Daniel Walker, Tom Rini, George Anzinger On Mon, 2005-11-21 at 23:19 +0100, Ingo Molnar wrote: > * Fernando Lopez-Lezcano <nando@ccrma.Stanford.EDU> wrote: > > > On Tue, 2005-11-15 at 10:08 +0100, Ingo Molnar wrote: > > > i have released the 2.6.14-rt13 tree, which can be downloaded from the > > > usual place: > > > > > > http://redhat.com/~mingo/realtime-preempt/ > > > > > > lots of fixes in this release affecting all supported architectures, all > > > across the board. Big MIPS update from John Cooper. > > > > Can someone tell me if 2.6.14-rt13 is supposed to be fixed re: the > > problems I was having with random screensaver triggering and keyboard > > repeats? > > > > It is apparently not fixed. > > > > I just had a short burst of key repeats and saw one random screen > > blank. Right now everything seems normal but I was not allucinating > > :-) > > is this on the dual-core X2 box, running 32-bit code? That's correct. > Did it happen with idle=poll? No, I'm not running with idle=poll right now. > Without idle=poll the TSCs run apart and a number of > artifacts may happen. With idle=poll specified the TSC _should_ be fully > synchronized. Well, I could try but it is not a solution I could use. It would turn all my machines into space heaters 24x7, no sense in doing that :-) I got an answer off the list from John (Stultz) in response to the dmesg output I sent him and he suggested I try idle=poll (which I briefly did last week) and also changing: /sys/devices/system/clocksource/clocksource0/clocksource to acpi_pm, which I just did. It is too early to tell re: keyboard repeats and screensaver false triggers, but it did fix the problems I was seeing with a hacked Jack that is using gettimeofday instead of tsc reads. Meaning, Jack with gettimeofday + tsc timing source has problems, Jack with gettimeofday + acpi_pm does not. It would seem gettimeofday is not working correctly with tsc. > To make sure could you run the attached time-warp-test utility i wrote > today? I will and report back. Thanks. -- Fernando > Compile it with: > > gcc -Wall -O2 -o time-warp-test time-warp-test.c > > it detects and reports time-warps (and does a maximum search for them > over time, that way you can see systematic drifts too). (It auto-detects > the # of CPUs and runs the appropriate number of tasks.) > > running this tool on a X2 with idle=poll and an -rt kernel should give a > silent test-output. > > running a vanilla kernel should give TSC level time warps: > > #CPUs: 2 > running 2 tasks to check for time-warps. > warp .. -1 cycles, ... 00000277ed9520c6 -> 00000277ed9520c5 ? > warp .. -18 cycles, ... 00000277ed97ac77 -> 00000277ed97ac65 ? > warp .. -19 cycles, ... 00000277edaedd54 -> 00000277edaedd41 ? > warp .. -84 cycles, ... 00000277ede0558a -> 00000277ede05536 ? > warp .. -97 cycles, ... 00000278035328a5 -> 0000027803532844 ? > warp .. -224 cycles, ... 000002781ed2db04 -> 000002781ed2da24 ? > > (because the vanilla kernel doesnt do TSC synchronization accurately) > > running it without idle=poll should give some really big time warps: > > neptune:~> ./time-warp-test > #CPUs: 2 > running 2 tasks to check for time-warps. > warp .. -435934 cycles, ... 00000101a2db4a8f -> 00000101a2d4a3b1 ? > WARP .. -123 usecs, .... 0003e96c2f3bb579 -> 0003e96c2f3bb4fe ? > WARP .. -198 usecs, .... 0003e96c2f3bb625 -> 0003e96c2f3bb55f ? > WARP .. -199 usecs, .... 0003e96c2f3bb659 -> 0003e96c2f3bb592 ? > warp .. -436117 cycles, ... 00000101a2e5aaf0 -> 00000101a2df035b ? > warp .. -437143 cycles, ... 00000101a2e84590 -> 00000101a2e199f9 ? > warp .. -437314 cycles, ... 00000101a2ead1b1 -> 00000101a2e4256f ? > warp .. -437363 cycles, ... 00000101a2ed9b19 -> 00000101a2e6eea6 ? > WARP .. -1951680 usecs, .... 0003e96c2f597f70 -> 0003e96c2f3bb7b0 ? > WARP .. -1951879 usecs, .... 0003e96c2f598016 -> 0003e96c2f3bb78f ? > WARP .. -1951681 usecs, .... 0003e96c2f598014 -> 0003e96c2f3bb853 ? > warp .. -437365 cycles, ... 00000101a4c5be7b -> 00000101a4bf1206 ? > warp .. -437366 cycles, ... 00000101a8f4af76 -> 00000101a8ee0300 ? > warp .. -437367 cycles, ... 00000101a968a34a -> 00000101a961f6d3 ? > > these time warps will get worse over time - as the two cores drift > apart. (note that they wont drift during the test itself, because the > test makes all cores artificially busy and the X2 TSC drifting depends > on the core being idle) > > but in any case, -rt13 should be silent and there should be no time > warps. If there are any then those could cause the keyboard repeat > problems. > > Ingo > > -------{ CUT HERE time-warp-test.c }--------------> [MUNCH] ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: test time-warps [was: Re: 2.6.14-rt13] 2005-11-21 22:19 ` test time-warps [was: Re: 2.6.14-rt13] Ingo Molnar 2005-11-21 23:08 ` Fernando Lopez-Lezcano @ 2005-11-21 23:38 ` Fernando Lopez-Lezcano 2005-11-21 23:41 ` john stultz 2005-11-22 1:15 ` Steven Rostedt 3 siblings, 0 replies; 65+ messages in thread From: Fernando Lopez-Lezcano @ 2005-11-21 23:38 UTC (permalink / raw) To: Ingo Molnar Cc: linux-kernel, Paul E. McKenney, K.R. Foley, Steven Rostedt, Thomas Gleixner, pluto, john cooper, Benedikt Spranger, Daniel Walker, Tom Rini, George Anzinger On Mon, 2005-11-21 at 23:19 +0100, Ingo Molnar wrote: > * Fernando Lopez-Lezcano <nando@ccrma.Stanford.EDU> wrote: > > I just had a short burst of key repeats and saw one random screen > > blank. Right now everything seems normal but I was not allucinating > > :-) > > is this on the dual-core X2 box, running 32-bit code? Did it happen with > idle=poll? Without idle=poll the TSCs run apart and a number of > artifacts may happen. With idle=poll specified the TSC _should_ be fully > synchronized. > > To make sure could you run the attached time-warp-test utility i wrote > today? Compile it with: > > gcc -Wall -O2 -o time-warp-test time-warp-test.c > > it detects and reports time-warps (and does a maximum search for them > over time, that way you can see systematic drifts too). (It auto-detects > the # of CPUs and runs the appropriate number of tasks.) Ok, here are some test runs: Athlon X2, 2.6.14-rt13, __not__ booting idle=poll cat /sys/devices/system/clocksource/clocksource0/clocksource acpi_pm jiffies *tsc pit [hacked Jack with gettimeofday fails with "delay exceeded..." messages] # ./time-warp-test #CPUs: 2 running 2 tasks to check for time-warps. warp .. -2735313 cycles, ... 000014b9f770036f -> 000014b9f746469e ? WARP .. -1224 usecs, .... 0004061b6acd7dc6 -> 0004061b6acd78fe ? WARP .. -1237 usecs, .... 0004061b6acd7e07 -> 0004061b6acd7932 ? warp .. -2735317 cycles, ... 000014b9f7773a97 -> 000014b9f74d7dc2 ? WARP .. -1238 usecs, .... 0004061b6acd7e65 -> 0004061b6acd798f ? warp .. -2736775 cycles, ... 000014b9f77a9bd0 -> 000014b9f750d949 ? warp .. -2736848 cycles, ... 000014b9f77c83aa -> 000014b9f752c0da ? warp .. -2736953 cycles, ... 000014b9f77e82a6 -> 000014b9f754bf6d ? warp .. -2737060 cycles, ... 000014b9f7831875 -> 000014b9f75954d1 ? warp .. -2737090 cycles, ... 000014b9f792d70b -> 000014b9f7691349 ? warp .. -2737265 cycles, ... 000014b9f79c9509 -> 000014b9f772d098 ? warp .. -2737387 cycles, ... 000014ba0129c8e7 -> 000014ba010003fc ? warp .. -2737405 cycles, ... 000014ba0b696ad1 -> 000014ba0b3fa5d4 ? WARP .. -4398045268 usecs, .... 0004061c70fdbd6e -> 0004061b6ad8e51a ? WARP .. -4398045269 usecs, .... 0004061c70fdbe56 -> 0004061b6ad8e601 ? warp .. -2737407 cycles, ... 000014c0f4960dfd -> 000014c0f46c48fe ? warp .. -2737435 cycles, ... 000014c100f929b5 -> 000014c100cf649a ? warp .. -2737450 cycles, ... 000014ef1eff0250 -> 000014ef1ed53d26 ? warp .. -2737470 cycles, ... 000014ef2a976748 -> 000014ef2a6da20a ? warp .. -2737472 cycles, ... 000014ef98ee8f62 -> 000014ef98c4ca22 ? warp .. -2737494 cycles, ... 000014efac5b0d44 -> 000014efac3147ee ? warp .. -2737506 cycles, ... 000014f42d48833f -> 000014f42d1ebddd ? WARP .. -4398046507 usecs, .... 0004061c788544c5 -> 0004061b7260679a ? warp .. -2737535 cycles, ... 000014ffb2b84ca9 -> 000014ffb28e872a ? warp .. -2737678 cycles, ... 0000150b8cae9ad3 -> 0000150b8c84d4c5 ? warp .. -2737847 cycles, ... 0000153e388bc05d -> 0000153e3861f9a6 ? warp .. -2737851 cycles, ... 0000153e3b472185 -> 0000153e3b1d5aca ? warp .. -2737871 cycles, ... 0000153e3b94270d -> 0000153e3b6a603e ? warp .. -2737872 cycles, ... 0000153e3c3d4034 -> 0000153e3c137964 ? warp .. -2737891 cycles, ... 0000153e51313527 -> 0000153e51076e44 ? warp .. -2737935 cycles, ... 0000153e55df386a -> 0000153e55b5715b ? warp .. -2737987 cycles, ... 0000153ec3280132 -> 0000153ec2fe39ef ? warp .. -2738044 cycles, ... 00001542b6d5c7bd -> 00001542b6ac0041 ? warp .. -2738056 cycles, ... 0000154332e5f8dd -> 0000154332bc3155 ? warp .. -2738059 cycles, ... 000015433aa0e85b -> 000015433a7720d0 ? warp .. -2738087 cycles, ... 0000154363eb9eb5 -> 0000154363c1d70e ? warp .. -2738100 cycles, ... 00001547a3407554 -> 00001547a316ada0 ? warp .. -2738101 cycles, ... 00001547a342315e -> 00001547a31869a9 ? warp .. -2738131 cycles, ... 00001547a36dca74 -> 00001547a34402a1 ? warp .. -2738251 cycles, ... 00001547a67672fd -> 00001547a64caab2 ? warp .. -2738253 cycles, ... 0000154811d20a22 -> 0000154811a841d5 ? warp .. -2738261 cycles, ... 00001548bd4fe888 -> 00001548bd262033 ? warp .. -2738270 cycles, ... 00001549e8ba9459 -> 00001549e890cbfb ? warp .. -2738284 cycles, ... 0000154bca42c59f -> 0000154bca18fd33 ? warp .. -2738287 cycles, ... 0000154c15d10b04 -> 0000154c15a74295 ? warp .. -2738393 cycles, ... 00001559054f8a3b -> 000015590525c162 ? warp .. -2738445 cycles, ... 00001559055cd294 -> 0000155905330987 ? warp .. -2738462 cycles, ... 00001559057d79e3 -> 000015590553b0c5 ? warp .. -2738482 cycles, ... 00001559221f9b08 -> 0000155921f5d1d6 ? warp .. -2738486 cycles, ... 000015593f6a2298 -> 000015593f405962 ? warp .. -2738602 cycles, ... 000015594da97b42 -> 000015594d7fb198 ? warp .. -2738607 cycles, ... 0000155a41e90e62 -> 0000155a41bf44b3 ? warp .. -2738621 cycles, ... 0000155e0f15910d -> 0000155e0eebc750 ? warp .. -2738650 cycles, ... 0000155f746123f6 -> 0000155f74375a1c ? warp .. -2738653 cycles, ... 000015610cbc0276 -> 000015610c923899 ? warp .. -2738655 cycles, ... 0000156241a4f73a -> 00001562417b2d5b ? Now with cat /sys/devices/system/clocksource/clocksource0/clocksource *acpi_pm jiffies tsc pit [hacked Jack with gettimeofday works fine] # ./time-warp-test #CPUs: 2 running 2 tasks to check for time-warps. warp .. -2709892 cycles, ... 000015870e3c5333 -> 000015870e12f9af ? warp .. -2709931 cycles, ... 000015870e611d33 -> 000015870e37c388 ? warp .. -2714592 cycles, ... 000015871b20ef38 -> 000015871af78358 ? warp .. -2714599 cycles, ... 0000158727b08141 -> 000015872787155a ? warp .. -2714610 cycles, ... 00001587341f8c9c -> 0000158733f620aa ? warp .. -2714611 cycles, ... 0000158740a746a4 -> 00001587407ddab1 ? warp .. -2714632 cycles, ... 000015874d202559 -> 000015874cf6b951 ? warp .. -2714672 cycles, ... 000015875aa36481 -> 000015875a79f851 ? warp .. -2714674 cycles, ... 000015876eabae9b -> 000015876e824269 ? warp .. -2714676 cycles, ... 0000158c00b9eec1 -> 0000158c0090828d ? warp .. -2714851 cycles, ... 000015a87d87fdf7 -> 000015a87d5e9114 ? warp .. -2714868 cycles, ... 000015a91f8611c6 -> 000015a91f5ca4d2 ? warp .. -2714900 cycles, ... 000015d4abcac875 -> 000015d4aba15b61 ? warp .. -2714932 cycles, ... 000016722ed1bafe -> 000016722ea84dca ? warp .. -2714933 cycles, ... 000016722edb5d24 -> 000016722eb1efef ? warp .. -2714960 cycles, ... 000016722edf16d0 -> 000016722eb5a980 ? warp .. -2715093 cycles, ... 0000167241711403 -> 000016724147a62e ? warp .. -2715369 cycles, ... 0000167254f44d20 -> 0000167254cade37 ? warp .. -2715372 cycles, ... 000016727c056ff2 -> 000016727bdc0106 ? warp .. -2715382 cycles, ... 0000167294580d33 -> 00001672942e9e3d ? warp .. -2715386 cycles, ... 00001672acf231c5 -> 00001672acc8c2cb ? warp .. -2715394 cycles, ... 00001672c5a30efc -> 00001672c5799ffa ? warp .. -2715397 cycles, ... 00001672f3946ebc -> 00001672f36affb7 ? warp .. -2715417 cycles, ... 000016733b4806b8 -> 000016733b1e979f ? warp .. -2715464 cycles, ... 00001675810adae0 -> 0000167580e16b98 ? warp .. -2715471 cycles, ... 0000174825657d7a -> 00001748253c0e2b ? I both cases messages seem to come in bunches. I get 5 to 15 on startup of the test no matter what. After that it is more sporadic. -- Fernando ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: test time-warps [was: Re: 2.6.14-rt13] 2005-11-21 22:19 ` test time-warps [was: Re: 2.6.14-rt13] Ingo Molnar 2005-11-21 23:08 ` Fernando Lopez-Lezcano 2005-11-21 23:38 ` Fernando Lopez-Lezcano @ 2005-11-21 23:41 ` john stultz 2005-11-22 1:31 ` Lee Revell 2005-11-22 1:15 ` Steven Rostedt 3 siblings, 1 reply; 65+ messages in thread From: john stultz @ 2005-11-21 23:41 UTC (permalink / raw) To: Ingo Molnar Cc: Fernando Lopez-Lezcano, linux-kernel, Paul E. McKenney, K.R. Foley, Steven Rostedt, Thomas Gleixner, pluto, john cooper, Benedikt Spranger, Daniel Walker, Tom Rini, George Anzinger On Mon, 2005-11-21 at 23:19 +0100, Ingo Molnar wrote: > * Fernando Lopez-Lezcano <nando@ccrma.Stanford.EDU> wrote: > > > On Tue, 2005-11-15 at 10:08 +0100, Ingo Molnar wrote: > > > i have released the 2.6.14-rt13 tree, which can be downloaded from the > > > usual place: > > > > > > http://redhat.com/~mingo/realtime-preempt/ > > > > > > lots of fixes in this release affecting all supported architectures, all > > > across the board. Big MIPS update from John Cooper. > > > > Can someone tell me if 2.6.14-rt13 is supposed to be fixed re: the > > problems I was having with random screensaver triggering and keyboard > > repeats? > > > > It is apparently not fixed. > > > > I just had a short burst of key repeats and saw one random screen > > blank. Right now everything seems normal but I was not allucinating > > :-) > > is this on the dual-core X2 box, running 32-bit code? Did it happen with > idle=poll? Without idle=poll the TSCs run apart and a number of > artifacts may happen. With idle=poll specified the TSC _should_ be fully > synchronized. > > To make sure could you run the attached time-warp-test utility i wrote > today? Compile it with: > > gcc -Wall -O2 -o time-warp-test time-warp-test.c > > it detects and reports time-warps (and does a maximum search for them > over time, that way you can see systematic drifts too). (It auto-detects > the # of CPUs and runs the appropriate number of tasks.) > > running this tool on a X2 with idle=poll and an -rt kernel should give a > silent test-output. > > running a vanilla kernel should give TSC level time warps: > > #CPUs: 2 > running 2 tasks to check for time-warps. > warp .. -1 cycles, ... 00000277ed9520c6 -> 00000277ed9520c5 ? > warp .. -18 cycles, ... 00000277ed97ac77 -> 00000277ed97ac65 ? > warp .. -19 cycles, ... 00000277edaedd54 -> 00000277edaedd41 ? > warp .. -84 cycles, ... 00000277ede0558a -> 00000277ede05536 ? > warp .. -97 cycles, ... 00000278035328a5 -> 0000027803532844 ? > warp .. -224 cycles, ... 000002781ed2db04 -> 000002781ed2da24 ? > > (because the vanilla kernel doesnt do TSC synchronization accurately) > > running it without idle=poll should give some really big time warps: > > neptune:~> ./time-warp-test > #CPUs: 2 > running 2 tasks to check for time-warps. > warp .. -435934 cycles, ... 00000101a2db4a8f -> 00000101a2d4a3b1 ? > WARP .. -123 usecs, .... 0003e96c2f3bb579 -> 0003e96c2f3bb4fe ? > WARP .. -198 usecs, .... 0003e96c2f3bb625 -> 0003e96c2f3bb55f ? > WARP .. -199 usecs, .... 0003e96c2f3bb659 -> 0003e96c2f3bb592 ? > warp .. -436117 cycles, ... 00000101a2e5aaf0 -> 00000101a2df035b ? > warp .. -437143 cycles, ... 00000101a2e84590 -> 00000101a2e199f9 ? > warp .. -437314 cycles, ... 00000101a2ead1b1 -> 00000101a2e4256f ? > warp .. -437363 cycles, ... 00000101a2ed9b19 -> 00000101a2e6eea6 ? > WARP .. -1951680 usecs, .... 0003e96c2f597f70 -> 0003e96c2f3bb7b0 ? > WARP .. -1951879 usecs, .... 0003e96c2f598016 -> 0003e96c2f3bb78f ? > WARP .. -1951681 usecs, .... 0003e96c2f598014 -> 0003e96c2f3bb853 ? > warp .. -437365 cycles, ... 00000101a4c5be7b -> 00000101a4bf1206 ? > warp .. -437366 cycles, ... 00000101a8f4af76 -> 00000101a8ee0300 ? > warp .. -437367 cycles, ... 00000101a968a34a -> 00000101a961f6d3 ? > > these time warps will get worse over time - as the two cores drift > apart. (note that they wont drift during the test itself, because the > test makes all cores artificially busy and the X2 TSC drifting depends > on the core being idle) I believe this is the same dual-core TSC drift that has been seen w/ x86-64. I have just added some similar logic to the TSC clocksource that mimics what x86-64 does so an alternative clocksource will be selected automatically. I should be sending out another release later tonight with these updates. thanks -john ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: test time-warps [was: Re: 2.6.14-rt13] 2005-11-21 23:41 ` john stultz @ 2005-11-22 1:31 ` Lee Revell 0 siblings, 0 replies; 65+ messages in thread From: Lee Revell @ 2005-11-22 1:31 UTC (permalink / raw) To: john stultz Cc: Ingo Molnar, Fernando Lopez-Lezcano, linux-kernel, Paul E. McKenney, K.R. Foley, Steven Rostedt, Thomas Gleixner, pluto, john cooper, Benedikt Spranger, Daniel Walker, Tom Rini, George Anzinger On Mon, 2005-11-21 at 15:41 -0800, john stultz wrote: > I believe this is the same dual-core TSC drift that has been seen w/ > x86-64. I have just added some similar logic to the TSC clocksource > that mimics what x86-64 does so an alternative clocksource will be > selected automatically. > > I should be sending out another release later tonight with these > updates. > It is really unfortunate that the TSC cannot be used for timekeeping on these machines. I wrote a simple benchmark that shows rdtsc on Fernando's box to be insanely fast - 10000 iterations in 68 microseconds. This was an order of magnitude faster than any other machine we tested. Why would they bother making it so fast if it's useless for timekeeping? Lee ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: test time-warps [was: Re: 2.6.14-rt13] 2005-11-21 22:19 ` test time-warps [was: Re: 2.6.14-rt13] Ingo Molnar ` (2 preceding siblings ...) 2005-11-21 23:41 ` john stultz @ 2005-11-22 1:15 ` Steven Rostedt 2005-11-22 11:16 ` Ingo Molnar 3 siblings, 1 reply; 65+ messages in thread From: Steven Rostedt @ 2005-11-22 1:15 UTC (permalink / raw) To: Ingo Molnar Cc: Fernando Lopez-Lezcano, linux-kernel, Paul E. McKenney, K.R. Foley, Thomas Gleixner, pluto, john cooper, Benedikt Spranger, Daniel Walker, Tom Rini, George Anzinger On Mon, 21 Nov 2005, Ingo Molnar wrote: > > but in any case, -rt13 should be silent and there should be no time > warps. If there are any then those could cause the keyboard repeat > problems. > Hi Ingo, I'm running -rt13 with the following command line: root=/dev/md0 ro console=ttyS0,115200 console=tty0 nmi_watchdog=2 lapic earlyprintk=ttyS0,115200 idle=poll I just got the following output: $ ./time-warp-test #CPUs: 2 running 2 tasks to check for time-warps. warp .. -5 cycles, ... 0000004fc2ab2b7f -> 0000004fc2ab2b7a ? warp .. -12 cycles, ... 000000506d1d558c -> 000000506d1d5580 ? warp .. -97 cycles, ... 000000536c8868d3 -> 000000536c886872 ? warp .. -99 cycles, ... 00000059ae9d49a1 -> 00000059ae9d493e ? warp .. -110 cycles, ... 00000059ed0f05d6 -> 00000059ed0f0568 ? warp .. -118 cycles, ... 0000007392963142 -> 00000073929630cc ? warp .. -122 cycles, ... 0000007d6a94bc76 -> 0000007d6a94bbfc ? warp .. -346 cycles, ... 0000008acf28a18e -> 0000008acf28a034 ? warp .. -390 cycles, ... 0000008b2fc61fef -> 0000008b2fc61e69 ? -- Steve ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: test time-warps [was: Re: 2.6.14-rt13] 2005-11-22 1:15 ` Steven Rostedt @ 2005-11-22 11:16 ` Ingo Molnar 2005-11-22 17:49 ` Fernando Lopez-Lezcano 0 siblings, 1 reply; 65+ messages in thread From: Ingo Molnar @ 2005-11-22 11:16 UTC (permalink / raw) To: Steven Rostedt Cc: Fernando Lopez-Lezcano, linux-kernel, Paul E. McKenney, K.R. Foley, Thomas Gleixner, pluto, john cooper, Benedikt Spranger, Daniel Walker, Tom Rini, George Anzinger * Steven Rostedt <rostedt@goodmis.org> wrote: > Hi Ingo, > > I'm running -rt13 with the following command line: > > root=/dev/md0 ro console=ttyS0,115200 console=tty0 nmi_watchdog=2 lapic > earlyprintk=ttyS0,115200 idle=poll > > I just got the following output: > > $ ./time-warp-test > #CPUs: 2 > running 2 tasks to check for time-warps. > warp .. -5 cycles, ... 0000004fc2ab2b7f -> 0000004fc2ab2b7a ? > warp .. -12 cycles, ... 000000506d1d558c -> 000000506d1d5580 ? > warp .. -97 cycles, ... 000000536c8868d3 -> 000000536c886872 ? > warp .. -99 cycles, ... 00000059ae9d49a1 -> 00000059ae9d493e ? > warp .. -110 cycles, ... 00000059ed0f05d6 -> 00000059ed0f0568 ? > warp .. -118 cycles, ... 0000007392963142 -> 00000073929630cc ? > warp .. -122 cycles, ... 0000007d6a94bc76 -> 0000007d6a94bbfc ? > warp .. -346 cycles, ... 0000008acf28a18e -> 0000008acf28a034 ? > warp .. -390 cycles, ... 0000008b2fc61fef -> 0000008b2fc61e69 ? i've attached an updated utility below. But i too can see similar output on an X2. A TSC-warp of 390 cycles _might_ be OK, but there are no guarantees. It wont show up as a usecs-level (i.e. gettimeofday()) warp, because 390 cycles is still much lower than the ~2000 cycles one microsecond takes, but it could cause problems for other TSC users. Basically if there is an observable and provable warp in the TSC output then it must not be used for any purpose that is not strictly per-CPU-ified (such as userspace threads bound to a single CPU, and the TSC never used between threads). Ingo ---------{ time-warp-test.c }---------> /* * Copyright (C) 2005, Ingo Molnar * * time-warp-test.c: check TSC synchronity on x86 CPUs. Also detects * gettimeofday()-level time warps. */ #include <stdio.h> #include <stdarg.h> #include <stdlib.h> #include <signal.h> #include <sys/wait.h> #include <linux/unistd.h> #include <unistd.h> #include <string.h> #include <pwd.h> #include <grp.h> #include <sys/stat.h> #include <sys/types.h> #include <sys/stat.h> #include <sys/time.h> #include <regex.h> #include <fcntl.h> #include <time.h> #include <sys/mman.h> #include <dlfcn.h> #include <popt.h> #include <sys/socket.h> #include <ctype.h> #include <assert.h> #include <sched.h> #define TEST_TSC 1 #define TEST_TOD 1 #if !TEST_TSC && !TEST_TOD # error this makes no sense ... #endif #if DEBUG # define Printf(x...) printf(x) #else # define Printf(x...) do { } while (0) #endif /* * Shared locks and variables between the test tasks: */ enum { SHARED_TSC = 0, SHARED_LOCK = 2, SHARED_TOD = 3, SHARED_WORST_TSC = 5, SHARED_WORST_TOD = 7, SHARED_NR_TSC_WARPS = 9, SHARED_NR_TOD_WARPS = 10, }; #define SHARED(x) (*(shared + SHARED_##x)) #define SHARED_LL(x) (*(long long *)(shared + SHARED_##x)) #define BUG_ON(c) assert(!(c)) typedef unsigned long long cycles_t; typedef unsigned long long usecs_t; #define rdtscll(val) \ do { \ __asm__ __volatile__("rdtsc" : "=A" (val)); \ } while (0) #define rdtod(val) \ do { \ struct timeval tv; \ \ gettimeofday(&tv, NULL); \ (val) = tv.tv_sec * 1000000LL + tv.tv_usec; \ } while (0) static unsigned long *setup_shared_var(void) { char zerobuff [4096] = { 0, }; int ret, fd; unsigned long *buf; fd = creat(".tmp_mmap", 0700); BUG_ON(fd == -1); close(fd); fd = open(".tmp_mmap", O_RDWR|O_CREAT|O_TRUNC); BUG_ON(fd == -1); ret = write(fd, zerobuff, 4096); BUG_ON(ret != 4096); buf = (void *)mmap(0, 4096, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0); BUG_ON(buf == (void *)-1); close(fd); return buf; } #define LOOPS 1000000 static inline unsigned long cmpxchg(volatile unsigned long *ptr, unsigned long old, unsigned long new) { unsigned long prev; __asm__ __volatile__("lock; cmpxchg %b1,%2" : "=a"(prev) : "q"(new), "m"(*(ptr)), "0"(old) : "memory"); return prev; } static inline void lock(unsigned long *flag) { while (cmpxchg(flag, 0, 1) != 0) /* nothing */; } static inline void unlock(unsigned long *flag) { *flag = 0; } static void print_status(unsigned long *shared) { const char progress[] = "\\|/-"; static usecs_t prev_tod; static int count1, count2; usecs_t tod; count1++; if (count1 < 1000) return; count1 = 0; rdtod(tod); if (tod - prev_tod < 100000ULL) return; prev_tod = tod; count2++; if (TEST_TSC) printf("| # of TSC-warps:%ld", SHARED(NR_TSC_WARPS)); if (TEST_TOD) printf(" | # of TOD-warps:%ld", SHARED(NR_TOD_WARPS)); printf(" %c\r", progress[count2 & 3]); fflush(stdout); } static inline void test_TSC(unsigned long *shared) { #if TEST_TSC cycles_t t0, t1; long long delta; lock(&SHARED(LOCK)); rdtscll(t1); t0 = SHARED_LL(TSC); SHARED_LL(TSC) = t1; delta = t1-t0; if (delta < 0) { SHARED(NR_TSC_WARPS)++; if (delta < SHARED_LL(WORST_TSC)) { SHARED_LL(WORST_TSC) = delta; fprintf(stderr, "\rnew TSC-warp maximum: %9Ld cycles, %016Lx -> %016Lx\n", delta, t0, t1); } } unlock(&SHARED(LOCK)); #endif } static inline void test_TOD(unsigned long *shared) { #if TEST_TOD usecs_t T0, T1; long long delta; lock(&SHARED(LOCK)); rdtod(T1); T0 = SHARED_LL(TOD); SHARED_LL(TOD) = T1; delta = T1-T0; if (delta < 0) { SHARED(NR_TOD_WARPS)++; if (delta < SHARED_LL(WORST_TOD)) { SHARED_LL(WORST_TOD) = delta; fprintf(stderr, "\rnew TOD-warp maximum: %9Ld usecs, %016Lx -> %016Lx\n", delta, T0, T1); } } unlock(&SHARED(LOCK)); #endif } int main(int argc, char **argv) { int i, parent, me; unsigned long *shared; unsigned long cpus, tasks; cpus = system("exit `grep processor /proc/cpuinfo | wc -l`"); cpus = WEXITSTATUS(cpus); if (argc > 2) { usage: fprintf(stderr, "usage: tsc-sync-test <threads>\n"); exit(-1); } if (argc == 2) { tasks = atol(argv[1]); if (!tasks) goto usage; } else tasks = cpus; printf("%ld CPUs, running %ld parallel test-tasks.\n", cpus, tasks); printf("checking for time-warps via:\n" #if TEST_TSC "- read time stamp counter (RDTSC) instruction (cycle resolution)\n" #endif #if TEST_TOD "- gettimeofday (TOD) syscall (usec resolution)\n" #endif "\n" ); shared = setup_shared_var(); parent = getpid(); for (i = 1; i < tasks; i++) if (!fork()) break; me = getpid(); while (1) { test_TSC(shared); test_TOD(shared); if (me == parent) print_status(shared); } return 0; } ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: test time-warps [was: Re: 2.6.14-rt13] 2005-11-22 11:16 ` Ingo Molnar @ 2005-11-22 17:49 ` Fernando Lopez-Lezcano 2005-11-22 18:01 ` Christopher Friesen 0 siblings, 1 reply; 65+ messages in thread From: Fernando Lopez-Lezcano @ 2005-11-22 17:49 UTC (permalink / raw) To: Ingo Molnar Cc: Steven Rostedt, linux-kernel, Paul E. McKenney, K.R. Foley, Thomas Gleixner, pluto, john cooper, Benedikt Spranger, Daniel Walker, Tom Rini, George Anzinger On Tue, 2005-11-22 at 12:16 +0100, Ingo Molnar wrote: > * Steven Rostedt <rostedt@goodmis.org> wrote: > > > Hi Ingo, > > > > I'm running -rt13 with the following command line: > > > > root=/dev/md0 ro console=ttyS0,115200 console=tty0 nmi_watchdog=2 lapic > > earlyprintk=ttyS0,115200 idle=poll > > > > I just got the following output: > > > > $ ./time-warp-test > > #CPUs: 2 > > running 2 tasks to check for time-warps. > > warp .. -5 cycles, ... 0000004fc2ab2b7f -> 0000004fc2ab2b7a ? > > warp .. -12 cycles, ... 000000506d1d558c -> 000000506d1d5580 ? > > warp .. -97 cycles, ... 000000536c8868d3 -> 000000536c886872 ? > > warp .. -99 cycles, ... 00000059ae9d49a1 -> 00000059ae9d493e ? > > warp .. -110 cycles, ... 00000059ed0f05d6 -> 00000059ed0f0568 ? > > warp .. -118 cycles, ... 0000007392963142 -> 00000073929630cc ? > > warp .. -122 cycles, ... 0000007d6a94bc76 -> 0000007d6a94bbfc ? > > warp .. -346 cycles, ... 0000008acf28a18e -> 0000008acf28a034 ? > > warp .. -390 cycles, ... 0000008b2fc61fef -> 0000008b2fc61e69 ? > > i've attached an updated utility below. I'm adding a run with: echo "tsc"> /sys/devices/system/clocksource/clocksource0/clocksource _not_ booted with idle=poll at the end of this email. > But i too can see similar output > on an X2. A TSC-warp of 390 cycles _might_ be OK, but there are no > guarantees. In my experience the amount seems to be related to how long the system has been up. Which is to be expected if the two TSCs drift, right? > It wont show up as a usecs-level (i.e. gettimeofday()) warp, > because 390 cycles is still much lower than the ~2000 cycles one > microsecond takes, but it could cause problems for other TSC users. > > Basically if there is an observable and provable warp in the TSC output > then it must not be used for any purpose that is not strictly > per-CPU-ified (such as userspace threads bound to a single CPU, and the > TSC never used between threads). Apparently that's the case. John Stultz just released a new version of his patch that takes care of not using the TSC as a time source on X2's. Hopefully that will make its way to the -rt patches soon :-) This would take care of the key repeat / screensaver problems (I just saw a post yesterday on linux-audio-user about someone else on an X2 processor having the same problems), Jack will need a patch to use gettimeofday in those cases. Is /sys/devices/system/clocksource/clocksource0/clocksource part of the standard kernel tree? I was thinking on using that for the Jack patch to decide whether to use tsc or not (ie: if it is good enough for the kernel it should be good enough for Jack). To all involved, a big _THANKS_ for helping track this very annoying problem! -- Fernando # time ./time-warp 2 CPUs, running 2 parallel test-tasks. checking for time-warps via: - read time stamp counter (RDTSC) instruction (cycle resolution) - gettimeofday (TOD) syscall (usec resolution) new TOD-warp maximum: -4398046507 usecs, 0004062bea76af5b -> 0004062ae451d230 new TSC-warp maximum: -3122849 cycles, 00009a5f725821a3 -> 00009a5f72287b02 new TSC-warp maximum: -3123428 cycles, 00009a5f725b26a8 -> 00009a5f722b7dc4 new TSC-warp maximum: -3123690 cycles, 00009a60ccc01765 -> 00009a60cc906d7b new TSC-warp maximum: -3123793 cycles, 00009a61a5897c78 -> 00009a61a559d227 new TSC-warp maximum: -3123965 cycles, 00009a68b7481924 -> 00009a68b7186e27 new TSC-warp maximum: -3123966 cycles, 00009a68b754b37b -> 00009a68b725087d new TSC-warp maximum: -3124141 cycles, 00009a68c003e8ee -> 00009a68bfd43d41 new TSC-warp maximum: -3124253 cycles, 00009a68c8b511d9 -> 00009a68c88565bc new TSC-warp maximum: -3124268 cycles, 00009a68d2bcaaad -> 00009a68d28cfe81 new TSC-warp maximum: -3124269 cycles, 00009a68eedd440e -> 00009a68eead97e1 new TSC-warp maximum: -3124280 cycles, 00009a68eefefe95 -> 00009a68eecf525d new TSC-warp maximum: -3124342 cycles, 00009a6907369ac7 -> 00009a690706ee51 new TSC-warp maximum: -3124592 cycles, 00009a69147b7019 -> 00009a69144bc2a9 new TSC-warp maximum: -3124609 cycles, 00009a69aa0dd745 -> 00009a69a9de29c4 new TSC-warp maximum: -3124637 cycles, 00009a69df64a2ff -> 00009a69df34f562 new TSC-warp maximum: -3124652 cycles, 00009a6a649d4a10 -> 00009a6a646d9c64 new TSC-warp maximum: -3124663 cycles, 00009a6ad73c29e2 -> 00009a6ad70c7c2b new TSC-warp maximum: -3124699 cycles, 00009af351a28fbb -> 00009af35172e1e0 | # of TSC-warps:185478076 | # of TOD-warps:185477650 \ real 6m58.633s user 5m17.436s sys 1m27.135s ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: test time-warps [was: Re: 2.6.14-rt13] 2005-11-22 17:49 ` Fernando Lopez-Lezcano @ 2005-11-22 18:01 ` Christopher Friesen 2005-11-22 18:22 ` Steven Rostedt 0 siblings, 1 reply; 65+ messages in thread From: Christopher Friesen @ 2005-11-22 18:01 UTC (permalink / raw) To: Fernando Lopez-Lezcano Cc: Ingo Molnar, Steven Rostedt, linux-kernel, Paul E. McKenney, K.R. Foley, Thomas Gleixner, pluto, john cooper, Benedikt Spranger, Daniel Walker, Tom Rini, George Anzinger Fernando Lopez-Lezcano wrote: >>Basically if there is an observable and provable warp in the TSC output >>then it must not be used for any purpose that is not strictly >>per-CPU-ified (such as userspace threads bound to a single CPU, and the >>TSC never used between threads). > Apparently that's the case. What about periodically re-syncing the TSCs on the cpus? Are they writeable? Chris ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: test time-warps [was: Re: 2.6.14-rt13] 2005-11-22 18:01 ` Christopher Friesen @ 2005-11-22 18:22 ` Steven Rostedt 2005-11-22 20:52 ` Ingo Molnar 0 siblings, 1 reply; 65+ messages in thread From: Steven Rostedt @ 2005-11-22 18:22 UTC (permalink / raw) To: Christopher Friesen Cc: Fernando Lopez-Lezcano, Ingo Molnar, linux-kernel, Paul E. McKenney, K.R. Foley, Thomas Gleixner, pluto, john cooper, Benedikt Spranger, Daniel Walker, Tom Rini, George Anzinger On Tue, 22 Nov 2005, Christopher Friesen wrote: > Fernando Lopez-Lezcano wrote: > > >>Basically if there is an observable and provable warp in the TSC output > >>then it must not be used for any purpose that is not strictly > >>per-CPU-ified (such as userspace threads bound to a single CPU, and the > >>TSC never used between threads). > > > Apparently that's the case. > > What about periodically re-syncing the TSCs on the cpus? Are they > writeable? > I believe you can reset them to zero, but I don't think you can set them to anything else. I had to do something similar a few years ago, and I don't have the specs in front of me, so this is coming straight from memory. Even if you could reset them, it would be very difficult to make all CPUs have the same counter. Not to mention that this would also screw up all timings elsewhere when the sync happens. Remember, this would have to work not just on 2 cpus, but 4, 8 and beyond. -- Steve ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: test time-warps [was: Re: 2.6.14-rt13] 2005-11-22 18:22 ` Steven Rostedt @ 2005-11-22 20:52 ` Ingo Molnar 0 siblings, 0 replies; 65+ messages in thread From: Ingo Molnar @ 2005-11-22 20:52 UTC (permalink / raw) To: Steven Rostedt Cc: Christopher Friesen, Fernando Lopez-Lezcano, linux-kernel, Paul E. McKenney, K.R. Foley, Thomas Gleixner, pluto, john cooper, Benedikt Spranger, Daniel Walker, Tom Rini, George Anzinger, john stultz * Steven Rostedt <rostedt@goodmis.org> wrote: > > > Apparently that's the case. > > > > What about periodically re-syncing the TSCs on the cpus? Are they > > writeable? > > I believe you can reset them to zero, but I don't think you can set > them to anything else. I had to do something similar a few years ago, > and I don't have the specs in front of me, so this is coming straight > from memory. on a reasonably new CPU you ought to be able to set the 64-bit value - but that doesnt change the fundamental fact: we have no idea how much time has passed while we were in HLT. Especially with things like dyntick/noidle we could spend _alot_ of time in HLT, and the TSC could drift significantly. How do we know how much that is? Ingo ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: 2.6.14-rt13 2005-11-21 21:32 ` 2.6.14-rt13 Fernando Lopez-Lezcano 2005-11-21 21:41 ` 2.6.14-rt13 john stultz [not found] ` <20051121221511.GA7255@elte.hu> @ 2005-11-22 11:19 ` Ingo Molnar 2 siblings, 0 replies; 65+ messages in thread From: Ingo Molnar @ 2005-11-22 11:19 UTC (permalink / raw) To: Fernando Lopez-Lezcano Cc: linux-kernel, Paul E. McKenney, K.R. Foley, Steven Rostedt, Thomas Gleixner, pluto, john cooper, Benedikt Spranger, Daniel Walker, Tom Rini, George Anzinger * Fernando Lopez-Lezcano <nando@ccrma.Stanford.EDU> wrote: > I just had a short burst of key repeats and saw one random screen > blank. Right now everything seems normal but I was not allucinating > :-) btw., today i have experienced a 'key repeat' event with the stock FC4 SMP kernel too, on an X2 athlon. That kernel didnt have idle=poll specified, so gettimeofday() could time-warp in substantial ways. so i'd say the 'key repeat' problem is almost certainly caused by TSC "time warps" on X2's. Ingo ^ permalink raw reply [flat|nested] 65+ messages in thread
* RE: [RFC][PATCH] Runtime switching of the idle function [take 2]
@ 2005-11-29 19:37 Brown, Len
2005-11-29 19:53 ` Andi Kleen
0 siblings, 1 reply; 65+ messages in thread
From: Brown, Len @ 2005-11-29 19:37 UTC (permalink / raw)
To: Nick Piggin, Ingo Molnar, Steven Rostedt, Andi Kleen
Cc: Andrew Morton, acpi-devel, nando, rlrevell, linux-kernel, paulmck,
kr, tglx, pluto, john.cooper, bene, dwalker, trini, george
idle=poll is a really bad way to go from a power perspective.
While it is diminishing returns to get into deeper C-states,
getting into at least C1 (HALT or MONITOR/MWAIT) is very important
on many processors.
Note that if the issue at hand is the TSC stopping in deep
ACPI C-states, that there is a flag already available to limit
how deep the C-states go. eg.
processor.max_cstate=2 will disable C3, C4 etc
You can do this at run-time by writing to
/sys/module/processor/parameters/max_cstate
I agree with Andi that we have some work to do to address
the issue directly, which is that the TSC is not reliable
under all conditions on all processors. I think we need
some modes for TSC to detect and handle the cases where it either
stops in C3 or changes speeds, vs the systems where it actually
works the way we want it to -- constant rate that never stops.
>Why not just slightly cleanup and extend (eg. to ACPI) the
>hlt_counter thingy that many architectures already have?
Hmmm, I see the floppy driver invoking hlt_counter,
but it isn't clear what the general semantics and general
users are supposd to be. Can you clue me in?
thanks,
-Len
^ permalink raw reply [flat|nested] 65+ messages in thread* Re: [RFC][PATCH] Runtime switching of the idle function [take 2] 2005-11-29 19:37 [RFC][PATCH] Runtime switching of the idle function [take 2] Brown, Len @ 2005-11-29 19:53 ` Andi Kleen 2005-11-29 20:35 ` Lee Revell 0 siblings, 1 reply; 65+ messages in thread From: Andi Kleen @ 2005-11-29 19:53 UTC (permalink / raw) To: Brown, Len Cc: Nick Piggin, Ingo Molnar, Steven Rostedt, Andi Kleen, Andrew Morton, acpi-devel, nando, rlrevell, linux-kernel, paulmck, kr, tglx, pluto, john.cooper, bene, dwalker, trini, george On Tue, Nov 29, 2005 at 02:37:53PM -0500, Brown, Len wrote: > idle=poll is a really bad way to go from a power perspective. > While it is diminishing returns to get into deeper C-states, > getting into at least C1 (HALT or MONITOR/MWAIT) is very important > on many processors. > > Note that if the issue at hand is the TSC stopping in deep > ACPI C-states, that there is a flag already available to limit > how deep the C-states go. eg. No i think they tried to work around the fact that it's not synchronized on AMD systems - in particular it drifts slightly even on single socket dual core A64 X2s and disabling C1 works around that. But idle=poll is too big an hammer for this. Vojtech is working on a solution anyways that should address this better. > processor.max_cstate=2 will disable C3, C4 etc > You can do this at run-time by writing to > /sys/module/processor/parameters/max_cstate In this case it's already C1 that's the problem, so that won't help them. > I agree with Andi that we have some work to do to address > the issue directly, which is that the TSC is not reliable > under all conditions on all processors. I think we need We're mostly addressing it - there are problems left, but overall it's looking good. The remaining problem is an education issue of users to not use RDTSC directly, but use gettimeofday/clock_gettime One remaining use is measurements, but for that it is already dubious (e.g. due to ticking at a possible different frequency than the CPU). For that I want to establish the RDPMC 0 convention. Probably need better documentation for all of this though... > some modes for TSC to detect and handle the cases where it either > stops in C3 or changes speeds, vs the systems where it actually > works the way we want it to -- constant rate that never stops. > > >Why not just slightly cleanup and extend (eg. to ACPI) the > >hlt_counter thingy that many architectures already have? > > Hmmm, I see the floppy driver invoking hlt_counter, > but it isn't clear what the general semantics and general > users are supposd to be. Can you clue me in? It's an ancient hack for an ancient machine chipset bug, but AFAIK not used/needed on anything modern. Should probably remove it from x86-64 too. -Andi ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [RFC][PATCH] Runtime switching of the idle function [take 2] 2005-11-29 19:53 ` Andi Kleen @ 2005-11-29 20:35 ` Lee Revell 2005-11-29 20:51 ` Andi Kleen 0 siblings, 1 reply; 65+ messages in thread From: Lee Revell @ 2005-11-29 20:35 UTC (permalink / raw) To: Andi Kleen Cc: Brown, Len, Nick Piggin, Ingo Molnar, Steven Rostedt, Andrew Morton, acpi-devel, nando, linux-kernel, paulmck, kr, tglx, pluto, john.cooper, bene, dwalker, trini, george On Tue, 2005-11-29 at 20:53 +0100, Andi Kleen wrote: > We're mostly addressing it - there are problems left, but > overall it's looking good. The remaining problem is > an education issue of users to not use RDTSC directly, > but use gettimeofday/clock_gettime No the issue is to make gettimeofday fast enough that the people who currently have to use the TSC can use it. Right now it's 1500-3000 nsec or so, Vojtech mentioned that he has a patch that could reduce that to 150-300 nsec. Lee ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [RFC][PATCH] Runtime switching of the idle function [take 2] 2005-11-29 20:35 ` Lee Revell @ 2005-11-29 20:51 ` Andi Kleen 2005-11-29 23:55 ` Lee Revell 0 siblings, 1 reply; 65+ messages in thread From: Andi Kleen @ 2005-11-29 20:51 UTC (permalink / raw) To: Lee Revell Cc: Andi Kleen, Brown, Len, Nick Piggin, Ingo Molnar, Steven Rostedt, Andrew Morton, acpi-devel, nando, linux-kernel, paulmck, kr, tglx, pluto, john.cooper, bene, dwalker, trini, george On Tue, Nov 29, 2005 at 03:35:39PM -0500, Lee Revell wrote: > On Tue, 2005-11-29 at 20:53 +0100, Andi Kleen wrote: > > We're mostly addressing it - there are problems left, but > > overall it's looking good. The remaining problem is > > an education issue of users to not use RDTSC directly, > > but use gettimeofday/clock_gettime > > No the issue is to make gettimeofday fast enough that the people who > currently have to use the TSC can use it. Right now it's 1500-3000 nsec > or so, Vojtech mentioned that he has a patch that could reduce that to It's only that slow if the hardware can't do better. And the kernel makes it only slow when using RDTSC directly is unsafe - so if you use it directly thinking the kernel cheats you for your cycles you're just shoting yourself in the own foot. > 150-300 nsec. If you have capable hardware it can already do much better. -Andi ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [RFC][PATCH] Runtime switching of the idle function [take 2] 2005-11-29 20:51 ` Andi Kleen @ 2005-11-29 23:55 ` Lee Revell 2005-11-30 1:06 ` Andi Kleen 0 siblings, 1 reply; 65+ messages in thread From: Lee Revell @ 2005-11-29 23:55 UTC (permalink / raw) To: Andi Kleen Cc: Brown, Len, Nick Piggin, Ingo Molnar, Steven Rostedt, Andrew Morton, acpi-devel, nando, linux-kernel, paulmck, kr, tglx, pluto, john.cooper, bene, dwalker, trini, george On Tue, 2005-11-29 at 21:51 +0100, Andi Kleen wrote: > On Tue, Nov 29, 2005 at 03:35:39PM -0500, Lee Revell wrote: > > On Tue, 2005-11-29 at 20:53 +0100, Andi Kleen wrote: > > > We're mostly addressing it - there are problems left, but > > > overall it's looking good. The remaining problem is > > > an education issue of users to not use RDTSC directly, > > > but use gettimeofday/clock_gettime > > > > No the issue is to make gettimeofday fast enough that the people who > > currently have to use the TSC can use it. Right now it's 1500-3000 nsec > > or so, Vojtech mentioned that he has a patch that could reduce that to > > It's only that slow if the hardware can't do better. > > And the kernel makes it only slow when using RDTSC directly > is unsafe - so if you use it directly thinking the kernel cheats > you for your cycles you're just shoting yourself in the own foot. > > > 150-300 nsec. > > If you have capable hardware it can already do much better. > But on my system gettimeofday uses the TSC and it's still ~35x slower than RDTSC: rlrevell@mindpipe:~$ ./timetest rdtsc: 10000 calls in 1079 usecs gettimeofday: 10000 calls in 36628 usecs #include <stdio.h> #include <stdlib.h> #include <sys/time.h> typedef unsigned long long cycles_t; #define rdtscll(val) \ __asm__ __volatile__("rdtsc" : "=A" (val)) static inline cycles_t get_cycles_tsc (void) { unsigned long long ret; rdtscll(ret); return ret; } static inline cycles_t get_cycles_gtod (void) { struct timeval tv; gettimeofday (&tv, NULL); return tv.tv_usec; } int main (void) { int i; cycles_t start_time; start_time= get_cycles_gtod(); for (i = 0; i < 10000; i++) { get_cycles_tsc(); } printf("rdtsc: %i calls in %llu usecs\n", i, get_cycles_gtod() - start_time); start_time = get_cycles_gtod(); for (i = 0; i < 10000; i++) { get_cycles_gtod(); } printf("gettimeofday: %i calls in %llu usecs\n", i, get_cycles_gtod() - start_time); return 0; } ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [RFC][PATCH] Runtime switching of the idle function [take 2] 2005-11-29 23:55 ` Lee Revell @ 2005-11-30 1:06 ` Andi Kleen 2005-11-30 1:22 ` Lee Revell 0 siblings, 1 reply; 65+ messages in thread From: Andi Kleen @ 2005-11-30 1:06 UTC (permalink / raw) To: Lee Revell Cc: Andi Kleen, Brown, Len, Nick Piggin, Ingo Molnar, Steven Rostedt, Andrew Morton, acpi-devel, nando, linux-kernel, paulmck, kr, tglx, pluto, john.cooper, bene, dwalker, trini, george > But on my system gettimeofday uses the TSC and it's still ~35x slower > than RDTSC: > > rlrevell@mindpipe:~$ ./timetest > rdtsc: 10000 calls in 1079 usecs > gettimeofday: 10000 calls in 36628 usecs First if you run this on an Athlon 64 the measurement is likely wrong because RDTSC can be speculated around. To get accurate data you need to add synchronizing instructions. Then you're likely running 32bit. It doesn't use vsyscall gettimeofday yet, which makes it slower. 64bit would. -Andi ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [RFC][PATCH] Runtime switching of the idle function [take 2] 2005-11-30 1:06 ` Andi Kleen @ 2005-11-30 1:22 ` Lee Revell 2005-11-30 1:58 ` Andi Kleen 0 siblings, 1 reply; 65+ messages in thread From: Lee Revell @ 2005-11-30 1:22 UTC (permalink / raw) To: Andi Kleen Cc: Brown, Len, Nick Piggin, Ingo Molnar, Steven Rostedt, Andrew Morton, acpi-devel, nando, linux-kernel, paulmck, kr, tglx, pluto, john.cooper, bene, dwalker, trini, george, Vojtech Pavlik On Wed, 2005-11-30 at 02:06 +0100, Andi Kleen wrote: > > But on my system gettimeofday uses the TSC and it's still ~35x slower > > than RDTSC: > > > > rlrevell@mindpipe:~$ ./timetest > > rdtsc: 10000 calls in 1079 usecs > > gettimeofday: 10000 calls in 36628 usecs > > First if you run this on an Athlon 64 the measurement is likely > wrong because RDTSC can be speculated around. To get accurate > data you need to add synchronizing instructions. > OK. Just for reference here's what people on the JACK list reported: 2.6.14-rt13, PREEMPT_RT, Athlon X2 4400+ (dual core) rdtsc: 10000 calls in 68 usecs gettimeofday: 10000 calls in 5170 usecs P4@3.3Ghz/HT (OpenSUSE 10.0 2.6.13-15-smp): rdtsc: 10000 calls in 253 usecs gettimeofday: 10000 calls in 26547 usecs > Then you're likely running 32bit. It doesn't use vsyscall gettimeofday > yet, which makes it slower. 64bit would. Yes, I am. So it sounds like vsyscall gettimeofday for i386 is in the works? Lee ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [RFC][PATCH] Runtime switching of the idle function [take 2] 2005-11-30 1:22 ` Lee Revell @ 2005-11-30 1:58 ` Andi Kleen 2005-11-30 2:19 ` john stultz 0 siblings, 1 reply; 65+ messages in thread From: Andi Kleen @ 2005-11-30 1:58 UTC (permalink / raw) To: Lee Revell Cc: Andi Kleen, Brown, Len, Nick Piggin, Ingo Molnar, Steven Rostedt, Andrew Morton, acpi-devel, nando, linux-kernel, paulmck, kr, tglx, pluto, john.cooper, bene, dwalker, trini, george, Vojtech Pavlik, johnstul > > Then you're likely running 32bit. It doesn't use vsyscall gettimeofday > > yet, which makes it slower. 64bit would. > > Yes, I am. So it sounds like vsyscall gettimeofday for i386 is in the > works? John Stultz used to have patches for it, but for some reason he never pushed them into mainline. On i386 it unfortunately needs adding a test and branch to the syscall path to be 100% ABI compatible, but I doubt that was the reason he dropped it. -Andi ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [RFC][PATCH] Runtime switching of the idle function [take 2] 2005-11-30 1:58 ` Andi Kleen @ 2005-11-30 2:19 ` john stultz 0 siblings, 0 replies; 65+ messages in thread From: john stultz @ 2005-11-30 2:19 UTC (permalink / raw) To: Andi Kleen Cc: Lee Revell, Brown, Len, Nick Piggin, Ingo Molnar, Steven Rostedt, Andrew Morton, acpi-devel, nando, linux-kernel, paulmck, kr, tglx, pluto, john.cooper, bene, dwalker, trini, george, Vojtech Pavlik On Wed, 2005-11-30 at 02:58 +0100, Andi Kleen wrote: > > > Then you're likely running 32bit. It doesn't use vsyscall gettimeofday > > > yet, which makes it slower. 64bit would. > > > > Yes, I am. So it sounds like vsyscall gettimeofday for i386 is in the > > works? > > John Stultz used to have patches for it, but for some reason he never > pushed them into mainline. Unfortunately it was a pretty ugly patch. Correctness issues with the existing code have kept focused on my timekeeping rework, however I have kept it in mind, and I do have a i386 vsyscall gtod patch that applies ontop of my tod work. I've been maintaining it on the side while I focus on the core code, but it is much cleaner now. For fun I'll try to remember to send it out with the next release. > On i386 it unfortunately needs adding > a test and branch to the syscall path to be 100% ABI compatible, but I > doubt that was the reason he dropped it. Yea, I didn't know enough about the VDSO/unwind bits to get it to do the right thing w/ glibc, so that bit was pretty hackish. I'll still need some help on this bit to make it really something that could be included. thanks -john ^ permalink raw reply [flat|nested] 65+ messages in thread
end of thread, other threads:[~2005-12-18 15:26 UTC | newest]
Thread overview: 65+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-11-15 9:08 2.6.14-rt13 Ingo Molnar
2005-11-15 16:36 ` 2.6.14-rt13 Mark Knecht
2005-11-15 19:57 ` 2.6.14-rt13 Paul E. McKenney
2005-11-16 3:48 ` 2.6.14-rt13 K.R. Foley
2005-11-16 8:40 ` 2.6.14-rt13 Ingo Molnar
2005-11-16 17:02 ` 2.6.14-rt13 Paul E. McKenney
2005-11-18 18:02 ` 2.6.14-rt13 Fernando Lopez-Lezcano
2005-11-18 21:54 ` 2.6.14-rt13 Lee Revell
2005-11-18 22:05 ` 2.6.14-rt13 Fernando Lopez-Lezcano
2005-11-18 22:07 ` 2.6.14-rt13 Ingo Molnar
2005-11-18 22:15 ` 2.6.14-rt13 Lee Revell
2005-11-18 22:25 ` 2.6.14-rt13 Steven Rostedt
2005-11-18 23:36 ` 2.6.14-rt13 Fernando Lopez-Lezcano
2005-11-18 23:57 ` 2.6.14-rt13 Steven Rostedt
2005-11-18 22:41 ` 2.6.14-rt13 Fernando Lopez-Lezcano
2005-11-19 2:39 ` 2.6.14-rt13 Steven Rostedt
2005-11-24 15:07 ` 2.6.14-rt13 Ingo Molnar
2005-11-24 15:21 ` 2.6.14-rt13 Steven Rostedt
2005-11-25 20:56 ` [RFC][PATCH] Runtime switching to idle_poll (was: Re: 2.6.14-rt13) Steven Rostedt
2005-11-26 13:05 ` Ingo Molnar
2005-11-29 2:48 ` [RFC][PATCH] Runtime switching of the idle function [take 2] Steven Rostedt
2005-11-29 3:02 ` Andrew Morton
2005-11-29 3:42 ` Steven Rostedt
2005-11-29 4:01 ` Andrew Morton
2005-11-29 6:44 ` Ingo Molnar
2005-11-29 6:55 ` Nick Piggin
2005-11-29 18:05 ` Andi Kleen
2005-11-29 14:19 ` Steven Rostedt
2005-11-29 14:50 ` Andi Kleen
2005-11-29 15:42 ` Steven Rostedt
2005-12-02 1:27 ` Max Krasnyansky
2005-12-02 1:45 ` Andi Kleen
2005-12-03 2:17 ` Max Krasnyansky
2005-11-29 4:22 ` john stultz
2005-11-29 14:22 ` Steven Rostedt
2005-11-29 13:08 ` Pavel Machek
2005-12-18 15:26 ` Steven Rostedt
2005-11-18 22:13 ` 2.6.14-rt13 Lee Revell
2005-11-18 22:32 ` 2.6.14-rt13 Vojtech Pavlik
2005-11-19 2:28 ` 2.6.14-rt13 George Anzinger
2005-11-19 7:45 ` 2.6.14-rt13 Vojtech Pavlik
2005-11-19 18:27 ` 2.6.14-rt13 Lee Revell
2005-11-21 21:32 ` 2.6.14-rt13 Fernando Lopez-Lezcano
2005-11-21 21:41 ` 2.6.14-rt13 john stultz
[not found] ` <20051121221511.GA7255@elte.hu>
2005-11-21 22:19 ` test time-warps [was: Re: 2.6.14-rt13] Ingo Molnar
2005-11-21 23:08 ` Fernando Lopez-Lezcano
2005-11-21 23:38 ` Fernando Lopez-Lezcano
2005-11-21 23:41 ` john stultz
2005-11-22 1:31 ` Lee Revell
2005-11-22 1:15 ` Steven Rostedt
2005-11-22 11:16 ` Ingo Molnar
2005-11-22 17:49 ` Fernando Lopez-Lezcano
2005-11-22 18:01 ` Christopher Friesen
2005-11-22 18:22 ` Steven Rostedt
2005-11-22 20:52 ` Ingo Molnar
2005-11-22 11:19 ` 2.6.14-rt13 Ingo Molnar
-- strict thread matches above, loose matches on Subject: below --
2005-11-29 19:37 [RFC][PATCH] Runtime switching of the idle function [take 2] Brown, Len
2005-11-29 19:53 ` Andi Kleen
2005-11-29 20:35 ` Lee Revell
2005-11-29 20:51 ` Andi Kleen
2005-11-29 23:55 ` Lee Revell
2005-11-30 1:06 ` Andi Kleen
2005-11-30 1:22 ` Lee Revell
2005-11-30 1:58 ` Andi Kleen
2005-11-30 2:19 ` john stultz
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox