* Regression in 2.6.27 caused by commit bfc0f59
@ 2008-08-31 22:54 Larry Finger
2008-09-01 11:14 ` Thomas Gleixner
0 siblings, 1 reply; 56+ messages in thread
From: Larry Finger @ 2008-08-31 22:54 UTC (permalink / raw)
To: LKML, Rafael J. Wysocki, Linus Torvalds, Alok Kataria; +Cc: Michael Buesch
An ancient laptop of mine started throwing errors from b43legacy when
I started using 2.6.27 on it. This has been bisected to commit bfc0f59
"x86: merge tsc calibration". Incidentally, I was appalled at the
number of build errors that were found while bisecting in this region.
Every commit from 2.6.26-rc9-00715 to at least -00719 had include
errors that had to be fixed before it would compile. Those fixes are
the only reason that the builds below are "dirty".
The CPU in this computer is an AMD-K6 at stepping 0c and running at
450 MHz.
The critical differences in the dmesg output between the "good" and
"bad" results indicate a factor of 2 difference in the clock speed,
and are shown below:
============================================================
finger@larrylap:/ide1/mtech/wireless-testing> diff -u /ide1/dmesg.good
/ide1/dmesg.bad
--- /ide1/dmesg.good 2008-08-31 16:23:32.000000000 -0500
+++ /ide1/dmesg.bad 2008-08-31 16:26:25.000000000 -0500
@@ -1,4 +1,4 @@
-Linux version 2.6.26-rc9-test-00715-g0ef9553-dirty (finger@larrylap)
(gcc version 4.3.1 20080507 (prerelease) [gcc-4_3-branch revision
135036] (SUSE Linux) ) #29 Sun Aug 31 15:37:05 ACT 2008
+Linux version 2.6.26-rc9-test-00716-gbfc0f59-dirty (finger@larrylap)
(gcc version 4.3.1 20080507 (prerelease) [gcc-4_3-branch revision
135036] (SUSE Linux) ) #31 Sun Aug 31 16:52:23 ACT 2008
BIOS-provided physical RAM map:
BIOS-e820: 0000000000000000 - 000000000009f800 (usable)
BIOS-e820: 000000000009f800 - 00000000000a0000 (reserved)
@@ -54,7 +54,8 @@
Kernel command line: root=/dev/sda3 vga=0x314 resume=/dev/sda2
splash=silent
Initializing CPU#0
PID hash table entries: 1024 (order: 10, 4096 bytes)
-Detected 428.846 MHz processor.
+TSC calibrated against PM_TIMER
+Detected 214.401 MHz processor.
Console: colour dummy device 80x25
console [tty0] enabled
Dentry cache hash table entries: 32768 (order: 5, 131072 bytes)
@@ -69,7 +70,7 @@
.text : 0xc0100000 - 0xc02638ac (1422 kB)
Checking if this processor honours the WP bit even in supervisor
mode...Ok.
CPA: page pool initialized 1 of 1 pages preallocated
-Calibrating delay loop (skipped), value calculated using timer
frequency.. <6>857.69 BogoMIPS (lpj=1715384)
+Calibrating delay loop (skipped), value calculated using timer
frequency.. <6>428.80 BogoMIPS (lpj=857604)
Security Framework initialized
Capability LSM initialized
Mount-cache hash table entries: 512
@@ -140,6 +141,7 @@
TCP reno registered
NET: Registered protocol family 1
checking if image is initramfs... it is
+Clocksource tsc unstable (delta = 500037272 ns)
Freeing initrd memory: 4071k freed
ACPI: EC: non-query interrupt received, switching to interrupt mode
ACPI: EC: GPE storm detected, disabling EC GPE
@@ -174,11 +176,10 @@
TCP cubic registered
Using IPI Shortcut mode
Freeing unused kernel memory: 184k freed
-Clocksource tsc unstable (delta = 83950402 ns)
ACPI: ACPI0007:00 is registered as cooling_device0
ACPI: Processor [CPU0] (supports 8 throttling states)
ACPI: LNXTHERM:01 is registered as thermal_zone0
-ACPI: Thermal Zone [THRM] (50 C)
+ACPI: Thermal Zone [THRM] (51 C)
No dock devices found.
SCSI subsystem initialized
libata version 3.00 loaded.
@@ -306,7 +307,10 @@
b43legacy-phy0: Loading firmware version 0x127, patch level 14
(2005-04-18 02:36:27)
b43legacy-phy0 debug: Chip initialized
b43legacy-phy0 debug: 30-bit DMA initialized
+b43legacy-phy0 ERROR: PHY transmission error
b43legacy-phy0 debug: Wireless interface started
+b43legacy-phy0 ERROR: PHY transmission error
+b43legacy-phy0 ERROR: PHY transmission error
b43legacy-phy0 debug: Adding Interface type 2
NET: Registered protocol family 17
eth1: Initial auth_alg=0
=======================================================
I am eager to test trial patches.
Thanks,
Larry
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: Regression in 2.6.27 caused by commit bfc0f59
2008-08-31 22:54 Regression in 2.6.27 caused by commit bfc0f59 Larry Finger
@ 2008-09-01 11:14 ` Thomas Gleixner
2008-09-01 15:37 ` Larry Finger
2008-09-01 17:44 ` Larry Finger
0 siblings, 2 replies; 56+ messages in thread
From: Thomas Gleixner @ 2008-09-01 11:14 UTC (permalink / raw)
To: Larry Finger
Cc: LKML, Rafael J. Wysocki, Linus Torvalds, Alok Kataria,
Michael Buesch
On Sun, 31 Aug 2008, Larry Finger wrote:
> An ancient laptop of mine started throwing errors from b43legacy when I
> started using 2.6.27 on it. This has been bisected to commit bfc0f59 "x86:
> merge tsc calibration". Incidentally, I was appalled at the number of build
> errors that were found while bisecting in this region. Every commit from
> 2.6.26-rc9-00715 to at least -00719 had include errors that had to be fixed
> before it would compile. Those fixes are the only reason that the builds below
> are "dirty".
>
> The CPU in this computer is an AMD-K6 at stepping 0c and running at 450 MHz.
>
> The critical differences in the dmesg output between the "good" and "bad"
> results indicate a factor of 2 difference in the clock speed, and are shown
> below:
> +Clocksource tsc unstable (delta = 500037272 ns)
> -Clocksource tsc unstable (delta = 83950402 ns)
In both cases the TSC is ahead of the pm_timer, which looks like the
pm_timer is behaving strange.
Can you please disable the pm_timer (in the kernel config,
unfortunately there is no command line option for that) for a test and
provide the relevant output of demsg ?
Thanks,
tglx
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: Regression in 2.6.27 caused by commit bfc0f59
2008-09-01 11:14 ` Thomas Gleixner
@ 2008-09-01 15:37 ` Larry Finger
2008-09-01 17:49 ` Thomas Gleixner
2008-09-01 17:44 ` Larry Finger
1 sibling, 1 reply; 56+ messages in thread
From: Larry Finger @ 2008-09-01 15:37 UTC (permalink / raw)
To: Thomas Gleixner
Cc: LKML, Rafael J. Wysocki, Linus Torvalds, Alok Kataria,
Michael Buesch
Thomas Gleixner wrote:
>
>> +Clocksource tsc unstable (delta = 500037272 ns)
>
>> -Clocksource tsc unstable (delta = 83950402 ns)
>
> In both cases the TSC is ahead of the pm_timer, which looks like the
> pm_timer is behaving strange.
>
> Can you please disable the pm_timer (in the kernel config,
> unfortunately there is no command line option for that) for a test and
> provide the relevant output of demsg ?
I manually deleted CONFIG_X86_PM_TIMER from .config, but it returns
silently. The only way I see to turn off the pm_timer is to disable
power management. Is that what I need to do?
Larry
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: Regression in 2.6.27 caused by commit bfc0f59
2008-09-01 11:14 ` Thomas Gleixner
2008-09-01 15:37 ` Larry Finger
@ 2008-09-01 17:44 ` Larry Finger
2008-09-01 18:31 ` Thomas Gleixner
2008-09-01 18:42 ` Linus Torvalds
1 sibling, 2 replies; 56+ messages in thread
From: Larry Finger @ 2008-09-01 17:44 UTC (permalink / raw)
To: Thomas Gleixner
Cc: LKML, Rafael J. Wysocki, Linus Torvalds, Alok Kataria,
Michael Buesch
Thomas Gleixner wrote:
> On Sun, 31 Aug 2008, Larry Finger wrote:
>
>> An ancient laptop of mine started throwing errors from b43legacy when I
>> started using 2.6.27 on it. This has been bisected to commit bfc0f59 "x86:
>> merge tsc calibration". Incidentally, I was appalled at the number of build
>> errors that were found while bisecting in this region. Every commit from
>> 2.6.26-rc9-00715 to at least -00719 had include errors that had to be fixed
>> before it would compile. Those fixes are the only reason that the builds below
>> are "dirty".
>>
>> The CPU in this computer is an AMD-K6 at stepping 0c and running at 450 MHz.
>>
>> The critical differences in the dmesg output between the "good" and "bad"
>> results indicate a factor of 2 difference in the clock speed, and are shown
>> below:
>
>> +Clocksource tsc unstable (delta = 500037272 ns)
>
>> -Clocksource tsc unstable (delta = 83950402 ns)
>
> In both cases the TSC is ahead of the pm_timer, which looks like the
> pm_timer is behaving strange.
>
> Can you please disable the pm_timer (in the kernel config,
> unfortunately there is no command line option for that) for a test and
> provide the relevant output of demsg ?
It took a while to figure out how to kill the pm_timer. I finally did
it by changing the default to no rather than yes. I also reset the
bisection and compiled a full -rc4 kernel.
What I hope is the relevant output of dmesg is below. The clock rate
is correctly determined, and the b43legacy errors are gone.
Thanks - Larry
Linux version 2.6.27-rc4-wl-15747-g2e3bbe3-dirty (finger@larrylap)
(gcc version 4.3.1 20080507 (prerelease) [gcc-4_3-branch revision
135036] (SUSE Linux) ) #36 Mon Sep 1 12:19:13 ACT 2008
--snip--
Kernel command line: root=/dev/sda3 vga=0x314 resume=/dev/sda2
splash=silent
Initializing CPU#0
PID hash table entries: 1024 (order: 10, 4096 bytes)
TSC calibrated against PIT
Detected 428.823 MHz processor.
Console: colour dummy device 80x25
console [tty0] enabled
Dentry cache hash table entries: 32768 (order: 5, 131072 bytes)
Inode-cache hash table entries: 16384 (order: 4, 65536 bytes)
Memory: 252692k/262080k available (1442k kernel code, 8836k reserved,
613k data, 188k init, 0k highmem)
virtual kernel memory layout:
fixmap : 0xffffa000 - 0xfffff000 ( 20 kB)
vmalloc : 0xd0800000 - 0xffff8000 ( 759 MB)
lowmem : 0xc0000000 - 0xcfff0000 ( 255 MB)
.init : 0xc0306000 - 0xc0335000 ( 188 kB)
.data : 0xc0268b54 - 0xc0302208 ( 613 kB)
.text : 0xc0100000 - 0xc0268b54 (1442 kB)
Checking if this processor honours the WP bit even in supervisor
mode...Ok.
CPA: page pool initialized 1 of 1 pages preallocated
Calibrating delay loop (skipped), value calculated using timer
frequency.. 857.64 BogoMIPS (lpj=1715292)
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: Regression in 2.6.27 caused by commit bfc0f59
2008-09-01 15:37 ` Larry Finger
@ 2008-09-01 17:49 ` Thomas Gleixner
0 siblings, 0 replies; 56+ messages in thread
From: Thomas Gleixner @ 2008-09-01 17:49 UTC (permalink / raw)
To: Larry Finger
Cc: LKML, Rafael J. Wysocki, Linus Torvalds, Alok Kataria,
Michael Buesch
On Mon, 1 Sep 2008, Larry Finger wrote:
> Thomas Gleixner wrote:
> >
> > > +Clocksource tsc unstable (delta = 500037272 ns)
> >
> > > -Clocksource tsc unstable (delta = 83950402 ns)
> >
> > In both cases the TSC is ahead of the pm_timer, which looks like the
> > pm_timer is behaving strange.
> >
> > Can you please disable the pm_timer (in the kernel config,
> > unfortunately there is no command line option for that) for a test and
> > provide the relevant output of demsg ?
>
> I manually deleted CONFIG_X86_PM_TIMER from .config, but it returns silently.
> The only way I see to turn off the pm_timer is to disable power management. Is
> that what I need to do?
The switch depends on CONFIG_EMBEDDED. Just turn that on in "General Setup"
[*] Configure standard kernel features (for small systems) --->
Then you get an extra switch in:
"Powermanagement Options"
ACPI (Advanced Configuration and Power Interface) Support --->
[*] Power Management Timer Support
Turn that one off.
Thanks,
tglx
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: Regression in 2.6.27 caused by commit bfc0f59
2008-09-01 17:44 ` Larry Finger
@ 2008-09-01 18:31 ` Thomas Gleixner
2008-09-01 19:10 ` Linus Torvalds
2008-09-01 19:36 ` Larry Finger
2008-09-01 18:42 ` Linus Torvalds
1 sibling, 2 replies; 56+ messages in thread
From: Thomas Gleixner @ 2008-09-01 18:31 UTC (permalink / raw)
To: Larry Finger
Cc: LKML, Rafael J. Wysocki, Linus Torvalds, Alok Kataria,
Michael Buesch
On Mon, 1 Sep 2008, Larry Finger wrote:
> Thomas Gleixner wrote:
> > > The critical differences in the dmesg output between the "good" and "bad"
> > > results indicate a factor of 2 difference in the clock speed, and are
> > > shown
> > > below:
> >
> > > +Clocksource tsc unstable (delta = 500037272 ns)
> >
> > > -Clocksource tsc unstable (delta = 83950402 ns)
> >
> > In both cases the TSC is ahead of the pm_timer, which looks like the
> > pm_timer is behaving strange.
> >
> > Can you please disable the pm_timer (in the kernel config,
> > unfortunately there is no command line option for that) for a test and
> > provide the relevant output of demsg ?
>
> It took a while to figure out how to kill the pm_timer. I finally did it by
> changing the default to no rather than yes. I also reset the bisection and
> compiled a full -rc4 kernel.
>
> What I hope is the relevant output of dmesg is below. The clock rate is
> correctly determined, and the b43legacy errors are gone.
Hmm. Haven't seen that before, but if confirms what I guessed from
your previous dmesg information. I wonder why you did not observe
strange behaviour with older kernel versions. I don't mean the
b43legacy errors, which might be caused by the wrong calibrated TSC,
but those even should show strange behaviour vs. time.
Can you please provide the output of an older "working" kernel version
from:
# cat /sys/devices/system/clocksource/clocksource0/current_clocksource
after the TSC was set to unstable. It should say acpi_pm.
If that's the case please run
# time sleep 60
on a shell and provide the output and verify it against a knwn to be
halfways correct stopwatch.
Then do the same on the current mainline with pm_timer
disabled. current clocksource should be either jiffies or tsc.
Thanks,
tglx
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: Regression in 2.6.27 caused by commit bfc0f59
2008-09-01 17:44 ` Larry Finger
2008-09-01 18:31 ` Thomas Gleixner
@ 2008-09-01 18:42 ` Linus Torvalds
2008-09-01 19:08 ` Thomas Gleixner
1 sibling, 1 reply; 56+ messages in thread
From: Linus Torvalds @ 2008-09-01 18:42 UTC (permalink / raw)
To: Larry Finger
Cc: Thomas Gleixner, LKML, Rafael J. Wysocki, Alok Kataria,
Michael Buesch
On Mon, 1 Sep 2008, Larry Finger wrote:
>
> TSC calibrated against PIT
> Detected 428.823 MHz processor.
Ok, Thomas, that means that the PIT is reliable (not surprising), and the
PM_TIMER isn't (again, I'm not horribly surprised). And HPET isn't
available, of course.
The old x86-32 code never even bothered with the PM_TIMER for calibration.
I don't understand why the x86-64 code bothers with it either. Why not
just drop that whole broken thing, and just depend on the PIT if there is
no HPET?
I would also like to point out that the 32-bit code actually had a much
nicer PIT setup, using the much better documented mach_prepare_counter()
and mach_countup() helper functions. I'm unhappy to note that the new
"common" code uses what appears to be the inferior code.
Also, note that this is _not_ a new issue. See "verify_pmtmr_rate()" in
drivers/clocksource/acpi_pm.c, along with all the code to check that the
reads are stable in "init_acpi_pm_clocksource()".
IOW, the PM_TIMER has been found to be broken before. Depending on it for
calibration is broken.
Linus
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: Regression in 2.6.27 caused by commit bfc0f59
2008-09-01 18:42 ` Linus Torvalds
@ 2008-09-01 19:08 ` Thomas Gleixner
0 siblings, 0 replies; 56+ messages in thread
From: Thomas Gleixner @ 2008-09-01 19:08 UTC (permalink / raw)
To: Linus Torvalds
Cc: Larry Finger, LKML, Rafael J. Wysocki, Alok Kataria,
Michael Buesch
On Mon, 1 Sep 2008, Linus Torvalds wrote:
> On Mon, 1 Sep 2008, Larry Finger wrote:
> >
> > TSC calibrated against PIT
> > Detected 428.823 MHz processor.
>
> Ok, Thomas, that means that the PIT is reliable (not surprising), and the
> PM_TIMER isn't (again, I'm not horribly surprised). And HPET isn't
> available, of course.
>
> The old x86-32 code never even bothered with the PM_TIMER for calibration.
> I don't understand why the x86-64 code bothers with it either. Why not
> just drop that whole broken thing, and just depend on the PIT if there is
> no HPET?
We had horrible results on some machines with the PIT especially in
the local APIC timer calibration code. One of my own laptops behaved
stupid until I verified the APIC timer versus the PM timer. Until
recently it also had occasional stupid TSC calibration values.
This was reported by others as well and is definitely caused by SMM
taking the CPU away and blocking the PIT interrupt.
> I would also like to point out that the 32-bit code actually had a much
> nicer PIT setup, using the much better documented mach_prepare_counter()
> and mach_countup() helper functions. I'm unhappy to note that the new
> "common" code uses what appears to be the inferior code.
>
> Also, note that this is _not_ a new issue. See "verify_pmtmr_rate()" in
> drivers/clocksource/acpi_pm.c, along with all the code to check that the
> reads are stable in "init_acpi_pm_clocksource()".
Yeah, I really forgot about this one.
> IOW, the PM_TIMER has been found to be broken before. Depending on it for
> calibration is broken.
While on older machines, which might have pmtimer wreckage (I just
googled that some AMD K6 are known to have such issues), the SMM/SMI
wreckage is likely to be a non issue, while on modern systems the
SMM/SMI wreckage will bite us. So we have the choice between evil and
evil.
One way out would be to check on which CPU type we run and ignore the
pmtimer for older systems.
Thanks,
tglx
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: Regression in 2.6.27 caused by commit bfc0f59
2008-09-01 18:31 ` Thomas Gleixner
@ 2008-09-01 19:10 ` Linus Torvalds
2008-09-01 20:07 ` Thomas Gleixner
2008-09-01 19:36 ` Larry Finger
1 sibling, 1 reply; 56+ messages in thread
From: Linus Torvalds @ 2008-09-01 19:10 UTC (permalink / raw)
To: Thomas Gleixner
Cc: Larry Finger, LKML, Rafael J. Wysocki, Alok Kataria,
Michael Buesch
On Mon, 1 Sep 2008, Thomas Gleixner wrote:
>
> Hmm. Haven't seen that before, but if confirms what I guessed from
> your previous dmesg information. I wonder why you did not observe
> strange behaviour with older kernel versions.
x86-32 never used the PM_TIMER for frequency estimation, it only ever used
the PIT. See the old "native_calculate_cpu_khz()" in tsc_32.c that you
deleted in favor of the (imho inferior) x86-64 version.
How about:
- taking the old 32-bit code, and using it to initially _just_ estimate
the TSC speed. That code was stable and pretty much guaranteed to work
reasonably well on all machines. It retries the timings three times,
and picks the best one.
- Then, _after_ you already have a pretty good estimation for TSC, you
can use _that_ to then get the HPET and/or PM_TIMER version (and not
use the PIT at all for those calibrations)
- and if the PM_TIMER one is too far off, just throw it away. We know the
PIT is a lot more trustworthy than the PM_TIMER.
Hmm?
Linus
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: Regression in 2.6.27 caused by commit bfc0f59
2008-09-01 18:31 ` Thomas Gleixner
2008-09-01 19:10 ` Linus Torvalds
@ 2008-09-01 19:36 ` Larry Finger
2008-09-01 20:09 ` Thomas Gleixner
1 sibling, 1 reply; 56+ messages in thread
From: Larry Finger @ 2008-09-01 19:36 UTC (permalink / raw)
To: Thomas Gleixner
Cc: LKML, Rafael J. Wysocki, Linus Torvalds, Alok Kataria,
Michael Buesch
Thomas Gleixner wrote:
> On Mon, 1 Sep 2008, Larry Finger wrote:
>> Thomas Gleixner wrote:
>>>> The critical differences in the dmesg output between the "good" and "bad"
>>>> results indicate a factor of 2 difference in the clock speed, and are
>>>> shown
>>>> below:
>>>> +Clocksource tsc unstable (delta = 500037272 ns)
>>>> -Clocksource tsc unstable (delta = 83950402 ns)
>>> In both cases the TSC is ahead of the pm_timer, which looks like the
>>> pm_timer is behaving strange.
>>>
>>> Can you please disable the pm_timer (in the kernel config,
>>> unfortunately there is no command line option for that) for a test and
>>> provide the relevant output of demsg ?
>> It took a while to figure out how to kill the pm_timer. I finally did it by
>> changing the default to no rather than yes. I also reset the bisection and
>> compiled a full -rc4 kernel.
>>
>> What I hope is the relevant output of dmesg is below. The clock rate is
>> correctly determined, and the b43legacy errors are gone.
>
> Hmm. Haven't seen that before, but if confirms what I guessed from
> your previous dmesg information. I wonder why you did not observe
> strange behaviour with older kernel versions. I don't mean the
> b43legacy errors, which might be caused by the wrong calibrated TSC,
> but those even should show strange behaviour vs. time.
>
> Can you please provide the output of an older "working" kernel version
> from:
>
> # cat /sys/devices/system/clocksource/clocksource0/current_clocksource
>
> after the TSC was set to unstable. It should say acpi_pm.
>
> If that's the case please run
>
> # time sleep 60
>
> on a shell and provide the output and verify it against a knwn to be
> halfways correct stopwatch.
>
> Then do the same on the current mainline with pm_timer
> disabled. current clocksource should be either jiffies or tsc.
Both the openSUSE 2.6.22 kernel and the one with the pm_timer disabled
return "pit". I don't think pm_timer had ever been used until the
commit in question.
The timed sleep is as accurate as I can measure.
I put in some test prints. The value of pm2 is zero when the else
branch of the "if (hpet)" is entered; however, pm1 is 15768471. When
we reach the do_div(tsc2, tsc1) statement, tsc2 is zero, which I think
means that the two calls to tsc_read_refs() are returning the same
junk value.
Larry
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: Regression in 2.6.27 caused by commit bfc0f59
2008-09-01 19:10 ` Linus Torvalds
@ 2008-09-01 20:07 ` Thomas Gleixner
2008-09-01 21:30 ` Thomas Gleixner
` (2 more replies)
0 siblings, 3 replies; 56+ messages in thread
From: Thomas Gleixner @ 2008-09-01 20:07 UTC (permalink / raw)
To: Linus Torvalds
Cc: Larry Finger, LKML, Rafael J. Wysocki, Alok Kataria,
Michael Buesch
On Mon, 1 Sep 2008, Linus Torvalds wrote:
> On Mon, 1 Sep 2008, Thomas Gleixner wrote:
> >
> > Hmm. Haven't seen that before, but if confirms what I guessed from
> > your previous dmesg information. I wonder why you did not observe
> > strange behaviour with older kernel versions.
>
> x86-32 never used the PM_TIMER for frequency estimation, it only ever used
> the PIT. See the old "native_calculate_cpu_khz()" in tsc_32.c that you
> deleted in favor of the (imho inferior) x86-64 version.
>
> How about:
>
> - taking the old 32-bit code, and using it to initially _just_ estimate
> the TSC speed. That code was stable and pretty much guaranteed to work
> reasonably well on all machines. It retries the timings three times,
> and picks the best one.
>
> - Then, _after_ you already have a pretty good estimation for TSC, you
> can use _that_ to then get the HPET and/or PM_TIMER version (and not
> use the PIT at all for those calibrations)
>
> - and if the PM_TIMER one is too far off, just throw it away. We know the
> PIT is a lot more trustworthy than the PM_TIMER.
Far off in which direction ?
If the PIT interrupts are delayed by SMM code, then I see That's on a
max. three years old 32bit Core Duo things like:
[ 0.000000] Detected 8340.258 MHz processor.
[ 13.782091] APIC calibration not consistent with PM Timer: 228ms instead of 100ms
This one is way off, while the next one is in a reasonable range
[ 0.000000] Detected 3240.001 MHz processor.
[ 13.792122] APIC calibration not consistent with PM Timer: 178ms instead of 100ms
while in reality the machine is @2GHZ and current mainline says:
[ 0.000000] Detected 2000.065 MHz processor.
The CPU calibration of < 2.6.27 is against PIT and does _NOT_ give me a
pretty good estimation for TSC.
I was pretty happy when Alok beat me to unify the TSC calibration code
as it solved one of my long standing todo items, which also filled my
buglist on a regular base.
I did debugged this thorougly using the tracer from preempt-rt to
check, what the box does during that time, and it definitely vanishes
for >100ms in a row in the black hole of the stupid BIOS.
So either way. Relying on PIT on newer machines is _BAD_, relying on
PM_TIMER on older machine is _BAD_ as well.
There is no given good estimate, when the TSC/PIT calibration is off
by factor 1.5 to 4. The consequence would be that I throw away a
perfect fine pmtimer and run a machine which advertises itself as the
fastest box on the planet. With your method I would disable nohz and I
would be back to 50% battery time.
I'm happy to discard the PIT on the 32bit machines again and then file
a bugreport for a regression between 2.6.27-rc1 and tomorrows git :)
This one is the first complaints, I've seen vs. a non working pmtimer
since quite a time. That's why I obviously forgot about the rate check
issue.
I just looked at drivers/clocksource/acpi_pm.c history and saw, that
John explicitely mentions AMD K6 in commit
562f9c574e0707f9159a729ea41faf53b221cd30
This patch re-adds the verify_pmtmr_rate functionality from 2.6.17 that
I dropped 2.6.18.
This resolves problems seen on older K6 ASUS boards where the ACPI PM
timer runs too fast.
Larry's box has: "an AMD-K6 at stepping 0c and running at 450 MHz."
The oracle of google only gave me hits for AMD-K6 in a quick survey
along with the slow access mode problem for older ICH4 chipsets.
So I think it's a reasonable thing to disable the PMTIMER based
calibration on AMD-K6 and older. I'm not sure about the exact cut line
we choose - it might be wrong as always, but it's definitely better
than adding a lot of magic into the calibration code.
Thanks,
tglx
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: Regression in 2.6.27 caused by commit bfc0f59
2008-09-01 19:36 ` Larry Finger
@ 2008-09-01 20:09 ` Thomas Gleixner
2008-09-01 20:23 ` Larry Finger
0 siblings, 1 reply; 56+ messages in thread
From: Thomas Gleixner @ 2008-09-01 20:09 UTC (permalink / raw)
To: Larry Finger
Cc: LKML, Rafael J. Wysocki, Linus Torvalds, Alok Kataria,
Michael Buesch
On Mon, 1 Sep 2008, Larry Finger wrote:
> The timed sleep is as accurate as I can measure.
>
> I put in some test prints. The value of pm2 is zero when the else branch of
> the "if (hpet)" is entered; however, pm1 is 15768471. When we reach the
> do_div(tsc2, tsc1) statement, tsc2 is zero, which I think means that the two
> calls to tsc_read_refs() are returning the same junk value.
Ok, so the pmtimer is probably detected later as unusable and disabled.
Please check your logs for:
"PM-Timer had inconsistent results:"
Thanks,
tglx
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: Regression in 2.6.27 caused by commit bfc0f59
2008-09-01 20:09 ` Thomas Gleixner
@ 2008-09-01 20:23 ` Larry Finger
2008-09-01 20:45 ` Thomas Gleixner
0 siblings, 1 reply; 56+ messages in thread
From: Larry Finger @ 2008-09-01 20:23 UTC (permalink / raw)
To: Thomas Gleixner
Cc: LKML, Rafael J. Wysocki, Linus Torvalds, Alok Kataria,
Michael Buesch
Thomas Gleixner wrote:
> On Mon, 1 Sep 2008, Larry Finger wrote:
>> The timed sleep is as accurate as I can measure.
>>
>> I put in some test prints. The value of pm2 is zero when the else branch of
>> the "if (hpet)" is entered; however, pm1 is 15768471. When we reach the
>> do_div(tsc2, tsc1) statement, tsc2 is zero, which I think means that the two
>> calls to tsc_read_refs() are returning the same junk value.
>
> Ok, so the pmtimer is probably detected later as unusable and disabled.
> Please check your logs for:
> "PM-Timer had inconsistent results:"
Booting 2.6.26, the dmesg output has a line that says:
PM-Timer running at invalid rate: 200% of normal - aborting.
Amazing that it should be exactly 200%. Why is the CPU running at half
speed when the PM-Timer rate is measured?
Larry
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: Regression in 2.6.27 caused by commit bfc0f59
2008-09-01 20:23 ` Larry Finger
@ 2008-09-01 20:45 ` Thomas Gleixner
0 siblings, 0 replies; 56+ messages in thread
From: Thomas Gleixner @ 2008-09-01 20:45 UTC (permalink / raw)
To: Larry Finger
Cc: LKML, Rafael J. Wysocki, Linus Torvalds, Alok Kataria,
Michael Buesch
On Mon, 1 Sep 2008, Larry Finger wrote:
> Thomas Gleixner wrote:
> > On Mon, 1 Sep 2008, Larry Finger wrote:
> > > The timed sleep is as accurate as I can measure.
> > >
> > > I put in some test prints. The value of pm2 is zero when the else branch
> > > of
> > > the "if (hpet)" is entered; however, pm1 is 15768471. When we reach the
> > > do_div(tsc2, tsc1) statement, tsc2 is zero, which I think means that the
> > > two
> > > calls to tsc_read_refs() are returning the same junk value.
> >
> > Ok, so the pmtimer is probably detected later as unusable and disabled.
> > Please check your logs for: "PM-Timer had inconsistent results:"
>
> Booting 2.6.26, the dmesg output has a line that says:
>
> PM-Timer running at invalid rate: 200% of normal - aborting.
>
> Amazing that it should be exactly 200%. Why is the CPU running at half speed
> when the PM-Timer rate is measured?
The kernel assumes that the PM timer frequency is normal, so it does:
read pm-timer start value, read TSC start value
wait for a some time
read pm-timer end value, read TSC end value
And the TSC frequency is calculated via:
TSC-End - TSC-Start
TSC-Frequency = -------------------- * PM-Frequency
PM-End - PM-Start
So if your PM-Timer runs at the double frequency for reasons only
known to the Chip Manufacturer the kernel miscalculates the TSC
frequency by factor 0.5.
Simple rule of three.
Thanks,
tglx
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: Regression in 2.6.27 caused by commit bfc0f59
2008-09-01 20:07 ` Thomas Gleixner
@ 2008-09-01 21:30 ` Thomas Gleixner
2008-09-01 22:02 ` Linus Torvalds
2008-09-01 22:16 ` Linus Torvalds
2 siblings, 0 replies; 56+ messages in thread
From: Thomas Gleixner @ 2008-09-01 21:30 UTC (permalink / raw)
To: Linus Torvalds
Cc: Larry Finger, LKML, Rafael J. Wysocki, Alok Kataria,
Michael Buesch
On Mon, 1 Sep 2008, Thomas Gleixner wrote:
>
> I'm happy to discard the PIT on the 32bit machines again and then file
ooops ---------------------^^^ wanted to say PMTIMER of course :)
tglx
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: Regression in 2.6.27 caused by commit bfc0f59
2008-09-01 20:07 ` Thomas Gleixner
2008-09-01 21:30 ` Thomas Gleixner
@ 2008-09-01 22:02 ` Linus Torvalds
2008-09-01 22:33 ` Thomas Gleixner
2008-09-01 22:16 ` Linus Torvalds
2 siblings, 1 reply; 56+ messages in thread
From: Linus Torvalds @ 2008-09-01 22:02 UTC (permalink / raw)
To: Thomas Gleixner
Cc: Larry Finger, LKML, Rafael J. Wysocki, Alok Kataria,
Michael Buesch
On Mon, 1 Sep 2008, Thomas Gleixner wrote:
>
> The CPU calibration of < 2.6.27 is against PIT and does _NOT_ give me a
> pretty good estimation for TSC.
Umm. Which one. The 32-bit or the 64-bit?
I already pointed out that the 64-bit PIT calibration is pure and utter
GARBAGE. And I pointed out that the 32-bit one is a lot smarter.
> I was pretty happy when Alok beat me to unify the TSC calibration code
> as it solved one of my long standing todo items, which also filled my
> buglist on a regular base.
The problem is that he unified it around the _worse_ code.
Linus
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: Regression in 2.6.27 caused by commit bfc0f59
2008-09-01 20:07 ` Thomas Gleixner
2008-09-01 21:30 ` Thomas Gleixner
2008-09-01 22:02 ` Linus Torvalds
@ 2008-09-01 22:16 ` Linus Torvalds
2008-09-01 23:16 ` Thomas Gleixner
2 siblings, 1 reply; 56+ messages in thread
From: Linus Torvalds @ 2008-09-01 22:16 UTC (permalink / raw)
To: Thomas Gleixner
Cc: Larry Finger, LKML, Rafael J. Wysocki, Alok Kataria,
Michael Buesch
On Mon, 1 Sep 2008, Thomas Gleixner wrote:
>
> If the PIT interrupts are delayed by SMM code
Btw, this sentence of yours just doesn't seem to make much sense.
The thing is, the calibration code doesn't even use interrupts. It just
reads the PIT timer value.
And the thing is, the 64-bit code really does things that the 32-bit code
does _not_ do.
Just as an example, the old 32-bit code (the thing that was deleted) tried
to actually round things correctly, while the 64-bit code does not. Look
at what the 64-bit code does:
outb((inb(0x61) & ~0x02) | 0x01, 0x61);
outb(0xb0, 0x43);
outb((CLOCK_TICK_RATE / (1000 / 50)) & 0xff, 0x42);
outb((CLOCK_TICK_RATE / (1000 / 50)) >> 8, 0x42);
tr1 = get_cycles();
while ((inb(0x61) & 0x20) == 0);
tr2 = get_cycles();
and notice three things:
- no comments (can you actually see what it does?)
- no rounding of the inevitable errors of trying to wait 1/50th of a
second
- not a single try to even account for the fact that there might be
things going on that perturb the value.
Now, look at what the 32-bit code _used_ to do. The good code. The code
that was _deleted_.
Really. I don't think you really even looked. It did:
/* run 3 times to ensure the cache is warm and to get an accurate reading */
for (i = 0; i < 3; i++) {
mach_prepare_counter();
rdtscll(start);
mach_countup(&count);
rdtscll(end);
.. ignore bad values ..
/*
* We want the minimum time of all runs in case one of them
* is inaccurate due to SMI or other delay
*/
delta64 = min(delta64, (end - start));
}
and if you actually look at those counter things, you'll see:
#define CALIBRATE_TIME_MSEC 30 /* 30 msecs */
#define CALIBRATE_LATCH \
((CLOCK_TICK_RATE * CALIBRATE_TIME_MSEC + 1000/2)/1000)
static inline void mach_prepare_counter(void)
{
/* Set the Gate high, disable speaker */
outb((inb(0x61) & ~0x02) | 0x01, 0x61);
/*
* Now let's take care of CTC channel 2
*
* Set the Gate high, program CTC channel 2 for mode 0,
* (interrupt on terminal count mode), binary count,
* load 5 * LATCH count, (LSB and MSB) to begin countdown.
*
* Some devices need a delay here.
*/
outb(0xb0, 0x43); /* binary, mode 0, LSB/MSB, Ch 2 */
outb_p(CALIBRATE_LATCH & 0xff, 0x42); /* LSB of count */
outb_p(CALIBRATE_LATCH >> 8, 0x42); /* MSB of count */
}
ie look how it actually tries to round to the nearest latch value, an how
it actually comments on what it is doing.
Now, which piece of code is better?
Honestly?
Have you tried the better version (for example, boot a 32-bit kernel
_before_ the unification on that machine to try).
Linus
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: Regression in 2.6.27 caused by commit bfc0f59
2008-09-01 22:02 ` Linus Torvalds
@ 2008-09-01 22:33 ` Thomas Gleixner
2008-09-01 22:56 ` Linus Torvalds
0 siblings, 1 reply; 56+ messages in thread
From: Thomas Gleixner @ 2008-09-01 22:33 UTC (permalink / raw)
To: Linus Torvalds
Cc: Larry Finger, LKML, Rafael J. Wysocki, Alok Kataria,
Michael Buesch
On Mon, 1 Sep 2008, Linus Torvalds wrote:
>
>
> On Mon, 1 Sep 2008, Thomas Gleixner wrote:
> >
> > The CPU calibration of < 2.6.27 is against PIT and does _NOT_ give me a
> > pretty good estimation for TSC.
>
> Umm. Which one. The 32-bit or the 64-bit?
The 32 bit version.
tglx
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: Regression in 2.6.27 caused by commit bfc0f59
2008-09-01 22:33 ` Thomas Gleixner
@ 2008-09-01 22:56 ` Linus Torvalds
2008-09-01 23:24 ` Thomas Gleixner
0 siblings, 1 reply; 56+ messages in thread
From: Linus Torvalds @ 2008-09-01 22:56 UTC (permalink / raw)
To: Thomas Gleixner
Cc: Larry Finger, LKML, Rafael J. Wysocki, Alok Kataria,
Michael Buesch
On Tue, 2 Sep 2008, Thomas Gleixner wrote:
> >
> > Umm. Which one. The 32-bit or the 64-bit?
>
> The 32 bit version.
So then the PIT isn't going to work on that machine. Do you have HPET?
Because if so, the most obvious choice would be to say:
- on old machines, the PIT is likely more reliable than PM_TIMER
- on new machines, the HPET is likely more reliable than PIT
and there's actually a fairly obvious place to distinguish between old and
new: if ti has a HPET, consider it a new one.
But that just means that there's never any reason to use PM_TIMER anyway,
and the proper patch would probably be to just ignore it. No?
Linus
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: Regression in 2.6.27 caused by commit bfc0f59
2008-09-01 22:16 ` Linus Torvalds
@ 2008-09-01 23:16 ` Thomas Gleixner
2008-09-02 3:18 ` Linus Torvalds
0 siblings, 1 reply; 56+ messages in thread
From: Thomas Gleixner @ 2008-09-01 23:16 UTC (permalink / raw)
To: Linus Torvalds
Cc: Larry Finger, LKML, Rafael J. Wysocki, Alok Kataria,
Michael Buesch
On Mon, 1 Sep 2008, Linus Torvalds wrote:
>
>
> On Mon, 1 Sep 2008, Thomas Gleixner wrote:
> >
> > If the PIT interrupts are delayed by SMM code
>
> Btw, this sentence of yours just doesn't seem to make much sense.
>
> The thing is, the calibration code doesn't even use interrupts. It just
> reads the PIT timer value.
Sorry. I was wrong on the interrupts part. Too tired :(
> Now, look at what the 32-bit code _used_ to do. The good code. The code
> that was _deleted_.
The _good_ code which results in a 8GhZ TSC calibration on that very
_32_ bit box I have here. The CPU is 32bit only, so it never even
touched a 64 bit kernel remotely.
> Really. I don't think you really even looked. It did:
>
> /* run 3 times to ensure the cache is warm and to get an accurate reading */
> for (i = 0; i < 3; i++) {
> mach_prepare_counter();
> rdtscll(start);
> mach_countup(&count);
> rdtscll(end);
>
> .. ignore bad values ..
>
> /*
> * We want the minimum time of all runs in case one of them
> * is inaccurate due to SMI or other delay
> */
> delta64 = min(delta64, (end - start));
> }
I know that code.
> and if you actually look at those counter things, you'll see:
>
> #define CALIBRATE_TIME_MSEC 30 /* 30 msecs */
> #define CALIBRATE_LATCH \
> ((CLOCK_TICK_RATE * CALIBRATE_TIME_MSEC + 1000/2)/1000)
>
> static inline void mach_prepare_counter(void)
> {
> /* Set the Gate high, disable speaker */
> outb((inb(0x61) & ~0x02) | 0x01, 0x61);
>
> /*
> * Now let's take care of CTC channel 2
> *
> * Set the Gate high, program CTC channel 2 for mode 0,
> * (interrupt on terminal count mode), binary count,
> * load 5 * LATCH count, (LSB and MSB) to begin countdown.
> *
> * Some devices need a delay here.
> */
> outb(0xb0, 0x43); /* binary, mode 0, LSB/MSB, Ch 2 */
> outb_p(CALIBRATE_LATCH & 0xff, 0x42); /* LSB of count */
> outb_p(CALIBRATE_LATCH >> 8, 0x42); /* MSB of count */
> }
>
> ie look how it actually tries to round to the nearest latch value, an how
> it actually comments on what it is doing.
>
> Now, which piece of code is better?
>
> Honestly?
None.
start_pit_documented_magic()
read_tsc()
wait_until_pit_has_wrapped_documented_magic()
read_tsc()
is error prone versus SMI/SMM code simply due to the fact, that at any
given point between those functions the SMM/SMI can happen. Doing it
three times in a row and select the lowest one does not change much. I
tried it 10 times in a row with varying bogus results.
So at every boot I get significant different calibration values. See
below.
> Have you tried the better version (for example, boot a 32-bit kernel
> _before_ the unification on that machine to try).
The following is from a 32bit boot on that very 32bit Intel Core Duo
Laptop running 2.6.26:
[ 0.000000] Detected 8340.258 MHz processor.
next boot
[ 0.000000] Detected 3240.001 MHz processor.
next boot
[ 0.000000] Detected 2211.134 MHz processor.
I can print you the value for 100 loops if you want, but I bet that
the correctness rate will be pretty small.
Current mainline calibrated against pmtimer gives me:
[ 0.000000] Detected 2000.065 MHz processor.
next boot
[ 0.000000] Detected 2000.129 MHz processor.
next boot
[ 0.000000] Detected 1999.988 MHz processor.
which is about accurate:
[ 13.408342] CPU0: Intel Genuine Intel(R) CPU T2500 @ 2.00GHz stepping 08
We had the same problem versus the local APIC timer calibration, which
had basically the same algorithm as the TSC one and we changed it to
look at the PMTimer as well in the days where we debugged the initial
wreckage caused by the nohz/highres changes. I can dig up the archives
of LAPIC timers with 200Mhz clock frequency, which results in a 10GHz
bus frequency, if you want.
How do you prevent the SMM brain damage, when it hits 3 times in a row ?
You can not prevent it for a very simple reason: The PIT is not
necessary a PIT. It can be a fake SMM code replacement. We actually
have no idea anymore what's hardware and what's just emulated crapola
under the control of BIOS maniacs.
But we know pretty much, that the old K6 has a reliable PIT, a maybe
broken pmtimer and is pretty much unaffected from todays SMM code
disasters.
So excluding the documented breakage of K6 from using pmtimer and
keeping the pmtimer as a reference for todays SMM code wreckaged
systems is not a too bad idea. That way we can actually serve both
worlds.
Thanks,
tglx
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: Regression in 2.6.27 caused by commit bfc0f59
2008-09-01 22:56 ` Linus Torvalds
@ 2008-09-01 23:24 ` Thomas Gleixner
2008-09-02 6:37 ` Andi Kleen
0 siblings, 1 reply; 56+ messages in thread
From: Thomas Gleixner @ 2008-09-01 23:24 UTC (permalink / raw)
To: Linus Torvalds
Cc: Larry Finger, LKML, Rafael J. Wysocki, Alok Kataria,
Michael Buesch
On Mon, 1 Sep 2008, Linus Torvalds wrote:
> On Tue, 2 Sep 2008, Thomas Gleixner wrote:
> > >
> > > Umm. Which one. The 32-bit or the 64-bit?
> >
> > The 32 bit version.
>
> So then the PIT isn't going to work on that machine. Do you have HPET?
>
> Because if so, the most obvious choice would be to say:
>
> - on old machines, the PIT is likely more reliable than PM_TIMER
>
> - on new machines, the HPET is likely more reliable than PIT
>
> and there's actually a fairly obvious place to distinguish between old and
> new: if ti has a HPET, consider it a new one.
>
> But that just means that there's never any reason to use PM_TIMER anyway,
> and the proper patch would probably be to just ignore it. No?
There is a HPET timer on that machine, but it is only available by
enforcing the address via black magic because the BIOS does advertise
it on the wrong location and the newest BIOS version does not
advertise it at all.
I have not seen a single bug report against PM timer aside of those K6
area systems, but we have
- HPETs which are not advertised at all
- HPETs which are advertised at the wrong address
- HPETs which do not count at all
- HPETs which tell us bogus operating frequencies
- HPETs which have bogus interrupt routings
So why should we rely on the HPET ?
The whole ACPI code relies pretty much on PM_TIMER and as I said
before there are no PM-TIMER related bug reports aside of those
documented K6 ones, which Alok did probably not know about and I
forgot them as well. Yeah, that's my fault.
Thanks,
tglx
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: Regression in 2.6.27 caused by commit bfc0f59
2008-09-01 23:16 ` Thomas Gleixner
@ 2008-09-02 3:18 ` Linus Torvalds
2008-09-02 3:35 ` Linus Torvalds
2008-09-02 17:17 ` Bill Davidsen
0 siblings, 2 replies; 56+ messages in thread
From: Linus Torvalds @ 2008-09-02 3:18 UTC (permalink / raw)
To: Thomas Gleixner
Cc: Larry Finger, LKML, Rafael J. Wysocki, Alok Kataria,
Michael Buesch
On Tue, 2 Sep 2008, Thomas Gleixner wrote:
>
> We had the same problem versus the local APIC timer calibration, which
> had basically the same algorithm as the TSC one and we changed it to
> look at the PMTimer as well in the days where we debugged the initial
> wreckage caused by the nohz/highres changes.
Hmm.
So then how would you discover when it's reliable and when it's not? Just
hardcode it for certain machines?
One alternative might be to do the same "detect if it's SMM code by seeing
how long the read takes" for the PIT reads themselves. Right now the code
does it for the HPET timer read and for the PM_TIMER reads, but _not_ for
the PIT status register reads.
> How do you prevent the SMM brain damage, when it hits 3 times in a row ?
Well, the biggest problem is actually _detection_.
We have three different timers, and they all have their own problems. How
do you reliably detect which one to use? The PM_TIMER clearly is _not_
always the answer here, but the code just assumes it is!
Linus
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: Regression in 2.6.27 caused by commit bfc0f59
2008-09-02 3:18 ` Linus Torvalds
@ 2008-09-02 3:35 ` Linus Torvalds
2008-09-02 4:54 ` Larry Finger
` (2 more replies)
2008-09-02 17:17 ` Bill Davidsen
1 sibling, 3 replies; 56+ messages in thread
From: Linus Torvalds @ 2008-09-02 3:35 UTC (permalink / raw)
To: Thomas Gleixner
Cc: Larry Finger, LKML, Rafael J. Wysocki, Alok Kataria,
Michael Buesch
On Mon, 1 Sep 2008, Linus Torvalds wrote:
>
> Well, the biggest problem is actually _detection_.
>
> We have three different timers, and they all have their own problems. How
> do you reliably detect which one to use? The PM_TIMER clearly is _not_
> always the answer here, but the code just assumes it is!
On the machine you have trouble with the PIT on, does this thing trigger?
If it does, that could be a simple way to say whether you prefer PM_TIMER
over PIT.
For me, even on a modern machine, I get a pit_count of 46321, which
matches the "about one microsecond for an ISA/LPC read" timing pretty
well. What do you get?
Linus
---
arch/x86/kernel/tsc.c | 17 ++++++++++++++++-
1 files changed, 16 insertions(+), 1 deletions(-)
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 8e786b0..7cf7543 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -131,6 +131,7 @@ unsigned long native_calibrate_tsc(void)
u64 tsc1, tsc2, tr1, tr2, delta, pm1, pm2, hpet1, hpet2;
int hpet = is_hpet_enabled();
unsigned int tsc_khz_val = 0;
+ unsigned int pit_count;
local_irq_save(flags);
@@ -142,7 +143,9 @@ unsigned long native_calibrate_tsc(void)
outb((CLOCK_TICK_RATE / (1000 / 50)) & 0xff, 0x42);
outb((CLOCK_TICK_RATE / (1000 / 50)) >> 8, 0x42);
tr1 = get_cycles();
- while ((inb(0x61) & 0x20) == 0);
+ pit_count = 0;
+ while ((inb(0x61) & 0x20) == 0)
+ pit_count++;
tr2 = get_cycles();
tsc2 = tsc_read_refs(&pm2, hpet ? &hpet2 : NULL);
@@ -150,6 +153,18 @@ unsigned long native_calibrate_tsc(void)
local_irq_restore(flags);
/*
+ * We should have waited for about 50ms, and a real hardware
+ * PIT access _should_ take about a microsecond even if it's
+ * over an ISA bus (or something that tries to emulate timings).
+ *
+ * So pit_count should be 50,000+. If it's a lot lower, that
+ * implies it might be some emulated device rather than real
+ * hardware..
+ */
+ if (pit_count < 25000)
+ printk("Slow PIT device (%u) - emulated using SMI?\n", pit_count);
+
+ /*
* Preset the result with the raw and inaccurate PIT
* calibration value
*/
^ permalink raw reply related [flat|nested] 56+ messages in thread
* Re: Regression in 2.6.27 caused by commit bfc0f59
2008-09-02 3:35 ` Linus Torvalds
@ 2008-09-02 4:54 ` Larry Finger
2008-09-02 9:17 ` Alan Cox
2008-09-02 12:15 ` Thomas Gleixner
2 siblings, 0 replies; 56+ messages in thread
From: Larry Finger @ 2008-09-02 4:54 UTC (permalink / raw)
To: Linus Torvalds
Cc: Thomas Gleixner, LKML, Rafael J. Wysocki, Alok Kataria,
Michael Buesch
Linus Torvalds wrote:
>
> On the machine you have trouble with the PIT on, does this thing trigger?
>
> If it does, that could be a simple way to say whether you prefer PM_TIMER
> over PIT.
>
> For me, even on a modern machine, I get a pit_count of 46321, which
> matches the "about one microsecond for an ISA/LPC read" timing pretty
> well. What do you get?
On the problem AMD K6, the count is 29411. On my almost new AMD Turion
64 X2 that runs at 2.0 GHz, the value was 31401. A cutoff value of
25000 may be a bit close. That machine says that it calibrates against
the PM_TIMER, but cat /sys/devices/...../current_clocksource returns
hpet.
Larry
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: Regression in 2.6.27 caused by commit bfc0f59
2008-09-01 23:24 ` Thomas Gleixner
@ 2008-09-02 6:37 ` Andi Kleen
2008-09-02 12:21 ` Thomas Gleixner
0 siblings, 1 reply; 56+ messages in thread
From: Andi Kleen @ 2008-09-02 6:37 UTC (permalink / raw)
To: Thomas Gleixner
Cc: Linus Torvalds, Larry Finger, LKML, Rafael J. Wysocki,
Alok Kataria, Michael Buesch
Thomas Gleixner <tglx@linutronix.de> writes:
>
> The whole ACPI code relies pretty much on PM_TIMER and as I said
The Linux ACPI code does not rely on PM_TIMER in any special ways AFAIK
(except that it occasionally uses standard linux timer functions)
It merely discovers it for use by the other pmtimer users.
-Andi
--
ak@linux.intel.com
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: Regression in 2.6.27 caused by commit bfc0f59
2008-09-02 3:35 ` Linus Torvalds
2008-09-02 4:54 ` Larry Finger
@ 2008-09-02 9:17 ` Alan Cox
2008-09-02 12:15 ` Thomas Gleixner
2 siblings, 0 replies; 56+ messages in thread
From: Alan Cox @ 2008-09-02 9:17 UTC (permalink / raw)
To: Linus Torvalds
Cc: Thomas Gleixner, Larry Finger, LKML, Rafael J. Wysocki,
Alok Kataria, Michael Buesch
> On the machine you have trouble with the PIT on, does this thing trigger?
>
> If it does, that could be a simple way to say whether you prefer PM_TIMER
> over PIT.
You'll get funny answers on the Geode doing that. It will also randomly
mistrigger on some laptops - they don't use SMI for PIC but their power
handling SMIs will trigger that randomly.
PMTimer may also be an SMI too btw.
Alan
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: Regression in 2.6.27 caused by commit bfc0f59
2008-09-02 3:35 ` Linus Torvalds
2008-09-02 4:54 ` Larry Finger
2008-09-02 9:17 ` Alan Cox
@ 2008-09-02 12:15 ` Thomas Gleixner
2008-09-02 15:09 ` Linus Torvalds
2 siblings, 1 reply; 56+ messages in thread
From: Thomas Gleixner @ 2008-09-02 12:15 UTC (permalink / raw)
To: Linus Torvalds
Cc: Larry Finger, LKML, Rafael J. Wysocki, Alok Kataria,
Michael Buesch
On Mon, 1 Sep 2008, Linus Torvalds wrote:
> >
> > Well, the biggest problem is actually _detection_.
> >
> > We have three different timers, and they all have their own problems. How
> > do you reliably detect which one to use? The PM_TIMER clearly is _not_
> > always the answer here, but the code just assumes it is!
>
> On the machine you have trouble with the PIT on, does this thing trigger?
>
> If it does, that could be a simple way to say whether you prefer PM_TIMER
> over PIT.
>
> For me, even on a modern machine, I get a pit_count of 46321, which
> matches the "about one microsecond for an ISA/LPC read" timing pretty
> well. What do you get?
About anything between 0 and useful, but still I had cases where the
pit_count was way above the 25000 but the TSC was off by factor 2.
Went down the road and instrumented the code:
unsinged long tsc_deltas[50000];
start_pit();
tsc1 = tsc = read_tsc_start();
while (!pit_ready()) {
tsc2 = read_tsc();
tsc_deltas[pit_count++] = tsc2 - tsc;
tsc = tsc2;
}
Analysing the tsc_deltas gave interesting insight. On the affected
laptop I had several entries where the delta between two reads was
from 1msec up to 120msec maximum.
As the code does nothing else and has interrupts disabled there is
only one explanation: the SMM/SMI black hole.
If this high delta hits after the pit_count reached 25000 we still
believe that our calibration against pit is fine :(
So what I'm working on is an algorithm, which is similar to the checks
in the tsc_read_refs() function. That should allow us to detect
whether one of the reads is way off by doing a min/max detection. In
such a case we can either repeat the calibration or try to figure out
whether the pmtimer / hpet can provide us with some useful reference.
Thanks,
tglx
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: Regression in 2.6.27 caused by commit bfc0f59
2008-09-02 6:37 ` Andi Kleen
@ 2008-09-02 12:21 ` Thomas Gleixner
0 siblings, 0 replies; 56+ messages in thread
From: Thomas Gleixner @ 2008-09-02 12:21 UTC (permalink / raw)
To: Andi Kleen
Cc: Linus Torvalds, Larry Finger, LKML, Rafael J. Wysocki,
Alok Kataria, Michael Buesch
On Tue, 2 Sep 2008, Andi Kleen wrote:
> Thomas Gleixner <tglx@linutronix.de> writes:
> >
> > The whole ACPI code relies pretty much on PM_TIMER and as I said
>
> The Linux ACPI code does not rely on PM_TIMER in any special ways AFAIK
> (except that it occasionally uses standard linux timer functions)
> It merely discovers it for use by the other pmtimer users.
# grep -rc acpi_gbl_FADT.xpm_timer_block.address drivers/acpi/ | grep -v ":0"
drivers/acpi/processor_idle.c:12
I need to correct myself. It's only the processor_idle code :)
Thanks,
tglx
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: Regression in 2.6.27 caused by commit bfc0f59
2008-09-02 12:15 ` Thomas Gleixner
@ 2008-09-02 15:09 ` Linus Torvalds
2008-09-02 18:14 ` Thomas Gleixner
2008-09-05 13:45 ` Regression in 2.6.27 caused by commit bfc0f59 Mark Lord
0 siblings, 2 replies; 56+ messages in thread
From: Linus Torvalds @ 2008-09-02 15:09 UTC (permalink / raw)
To: Thomas Gleixner
Cc: Larry Finger, LKML, Rafael J. Wysocki, Alok Kataria,
Michael Buesch
On Tue, 2 Sep 2008, Thomas Gleixner wrote:
>
> Analysing the tsc_deltas gave interesting insight. On the affected
> laptop I had several entries where the delta between two reads was
> from 1msec up to 120msec maximum.
Ok, that sounds like a good approach to find if it's done by some
kind of emulation or not. Of course, any machine with SMM (even if it
doesn't emulate the PIT per se - maybe it just gets some event related to
overheating or other 'maintenance' stuff) can have occasional hickups, but
the '120msec' thing is, I think, the real clincher.
Why? Because we only try to wait for 50ms in the first place! Even if
emulation is 100% exact (or even none at all, and the PIT accesses are in
hardware), if we have a 120ms hickup while waiting for 50ms, then the end
result will obviously be total crap, and yes, that sure explains how you
can get >100% wrong values.
> So what I'm working on is an algorithm, which is similar to the checks
> in the tsc_read_refs() function. That should allow us to detect
> whether one of the reads is way off by doing a min/max detection. In
> such a case we can either repeat the calibration or try to figure out
> whether the pmtimer / hpet can provide us with some useful reference.
I think the most trivial approach would be to
- just keep track of the max TSC difference for each loop iteration.
- if the max TSC is bigger than 1% of the total TSC, then something is
already seriously wrong (either we had very few loops indeed, or some
of them were very expensive)
- perhaps loop over the calibration, and make the TSC calibration loop
increase the delay. Because even if there is a 120ms hickup, if we had
used a longer calibration delay, we'd probably not have noticed (well,
ok, 120ms is pretty damning and is probably just unfixable, but smaller
hickups are probably harmless)
Additionally doing a min/max comparison to see that the loop is very
_stable_ is of course also a way to validate things, but expecting _too_
much stability may be wrong too. As mentioned, SMM events can happen for
other reasons than emulation.
Linus
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: Regression in 2.6.27 caused by commit bfc0f59
2008-09-02 3:18 ` Linus Torvalds
2008-09-02 3:35 ` Linus Torvalds
@ 2008-09-02 17:17 ` Bill Davidsen
1 sibling, 0 replies; 56+ messages in thread
From: Bill Davidsen @ 2008-09-02 17:17 UTC (permalink / raw)
To: linux-kernel
Cc: Thomas Gleixner, Larry Finger, LKML, Rafael J. Wysocki,
Alok Kataria, Michael Buesch
Linus Torvalds wrote:
>
> On Tue, 2 Sep 2008, Thomas Gleixner wrote:
>> We had the same problem versus the local APIC timer calibration, which
>> had basically the same algorithm as the TSC one and we changed it to
>> look at the PMTimer as well in the days where we debugged the initial
>> wreckage caused by the nohz/highres changes.
>
> Hmm.
>
> So then how would you discover when it's reliable and when it's not? Just
> hardcode it for certain machines?
Looking at values for old K6 machines, I would suspect that doing the
test three times and checking the deviation would be enough. If the
timer is emulated the value will jump around and if it is stable it
could be used. Considering that this is one use code you could increase
the number of trials to five or so, keeping the high and low. If
changing values are part of the problem, make them part of the solution.
>
> One alternative might be to do the same "detect if it's SMM code by seeing
> how long the read takes" for the PIT reads themselves. Right now the code
> does it for the HPET timer read and for the PM_TIMER reads, but _not_ for
> the PIT status register reads.
>
>> How do you prevent the SMM brain damage, when it hits 3 times in a row ?
>
> Well, the biggest problem is actually _detection_.
>
> We have three different timers, and they all have their own problems. How
> do you reliably detect which one to use? The PM_TIMER clearly is _not_
> always the answer here, but the code just assumes it is!
>
> Linus
--
Bill Davidsen <davidsen@tmr.com>
"We have more to fear from the bungling of the incompetent than from
the machinations of the wicked." - from Slashdot
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: Regression in 2.6.27 caused by commit bfc0f59
2008-09-02 15:09 ` Linus Torvalds
@ 2008-09-02 18:14 ` Thomas Gleixner
2008-09-02 18:41 ` Alok Kataria
2008-09-02 18:42 ` Linus Torvalds
2008-09-05 13:45 ` Regression in 2.6.27 caused by commit bfc0f59 Mark Lord
1 sibling, 2 replies; 56+ messages in thread
From: Thomas Gleixner @ 2008-09-02 18:14 UTC (permalink / raw)
To: Linus Torvalds
Cc: Larry Finger, LKML, Rafael J. Wysocki, Alok Kataria,
Michael Buesch
On Tue, 2 Sep 2008, Linus Torvalds wrote:
> > So what I'm working on is an algorithm, which is similar to the checks
> > in the tsc_read_refs() function. That should allow us to detect
> > whether one of the reads is way off by doing a min/max detection. In
> > such a case we can either repeat the calibration or try to figure out
> > whether the pmtimer / hpet can provide us with some useful reference.
>
> I think the most trivial approach would be to
>
> - just keep track of the max TSC difference for each loop iteration.
>
> - if the max TSC is bigger than 1% of the total TSC, then something is
> already seriously wrong (either we had very few loops indeed, or some
> of them were very expensive)
I went for summing up the deltas and build an average at the
end. That's from a loop of 10 consecutive runs:
[ 0.000000] TSC min 2160 max 3732 avg 3266 pitcnt 30614
[ 0.000000] TSC min 2160 max 1036164 avg 3299 pitcnt 30310
[ 0.000000] TSC min 2160 max 1032360 avg 3303 pitcnt 30277
[ 0.000000] TSC min 2160 max 210453018 avg 69509 pitcnt 30260
Hit very late in the loop, as pitcnt is close to the others
[ 0.000000] TSC min 2160 max 3708 avg 3265 pitcnt 30624
[ 0.000000] TSC min 2160 max 3720 avg 3265 pitcnt 30622
[ 0.000000] TSC min 2160 max 1062252 avg 3301 pitcnt 30287
[ 0.000000] TSC min 2160 max 3756 avg 3267 pitcnt 30605
[ 0.000000] TSC min 2160 max 3732 avg 3267 pitcnt 30605
[ 0.000000] TSC min 2136 max 989292 avg 3297 pitcnt 30324
[ 0.000000] TSC min 2136 max 3744 avg 3266 pitcnt 30612
[ 0.000000] TSC min 2160 max 78042006 avg 78045 pitcnt 1001
This one hit early in the loop as pitcnt is pretty low.
The min value is pretty constant.
The max value for sane loops is in the range of 3708 - 3756, the
average is between 3266 and 3267.
For those which have a ~500us maximum the average is still in a sane
range. That seems to be a single glitch, which pushs the maximum, but
does not really influence the average result.
The outstanding one is the 100ms (210 453 018 ticks), where the average
is also off by factor 20.
I think that information is enough to give us a pretty precice idea
when to discard the result. I'm currently looking at the hpet/pmtimer
values for comparison and I should have a patch for testing ready
later tonight.
> - perhaps loop over the calibration, and make the TSC calibration loop
> increase the delay. Because even if there is a 120ms hickup, if we had
> used a longer calibration delay, we'd probably not have noticed (well,
> ok, 120ms is pretty damning and is probably just unfixable, but smaller
> hickups are probably harmless)
Increasing the delay is probably not a good idea as we just make the
window larger for the SMI to happen.
> Additionally doing a min/max comparison to see that the loop is very
> _stable_ is of course also a way to validate things, but expecting _too_
> much stability may be wrong too. As mentioned, SMM events can happen for
> other reasons than emulation.
Yeah, I know. One of the oddballs is the USB->PS2 keyboard emulator
which is active during early boot. We do the USB handoff definitely
after the TSC calibration. Found a box with similar (not that bad)
hickups which go away when I disable that in the BIOS settings.
Can't say anything about the laptop in that regard, because the BIOS
does not offer me a switch for that :(
Thanks,
tglx
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: Regression in 2.6.27 caused by commit bfc0f59
2008-09-02 18:14 ` Thomas Gleixner
@ 2008-09-02 18:41 ` Alok Kataria
2008-09-02 21:16 ` Thomas Gleixner
2008-09-02 18:42 ` Linus Torvalds
1 sibling, 1 reply; 56+ messages in thread
From: Alok Kataria @ 2008-09-02 18:41 UTC (permalink / raw)
To: Thomas Gleixner
Cc: Linus Torvalds, Larry Finger, LKML, Rafael J. Wysocki,
Michael Buesch
On Tue, 2008-09-02 at 11:14 -0700, Thomas Gleixner wrote:
> On Tue, 2 Sep 2008, Linus Torvalds wrote:
> > > So what I'm working on is an algorithm, which is similar to the checks
> > > in the tsc_read_refs() function. That should allow us to detect
> > > whether one of the reads is way off by doing a min/max detection. In
> > > such a case we can either repeat the calibration or try to figure out
> > > whether the pmtimer / hpet can provide us with some useful reference.
> >
> > I think the most trivial approach would be to
> >
> > - just keep track of the max TSC difference for each loop iteration.
> >
> > - if the max TSC is bigger than 1% of the total TSC, then something is
> > already seriously wrong (either we had very few loops indeed, or some
> > of them were very expensive)
>
> I went for summing up the deltas and build an average at the
> end. That's from a loop of 10 consecutive runs:
>
> [ 0.000000] TSC min 2160 max 3732 avg 3266 pitcnt 30614
> [ 0.000000] TSC min 2160 max 1036164 avg 3299 pitcnt 30310
> [ 0.000000] TSC min 2160 max 1032360 avg 3303 pitcnt 30277
>
> [ 0.000000] TSC min 2160 max 210453018 avg 69509 pitcnt 30260
>
> Hit very late in the loop, as pitcnt is close to the others
>
> [ 0.000000] TSC min 2160 max 3708 avg 3265 pitcnt 30624
> [ 0.000000] TSC min 2160 max 3720 avg 3265 pitcnt 30622
> [ 0.000000] TSC min 2160 max 1062252 avg 3301 pitcnt 30287
> [ 0.000000] TSC min 2160 max 3756 avg 3267 pitcnt 30605
> [ 0.000000] TSC min 2160 max 3732 avg 3267 pitcnt 30605
> [ 0.000000] TSC min 2136 max 989292 avg 3297 pitcnt 30324
> [ 0.000000] TSC min 2136 max 3744 avg 3266 pitcnt 30612
>
> [ 0.000000] TSC min 2160 max 78042006 avg 78045 pitcnt 1001
>
> This one hit early in the loop as pitcnt is pretty low.
>
> The min value is pretty constant.
>
> The max value for sane loops is in the range of 3708 - 3756, the
> average is between 3266 and 3267.
>
> For those which have a ~500us maximum the average is still in a sane
> range. That seems to be a single glitch, which pushs the maximum, but
> does not really influence the average result.
>
> The outstanding one is the 100ms (210 453 018 ticks), where the average
> is also off by factor 20.
>
> I think that information is enough to give us a pretty precice idea
> when to discard the result. I'm currently looking at the hpet/pmtimer
> values for comparison and I should have a patch for testing ready
> later tonight.
>
Sorry for joining the party this late...am still going through all my
mails.
Ok, so from what I understand until now, we will calibrate TSC against
PIT as was done in 32bit code and use that as default. If that fails to
give any sane results we will fall back to calibrating against PM_timer
or HPET ?
Thomas has already explained the problem with 32bit calibration ( i.e.
just against PIT and no checks for SMI's and all) but would like to
point that this problem is lot more worse in virtualized environment,
because we may fail to get sane values even from multiple loops of
calibrating against PIT.
If we have a fall back mechanism to detect this SMI event, and then try
calibrating against PM timer or HPET we should be good.
Anyways I will wait to see the patch.
Thanks,
Alok
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: Regression in 2.6.27 caused by commit bfc0f59
2008-09-02 18:14 ` Thomas Gleixner
2008-09-02 18:41 ` Alok Kataria
@ 2008-09-02 18:42 ` Linus Torvalds
2008-09-02 21:13 ` Thomas Gleixner
2008-09-02 22:54 ` [PATCH] Fix TSC calibration issues Thomas Gleixner
1 sibling, 2 replies; 56+ messages in thread
From: Linus Torvalds @ 2008-09-02 18:42 UTC (permalink / raw)
To: Thomas Gleixner
Cc: Larry Finger, LKML, Rafael J. Wysocki, Alok Kataria,
Michael Buesch
On Tue, 2 Sep 2008, Thomas Gleixner wrote:
>
> Yeah, I know. One of the oddballs is the USB->PS2 keyboard emulator
> which is active during early boot. We do the USB handoff definitely
> after the TSC calibration. Found a box with similar (not that bad)
> hickups which go away when I disable that in the BIOS settings.
>
> Can't say anything about the laptop in that regard, because the BIOS
> does not offer me a switch for that :(
Dang.
We actually _used_ to do this fairly early for UHCI, since it was simple
and very useful to do, and not doing it caused serious problems with the
USB stack.
But then it got moved to the USB stack itself, and for some reason the
DECLARE_PCI_FIXUP_HEADER got turned into a DECLARE_PCI_FIXUP_FINAL instead
(probably because the USB layer expected things to be fairly set up),
which means that it now happens much much later.
But I guess even DECLARE_PCI_FIXUP_HEADER was is too late for TSC, so it
doesn't much matter. Even with the old setup, it would be too late.
You're definitely right that this could easily be the _real_ problem.
Especially as your TSC min value of 2160 is (a) pretty close to the
expected time of a microsecond and (b) so stable that I actually do not
believe that the PIT itself is at all emulated or the problem.
Btw - as to caring about the average value: that's pointless. If you only
look at the average time the PIT read takes place, then it is going to
approximate that "pit_count" thing in the end that I already did.
Why? Because the average value should essentially end up being "(end_tsc -
start_tsc) / pit_count". And if you just compare that to "min_tsc", then
that should always be about a microsecond (on normal machines where the
PIT is essentially on the old emulated internal "ISA" bus on the
southbridge). So you end up with what I already posted, and you already
dismissed.
So average TSC is not any more interesting than "pit_count".
Linus
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: Regression in 2.6.27 caused by commit bfc0f59
2008-09-02 18:42 ` Linus Torvalds
@ 2008-09-02 21:13 ` Thomas Gleixner
2008-09-02 22:21 ` Linus Torvalds
2008-09-02 22:54 ` [PATCH] Fix TSC calibration issues Thomas Gleixner
1 sibling, 1 reply; 56+ messages in thread
From: Thomas Gleixner @ 2008-09-02 21:13 UTC (permalink / raw)
To: Linus Torvalds
Cc: Larry Finger, LKML, Rafael J. Wysocki, Alok Kataria,
Michael Buesch
On Tue, 2 Sep 2008, Linus Torvalds wrote:
>
> You're definitely right that this could easily be the _real_ problem.
> Especially as your TSC min value of 2160 is (a) pretty close to the
> expected time of a microsecond and (b) so stable that I actually do not
> believe that the PIT itself is at all emulated or the problem.
On that box, the PIT is probably real hardware or a damned good
emulation. When you look at the 10 loop values you see that it does
50% perfectly fine calibration loops. The others are just SMI
interruptions caused by random unknown crap in the BIOS.
> Btw - as to caring about the average value: that's pointless. If you only
> look at the average time the PIT read takes place, then it is going to
> approximate that "pit_count" thing in the end that I already did.
>
> Why? Because the average value should essentially end up being "(end_tsc -
> start_tsc) / pit_count". And if you just compare that to "min_tsc", then
> that should always be about a microsecond (on normal machines where the
> PIT is essentially on the old emulated internal "ISA" bus on the
> southbridge). So you end up with what I already posted, and you already
> dismissed.
>
> So average TSC is not any more interesting than "pit_count".
Yeah, you're right. Math is hard :)
Thanks,
tglx
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: Regression in 2.6.27 caused by commit bfc0f59
2008-09-02 18:41 ` Alok Kataria
@ 2008-09-02 21:16 ` Thomas Gleixner
0 siblings, 0 replies; 56+ messages in thread
From: Thomas Gleixner @ 2008-09-02 21:16 UTC (permalink / raw)
To: Alok Kataria
Cc: Linus Torvalds, Larry Finger, LKML, Rafael J. Wysocki,
Michael Buesch
On Tue, 2 Sep 2008, Alok Kataria wrote:
> On Tue, 2008-09-02 at 11:14 -0700, Thomas Gleixner wrote:
> > I think that information is enough to give us a pretty precice idea
> > when to discard the result. I'm currently looking at the hpet/pmtimer
> > values for comparison and I should have a patch for testing ready
> > later tonight.
> >
> Sorry for joining the party this late...am still going through all my
> mails.
>
> Ok, so from what I understand until now, we will calibrate TSC against
> PIT as was done in 32bit code and use that as default. If that fails to
> give any sane results we will fall back to calibrating against PM_timer
> or HPET ?
> Thomas has already explained the problem with 32bit calibration ( i.e.
> just against PIT and no checks for SMI's and all) but would like to
> point that this problem is lot more worse in virtualized environment,
> because we may fail to get sane values even from multiple loops of
> calibrating against PIT.
> If we have a fall back mechanism to detect this SMI event, and then try
> calibrating against PM timer or HPET we should be good.
I still keep the fallback against pmtimer/hpet alive.
Thanks,
tglx
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: Regression in 2.6.27 caused by commit bfc0f59
2008-09-02 21:13 ` Thomas Gleixner
@ 2008-09-02 22:21 ` Linus Torvalds
2008-09-02 23:10 ` Thomas Gleixner
0 siblings, 1 reply; 56+ messages in thread
From: Linus Torvalds @ 2008-09-02 22:21 UTC (permalink / raw)
To: Thomas Gleixner
Cc: Larry Finger, LKML, Rafael J. Wysocki, Alok Kataria,
Michael Buesch
On Tue, 2 Sep 2008, Thomas Gleixner wrote:
>
> On that box, the PIT is probably real hardware or a damned good
> emulation. When you look at the 10 loop values you see that it does
> 50% perfectly fine calibration loops. The others are just SMI
> interruptions caused by random unknown crap in the BIOS.
Ok, so I actually think I know how to resolve the problem once and for
all.
The solution is actually fairly simple: we use the HPET algorithm. The
reason the HPET algorithm is so robust is that
- we can actually read the frequency from the HPET itself
- we also simply just read the counter values from the HPET, and so it
doesn't really matter how much time has passed between the two reads,
it only matters that _some_ time has passed, and that we pick _one_
stable read that we can associate with a TSC value.
But the thing is, the exact same thing is actually true of the old PIT
timer too - except we simply don't take advantage of it. The PIT timer has
a very well known frequency value (PIT_TICK_RATE: 1193180 Hz), and we can
trivially read the counter value too.
But the thing is, that for some forgotten reason, that's not actually what
we do. Instead of reading the counter value, we wait until it counts down
to zero, and read the output value instead. So instead of having a nice
and dependable counter that ticks down (16 bits of precision), we actually
use a _single_ bit of result, and depend on reading the TSC at the same
time.
That's kind of sad.
I'll try to whip up a test-patch to do this in a smarter way.
Linus
^ permalink raw reply [flat|nested] 56+ messages in thread
* [PATCH] Fix TSC calibration issues
2008-09-02 18:42 ` Linus Torvalds
2008-09-02 21:13 ` Thomas Gleixner
@ 2008-09-02 22:54 ` Thomas Gleixner
2008-09-03 2:14 ` Linus Torvalds
2008-09-03 2:51 ` [PATCH] Fix TSC calibration issues Larry Finger
1 sibling, 2 replies; 56+ messages in thread
From: Thomas Gleixner @ 2008-09-02 22:54 UTC (permalink / raw)
To: Linus Torvalds
Cc: Larry Finger, LKML, Rafael J. Wysocki, Alok Kataria,
Michael Buesch
Larry Finger reported at http://lkml.org/lkml/2008/9/1/90:
An ancient laptop of mine started throwing errors from b43legacy when
I started using 2.6.27 on it. This has been bisected to commit bfc0f59
"x86: merge tsc calibration".
The unification of the TSC code adopted mostly the 64bit code, which
prefers PMTIMER/HPET over the PIT calibration.
Larrys system has an AMD K6 CPU. Such systems are known to have
PMTIMER incarnations which run at double speed. This results in a
miscalibration of the TSC by factor 0.5. So the resulting calibrated
CPU/TSC speed is half of the real CPU speed, which means that the TSC
based delay loop will run half the time it should run. That might
explain why the b43legacy driver went berserk.
On the other hand we know about systems, where the PIT based
calibration results in random crap due to heavy SMI/SMM
disturbance. On those systems the PMTIMER/HPET based calibration logic
with SMI detection shows better results.
According to Alok also virtualized systems suffer from the PIT
calibration method.
The solution is to use a more wreckage aware aproach than the current
either/or decision.
1) reimplement the retry loop which was dropped from the 32bit code
during the merge. It repeats the calibration and selects the lowest
frequency value as this is probably the closest estimate to the real
frequency
2) Monitor the delta of the TSC values in the delay loop which waits
for the PIT counter to reach zero. If the maximum value is
significantly different from the minimum, then we have a pretty safe
indicator that the loop was disturbed by an SMI.
3) keep the pmtimer/hpet reference as a backup solution for systems
where the SMI disturbance is a permanent point of failure for PIT
based calibration
4) do the loop iteration for both methods, record the lowest value and
decide after all iterations finished.
5) Set a clear preference to PIT based calibration when the result
makes sense.
The implementation does the reference calibration based on
HPET/PMTIMER around the delay, which is necessary for the PIT anyway,
but keeps separate TSC values to ensure the "independency" of the
resulting calibration values.
Tested on various 32bit/64bit machines including Geode 266Mhz, AMD K6
(affected machine with a double speed pmtimer which I grabbed out of
the dump), Pentium class machines and AMD/Intel 64 bit boxen.
Bisected-by: Larry Finger <Larry.Finger@lwfinger.net>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
arch/x86/kernel/tsc.c | 235 ++++++++++++++++++++++++++++++++++++++------------
1 file changed, 181 insertions(+), 54 deletions(-)
Index: linux-2.6/arch/x86/kernel/tsc.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/tsc.c
+++ linux-2.6/arch/x86/kernel/tsc.c
@@ -127,74 +127,201 @@ static u64 tsc_read_refs(u64 *pm, u64 *h
*/
unsigned long native_calibrate_tsc(void)
{
- unsigned long flags;
- u64 tsc1, tsc2, tr1, tr2, delta, pm1, pm2, hpet1, hpet2;
- int hpet = is_hpet_enabled();
- unsigned int tsc_khz_val = 0;
+ u64 tsc1, tsc2, tr1, tr2, tsc, delta, pm1, pm2, hpet1, hpet2;
+ unsigned long tsc_pit_min = ULONG_MAX, tsc_ref_min = ULONG_MAX;
+ unsigned long flags, tscmin, tscmax;
+ int hpet = is_hpet_enabled(), pitcnt, i;
- local_irq_save(flags);
-
- tsc1 = tsc_read_refs(&pm1, hpet ? &hpet1 : NULL);
-
- outb((inb(0x61) & ~0x02) | 0x01, 0x61);
-
- outb(0xb0, 0x43);
- outb((CLOCK_TICK_RATE / (1000 / 50)) & 0xff, 0x42);
- outb((CLOCK_TICK_RATE / (1000 / 50)) >> 8, 0x42);
- tr1 = get_cycles();
- while ((inb(0x61) & 0x20) == 0);
- tr2 = get_cycles();
+ /*
+ * Run 5 calibration loops to get the lowest frequency value
+ * (the best estimate). We use two different calibration modes
+ * here:
+ *
+ * 1) PIT loop. We set the PIT Channel 2 to oneshot mode and
+ * load a timeout of 50ms. We read the time right after we
+ * started the timer and wait until the PIT count down reaches
+ * zero. In each wait loop iteration we read the TSC and check
+ * the delta to the previous read. We keep track of the min
+ * and max values of that delta. The delta is mostly defined
+ * by the IO time of the PIT access, so we can detect when a
+ * SMI/SMM disturbance happend between the two reads. If the
+ * maximum time is significantly larger than the minimum time,
+ * then we discard the result and have another try.
+ *
+ * 2) Reference counter. If available we use the HPET or the
+ * PMTIMER as a reference to check the sanity of that value.
+ * We use separate TSC readouts and check inside of the
+ * reference read for a SMI/SMM disturbance. We dicard
+ * disturbed values here as well. We do that around the PIT
+ * calibration delay loop as we have to wait for a certain
+ * amount of time anyway.
+ */
+ for (i = 0; i < 5; i++) {
- tsc2 = tsc_read_refs(&pm2, hpet ? &hpet2 : NULL);
+ tscmin = ULONG_MAX;
+ tscmax = 0;
+ pitcnt = 0;
+
+ local_irq_save(flags);
+
+ /*
+ * Read the start value and the reference count of
+ * hpet/pmtimer when available:
+ */
+ tsc1 = tsc_read_refs(&pm1, hpet ? &hpet1 : NULL);
+
+ /* Set the Gate high, disable speaker */
+ outb((inb(0x61) & ~0x02) | 0x01, 0x61);
+
+ /*
+ * Setup CTC channel 2* for mode 0, (interrupt on terminal
+ * count mode), binary count. Set the latch register to 50ms
+ * (LSB then MSB) to begin countdown.
+ *
+ * Some devices need a delay here.
+ */
+ outb(0xb0, 0x43);
+ outb((CLOCK_TICK_RATE / (1000 / 50)) & 0xff, 0x42);
+ outb((CLOCK_TICK_RATE / (1000 / 50)) >> 8, 0x42);
+
+ tsc = tr1 = tr2 = get_cycles();
+
+ while ((inb(0x61) & 0x20) == 0) {
+ tr2 = get_cycles();
+ delta = tr2 - tsc;
+ tsc = tr2;
+ if ((unsigned int) delta < tscmin)
+ tscmin = (unsigned int) delta;
+ if ((unsigned int) delta > tscmax)
+ tscmax = (unsigned int) delta;
+ pitcnt++;
+ }
+
+ /*
+ * We waited at least 50ms above. Now read
+ * pmtimer/hpet reference again
+ */
+ tsc2 = tsc_read_refs(&pm2, hpet ? &hpet2 : NULL);
+
+ local_irq_restore(flags);
+
+ /*
+ * Sanity checks:
+ *
+ * If we were not able to read the PIT more than 5000
+ * times, then we have been hit by a massive SMI
+ *
+ * If the maximum is 10 times larger than the minimum,
+ * then we got hit by an SMI as well.
+ */
+ if (pitcnt > 5000 && tscmax < 10 * tscmin) {
+
+ /* Calculate the PIT value */
+ delta = tr2 - tr1;
+ do_div(delta, 50);
+
+ /* We take the smallest value into account */
+ tsc_pit_min = min(tsc_pit_min, (unsigned long) delta);
+ }
+
+ /* hpet or pmtimer available ? */
+ if (!hpet && !pm1 && !pm2)
+ continue;
+
+ /* Check, whether the sampling was disturbed by an SMI */
+ if (tsc1 == ULLONG_MAX || tsc2 == ULLONG_MAX)
+ continue;
+
+ tsc2 = (tsc2 - tsc1) * 1000000LL;
+
+ if (hpet) {
+ if (hpet2 < hpet1)
+ hpet2 += 0x100000000ULL;
+ hpet2 -= hpet1;
+ tsc1 = ((u64)hpet2 * hpet_readl(HPET_PERIOD));
+ do_div(tsc1, 1000000);
+ } else {
+ if (pm2 < pm1)
+ pm2 += (u64)ACPI_PM_OVRRUN;
+ pm2 -= pm1;
+ tsc1 = pm2 * 1000000000LL;
+ do_div(tsc1, PMTMR_TICKS_PER_SEC);
+ }
- local_irq_restore(flags);
+ do_div(tsc2, tsc1);
+ tsc_ref_min = min(tsc_ref_min, (unsigned long) tsc2);
+ }
/*
- * Preset the result with the raw and inaccurate PIT
- * calibration value
+ * Now check the results.
*/
- delta = (tr2 - tr1);
- do_div(delta, 50);
- tsc_khz_val = delta;
+ if (tsc_pit_min == ULONG_MAX) {
+ /* PIT gave no useful value */
+ printk(KERN_WARNING "TSC: PIT calibration failed due to "
+ "SMI disturbance.\n");
+
+ /* We don't have an alternative source, disable TSC */
+ if (!hpet && !pm1 && !pm2) {
+ printk("TSC: No reference (HPET/PMTIMER) available\n");
+ return 0;
+ }
+
+ /* The alternative source failed as well, disable TSC */
+ if (tsc_ref_min == ULONG_MAX) {
+ printk(KERN_WARNING "TSC: HPET/PMTIMER calibration "
+ "failed due to SMI disturbance.\n");
+ return 0;
+ }
+
+ /* Use the alternative source */
+ printk(KERN_INFO "TSC: using %s reference calibration\n",
+ hpet ? "HPET" : "PMTIMER");
+
+ return tsc_ref_min;
+ }
- /* hpet or pmtimer available ? */
+ /* We don't have an alternative source, use the PIT calibration value */
if (!hpet && !pm1 && !pm2) {
- printk(KERN_INFO "TSC calibrated against PIT\n");
- goto out;
+ printk(KERN_INFO "TSC: Using PIT calibration value\n");
+ return tsc_pit_min;
}
- /* Check, whether the sampling was disturbed by an SMI */
- if (tsc1 == ULLONG_MAX || tsc2 == ULLONG_MAX) {
- printk(KERN_WARNING "TSC calibration disturbed by SMI, "
- "using PIT calibration result\n");
- goto out;
- }
-
- tsc2 = (tsc2 - tsc1) * 1000000LL;
-
- if (hpet) {
- printk(KERN_INFO "TSC calibrated against HPET\n");
- if (hpet2 < hpet1)
- hpet2 += 0x100000000ULL;
- hpet2 -= hpet1;
- tsc1 = ((u64)hpet2 * hpet_readl(HPET_PERIOD));
- do_div(tsc1, 1000000);
- } else {
- printk(KERN_INFO "TSC calibrated against PM_TIMER\n");
- if (pm2 < pm1)
- pm2 += (u64)ACPI_PM_OVRRUN;
- pm2 -= pm1;
- tsc1 = pm2 * 1000000000LL;
- do_div(tsc1, PMTMR_TICKS_PER_SEC);
+ /* The alternative source failed, use the PIT calibration value */
+ if (tsc_ref_min == ULONG_MAX) {
+ printk(KERN_WARNING "TSC: HPET/PMTIMER calibration failed due "
+ "to SMI disturbance. Using PIT calibration\n");
+ return tsc_pit_min;
}
- do_div(tsc2, tsc1);
- tsc_khz_val = tsc2;
+ /* Check the reference deviation */
+ delta = ((u64) tsc_pit_min) * 100;
+ do_div(delta, tsc_ref_min);
-out:
- return tsc_khz_val;
-}
+ /*
+ * If both calibration results are inside a 5% window, the we
+ * use the lower frequency of those as it is probably the
+ * closest estimate.
+ */
+ if (delta >= 95 && delta <= 105) {
+ printk(KERN_INFO "TSC: PIT calibration confirmed by %s.\n",
+ hpet ? "HPET" : "PMTIMER");
+ printk(KERN_INFO "TSC: using %s calibration value\n",
+ tsc_pit_min <= tsc_ref_min ? "PIT" :
+ hpet ? "HPET" : "PMTIMER");
+ return tsc_pit_min <= tsc_ref_min ? tsc_pit_min : tsc_ref_min;
+ }
+ printk(KERN_WARNING "TSC: PIT calibration deviates from %s: %lu %lu.\n",
+ hpet ? "HPET" : "PMTIMER", tsc_pit_min, tsc_ref_min);
+
+ /*
+ * The calibration values differ too much. In doubt, we use
+ * the PIT value as we know that there are PMTIMERs around
+ * running at double speed.
+ */
+ printk(KERN_INFO "TSC: Using PIT calibration value\n");
+ return tsc_pit_min;
+}
#ifdef CONFIG_X86_32
/* Only called from the Powernow K7 cpu freq driver */
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: Regression in 2.6.27 caused by commit bfc0f59
2008-09-02 22:21 ` Linus Torvalds
@ 2008-09-02 23:10 ` Thomas Gleixner
2008-09-03 1:49 ` Linus Torvalds
0 siblings, 1 reply; 56+ messages in thread
From: Thomas Gleixner @ 2008-09-02 23:10 UTC (permalink / raw)
To: Linus Torvalds
Cc: Larry Finger, LKML, Rafael J. Wysocki, Alok Kataria,
Michael Buesch
On Tue, 2 Sep 2008, Linus Torvalds wrote:
> On Tue, 2 Sep 2008, Thomas Gleixner wrote:
> >
> > On that box, the PIT is probably real hardware or a damned good
> > emulation. When you look at the 10 loop values you see that it does
> > 50% perfectly fine calibration loops. The others are just SMI
> > interruptions caused by random unknown crap in the BIOS.
>
> Ok, so I actually think I know how to resolve the problem once and for
> all.
>
> The solution is actually fairly simple: we use the HPET algorithm. The
> reason the HPET algorithm is so robust is that
>
> - we can actually read the frequency from the HPET itself
>
> - we also simply just read the counter values from the HPET, and so it
> doesn't really matter how much time has passed between the two reads,
> it only matters that _some_ time has passed, and that we pick _one_
> stable read that we can associate with a TSC value.
>
> But the thing is, the exact same thing is actually true of the old PIT
> timer too - except we simply don't take advantage of it. The PIT timer has
> a very well known frequency value (PIT_TICK_RATE: 1193180 Hz), and we can
> trivially read the counter value too.
Except for the couple of exceptions, where the readout of the old PIT
timer is broken. See arch/x86/kernel/i8253.c:pit_read()
Thought about that already and discarded it as it is basically the
same problem as we have versus _ONE_ AMD K6 family pmtimer
incarnation.
> But the thing is, that for some forgotten reason, that's not actually what
> we do. Instead of reading the counter value, we wait until it counts down
> to zero, and read the output value instead. So instead of having a nice
> and dependable counter that ticks down (16 bits of precision), we actually
> use a _single_ bit of result, and depend on reading the TSC at the same
> time.
>
> That's kind of sad.
For a reason.
> I'll try to whip up a test-patch to do this in a smarter way.
I'm fine with either solution as long it works on _ALL_ kind of broken
hardware. Believe me or not, but since I work on the whole timer
issue, I have not seen anything really reliable in the x86 world.
That's the really sad part, that the hardware dudes did not learn
anything about the importance of timing and timekeeping within 20
years.
I'm tearing my hair on a regular base when I try to get down to the
root cause of timer related wreckage in several (including todays)
generations of CPU technology in x86 land. Other architectures have
their odds and ends as well, but the vast majority of implementations
is pretty straight forward and usable without restrictions.
x86 is definitely the ultimate winner of the all time "timer ignorance
award".
Thanks,
tglx
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: Regression in 2.6.27 caused by commit bfc0f59
2008-09-02 23:10 ` Thomas Gleixner
@ 2008-09-03 1:49 ` Linus Torvalds
0 siblings, 0 replies; 56+ messages in thread
From: Linus Torvalds @ 2008-09-03 1:49 UTC (permalink / raw)
To: Thomas Gleixner
Cc: Larry Finger, LKML, Rafael J. Wysocki, Alok Kataria,
Michael Buesch
On Wed, 3 Sep 2008, Thomas Gleixner wrote:
>
> Except for the couple of exceptions, where the readout of the old PIT
> timer is broken. See arch/x86/kernel/i8253.c:pit_read()
Well, the VIA problem, for example, is not an issue if you just make it do
the maximum timeout (ie count down from zero), which is what you want
_anyway_ in order to get the maximum range.
So that one is simply avoided by just programming the timer to count the
maximum possible range.
However, the counter latching itself does seem to stop the counting until
it is read, which is rather irritating. It means that you can't just latch
and read in a tight loop. An while I actually have a patch that works very
well for me, it depends on not latching, which is not strictly correct
either.
So yeah, I suspect using the current code and trying to just see when it
isn't reliable (and the "max time" seems good for that) is probably the
best option.
Linus
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH] Fix TSC calibration issues
2008-09-02 22:54 ` [PATCH] Fix TSC calibration issues Thomas Gleixner
@ 2008-09-03 2:14 ` Linus Torvalds
2008-09-03 9:11 ` Thomas Gleixner
2008-09-03 2:51 ` [PATCH] Fix TSC calibration issues Larry Finger
1 sibling, 1 reply; 56+ messages in thread
From: Linus Torvalds @ 2008-09-03 2:14 UTC (permalink / raw)
To: Thomas Gleixner
Cc: Larry Finger, LKML, Rafael J. Wysocki, Alok Kataria,
Michael Buesch
On Wed, 3 Sep 2008, Thomas Gleixner wrote:
>
> + /*
> + * Run 5 calibration loops to get the lowest frequency value
> + * (the best estimate). We use two different calibration modes
> + * here:
> + *
> + * 1) PIT loop. We set the PIT Channel 2 to oneshot mode and
> + * load a timeout of 50ms. We read the time right after we
> + * started the timer and wait until the PIT count down reaches
> + * zero. In each wait loop iteration we read the TSC and check
> + * the delta to the previous read. We keep track of the min
> + * and max values of that delta. The delta is mostly defined
> + * by the IO time of the PIT access, so we can detect when a
> + * SMI/SMM disturbance happend between the two reads. If the
> + * maximum time is significantly larger than the minimum time,
> + * then we discard the result and have another try.
> + *
> + * 2) Reference counter. If available we use the HPET or the
> + * PMTIMER as a reference to check the sanity of that value.
> + * We use separate TSC readouts and check inside of the
> + * reference read for a SMI/SMM disturbance. We dicard
> + * disturbed values here as well. We do that around the PIT
> + * calibration delay loop as we have to wait for a certain
> + * amount of time anyway.
> + */
> + for (i = 0; i < 5; i++) {
> + tscmin = ULONG_MAX;
> + tscmax = 0;
> + pitcnt = 0;
> +
> + local_irq_save(flags);
> +
> + /*
> + * Read the start value and the reference count of
> + * hpet/pmtimer when available:
> + */
> + tsc1 = tsc_read_refs(&pm1, hpet ? &hpet1 : NULL);
This is "wrongish".
You really should do the
tsc1 = tsc_read_refs(&pm1, hpet ? &hpet1 : NULL);
...
tsc2 = tsc_read_refs(&pm2, hpet ? &hpet2 : NULL);
around the whole loop, because they get more exact with more time inside,
and they don't improve from looping around.
Also, that code is already _too_ unreadable. How about starting by just
moving the PIT calibration into a function of its own, like the appended
patch. And then as a separate patch, improving the heuristics for just the
PIT calibration.
The others are independent of this issue..
Linus
---
arch/x86/kernel/tsc.c | 37 ++++++++++++++++++++-----------------
1 files changed, 20 insertions(+), 17 deletions(-)
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 8e786b0..bf7ff5a 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -122,19 +122,9 @@ static u64 tsc_read_refs(u64 *pm, u64 *hpet)
return ULLONG_MAX;
}
-/**
- * native_calibrate_tsc - calibrate the tsc on boot
- */
-unsigned long native_calibrate_tsc(void)
+static unsigned long PIT_calibrate_tsc(void)
{
- unsigned long flags;
- u64 tsc1, tsc2, tr1, tr2, delta, pm1, pm2, hpet1, hpet2;
- int hpet = is_hpet_enabled();
- unsigned int tsc_khz_val = 0;
-
- local_irq_save(flags);
-
- tsc1 = tsc_read_refs(&pm1, hpet ? &hpet1 : NULL);
+ u64 tr1, tr2, delta;
outb((inb(0x61) & ~0x02) | 0x01, 0x61);
@@ -145,17 +135,30 @@ unsigned long native_calibrate_tsc(void)
while ((inb(0x61) & 0x20) == 0);
tr2 = get_cycles();
- tsc2 = tsc_read_refs(&pm2, hpet ? &hpet2 : NULL);
-
- local_irq_restore(flags);
-
/*
* Preset the result with the raw and inaccurate PIT
* calibration value
*/
delta = (tr2 - tr1);
do_div(delta, 50);
- tsc_khz_val = delta;
+ return delta;
+}
+
+/**
+ * native_calibrate_tsc - calibrate the tsc on boot
+ */
+unsigned long native_calibrate_tsc(void)
+{
+ unsigned long flags;
+ u64 tsc1, tsc2, pm1, pm2, hpet1, hpet2;
+ int hpet = is_hpet_enabled();
+ unsigned int tsc_khz_val;
+
+ local_irq_save(flags);
+ tsc1 = tsc_read_refs(&pm1, hpet ? &hpet1 : NULL);
+ tsc_khz_val = PIT_calibrate_tsc();
+ tsc2 = tsc_read_refs(&pm2, hpet ? &hpet2 : NULL);
+ local_irq_restore(flags);
/* hpet or pmtimer available ? */
if (!hpet && !pm1 && !pm2) {
^ permalink raw reply related [flat|nested] 56+ messages in thread
* Re: [PATCH] Fix TSC calibration issues
2008-09-02 22:54 ` [PATCH] Fix TSC calibration issues Thomas Gleixner
2008-09-03 2:14 ` Linus Torvalds
@ 2008-09-03 2:51 ` Larry Finger
2008-09-03 4:00 ` Linus Torvalds
1 sibling, 1 reply; 56+ messages in thread
From: Larry Finger @ 2008-09-03 2:51 UTC (permalink / raw)
To: Thomas Gleixner
Cc: Linus Torvalds, LKML, Rafael J. Wysocki, Alok Kataria,
Michael Buesch
Thomas Gleixner wrote:
> Larry Finger reported at http://lkml.org/lkml/2008/9/1/90:
> An ancient laptop of mine started throwing errors from b43legacy when
> I started using 2.6.27 on it. This has been bisected to commit bfc0f59
> "x86: merge tsc calibration".
>
> The unification of the TSC code adopted mostly the 64bit code, which
> prefers PMTIMER/HPET over the PIT calibration.
>
> Larrys system has an AMD K6 CPU. Such systems are known to have
> PMTIMER incarnations which run at double speed. This results in a
> miscalibration of the TSC by factor 0.5. So the resulting calibrated
> CPU/TSC speed is half of the real CPU speed, which means that the TSC
> based delay loop will run half the time it should run. That might
> explain why the b43legacy driver went berserk.
>
> On the other hand we know about systems, where the PIT based
> calibration results in random crap due to heavy SMI/SMM
> disturbance. On those systems the PMTIMER/HPET based calibration logic
> with SMI detection shows better results.
>
> According to Alok also virtualized systems suffer from the PIT
> calibration method.
>
> The solution is to use a more wreckage aware aproach than the current
> either/or decision.
>
> 1) reimplement the retry loop which was dropped from the 32bit code
> during the merge. It repeats the calibration and selects the lowest
> frequency value as this is probably the closest estimate to the real
> frequency
>
> 2) Monitor the delta of the TSC values in the delay loop which waits
> for the PIT counter to reach zero. If the maximum value is
> significantly different from the minimum, then we have a pretty safe
> indicator that the loop was disturbed by an SMI.
>
> 3) keep the pmtimer/hpet reference as a backup solution for systems
> where the SMI disturbance is a permanent point of failure for PIT
> based calibration
>
> 4) do the loop iteration for both methods, record the lowest value and
> decide after all iterations finished.
>
> 5) Set a clear preference to PIT based calibration when the result
> makes sense.
>
> The implementation does the reference calibration based on
> HPET/PMTIMER around the delay, which is necessary for the PIT anyway,
> but keeps separate TSC values to ensure the "independency" of the
> resulting calibration values.
>
> Tested on various 32bit/64bit machines including Geode 266Mhz, AMD K6
> (affected machine with a double speed pmtimer which I grabbed out of
> the dump), Pentium class machines and AMD/Intel 64 bit boxen.
>
> Bisected-by: Larry Finger <Larry.Finger@lwfinger.net>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> ---
I know that Linus has some problems with this patch, but FWIW it
worked on my K6. The dmesg output is
TSC: PIT calibration deviates from PMTIMER: 428809 214401.
TSC: Using PIT calibration value
Detected 428.809 MHz processor.
Larry
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH] Fix TSC calibration issues
2008-09-03 2:51 ` [PATCH] Fix TSC calibration issues Larry Finger
@ 2008-09-03 4:00 ` Linus Torvalds
2008-09-03 4:34 ` Larry Finger
0 siblings, 1 reply; 56+ messages in thread
From: Linus Torvalds @ 2008-09-03 4:00 UTC (permalink / raw)
To: Larry Finger
Cc: Thomas Gleixner, LKML, Rafael J. Wysocki, Alok Kataria,
Michael Buesch
On Tue, 2 Sep 2008, Larry Finger wrote:
>
> I know that Linus has some problems with this patch, but FWIW it worked on my
> K6. The dmesg output is
>
> TSC: PIT calibration deviates from PMTIMER: 428809 214401.
> TSC: Using PIT calibration value
> Detected 428.809 MHz processor.
I don't have a problem with what the patch _does_ (I think all the added
sanity checking is good).
I just don't like how it makes the already rather complex and confusing
function about ten times _more_ complex and confusing. It needs splitting
up.
But we can split it up after-the-fact as well as before-the-fact.
Does this cleanup-patch (on _top_ of Thomas' patch) also work for you? It
should do exactly the same thing, except it splits up the TSC PIT
calibration into a function of its own.
But I may obviously have introduced some silly bug when cleaning it up,
so testing necessary..
Linus
---
arch/x86/kernel/tsc.c | 132 ++++++++++++++++++++++++++----------------------
1 files changed, 71 insertions(+), 61 deletions(-)
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index ac79bd1..346cae5 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -122,15 +122,75 @@ static u64 tsc_read_refs(u64 *pm, u64 *hpet)
return ULLONG_MAX;
}
+/*
+ * Try to calibrate the TSC against the Programmable
+ * Interrupt Timer and return the frequency of the TSC
+ * in kHz.
+ *
+ * Return ULONG_MAX on failure to calibrate.
+ */
+static unsigned long pit_calibrate_tsc(void)
+{
+ u64 tsc, t1, t2, delta;
+ unsigned long tscmin, tscmax;
+ int pitcnt;
+
+ /* Set the Gate high, disable speaker */
+ outb((inb(0x61) & ~0x02) | 0x01, 0x61);
+
+ /*
+ * Setup CTC channel 2* for mode 0, (interrupt on terminal
+ * count mode), binary count. Set the latch register to 50ms
+ * (LSB then MSB) to begin countdown.
+ */
+ outb(0xb0, 0x43);
+ outb((CLOCK_TICK_RATE / (1000 / 50)) & 0xff, 0x42);
+ outb((CLOCK_TICK_RATE / (1000 / 50)) >> 8, 0x42);
+
+ tsc = t1 = t2 = get_cycles();
+
+ pitcnt = 0;
+ tscmax = 0;
+ tscmin = ULONG_MAX;
+ while ((inb(0x61) & 0x20) == 0) {
+ t2 = get_cycles();
+ delta = t2 - tsc;
+ tsc = t2;
+ if ((unsigned long) delta < tscmin)
+ tscmin = (unsigned int) delta;
+ if ((unsigned long) delta > tscmax)
+ tscmax = (unsigned int) delta;
+ pitcnt++;
+ }
+
+ /*
+ * Sanity checks:
+ *
+ * If we were not able to read the PIT more than 5000
+ * times, then we have been hit by a massive SMI
+ *
+ * If the maximum is 10 times larger than the minimum,
+ * then we got hit by an SMI as well.
+ */
+ if (pitcnt < 5000 || tscmax > 10 * tscmin)
+ return ULONG_MAX;
+
+ /* Calculate the PIT value */
+ delta = t2 - t1;
+ do_div(delta, 50);
+ return delta;
+}
+
+
/**
* native_calibrate_tsc - calibrate the tsc on boot
*/
unsigned long native_calibrate_tsc(void)
{
- u64 tsc1, tsc2, tr1, tr2, tsc, delta, pm1, pm2, hpet1, hpet2;
+ u64 tsc1, tsc2, delta, pm1, pm2, hpet1, hpet2;
unsigned long tsc_pit_min = ULONG_MAX, tsc_ref_min = ULONG_MAX;
- unsigned long flags, tscmin, tscmax;
- int hpet = is_hpet_enabled(), pitcnt, i;
+ unsigned long flags;
+ int hpet = is_hpet_enabled(), i;
/*
* Run 5 calibration loops to get the lowest frequency value
@@ -157,72 +217,22 @@ unsigned long native_calibrate_tsc(void)
* amount of time anyway.
*/
for (i = 0; i < 5; i++) {
-
- tscmin = ULONG_MAX;
- tscmax = 0;
- pitcnt = 0;
-
- local_irq_save(flags);
+ unsigned long tsc_pit_khz;
/*
* Read the start value and the reference count of
- * hpet/pmtimer when available:
+ * hpet/pmtimer when available. Then do the PIT
+ * calibration, which will take at least 50ms, and
+ * read the end value.
*/
+ local_irq_save(flags);
tsc1 = tsc_read_refs(&pm1, hpet ? &hpet1 : NULL);
-
- /* Set the Gate high, disable speaker */
- outb((inb(0x61) & ~0x02) | 0x01, 0x61);
-
- /*
- * Setup CTC channel 2* for mode 0, (interrupt on terminal
- * count mode), binary count. Set the latch register to 50ms
- * (LSB then MSB) to begin countdown.
- *
- * Some devices need a delay here.
- */
- outb(0xb0, 0x43);
- outb((CLOCK_TICK_RATE / (1000 / 50)) & 0xff, 0x42);
- outb((CLOCK_TICK_RATE / (1000 / 50)) >> 8, 0x42);
-
- tsc = tr1 = tr2 = get_cycles();
-
- while ((inb(0x61) & 0x20) == 0) {
- tr2 = get_cycles();
- delta = tr2 - tsc;
- tsc = tr2;
- if ((unsigned int) delta < tscmin)
- tscmin = (unsigned int) delta;
- if ((unsigned int) delta > tscmax)
- tscmax = (unsigned int) delta;
- pitcnt++;
- }
-
- /*
- * We waited at least 50ms above. Now read
- * pmtimer/hpet reference again
- */
+ tsc_pit_khz = pit_calibrate_tsc();
tsc2 = tsc_read_refs(&pm2, hpet ? &hpet2 : NULL);
-
local_irq_restore(flags);
- /*
- * Sanity checks:
- *
- * If we were not able to read the PIT more than 5000
- * times, then we have been hit by a massive SMI
- *
- * If the maximum is 10 times larger than the minimum,
- * then we got hit by an SMI as well.
- */
- if (pitcnt > 5000 && tscmax < 10 * tscmin) {
-
- /* Calculate the PIT value */
- delta = tr2 - tr1;
- do_div(delta, 50);
-
- /* We take the smallest value into account */
- tsc_pit_min = min(tsc_pit_min, (unsigned long) delta);
- }
+ /* Pick the lowest PIT TSC calibration so far */
+ tsc_pit_min = min(tsc_pit_min, tsc_pit_khz);
/* hpet or pmtimer available ? */
if (!hpet && !pm1 && !pm2)
^ permalink raw reply related [flat|nested] 56+ messages in thread
* Re: [PATCH] Fix TSC calibration issues
2008-09-03 4:00 ` Linus Torvalds
@ 2008-09-03 4:34 ` Larry Finger
0 siblings, 0 replies; 56+ messages in thread
From: Larry Finger @ 2008-09-03 4:34 UTC (permalink / raw)
To: Linus Torvalds
Cc: Thomas Gleixner, LKML, Rafael J. Wysocki, Alok Kataria,
Michael Buesch
Linus Torvalds wrote:
>
> On Tue, 2 Sep 2008, Larry Finger wrote:
>> I know that Linus has some problems with this patch, but FWIW it worked on my
>> K6. The dmesg output is
>>
>> TSC: PIT calibration deviates from PMTIMER: 428809 214401.
>> TSC: Using PIT calibration value
>> Detected 428.809 MHz processor.
>
> I don't have a problem with what the patch _does_ (I think all the added
> sanity checking is good).
>
> I just don't like how it makes the already rather complex and confusing
> function about ten times _more_ complex and confusing. It needs splitting
> up.
>
> But we can split it up after-the-fact as well as before-the-fact.
>
> Does this cleanup-patch (on _top_ of Thomas' patch) also work for you? It
> should do exactly the same thing, except it splits up the TSC PIT
> calibration into a function of its own.
>
> But I may obviously have introduced some silly bug when cleaning it up,
> so testing necessary..
Same result as Thomas' version.
Larry
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH] Fix TSC calibration issues
2008-09-03 2:14 ` Linus Torvalds
@ 2008-09-03 9:11 ` Thomas Gleixner
2008-09-04 1:14 ` Alok Kataria
2008-09-04 1:18 ` [PATCH] Change warning message in TSC calibration Alok Kataria
0 siblings, 2 replies; 56+ messages in thread
From: Thomas Gleixner @ 2008-09-03 9:11 UTC (permalink / raw)
To: Linus Torvalds
Cc: Larry Finger, LKML, Rafael J. Wysocki, Alok Kataria,
Michael Buesch
On Tue, 2 Sep 2008, Linus Torvalds wrote:
> This is "wrongish".
>
> You really should do the
>
> tsc1 = tsc_read_refs(&pm1, hpet ? &hpet1 : NULL);
> ...
> tsc2 = tsc_read_refs(&pm2, hpet ? &hpet2 : NULL);
>
> around the whole loop, because they get more exact with more time inside,
> and they don't improve from looping around.
True. Just kept them at the place where my debug patches had left them.
> Also, that code is already _too_ unreadable. How about starting by just
> moving the PIT calibration into a function of its own, like the appended
> patch. And then as a separate patch, improving the heuristics for just the
> PIT calibration.
Yeah, should have done that. But I was too tired to touch anything in
the code more complex than adding a few comments.
Thanks,
tglx
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH] Fix TSC calibration issues
2008-09-03 9:11 ` Thomas Gleixner
@ 2008-09-04 1:14 ` Alok Kataria
2008-09-04 2:56 ` Linus Torvalds
2008-09-04 1:18 ` [PATCH] Change warning message in TSC calibration Alok Kataria
1 sibling, 1 reply; 56+ messages in thread
From: Alok Kataria @ 2008-09-04 1:14 UTC (permalink / raw)
To: Thomas Gleixner
Cc: Linus Torvalds, Larry Finger, LKML, Rafael J. Wysocki,
Michael Buesch, Dan Hecht
On Wed, 2008-09-03 at 02:11 -0700, Thomas Gleixner wrote:
> On Tue, 2 Sep 2008, Linus Torvalds wrote:
> > This is "wrongish".
> >
> > You really should do the
> >
> > tsc1 = tsc_read_refs(&pm1, hpet ? &hpet1 : NULL);
> > ...
> > tsc2 = tsc_read_refs(&pm2, hpet ? &hpet2 : NULL);
> >
> > around the whole loop, because they get more exact with more time inside,
> > and they don't improve from looping around.
>
> True. Just kept them at the place where my debug patches had left them.
Hi Thomas,
I agree with Linus that we should move the tsc_read_refs call outside of
the loop. I did those changes and ran some boot-halt tests at my end.
The frequency calibration against the pmtimer/hpet was surely more fine
tuned with this change, the variance that i see now in repeated reboots
is very minimal.
Please have a look at the patch below.
--
x86: Fine tune TSC calibration.
From: Alok N Kataria <akataria@vmware.com>
As Linus suggested, we should be moving the tsc_read_refs outside of the
loop, this gives us more accurate TSC calibration when calibrating against
hpet/pmtimer, since we are now calibrating over a period of 250ms.
With SMI_THRESHOLD equals to 50000, in the worst case, tsc values read by
tsc_read_refs, could be off by 50000 ticks. On a 2Ghz processor this would
mean an error of 25us. The tsc frequency is calibrated over a period of 250ms
with this patch, hence the worst case error would be around 100ppm down from
500ppm previously without the patch.
Signed-off-by: Alok N Kataria <akataria@vmware.com>
---
arch/x86/kernel/tsc.c | 54 ++++++++++++++++++++++++++-----------------------
1 files changed, 29 insertions(+), 25 deletions(-)
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 346cae5..bc2c1a2 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -188,7 +188,7 @@ static unsigned long pit_calibrate_tsc(void)
unsigned long native_calibrate_tsc(void)
{
u64 tsc1, tsc2, delta, pm1, pm2, hpet1, hpet2;
- unsigned long tsc_pit_min = ULONG_MAX, tsc_ref_min = ULONG_MAX;
+ unsigned long tsc_pit_min = ULONG_MAX, tsc_ref = ULONG_MAX;
unsigned long flags;
int hpet = is_hpet_enabled(), i;
@@ -216,31 +216,35 @@ unsigned long native_calibrate_tsc(void)
* calibration delay loop as we have to wait for a certain
* amount of time anyway.
*/
+
+ local_irq_save(flags);
+
+ /*
+ * Read the start value and the reference count of
+ * hpet/pmtimer when available. Then do the PIT
+ * calibration iteratively, which will take at least 250ms,
+ * and read the end value.
+ */
+ tsc1 = tsc_read_refs(&pm1, hpet ? &hpet1 : NULL);
+
for (i = 0; i < 5; i++) {
unsigned long tsc_pit_khz;
- /*
- * Read the start value and the reference count of
- * hpet/pmtimer when available. Then do the PIT
- * calibration, which will take at least 50ms, and
- * read the end value.
- */
- local_irq_save(flags);
- tsc1 = tsc_read_refs(&pm1, hpet ? &hpet1 : NULL);
tsc_pit_khz = pit_calibrate_tsc();
- tsc2 = tsc_read_refs(&pm2, hpet ? &hpet2 : NULL);
- local_irq_restore(flags);
/* Pick the lowest PIT TSC calibration so far */
tsc_pit_min = min(tsc_pit_min, tsc_pit_khz);
- /* hpet or pmtimer available ? */
- if (!hpet && !pm1 && !pm2)
- continue;
+ }
+
+ tsc2 = tsc_read_refs(&pm2, hpet ? &hpet2 : NULL);
+ local_irq_restore(flags);
- /* Check, whether the sampling was disturbed by an SMI */
- if (tsc1 == ULLONG_MAX || tsc2 == ULLONG_MAX)
- continue;
+ /*
+ * Check that, either HPET or PM timer is available and,
+ * the sampling was not disturbed by an SMI.
+ */
+ if ((hpet || pm1) && (tsc1 != ULLONG_MAX && tsc2 != ULLONG_MAX)) {
tsc2 = (tsc2 - tsc1) * 1000000LL;
@@ -259,7 +263,7 @@ unsigned long native_calibrate_tsc(void)
}
do_div(tsc2, tsc1);
- tsc_ref_min = min(tsc_ref_min, (unsigned long) tsc2);
+ tsc_ref = tsc2;
}
/*
@@ -277,7 +281,7 @@ unsigned long native_calibrate_tsc(void)
}
/* The alternative source failed as well, disable TSC */
- if (tsc_ref_min == ULONG_MAX) {
+ if (tsc_ref == ULONG_MAX) {
printk(KERN_WARNING "TSC: HPET/PMTIMER calibration "
"failed due to SMI disturbance.\n");
return 0;
@@ -287,7 +291,7 @@ unsigned long native_calibrate_tsc(void)
printk(KERN_INFO "TSC: using %s reference calibration\n",
hpet ? "HPET" : "PMTIMER");
- return tsc_ref_min;
+ return tsc_ref;
}
/* We don't have an alternative source, use the PIT calibration value */
@@ -297,7 +301,7 @@ unsigned long native_calibrate_tsc(void)
}
/* The alternative source failed, use the PIT calibration value */
- if (tsc_ref_min == ULONG_MAX) {
+ if (tsc_ref == ULONG_MAX) {
printk(KERN_WARNING "TSC: HPET/PMTIMER calibration failed due "
"to SMI disturbance. Using PIT calibration\n");
return tsc_pit_min;
@@ -305,7 +309,7 @@ unsigned long native_calibrate_tsc(void)
/* Check the reference deviation */
delta = ((u64) tsc_pit_min) * 100;
- do_div(delta, tsc_ref_min);
+ do_div(delta, tsc_ref);
/*
* If both calibration results are inside a 5% window, the we
@@ -316,13 +320,13 @@ unsigned long native_calibrate_tsc(void)
printk(KERN_INFO "TSC: PIT calibration confirmed by %s.\n",
hpet ? "HPET" : "PMTIMER");
printk(KERN_INFO "TSC: using %s calibration value\n",
- tsc_pit_min <= tsc_ref_min ? "PIT" :
+ tsc_pit_min <= tsc_ref ? "PIT" :
hpet ? "HPET" : "PMTIMER");
- return tsc_pit_min <= tsc_ref_min ? tsc_pit_min : tsc_ref_min;
+ return tsc_pit_min <= tsc_ref ? tsc_pit_min : tsc_ref;
}
printk(KERN_WARNING "TSC: PIT calibration deviates from %s: %lu %lu.\n",
- hpet ? "HPET" : "PMTIMER", tsc_pit_min, tsc_ref_min);
+ hpet ? "HPET" : "PMTIMER", tsc_pit_min, tsc_ref);
/*
* The calibration values differ too much. In doubt, we use
^ permalink raw reply related [flat|nested] 56+ messages in thread
* [PATCH] Change warning message in TSC calibration.
2008-09-03 9:11 ` Thomas Gleixner
2008-09-04 1:14 ` Alok Kataria
@ 2008-09-04 1:18 ` Alok Kataria
1 sibling, 0 replies; 56+ messages in thread
From: Alok Kataria @ 2008-09-04 1:18 UTC (permalink / raw)
To: Thomas Gleixner
Cc: Linus Torvalds, Larry Finger, LKML, Rafael J. Wysocki,
Michael Buesch, Dan Hecht
x86: Change warning message in TSC calibration.
From: Alok N Kataria <akataria@vmware.com>
When calibration against PIT fails, the warning that we print is misleading.
In a virtualized environment the VM may get descheduled while calibration
or, the check in PIT calibration may fail due to other virtualization
overheads.
The warning message explicitly assumes that calibration failed due to SMI's
which may not be the case. Change that to something proper.
Signed-off-by: Alok N Kataria <akataria@vmware.com>
---
arch/x86/kernel/tsc.c | 3 +--
1 files changed, 1 insertions(+), 2 deletions(-)
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index d74b3ef..9d4aeec 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -271,8 +271,7 @@ unsigned long native_calibrate_tsc(void)
*/
if (tsc_pit_min == ULONG_MAX) {
/* PIT gave no useful value */
- printk(KERN_WARNING "TSC: PIT calibration failed due to "
- "SMI disturbance.\n");
+ printk(KERN_WARNING "TSC: Unable to calibrate against PIT\n");
/* We don't have an alternative source, disable TSC */
if (!hpet && !pm1 && !pm2) {
^ permalink raw reply related [flat|nested] 56+ messages in thread
* Re: [PATCH] Fix TSC calibration issues
2008-09-04 1:14 ` Alok Kataria
@ 2008-09-04 2:56 ` Linus Torvalds
2008-09-04 3:16 ` Arjan van de Ven
0 siblings, 1 reply; 56+ messages in thread
From: Linus Torvalds @ 2008-09-04 2:56 UTC (permalink / raw)
To: Alok Kataria
Cc: Thomas Gleixner, Larry Finger, LKML, Rafael J. Wysocki,
Michael Buesch, Dan Hecht
On Wed, 3 Sep 2008, Alok Kataria wrote:
>
> As Linus suggested, we should be moving the tsc_read_refs outside of the
> loop, this gives us more accurate TSC calibration when calibrating against
> hpet/pmtimer, since we are now calibrating over a period of 250ms.
Side note: I'd like to change that.
250ms is a _loong_ time. It's a really really long time. It's
human-noticeable. A quarter of a second at boot is just too long.
We have a couple of options:
- make the PIT calibration shorter
I suspect we could easily make the PIT calibration be just 5ms rather
than 50ms. Yes, yes, it will be less precise, but the PIT counts at
1.1MHz or something, so we're still talking about 5000+ ticks of the
clock.
So we'll get within a fraction of a percent. We could then decide that
_if_ the HPET/PMTIMER matches the PIT, we'll decide to use that as the
reference time, because it would possibly have higher accuracy.
- do other things while calibrating.
Especially for HPET/PM_TIMER, we could actually do some other stuff
while it is calibrating, since those calibrations only depend on the
end/beginning being "close enough". This seems like it would be pretty
hard to actually do sanely, though (there's still the issue of the
HPET/PM_TIMER overflowing, although with 24 bits that should be on the
order of several seconds, I dunno)
Anyway, 250ms really is too long.
Linus
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH] Fix TSC calibration issues
2008-09-04 2:56 ` Linus Torvalds
@ 2008-09-04 3:16 ` Arjan van de Ven
2008-09-04 3:59 ` Linus Torvalds
0 siblings, 1 reply; 56+ messages in thread
From: Arjan van de Ven @ 2008-09-04 3:16 UTC (permalink / raw)
To: Linus Torvalds
Cc: Alok Kataria, Thomas Gleixner, Larry Finger, LKML,
Rafael J. Wysocki, Michael Buesch, Dan Hecht
On Wed, 3 Sep 2008 19:56:10 -0700 (PDT)
Linus Torvalds <torvalds@linux-foundation.org> wrote:
>
>
> On Wed, 3 Sep 2008, Alok Kataria wrote:
> >
> > As Linus suggested, we should be moving the tsc_read_refs outside
> > of the loop, this gives us more accurate TSC calibration when
> > calibrating against hpet/pmtimer, since we are now calibrating over
> > a period of 250ms.
>
> Side note: I'd like to change that.
>
> 250ms is a _loong_ time. It's a really really long time. It's
> human-noticeable. A quarter of a second at boot is just too long.
yeah it's insane ;-(
(since we boot a kernel in 1 second and a full OS including the kernel
in 5)
> - do other things while calibrating.
one option for that is the fastboot patchkit in -next ;-)
but at some point, even doing things in parallel/asynchronous isn't
helping, "parallel shit is still shit" :)
--
If you want to reach me at my work email, use arjan@linux.intel.com
For development, discussion and tips for power savings,
visit http://www.lesswatts.org
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH] Fix TSC calibration issues
2008-09-04 3:16 ` Arjan van de Ven
@ 2008-09-04 3:59 ` Linus Torvalds
2008-09-04 4:10 ` Arjan van de Ven
2008-09-04 4:25 ` Willy Tarreau
0 siblings, 2 replies; 56+ messages in thread
From: Linus Torvalds @ 2008-09-04 3:59 UTC (permalink / raw)
To: Arjan van de Ven
Cc: Alok Kataria, Thomas Gleixner, Larry Finger, LKML,
Rafael J. Wysocki, Michael Buesch, Dan Hecht
On Wed, 3 Sep 2008, Arjan van de Ven wrote:
>
> but at some point, even doing things in parallel/asynchronous isn't
> helping, "parallel shit is still shit" :)
Well, the thing is, you can't call ti "shit" when the fact is that we
don't have any other options than to wait.
The only frequency we can trust on 99% of all machines is the PIT, and
it's a very uncomfortable programming model due to all the history (it is
one of the few truly 8-bit things left in a modern PC). The other options
are just not reliably there, or are known to not have a stable frequency.
So how would you suggest we do it? Lowering the wait to 5ms (times 5, so
it's really 25ms, although we can probably stop early if the first
iterations are very consistent) will work, but it _will_ reduce precision.
And it's still real time.
But we simply don't have alternatives. That 'shit' is originally from the
company you work for, btw, and while it was good for its time, the
replacement (HPET) was horribly misdesigned by the same company, and is
deficient in many ways (not the least of which is the idiotic enumeration:
another ACPI braindamage), and it often isn't even exposed.
As a result, the PIT remains to this day the most reliable source of a
reference timer. That includes even on really modern machines (ie the one
I have from Intel that contains hardware not even released yet!).
Linus
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH] Fix TSC calibration issues
2008-09-04 3:59 ` Linus Torvalds
@ 2008-09-04 4:10 ` Arjan van de Ven
2008-09-04 4:20 ` Linus Torvalds
2008-09-04 4:25 ` Willy Tarreau
1 sibling, 1 reply; 56+ messages in thread
From: Arjan van de Ven @ 2008-09-04 4:10 UTC (permalink / raw)
To: Linus Torvalds
Cc: Alok Kataria, Thomas Gleixner, Larry Finger, LKML,
Rafael J. Wysocki, Michael Buesch, Dan Hecht
On Wed, 3 Sep 2008 20:59:05 -0700 (PDT)
Linus Torvalds <torvalds@linux-foundation.org> wrote:
> The only frequency we can trust on 99% of all machines is the PIT,
pmtimer is also quite ok, with the exception of some K6 based boxes.
(I'm surprised the K6 boxes even have enough modern stuff to have
pmtimer; I'd think they would fall under the date cutoff)
> and it's a very uncomfortable programming model due to all the
> history (it is one of the few truly 8-bit things left in a modern
> PC). The other options are just not reliably there, or are known to
> not have a stable frequency.
>
> So how would you suggest we do it? Lowering the wait to 5ms (times 5,
> so it's really 25ms, although we can probably stop early if the first
> iterations are very consistent) will work, but it _will_ reduce
> precision. And it's still real time.
one of the options we have is to start with an initial
rough-but-conservative estimate, and refine it over time as the system
is running.... sort of like ntp but for the calibration.
another option for calibrating the tsc rate is to read it from the
msr's/cpuid/aperf of what the hardware says it should be, and then all
we need is to verify it is that; that we could do over timer or quickly.
(of course that only works for systems with constant tsc)
--
If you want to reach me at my work email, use arjan@linux.intel.com
For development, discussion and tips for power savings,
visit http://www.lesswatts.org
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH] Fix TSC calibration issues
2008-09-04 4:10 ` Arjan van de Ven
@ 2008-09-04 4:20 ` Linus Torvalds
2008-09-04 4:27 ` Arjan van de Ven
0 siblings, 1 reply; 56+ messages in thread
From: Linus Torvalds @ 2008-09-04 4:20 UTC (permalink / raw)
To: Arjan van de Ven
Cc: Alok Kataria, Thomas Gleixner, Larry Finger, LKML,
Rafael J. Wysocki, Michael Buesch, Dan Hecht
On Wed, 3 Sep 2008, Arjan van de Ven wrote:
>
> On Wed, 3 Sep 2008 20:59:05 -0700 (PDT)
> Linus Torvalds <torvalds@linux-foundation.org> wrote:
>
> > The only frequency we can trust on 99% of all machines is the PIT,
>
> pmtimer is also quite ok, with the exception of some K6 based boxes.
> (I'm surprised the K6 boxes even have enough modern stuff to have
> pmtimer; I'd think they would fall under the date cutoff)
Quite frankly, pmtimer isn't all that much better than PIT. It has a
slightly bigger range, but it has a much more limited format, and it
doesn't have a reliable frequency. It was designed for something else.
At least HPET is clearly better than PIT as a _timer_.
All the big HPET problems are with its idiotic interface.
Of course, in any _sane_ situation, the timer really would have been in
the local APIC instead, with a fixed and architected frequency, and it
should run in all power states. But noo, that obviously won't ever work,
because that would have been _sensible_.
> one of the options we have is to start with an initial
> rough-but-conservative estimate, and refine it over time as the system
> is running.... sort of like ntp but for the calibration.
I do agree that we could aim for something like that. But even to get the
rough estimate, we'd probably have to do the 5ms thing.
> another option for calibrating the tsc rate is to read it from the
> msr's/cpuid/aperf of what the hardware says it should be, and then all
> we need is to verify it is that; that we could do over timer or quickly.
> (of course that only works for systems with constant tsc)
I don't think it's reliable even for systems with a constant TSC. Because
the msr/cpuid thing isn't going to actualyl give the right frequency. It
might be the frequency the thing is _rated_ at, but it will be off when
people over- or under-clock the front-side bus etc.
This is why it's so important that the clock input be a _known_ frequency.
The thing that makes the PIT still so useful is not that it's a good
timer, but that we *know* the frequency it runs at.
Linus
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH] Fix TSC calibration issues
2008-09-04 3:59 ` Linus Torvalds
2008-09-04 4:10 ` Arjan van de Ven
@ 2008-09-04 4:25 ` Willy Tarreau
2008-09-04 4:53 ` Linus Torvalds
1 sibling, 1 reply; 56+ messages in thread
From: Willy Tarreau @ 2008-09-04 4:25 UTC (permalink / raw)
To: Linus Torvalds
Cc: Arjan van de Ven, Alok Kataria, Thomas Gleixner, Larry Finger,
LKML, Rafael J. Wysocki, Michael Buesch, Dan Hecht
On Wed, Sep 03, 2008 at 08:59:05PM -0700, Linus Torvalds wrote:
>
>
> On Wed, 3 Sep 2008, Arjan van de Ven wrote:
> >
> > but at some point, even doing things in parallel/asynchronous isn't
> > helping, "parallel shit is still shit" :)
>
> Well, the thing is, you can't call ti "shit" when the fact is that we
> don't have any other options than to wait.
>
> The only frequency we can trust on 99% of all machines is the PIT, and
> it's a very uncomfortable programming model due to all the history (it is
> one of the few truly 8-bit things left in a modern PC). The other options
> are just not reliably there, or are known to not have a stable frequency.
>
> So how would you suggest we do it? Lowering the wait to 5ms (times 5, so
> it's really 25ms, although we can probably stop early if the first
> iterations are very consistent) will work, but it _will_ reduce precision.
> And it's still real time.
>
> But we simply don't have alternatives. That 'shit' is originally from the
> company you work for, btw, and while it was good for its time, the
> replacement (HPET) was horribly misdesigned by the same company, and is
> deficient in many ways (not the least of which is the idiotic enumeration:
> another ACPI braindamage), and it often isn't even exposed.
>
> As a result, the PIT remains to this day the most reliable source of a
> reference timer. That includes even on really modern machines (ie the one
> I have from Intel that contains hardware not even released yet!).
15 years ago when I only knew DOS, I used the PIT a lot for precise
delay calculations. I can attest that it can be a very precise timer
for delays when you run busy loops. You even need very few ticks because
you detect the falling edge with a high accuracy. Basically, I would
do this :
pit1 = readpit();
while (readpit() == pit1);
t1 = rdtsc(); // precise beginning of tick 0
while (readpit() != pit1 - 5000);
t2 = rdtsc(); // precise beginning of tick 5000
(t2 - t1) will be exactly 5000 PIT ticks long, or 4.1904767 ms.
Additional sanity checks are needed of course, such as rollover
detection, and a max loop counter in case we boot on a machine
with a broken PIT.
If someone wants to test this, I'd be interested in the number of
ticks required to get a good accuracy, I bet that even with a few
hundred ones it's already precise by a few ppm (about the precision
of the input clock in fact).
Willy
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH] Fix TSC calibration issues
2008-09-04 4:20 ` Linus Torvalds
@ 2008-09-04 4:27 ` Arjan van de Ven
0 siblings, 0 replies; 56+ messages in thread
From: Arjan van de Ven @ 2008-09-04 4:27 UTC (permalink / raw)
To: Linus Torvalds
Cc: Alok Kataria, Thomas Gleixner, Larry Finger, LKML,
Rafael J. Wysocki, Michael Buesch, Dan Hecht
On Wed, 3 Sep 2008 21:20:23 -0700 (PDT)
Linus Torvalds <torvalds@linux-foundation.org> wrote:
>
> I do agree that we could aim for something like that. But even to get
> the rough estimate, we'd probably have to do the 5ms thing.
5msec isn't too big a deal; 250ms otoh ;-)
>
> > another option for calibrating the tsc rate is to read it from the
> > msr's/cpuid/aperf of what the hardware says it should be, and then
> > all we need is to verify it is that; that we could do over timer or
> > quickly. (of course that only works for systems with constant tsc)
>
> I don't think it's reliable even for systems with a constant TSC.
> Because the msr/cpuid thing isn't going to actualyl give the right
> frequency. It might be the frequency the thing is _rated_ at, but it
> will be off when people over- or under-clock the front-side bus etc.
for 99%+ of the systems it'll be extremely close. We do need to
validate it to detect the overclocking etc (and if the calibration says
"it smells funny" we can do the slow 250msec thing)
--
If you want to reach me at my work email, use arjan@linux.intel.com
For development, discussion and tips for power savings,
visit http://www.lesswatts.org
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH] Fix TSC calibration issues
2008-09-04 4:25 ` Willy Tarreau
@ 2008-09-04 4:53 ` Linus Torvalds
2008-09-04 5:09 ` Willy Tarreau
0 siblings, 1 reply; 56+ messages in thread
From: Linus Torvalds @ 2008-09-04 4:53 UTC (permalink / raw)
To: Willy Tarreau
Cc: Arjan van de Ven, Alok Kataria, Thomas Gleixner, Larry Finger,
LKML, Rafael J. Wysocki, Michael Buesch, Dan Hecht
On Thu, 4 Sep 2008, Willy Tarreau wrote:
>
> Basically, I would do this :
>
> pit1 = readpit();
> while (readpit() == pit1);
> t1 = rdtsc(); // precise beginning of tick 0
> while (readpit() != pit1 - 5000);
> t2 = rdtsc(); // precise beginning of tick 5000
There's a few caveats here:
- the "readpit()" has to read without actually latching the value
latching the PIT value will stop counting.
- and all the docs say that you have to be careful about reading the PIT
without latching it because the two 8-bit accesses aren't atomic.
so the above will work in practice, but there are dangers.
The best way to fix most of the dangers is probably to only care about the
*high* byte, so that it doesn't matter if the low byte doesn't match the
high byte.
So you could probably change your version to wait for 4096 cycles (a
change of 16 in the high byte):
static unsigned char read_pit_msb(void)
{
/* Read but throw away the LSB */
inb(0x42);
return inb(0x42);
}
..
/* PIT ch2: square wave, full 16-bit count */
outb(0xb6, 0x43);
outb(0, 0x42);
outb(0, 0x42);
..
unsigned char pit = read_pit_msb();
/* Wait until the MSB changes */
while (read_pit_msb() == pit1);
t1 = rdtsc();
while ((unsigned char) (pit - read_pit_msb()) < 9);
t2 = rdtsc();
and it might work out ok without explicit latching, and without having to
worry about low/high bytes being out of sync.
> If someone wants to test this, I'd be interested in the number of
> ticks required to get a good accuracy, I bet that even with a few
> hundred ones it's already precise by a few ppm (about the precision
> of the input clock in fact).
I actually tested a patch with a counter value of just 1024, and I got the
right answer.
But if the busy loops aren't busy (due to MSI or virtualization), then all
those things fly out the window.
Linus
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH] Fix TSC calibration issues
2008-09-04 4:53 ` Linus Torvalds
@ 2008-09-04 5:09 ` Willy Tarreau
0 siblings, 0 replies; 56+ messages in thread
From: Willy Tarreau @ 2008-09-04 5:09 UTC (permalink / raw)
To: Linus Torvalds
Cc: Arjan van de Ven, Alok Kataria, Thomas Gleixner, Larry Finger,
LKML, Rafael J. Wysocki, Michael Buesch, Dan Hecht
On Wed, Sep 03, 2008 at 09:53:35PM -0700, Linus Torvalds wrote:
>
>
> On Thu, 4 Sep 2008, Willy Tarreau wrote:
> >
> > Basically, I would do this :
> >
> > pit1 = readpit();
> > while (readpit() == pit1);
> > t1 = rdtsc(); // precise beginning of tick 0
> > while (readpit() != pit1 - 5000);
> > t2 = rdtsc(); // precise beginning of tick 5000
>
> There's a few caveats here:
>
> - the "readpit()" has to read without actually latching the value
>
> latching the PIT value will stop counting.
>
> - and all the docs say that you have to be careful about reading the PIT
> without latching it because the two 8-bit accesses aren't atomic.
Ah yes you're right, I remember having been doing crappy stuff like re-reading
and checking for difference bigger than 1.
> so the above will work in practice, but there are dangers.
>
> The best way to fix most of the dangers is probably to only care about the
> *high* byte, so that it doesn't matter if the low byte doesn't match the
> high byte.
>
> So you could probably change your version to wait for 4096 cycles (a
> change of 16 in the high byte):
>
> static unsigned char read_pit_msb(void)
> {
> /* Read but throw away the LSB */
> inb(0x42);
> return inb(0x42);
> }
>
> ..
> /* PIT ch2: square wave, full 16-bit count */
> outb(0xb6, 0x43);
> outb(0, 0x42);
> outb(0, 0x42);
> ..
>
> unsigned char pit = read_pit_msb();
> /* Wait until the MSB changes */
> while (read_pit_msb() == pit1);
> t1 = rdtsc();
> while ((unsigned char) (pit - read_pit_msb()) < 9);
> t2 = rdtsc();
>
> and it might work out ok without explicit latching, and without having to
> worry about low/high bytes being out of sync.
I like this variation.
> > If someone wants to test this, I'd be interested in the number of
> > ticks required to get a good accuracy, I bet that even with a few
> > hundred ones it's already precise by a few ppm (about the precision
> > of the input clock in fact).
>
> I actually tested a patch with a counter value of just 1024, and I got the
> right answer.
>
> But if the busy loops aren't busy (due to MSI or virtualization), then all
> those things fly out the window.
100% agreed, though the problem is already the same with any calibration code,
with more or less sensitivity.
Willy
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: Regression in 2.6.27 caused by commit bfc0f59
2008-09-02 15:09 ` Linus Torvalds
2008-09-02 18:14 ` Thomas Gleixner
@ 2008-09-05 13:45 ` Mark Lord
1 sibling, 0 replies; 56+ messages in thread
From: Mark Lord @ 2008-09-05 13:45 UTC (permalink / raw)
To: Linus Torvalds
Cc: Thomas Gleixner, Larry Finger, LKML, Rafael J. Wysocki,
Alok Kataria, Michael Buesch
Linus Torvalds wrote:
>
> Additionally doing a min/max comparison to see that the loop is very
> _stable_ is of course also a way to validate things, but expecting _too_
> much stability may be wrong too. As mentioned, SMM events can happen for
> other reasons than emulation.
..
Not to mention that we still have the unsolved regression from when HPET
support first went in (2.6.2[01] timeframe), where machines randomly lock up
at "Switching to HPET" time during boot. Depending upon kernel .config,
phase of the moon, and apparently SMM events.
Hopefully this will finally get fixed with this work?
Cheers
^ permalink raw reply [flat|nested] 56+ messages in thread
end of thread, other threads:[~2008-09-05 13:45 UTC | newest]
Thread overview: 56+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-08-31 22:54 Regression in 2.6.27 caused by commit bfc0f59 Larry Finger
2008-09-01 11:14 ` Thomas Gleixner
2008-09-01 15:37 ` Larry Finger
2008-09-01 17:49 ` Thomas Gleixner
2008-09-01 17:44 ` Larry Finger
2008-09-01 18:31 ` Thomas Gleixner
2008-09-01 19:10 ` Linus Torvalds
2008-09-01 20:07 ` Thomas Gleixner
2008-09-01 21:30 ` Thomas Gleixner
2008-09-01 22:02 ` Linus Torvalds
2008-09-01 22:33 ` Thomas Gleixner
2008-09-01 22:56 ` Linus Torvalds
2008-09-01 23:24 ` Thomas Gleixner
2008-09-02 6:37 ` Andi Kleen
2008-09-02 12:21 ` Thomas Gleixner
2008-09-01 22:16 ` Linus Torvalds
2008-09-01 23:16 ` Thomas Gleixner
2008-09-02 3:18 ` Linus Torvalds
2008-09-02 3:35 ` Linus Torvalds
2008-09-02 4:54 ` Larry Finger
2008-09-02 9:17 ` Alan Cox
2008-09-02 12:15 ` Thomas Gleixner
2008-09-02 15:09 ` Linus Torvalds
2008-09-02 18:14 ` Thomas Gleixner
2008-09-02 18:41 ` Alok Kataria
2008-09-02 21:16 ` Thomas Gleixner
2008-09-02 18:42 ` Linus Torvalds
2008-09-02 21:13 ` Thomas Gleixner
2008-09-02 22:21 ` Linus Torvalds
2008-09-02 23:10 ` Thomas Gleixner
2008-09-03 1:49 ` Linus Torvalds
2008-09-02 22:54 ` [PATCH] Fix TSC calibration issues Thomas Gleixner
2008-09-03 2:14 ` Linus Torvalds
2008-09-03 9:11 ` Thomas Gleixner
2008-09-04 1:14 ` Alok Kataria
2008-09-04 2:56 ` Linus Torvalds
2008-09-04 3:16 ` Arjan van de Ven
2008-09-04 3:59 ` Linus Torvalds
2008-09-04 4:10 ` Arjan van de Ven
2008-09-04 4:20 ` Linus Torvalds
2008-09-04 4:27 ` Arjan van de Ven
2008-09-04 4:25 ` Willy Tarreau
2008-09-04 4:53 ` Linus Torvalds
2008-09-04 5:09 ` Willy Tarreau
2008-09-04 1:18 ` [PATCH] Change warning message in TSC calibration Alok Kataria
2008-09-03 2:51 ` [PATCH] Fix TSC calibration issues Larry Finger
2008-09-03 4:00 ` Linus Torvalds
2008-09-03 4:34 ` Larry Finger
2008-09-05 13:45 ` Regression in 2.6.27 caused by commit bfc0f59 Mark Lord
2008-09-02 17:17 ` Bill Davidsen
2008-09-01 19:36 ` Larry Finger
2008-09-01 20:09 ` Thomas Gleixner
2008-09-01 20:23 ` Larry Finger
2008-09-01 20:45 ` Thomas Gleixner
2008-09-01 18:42 ` Linus Torvalds
2008-09-01 19:08 ` Thomas Gleixner
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox