* Dependent CPU core speed reporting not updated with CPUFREQ_SHARED_TYPE_HW?
@ 2007-03-30 0:43 Darrick J. Wong
2007-03-30 1:06 ` Pallipadi, Venkatesh
0 siblings, 1 reply; 9+ messages in thread
From: Darrick J. Wong @ 2007-03-30 0:43 UTC (permalink / raw)
To: Pallipadi, Venkatesh; +Cc: linux-kernel
Hi Venki,
I have a dual-Woodcrest machine here with _PSD tables that specify that
cpufreq coordination between cores is done in hardware with
DOMAIN_COORD_TYPE_HW_ALL. On this particular machine, CPU 0 and CPU 2
are on the same package, and it looks like they have to be at the same
frequency.
However, it seems that acpi_cpufreq_cpu_init() only sets policy->cpus to
the shared cpu mask if software coordination is required. While this
does have the effect of letting the hardware do its coordination job as
advertised, it also means that a frequency change to CPU0 doesn't get
echoed to CPU2 as it should be, and affected_cpus is inaccurate.
This seems like a bug to me. I can whip up a patch to set the policy
cpu mask in all cases and neuter all but one of the MSR/PCT writes if HW
coordination is desired so that HW coordination is preserved and sysfs
is accurate, but I'm curious to know if I've gotten it right.
--D
^ permalink raw reply [flat|nested] 9+ messages in thread
* RE: Dependent CPU core speed reporting not updated with CPUFREQ_SHARED_TYPE_HW?
2007-03-30 0:43 Dependent CPU core speed reporting not updated with CPUFREQ_SHARED_TYPE_HW? Darrick J. Wong
@ 2007-03-30 1:06 ` Pallipadi, Venkatesh
2007-06-01 18:43 ` Darrick J. Wong
0 siblings, 1 reply; 9+ messages in thread
From: Pallipadi, Venkatesh @ 2007-03-30 1:06 UTC (permalink / raw)
To: Darrick J. Wong; +Cc: linux-kernel
>-----Original Message-----
>From: Darrick J. Wong [mailto:djwong@us.ibm.com]
>Sent: Thursday, March 29, 2007 5:43 PM
>To: Pallipadi, Venkatesh
>Cc: linux-kernel@vger.kernel.org
>Subject: Dependent CPU core speed reporting not updated with
>CPUFREQ_SHARED_TYPE_HW?
>
>Hi Venki,
>
>I have a dual-Woodcrest machine here with _PSD tables that specify that
>cpufreq coordination between cores is done in hardware with
>DOMAIN_COORD_TYPE_HW_ALL. On this particular machine, CPU 0 and CPU 2
>are on the same package, and it looks like they have to be at the same
>frequency.
>
>However, it seems that acpi_cpufreq_cpu_init() only sets
>policy->cpus to
>the shared cpu mask if software coordination is required. While this
>does have the effect of letting the hardware do its coordination job as
>advertised, it also means that a frequency change to CPU0 doesn't get
>echoed to CPU2 as it should be, and affected_cpus is inaccurate.
>
>This seems like a bug to me. I can whip up a patch to set the policy
>cpu mask in all cases and neuter all but one of the MSR/PCT
>writes if HW
>coordination is desired so that HW coordination is preserved and sysfs
>is accurate, but I'm curious to know if I've gotten it right.
>
Darrick,
Above observation is correct. Affected_cpus and policy->cpus has
multiple
entry only when software coordination is desired. If hardware is doing
the
coordination, then setting policy->cpus makes another level of sw
coordination
on top of hardware coordination. The main reason is why I chose the
current
way is, from kernel perspective doing it at each logical CPU is much
better
than doing software coordination and making one CPU look at utilization
of different CPUs and take decisions on their behalf. Especially with
tickless
kind of situations,
Example: say CPU 0 and CPU 2 are sharing frequency with hw coordination
and we make CPU 2 MSR lowest at all times and run the ondemand policy on
CPU 0
to control both CPUs. Now CPU 0 is idle and CPU 2 gets some load.
Ideally it
will be better for CPU 2 to recognise and handle it, rather than wait
for CPU 1
to come out of idle and handle this at a later point in time.
Having the policy per CPU and dealing with hw coordination locally also
simplifies
CPU hotplug handling.
I agree that affected_cpus is lying in case of hardware coordination. I
thought of
making affected CPUs show the dependency in case of hw coord, but
retaining the percpu
control. But, it seemed complicated change for something that is
cosmetic.
Note that this issue is the problem with what we display as current
freq. However,
ondemand knows the current freq of each CPU correctly even in this case,
as it uses
measured_freq interface that is supported on all recent processors to
make
frequency decisions.
Thanks,
Venki
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Dependent CPU core speed reporting not updated with CPUFREQ_SHARED_TYPE_HW?
2007-03-30 1:06 ` Pallipadi, Venkatesh
@ 2007-06-01 18:43 ` Darrick J. Wong
2007-06-01 21:37 ` Andi Kleen
2007-06-02 1:59 ` Pallipadi, Venkatesh
0 siblings, 2 replies; 9+ messages in thread
From: Darrick J. Wong @ 2007-06-01 18:43 UTC (permalink / raw)
To: Pallipadi, Venkatesh; +Cc: linux-kernel
[-- Attachment #1: Type: text/plain, Size: 1256 bytes --]
On Thu, Mar 29, 2007 at 06:06:22PM -0700, Pallipadi, Venkatesh wrote:
> thought of
> making affected CPUs show the dependency in case of hw coord, but
> retaining the percpu
> control. But, it seemed complicated change for something that is
> cosmetic.
Actually, it's not so cosmetic any more. Our newest servers have a
power meter that measures power consumption, and I'm writing a program
to measure the power cost of various cpufreq transitions in order to
enforce a power cap. Due to the under-reporting in affected_cpus, the
app thinks that (taking your example above) CPUs 0 and 2 can be
controlled independently. Thus, a p-state transition of (x, x) ->
(x, x-1) yields no energy saving at all, while (x, x-1) -> (x-1, x-1)
does. My program considers the effects of a single CPU's transition
independently of which CPU it is and without considering what
frequencies the other CPUs are operating at, which means that it will
conclude that the cost of increasing speed (or the reward for decreasing
it) is half of what it is ... sort of. It's mildly broken as a result,
though amusingly enough it still seems to work ok. I suspect that it
might flail around trying to hit a cap a bit more than it would if
affected_cpus were more accurate.
--D
[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Dependent CPU core speed reporting not updated with CPUFREQ_SHARED_TYPE_HW?
2007-06-01 18:43 ` Darrick J. Wong
@ 2007-06-01 21:37 ` Andi Kleen
2007-06-01 22:39 ` Darrick J. Wong
2007-06-02 1:59 ` Pallipadi, Venkatesh
1 sibling, 1 reply; 9+ messages in thread
From: Andi Kleen @ 2007-06-01 21:37 UTC (permalink / raw)
To: Darrick J. Wong; +Cc: Pallipadi, Venkatesh, linux-kernel
"Darrick J. Wong" <djwong@us.ibm.com> writes:
> On Thu, Mar 29, 2007 at 06:06:22PM -0700, Pallipadi, Venkatesh wrote:
> > thought of
> > making affected CPUs show the dependency in case of hw coord, but
> > retaining the percpu
> > control. But, it seemed complicated change for something that is
> > cosmetic.
>
> Actually, it's not so cosmetic any more. Our newest servers have a
> power meter that measures power consumption, and I'm writing a program
> to measure the power cost of various cpufreq transitions in order to
> enforce a power cap.
How would that work? You would adjust the power cap dynamically during
runtime based on the power meter feedback? How long would
the adjustment interval be?
> Due to the under-reporting in affected_cpus, the
> app thinks that (taking your example above) CPUs 0 and 2 can be
> controlled independently. Thus, a p-state transition of (x, x) ->
> (x, x-1) yields no energy saving at all, while (x, x-1) -> (x-1, x-1)
> does. My program considers the effects of a single CPU's transition
> independently of which CPU it is and without considering what
> frequencies the other CPUs are operating at, which means that it will
> conclude that the cost of increasing speed (or the reward for decreasing
> it) is half of what it is ... sort of. It's mildly broken as a result,
> though amusingly enough it still seems to work ok. I suspect that it
> might flail around trying to hit a cap a bit more than it would if
> affected_cpus were more accurate.
Not sure affected CPUs is accurate enough for your purposes anyways.
It cannot express "other core can be independent if I'm idle, otherwise not"
which is common on Intel systems.
-Andi
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Dependent CPU core speed reporting not updated with CPUFREQ_SHARED_TYPE_HW?
2007-06-01 21:37 ` Andi Kleen
@ 2007-06-01 22:39 ` Darrick J. Wong
0 siblings, 0 replies; 9+ messages in thread
From: Darrick J. Wong @ 2007-06-01 22:39 UTC (permalink / raw)
To: Andi Kleen; +Cc: Pallipadi, Venkatesh, linux-kernel
On Fri, Jun 01, 2007 at 11:37:07PM +0200, Andi Kleen wrote:
> > On Thu, Mar 29, 2007 at 06:06:22PM -0700, Pallipadi, Venkatesh wrote:
> How would that work? You would adjust the power cap dynamically during
> runtime based on the power meter feedback? How long would
> the adjustment interval be?
Yep, I adjust scaling_max_frequency as needed. The adjustment is
currently done once per minute, though I've noticed that the BMC power meter
itself can react in about 10-15 seconds. Incidentally, the ACPI battery
meter seems to react in about 2-5 seconds on my T40. I suspect that I
could lower that adjustment interval even further, though on the AMD box
(x3755) the power meter is slow to read under high loads.
> Not sure affected CPUs is accurate enough for your purposes anyways.
> It cannot express "other core can be independent if I'm idle, otherwise not"
> which is common on Intel systems.
Yep, this is true too. Right now I'm using CPU offlining as a clumsy
mechanism to force a CPU into idle state; even with the incorrect
assumption that affected_cpus applies to forced idleness, it seems to
work ok. We can end up losing more cores than we need to, but so far it
has always been the case that we don't offline cores until we've run out
of lower p-states on all cores. But I imagine with 80-core behemoths
on the way, I ought to fix this particular bug somehow. It will
probably involve adding a transition rule for each of the
non-lowest-numbered CPUs in a cpufreq domain between 0 and whatever
speed the lowest numbered CPU is in that domain is running at.
--D
^ permalink raw reply [flat|nested] 9+ messages in thread
* RE: Dependent CPU core speed reporting not updated with CPUFREQ_SHARED_TYPE_HW?
2007-06-01 18:43 ` Darrick J. Wong
2007-06-01 21:37 ` Andi Kleen
@ 2007-06-02 1:59 ` Pallipadi, Venkatesh
2007-06-02 6:43 ` Dave Jones
1 sibling, 1 reply; 9+ messages in thread
From: Pallipadi, Venkatesh @ 2007-06-02 1:59 UTC (permalink / raw)
To: Darrick J. Wong; +Cc: linux-kernel, Dave Jones
>-----Original Message-----
>From: Darrick J. Wong [mailto:djwong@us.ibm.com]
>Sent: Friday, June 01, 2007 11:44 AM
>To: Pallipadi, Venkatesh
>Cc: linux-kernel@vger.kernel.org
>Subject: Re: Dependent CPU core speed reporting not updated
>with CPUFREQ_SHARED_TYPE_HW?
>
>On Thu, Mar 29, 2007 at 06:06:22PM -0700, Pallipadi, Venkatesh wrote:
>> thought of
>> making affected CPUs show the dependency in case of hw coord, but
>> retaining the percpu
>> control. But, it seemed complicated change for something that is
>> cosmetic.
>
>Actually, it's not so cosmetic any more. Our newest servers have a
>power meter that measures power consumption, and I'm writing a program
>to measure the power cost of various cpufreq transitions in order to
>enforce a power cap. Due to the under-reporting in affected_cpus, the
>app thinks that (taking your example above) CPUs 0 and 2 can be
>controlled independently. Thus, a p-state transition of (x, x) ->
>(x, x-1) yields no energy saving at all, while (x, x-1) -> (x-1, x-1)
>does. My program considers the effects of a single CPU's transition
>independently of which CPU it is and without considering what
>frequencies the other CPUs are operating at, which means that it will
>conclude that the cost of increasing speed (or the reward for
>decreasing
>it) is half of what it is ... sort of. It's mildly broken as a result,
>though amusingly enough it still seems to work ok. I suspect that it
>might flail around trying to hit a cap a bit more than it would if
>affected_cpus were more accurate.
Hmmm. How about having a new cpufreq_sysfs entry to say
these CPUs are frequency dependent in hardware.
affected_cpus today has a single cpufreq directory for all affected_cpus
and we coordinate all CPUs in software. To change freq, we will have to
move among all affected_cpus and write an MSR.
Hardware coordination basically tells us that kernel can control
frequency
percpu, but underneath hardware will pick highest requested freq among a
group of CPUs. Instaed of handling this case as the existing software
coordination case above, we can add a new entry in cpufreq /sysfs
denoting
hardware coordinated CPU group.
Though it will be confusing with too many interfaces, I feel this is the
right way to go about here.
Comments? Thoughts?
Thanks,
Venki
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Dependent CPU core speed reporting not updated with CPUFREQ_SHARED_TYPE_HW?
2007-06-02 1:59 ` Pallipadi, Venkatesh
@ 2007-06-02 6:43 ` Dave Jones
2007-06-02 14:19 ` Dependent CPU core speed reporting not updated withCPUFREQ_SHARED_TYPE_HW? Pallipadi, Venkatesh
0 siblings, 1 reply; 9+ messages in thread
From: Dave Jones @ 2007-06-02 6:43 UTC (permalink / raw)
To: Pallipadi, Venkatesh; +Cc: Darrick J. Wong, linux-kernel
On Fri, Jun 01, 2007 at 06:59:25PM -0700, Venki Pallipadi wrote:
> Hmmm. How about having a new cpufreq_sysfs entry to say
> these CPUs are frequency dependent in hardware.
Wait, wasn't this the entire purpose of affected_cpus in the first
place? So we could see which CPUs would be affected by a frequency
change? What went wrong here?
> affected_cpus today has a single cpufreq directory for all affected_cpus
> and we coordinate all CPUs in software. To change freq, we will have to
> move among all affected_cpus and write an MSR.
This I think is where the problem started. That these remained
independant. Changing one should also affect the others that it
'affects'. Is that not the case?
> Hardware coordination basically tells us that kernel can control
> frequency
> percpu, but underneath hardware will pick highest requested freq among a
> group of CPUs. Instaed of handling this case as the existing software
> coordination case above, we can add a new entry in cpufreq /sysfs
> denoting
> hardware coordinated CPU group.
>
> Though it will be confusing with too many interfaces, I feel this is the
> right way to go about here.
If 'affected_cpus' doesn't do the right thing, I'd vote for making it
do so over adding more interfaces.
Dave
--
http://www.codemonkey.org.uk
^ permalink raw reply [flat|nested] 9+ messages in thread
* RE: Dependent CPU core speed reporting not updated withCPUFREQ_SHARED_TYPE_HW?
2007-06-02 6:43 ` Dave Jones
@ 2007-06-02 14:19 ` Pallipadi, Venkatesh
2007-06-04 17:07 ` Darrick J. Wong
0 siblings, 1 reply; 9+ messages in thread
From: Pallipadi, Venkatesh @ 2007-06-02 14:19 UTC (permalink / raw)
To: Dave Jones; +Cc: Darrick J. Wong, linux-kernel
>-----Original Message-----
>From: Dave Jones [mailto:davej@redhat.com]
>Sent: Friday, June 01, 2007 11:43 PM
>To: Pallipadi, Venkatesh
>Cc: Darrick J. Wong; linux-kernel@vger.kernel.org
>Subject: Re: Dependent CPU core speed reporting not updated
>withCPUFREQ_SHARED_TYPE_HW?
>
>On Fri, Jun 01, 2007 at 06:59:25PM -0700, Venki Pallipadi wrote:
>
> > Hmmm. How about having a new cpufreq_sysfs entry to say
> > these CPUs are frequency dependent in hardware.
>
>Wait, wasn't this the entire purpose of affected_cpus in the first
>place? So we could see which CPUs would be affected by a frequency
>change? What went wrong here?
>
> > affected_cpus today has a single cpufreq directory for all
>affected_cpus
> > and we coordinate all CPUs in software. To change freq, we
>will have to
> > move among all affected_cpus and write an MSR.
>
>This I think is where the problem started. That these remained
>independant. Changing one should also affect the others that it
>'affects'. Is that not the case?
>
Yes. Current affected_cpus they are dependent from user perspective.
Single set of /sysfs files linked in to multiple cpus. But kernel
Kernel knows that these are multiple dependent cpus and while
changing freq, driver typically goes from one CPU to other using
set_cpus_allowed to write MSR on all CPUs.
> > Hardware coordination basically tells us that kernel can control
> > frequency
> > percpu, but underneath hardware will pick highest requested
>freq among a
> > group of CPUs. Instaed of handling this case as the
>existing software
> > coordination case above, we can add a new entry in cpufreq /sysfs
> > denoting
> > hardware coordinated CPU group.
> >
> > Though it will be confusing with too many interfaces, I
>feel this is the
> > right way to go about here.
>
>If 'affected_cpus' doesn't do the right thing, I'd vote for making it
>do so over adding more interfaces.
>
The problem here is that with hardware coordination, kernel need not
do what we do for affected_cpus today. Kernel can manage each CPU
independently in terms of setting freq as underlying hardware
guarantees to do the coordination (picking up the highest freq
among a group of dependent cpus). So ideally we can just manage cpu
frequencies as we do today without affected_cpus. But, in this case
there is a fyi from hardware which says even though OS is thinking that
CPUs are independent, hardware is doing the coordination across these
CPUs.
We cannot directly use affected_cpus for this. We can probably change
to use affected_cpus in a way that we enforce software coordination on
top of hardware coordination. But, maintaining freq as in current
affected_cpus may not be as optimal as doing a percpu policy and
decision.
Thanks,
Venki
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Dependent CPU core speed reporting not updated withCPUFREQ_SHARED_TYPE_HW?
2007-06-02 14:19 ` Dependent CPU core speed reporting not updated withCPUFREQ_SHARED_TYPE_HW? Pallipadi, Venkatesh
@ 2007-06-04 17:07 ` Darrick J. Wong
0 siblings, 0 replies; 9+ messages in thread
From: Darrick J. Wong @ 2007-06-04 17:07 UTC (permalink / raw)
To: Pallipadi, Venkatesh; +Cc: Dave Jones, linux-kernel
[-- Attachment #1: Type: text/plain, Size: 2989 bytes --]
On Sat, Jun 02, 2007 at 07:19:03AM -0700, Pallipadi, Venkatesh wrote:
> The problem here is that with hardware coordination, kernel need not
> do what we do for affected_cpus today. Kernel can manage each CPU
> independently in terms of setting freq as underlying hardware
> guarantees to do the coordination (picking up the highest freq
> among a group of dependent cpus). So ideally we can just manage cpu
> frequencies as we do today without affected_cpus. But, in this case
> there is a fyi from hardware which says even though OS is thinking that
> CPUs are independent, hardware is doing the coordination across these
> CPUs.
Yes, and ... it appears that (at least on Intel CPUs), writing
IA32_PERF_CTL on any core in the package causes the speed of all CPUs to
be set to the max of all CPU cores. Here's what I see when
reading/writing the performance control MSRs:
This is with all CPUs set to 2.6GHz. 0x198 = IA32_PERF_STATUS, 0x199 =
IA32_PERF_CTL.
root@elm3a188:/mnt/src/msr/msr-20060208# ./msr -n 0x198 0x199
CPU 0:
0x00000198 = 0x0828082806000828
0x00000199 = 0x0000000000000828
CPU 1:
0x00000198 = 0x0828082806000828
0x00000199 = 0x0000000000000828
Now, we set scaling_max_freq of CPU 0 to 2GHz:
root@elm3a188:/mnt/src/msr/msr-20060208# ./msr -n 0x198 0x199
CPU 0:
0x00000198 = 0x0828082806000828
0x00000199 = 0x000000000000061a
CPU 1:
0x00000198 = 0x0828082806000828
0x00000199 = 0x0000000000000828
And now likewise for CPU 1:
root@elm3a188:/mnt/src/msr/msr-20060208# ./msr -n 0x198 0x199
CPU 0:
0x00000198 = 0x082808280600061a
0x00000199 = 0x000000000000061a
CPU 1:
0x00000198 = 0x082808280600061a
0x00000199 = 0x000000000000061a
Notice that we've written the slower speed to the control register but
the status register says that we're still running at the higher speed.
This seems to corroborate my finding that the power use does not drop
until _both_ cores are lowered. Unfortunately, in that middle step we
are reporting an incorrect frequency for CPU 0--sysfs says 2GHz but the
hardware itself says 2.6. This is clearly a bad thing, because I just
set scaling_max_cpufreq on CPU0 to 2GHz, yet it is running 667MHz faster
than it ought to be because CPU1 wants to go faster.
How about this: When hardware coordination is specified in _PSD,
scaling_{min,max}_freq between CPUs in a cpufreq domain are tied
together so that we can be sure that our caps are being followed.
Requests to change speed can be done as they always have, but
afterwards the value of scaling_cur_freq for all CPUs in the cpufreq
domain will be determined by reading the speed value from hardware
since we can't really be sure how the hardware decided to coordinate
things anyway. When it becomes the case that individual cores on a
package can run at different speeds, we can drop the _PSD entries. Does
this scheme sound reasonable?
We might, however, want another sysfs file to tell userspace what kind
of coordination is taking place.
--D
[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2007-06-04 17:06 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-03-30 0:43 Dependent CPU core speed reporting not updated with CPUFREQ_SHARED_TYPE_HW? Darrick J. Wong
2007-03-30 1:06 ` Pallipadi, Venkatesh
2007-06-01 18:43 ` Darrick J. Wong
2007-06-01 21:37 ` Andi Kleen
2007-06-01 22:39 ` Darrick J. Wong
2007-06-02 1:59 ` Pallipadi, Venkatesh
2007-06-02 6:43 ` Dave Jones
2007-06-02 14:19 ` Dependent CPU core speed reporting not updated withCPUFREQ_SHARED_TYPE_HW? Pallipadi, Venkatesh
2007-06-04 17:07 ` Darrick J. Wong
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox