* Dependent CPU core speed reporting not updated with CPUFREQ_SHARED_TYPE_HW? @ 2007-03-30 0:43 Darrick J. Wong 2007-03-30 1:06 ` Pallipadi, Venkatesh 0 siblings, 1 reply; 9+ messages in thread From: Darrick J. Wong @ 2007-03-30 0:43 UTC (permalink / raw) To: Pallipadi, Venkatesh; +Cc: linux-kernel Hi Venki, I have a dual-Woodcrest machine here with _PSD tables that specify that cpufreq coordination between cores is done in hardware with DOMAIN_COORD_TYPE_HW_ALL. On this particular machine, CPU 0 and CPU 2 are on the same package, and it looks like they have to be at the same frequency. However, it seems that acpi_cpufreq_cpu_init() only sets policy->cpus to the shared cpu mask if software coordination is required. While this does have the effect of letting the hardware do its coordination job as advertised, it also means that a frequency change to CPU0 doesn't get echoed to CPU2 as it should be, and affected_cpus is inaccurate. This seems like a bug to me. I can whip up a patch to set the policy cpu mask in all cases and neuter all but one of the MSR/PCT writes if HW coordination is desired so that HW coordination is preserved and sysfs is accurate, but I'm curious to know if I've gotten it right. --D ^ permalink raw reply [flat|nested] 9+ messages in thread
* RE: Dependent CPU core speed reporting not updated with CPUFREQ_SHARED_TYPE_HW? 2007-03-30 0:43 Dependent CPU core speed reporting not updated with CPUFREQ_SHARED_TYPE_HW? Darrick J. Wong @ 2007-03-30 1:06 ` Pallipadi, Venkatesh 2007-06-01 18:43 ` Darrick J. Wong 0 siblings, 1 reply; 9+ messages in thread From: Pallipadi, Venkatesh @ 2007-03-30 1:06 UTC (permalink / raw) To: Darrick J. Wong; +Cc: linux-kernel >-----Original Message----- >From: Darrick J. Wong [mailto:djwong@us.ibm.com] >Sent: Thursday, March 29, 2007 5:43 PM >To: Pallipadi, Venkatesh >Cc: linux-kernel@vger.kernel.org >Subject: Dependent CPU core speed reporting not updated with >CPUFREQ_SHARED_TYPE_HW? > >Hi Venki, > >I have a dual-Woodcrest machine here with _PSD tables that specify that >cpufreq coordination between cores is done in hardware with >DOMAIN_COORD_TYPE_HW_ALL. On this particular machine, CPU 0 and CPU 2 >are on the same package, and it looks like they have to be at the same >frequency. > >However, it seems that acpi_cpufreq_cpu_init() only sets >policy->cpus to >the shared cpu mask if software coordination is required. While this >does have the effect of letting the hardware do its coordination job as >advertised, it also means that a frequency change to CPU0 doesn't get >echoed to CPU2 as it should be, and affected_cpus is inaccurate. > >This seems like a bug to me. I can whip up a patch to set the policy >cpu mask in all cases and neuter all but one of the MSR/PCT >writes if HW >coordination is desired so that HW coordination is preserved and sysfs >is accurate, but I'm curious to know if I've gotten it right. > Darrick, Above observation is correct. Affected_cpus and policy->cpus has multiple entry only when software coordination is desired. If hardware is doing the coordination, then setting policy->cpus makes another level of sw coordination on top of hardware coordination. The main reason is why I chose the current way is, from kernel perspective doing it at each logical CPU is much better than doing software coordination and making one CPU look at utilization of different CPUs and take decisions on their behalf. Especially with tickless kind of situations, Example: say CPU 0 and CPU 2 are sharing frequency with hw coordination and we make CPU 2 MSR lowest at all times and run the ondemand policy on CPU 0 to control both CPUs. Now CPU 0 is idle and CPU 2 gets some load. Ideally it will be better for CPU 2 to recognise and handle it, rather than wait for CPU 1 to come out of idle and handle this at a later point in time. Having the policy per CPU and dealing with hw coordination locally also simplifies CPU hotplug handling. I agree that affected_cpus is lying in case of hardware coordination. I thought of making affected CPUs show the dependency in case of hw coord, but retaining the percpu control. But, it seemed complicated change for something that is cosmetic. Note that this issue is the problem with what we display as current freq. However, ondemand knows the current freq of each CPU correctly even in this case, as it uses measured_freq interface that is supported on all recent processors to make frequency decisions. Thanks, Venki ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Dependent CPU core speed reporting not updated with CPUFREQ_SHARED_TYPE_HW? 2007-03-30 1:06 ` Pallipadi, Venkatesh @ 2007-06-01 18:43 ` Darrick J. Wong 2007-06-01 21:37 ` Andi Kleen 2007-06-02 1:59 ` Pallipadi, Venkatesh 0 siblings, 2 replies; 9+ messages in thread From: Darrick J. Wong @ 2007-06-01 18:43 UTC (permalink / raw) To: Pallipadi, Venkatesh; +Cc: linux-kernel [-- Attachment #1: Type: text/plain, Size: 1256 bytes --] On Thu, Mar 29, 2007 at 06:06:22PM -0700, Pallipadi, Venkatesh wrote: > thought of > making affected CPUs show the dependency in case of hw coord, but > retaining the percpu > control. But, it seemed complicated change for something that is > cosmetic. Actually, it's not so cosmetic any more. Our newest servers have a power meter that measures power consumption, and I'm writing a program to measure the power cost of various cpufreq transitions in order to enforce a power cap. Due to the under-reporting in affected_cpus, the app thinks that (taking your example above) CPUs 0 and 2 can be controlled independently. Thus, a p-state transition of (x, x) -> (x, x-1) yields no energy saving at all, while (x, x-1) -> (x-1, x-1) does. My program considers the effects of a single CPU's transition independently of which CPU it is and without considering what frequencies the other CPUs are operating at, which means that it will conclude that the cost of increasing speed (or the reward for decreasing it) is half of what it is ... sort of. It's mildly broken as a result, though amusingly enough it still seems to work ok. I suspect that it might flail around trying to hit a cap a bit more than it would if affected_cpus were more accurate. --D [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Dependent CPU core speed reporting not updated with CPUFREQ_SHARED_TYPE_HW? 2007-06-01 18:43 ` Darrick J. Wong @ 2007-06-01 21:37 ` Andi Kleen 2007-06-01 22:39 ` Darrick J. Wong 2007-06-02 1:59 ` Pallipadi, Venkatesh 1 sibling, 1 reply; 9+ messages in thread From: Andi Kleen @ 2007-06-01 21:37 UTC (permalink / raw) To: Darrick J. Wong; +Cc: Pallipadi, Venkatesh, linux-kernel "Darrick J. Wong" <djwong@us.ibm.com> writes: > On Thu, Mar 29, 2007 at 06:06:22PM -0700, Pallipadi, Venkatesh wrote: > > thought of > > making affected CPUs show the dependency in case of hw coord, but > > retaining the percpu > > control. But, it seemed complicated change for something that is > > cosmetic. > > Actually, it's not so cosmetic any more. Our newest servers have a > power meter that measures power consumption, and I'm writing a program > to measure the power cost of various cpufreq transitions in order to > enforce a power cap. How would that work? You would adjust the power cap dynamically during runtime based on the power meter feedback? How long would the adjustment interval be? > Due to the under-reporting in affected_cpus, the > app thinks that (taking your example above) CPUs 0 and 2 can be > controlled independently. Thus, a p-state transition of (x, x) -> > (x, x-1) yields no energy saving at all, while (x, x-1) -> (x-1, x-1) > does. My program considers the effects of a single CPU's transition > independently of which CPU it is and without considering what > frequencies the other CPUs are operating at, which means that it will > conclude that the cost of increasing speed (or the reward for decreasing > it) is half of what it is ... sort of. It's mildly broken as a result, > though amusingly enough it still seems to work ok. I suspect that it > might flail around trying to hit a cap a bit more than it would if > affected_cpus were more accurate. Not sure affected CPUs is accurate enough for your purposes anyways. It cannot express "other core can be independent if I'm idle, otherwise not" which is common on Intel systems. -Andi ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Dependent CPU core speed reporting not updated with CPUFREQ_SHARED_TYPE_HW? 2007-06-01 21:37 ` Andi Kleen @ 2007-06-01 22:39 ` Darrick J. Wong 0 siblings, 0 replies; 9+ messages in thread From: Darrick J. Wong @ 2007-06-01 22:39 UTC (permalink / raw) To: Andi Kleen; +Cc: Pallipadi, Venkatesh, linux-kernel On Fri, Jun 01, 2007 at 11:37:07PM +0200, Andi Kleen wrote: > > On Thu, Mar 29, 2007 at 06:06:22PM -0700, Pallipadi, Venkatesh wrote: > How would that work? You would adjust the power cap dynamically during > runtime based on the power meter feedback? How long would > the adjustment interval be? Yep, I adjust scaling_max_frequency as needed. The adjustment is currently done once per minute, though I've noticed that the BMC power meter itself can react in about 10-15 seconds. Incidentally, the ACPI battery meter seems to react in about 2-5 seconds on my T40. I suspect that I could lower that adjustment interval even further, though on the AMD box (x3755) the power meter is slow to read under high loads. > Not sure affected CPUs is accurate enough for your purposes anyways. > It cannot express "other core can be independent if I'm idle, otherwise not" > which is common on Intel systems. Yep, this is true too. Right now I'm using CPU offlining as a clumsy mechanism to force a CPU into idle state; even with the incorrect assumption that affected_cpus applies to forced idleness, it seems to work ok. We can end up losing more cores than we need to, but so far it has always been the case that we don't offline cores until we've run out of lower p-states on all cores. But I imagine with 80-core behemoths on the way, I ought to fix this particular bug somehow. It will probably involve adding a transition rule for each of the non-lowest-numbered CPUs in a cpufreq domain between 0 and whatever speed the lowest numbered CPU is in that domain is running at. --D ^ permalink raw reply [flat|nested] 9+ messages in thread
* RE: Dependent CPU core speed reporting not updated with CPUFREQ_SHARED_TYPE_HW? 2007-06-01 18:43 ` Darrick J. Wong 2007-06-01 21:37 ` Andi Kleen @ 2007-06-02 1:59 ` Pallipadi, Venkatesh 2007-06-02 6:43 ` Dave Jones 1 sibling, 1 reply; 9+ messages in thread From: Pallipadi, Venkatesh @ 2007-06-02 1:59 UTC (permalink / raw) To: Darrick J. Wong; +Cc: linux-kernel, Dave Jones >-----Original Message----- >From: Darrick J. Wong [mailto:djwong@us.ibm.com] >Sent: Friday, June 01, 2007 11:44 AM >To: Pallipadi, Venkatesh >Cc: linux-kernel@vger.kernel.org >Subject: Re: Dependent CPU core speed reporting not updated >with CPUFREQ_SHARED_TYPE_HW? > >On Thu, Mar 29, 2007 at 06:06:22PM -0700, Pallipadi, Venkatesh wrote: >> thought of >> making affected CPUs show the dependency in case of hw coord, but >> retaining the percpu >> control. But, it seemed complicated change for something that is >> cosmetic. > >Actually, it's not so cosmetic any more. Our newest servers have a >power meter that measures power consumption, and I'm writing a program >to measure the power cost of various cpufreq transitions in order to >enforce a power cap. Due to the under-reporting in affected_cpus, the >app thinks that (taking your example above) CPUs 0 and 2 can be >controlled independently. Thus, a p-state transition of (x, x) -> >(x, x-1) yields no energy saving at all, while (x, x-1) -> (x-1, x-1) >does. My program considers the effects of a single CPU's transition >independently of which CPU it is and without considering what >frequencies the other CPUs are operating at, which means that it will >conclude that the cost of increasing speed (or the reward for >decreasing >it) is half of what it is ... sort of. It's mildly broken as a result, >though amusingly enough it still seems to work ok. I suspect that it >might flail around trying to hit a cap a bit more than it would if >affected_cpus were more accurate. Hmmm. How about having a new cpufreq_sysfs entry to say these CPUs are frequency dependent in hardware. affected_cpus today has a single cpufreq directory for all affected_cpus and we coordinate all CPUs in software. To change freq, we will have to move among all affected_cpus and write an MSR. Hardware coordination basically tells us that kernel can control frequency percpu, but underneath hardware will pick highest requested freq among a group of CPUs. Instaed of handling this case as the existing software coordination case above, we can add a new entry in cpufreq /sysfs denoting hardware coordinated CPU group. Though it will be confusing with too many interfaces, I feel this is the right way to go about here. Comments? Thoughts? Thanks, Venki ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Dependent CPU core speed reporting not updated with CPUFREQ_SHARED_TYPE_HW? 2007-06-02 1:59 ` Pallipadi, Venkatesh @ 2007-06-02 6:43 ` Dave Jones 2007-06-02 14:19 ` Dependent CPU core speed reporting not updated withCPUFREQ_SHARED_TYPE_HW? Pallipadi, Venkatesh 0 siblings, 1 reply; 9+ messages in thread From: Dave Jones @ 2007-06-02 6:43 UTC (permalink / raw) To: Pallipadi, Venkatesh; +Cc: Darrick J. Wong, linux-kernel On Fri, Jun 01, 2007 at 06:59:25PM -0700, Venki Pallipadi wrote: > Hmmm. How about having a new cpufreq_sysfs entry to say > these CPUs are frequency dependent in hardware. Wait, wasn't this the entire purpose of affected_cpus in the first place? So we could see which CPUs would be affected by a frequency change? What went wrong here? > affected_cpus today has a single cpufreq directory for all affected_cpus > and we coordinate all CPUs in software. To change freq, we will have to > move among all affected_cpus and write an MSR. This I think is where the problem started. That these remained independant. Changing one should also affect the others that it 'affects'. Is that not the case? > Hardware coordination basically tells us that kernel can control > frequency > percpu, but underneath hardware will pick highest requested freq among a > group of CPUs. Instaed of handling this case as the existing software > coordination case above, we can add a new entry in cpufreq /sysfs > denoting > hardware coordinated CPU group. > > Though it will be confusing with too many interfaces, I feel this is the > right way to go about here. If 'affected_cpus' doesn't do the right thing, I'd vote for making it do so over adding more interfaces. Dave -- http://www.codemonkey.org.uk ^ permalink raw reply [flat|nested] 9+ messages in thread
* RE: Dependent CPU core speed reporting not updated withCPUFREQ_SHARED_TYPE_HW? 2007-06-02 6:43 ` Dave Jones @ 2007-06-02 14:19 ` Pallipadi, Venkatesh 2007-06-04 17:07 ` Darrick J. Wong 0 siblings, 1 reply; 9+ messages in thread From: Pallipadi, Venkatesh @ 2007-06-02 14:19 UTC (permalink / raw) To: Dave Jones; +Cc: Darrick J. Wong, linux-kernel >-----Original Message----- >From: Dave Jones [mailto:davej@redhat.com] >Sent: Friday, June 01, 2007 11:43 PM >To: Pallipadi, Venkatesh >Cc: Darrick J. Wong; linux-kernel@vger.kernel.org >Subject: Re: Dependent CPU core speed reporting not updated >withCPUFREQ_SHARED_TYPE_HW? > >On Fri, Jun 01, 2007 at 06:59:25PM -0700, Venki Pallipadi wrote: > > > Hmmm. How about having a new cpufreq_sysfs entry to say > > these CPUs are frequency dependent in hardware. > >Wait, wasn't this the entire purpose of affected_cpus in the first >place? So we could see which CPUs would be affected by a frequency >change? What went wrong here? > > > affected_cpus today has a single cpufreq directory for all >affected_cpus > > and we coordinate all CPUs in software. To change freq, we >will have to > > move among all affected_cpus and write an MSR. > >This I think is where the problem started. That these remained >independant. Changing one should also affect the others that it >'affects'. Is that not the case? > Yes. Current affected_cpus they are dependent from user perspective. Single set of /sysfs files linked in to multiple cpus. But kernel Kernel knows that these are multiple dependent cpus and while changing freq, driver typically goes from one CPU to other using set_cpus_allowed to write MSR on all CPUs. > > Hardware coordination basically tells us that kernel can control > > frequency > > percpu, but underneath hardware will pick highest requested >freq among a > > group of CPUs. Instaed of handling this case as the >existing software > > coordination case above, we can add a new entry in cpufreq /sysfs > > denoting > > hardware coordinated CPU group. > > > > Though it will be confusing with too many interfaces, I >feel this is the > > right way to go about here. > >If 'affected_cpus' doesn't do the right thing, I'd vote for making it >do so over adding more interfaces. > The problem here is that with hardware coordination, kernel need not do what we do for affected_cpus today. Kernel can manage each CPU independently in terms of setting freq as underlying hardware guarantees to do the coordination (picking up the highest freq among a group of dependent cpus). So ideally we can just manage cpu frequencies as we do today without affected_cpus. But, in this case there is a fyi from hardware which says even though OS is thinking that CPUs are independent, hardware is doing the coordination across these CPUs. We cannot directly use affected_cpus for this. We can probably change to use affected_cpus in a way that we enforce software coordination on top of hardware coordination. But, maintaining freq as in current affected_cpus may not be as optimal as doing a percpu policy and decision. Thanks, Venki ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Dependent CPU core speed reporting not updated withCPUFREQ_SHARED_TYPE_HW? 2007-06-02 14:19 ` Dependent CPU core speed reporting not updated withCPUFREQ_SHARED_TYPE_HW? Pallipadi, Venkatesh @ 2007-06-04 17:07 ` Darrick J. Wong 0 siblings, 0 replies; 9+ messages in thread From: Darrick J. Wong @ 2007-06-04 17:07 UTC (permalink / raw) To: Pallipadi, Venkatesh; +Cc: Dave Jones, linux-kernel [-- Attachment #1: Type: text/plain, Size: 2989 bytes --] On Sat, Jun 02, 2007 at 07:19:03AM -0700, Pallipadi, Venkatesh wrote: > The problem here is that with hardware coordination, kernel need not > do what we do for affected_cpus today. Kernel can manage each CPU > independently in terms of setting freq as underlying hardware > guarantees to do the coordination (picking up the highest freq > among a group of dependent cpus). So ideally we can just manage cpu > frequencies as we do today without affected_cpus. But, in this case > there is a fyi from hardware which says even though OS is thinking that > CPUs are independent, hardware is doing the coordination across these > CPUs. Yes, and ... it appears that (at least on Intel CPUs), writing IA32_PERF_CTL on any core in the package causes the speed of all CPUs to be set to the max of all CPU cores. Here's what I see when reading/writing the performance control MSRs: This is with all CPUs set to 2.6GHz. 0x198 = IA32_PERF_STATUS, 0x199 = IA32_PERF_CTL. root@elm3a188:/mnt/src/msr/msr-20060208# ./msr -n 0x198 0x199 CPU 0: 0x00000198 = 0x0828082806000828 0x00000199 = 0x0000000000000828 CPU 1: 0x00000198 = 0x0828082806000828 0x00000199 = 0x0000000000000828 Now, we set scaling_max_freq of CPU 0 to 2GHz: root@elm3a188:/mnt/src/msr/msr-20060208# ./msr -n 0x198 0x199 CPU 0: 0x00000198 = 0x0828082806000828 0x00000199 = 0x000000000000061a CPU 1: 0x00000198 = 0x0828082806000828 0x00000199 = 0x0000000000000828 And now likewise for CPU 1: root@elm3a188:/mnt/src/msr/msr-20060208# ./msr -n 0x198 0x199 CPU 0: 0x00000198 = 0x082808280600061a 0x00000199 = 0x000000000000061a CPU 1: 0x00000198 = 0x082808280600061a 0x00000199 = 0x000000000000061a Notice that we've written the slower speed to the control register but the status register says that we're still running at the higher speed. This seems to corroborate my finding that the power use does not drop until _both_ cores are lowered. Unfortunately, in that middle step we are reporting an incorrect frequency for CPU 0--sysfs says 2GHz but the hardware itself says 2.6. This is clearly a bad thing, because I just set scaling_max_cpufreq on CPU0 to 2GHz, yet it is running 667MHz faster than it ought to be because CPU1 wants to go faster. How about this: When hardware coordination is specified in _PSD, scaling_{min,max}_freq between CPUs in a cpufreq domain are tied together so that we can be sure that our caps are being followed. Requests to change speed can be done as they always have, but afterwards the value of scaling_cur_freq for all CPUs in the cpufreq domain will be determined by reading the speed value from hardware since we can't really be sure how the hardware decided to coordinate things anyway. When it becomes the case that individual cores on a package can run at different speeds, we can drop the _PSD entries. Does this scheme sound reasonable? We might, however, want another sysfs file to tell userspace what kind of coordination is taking place. --D [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2007-06-04 17:06 UTC | newest] Thread overview: 9+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2007-03-30 0:43 Dependent CPU core speed reporting not updated with CPUFREQ_SHARED_TYPE_HW? Darrick J. Wong 2007-03-30 1:06 ` Pallipadi, Venkatesh 2007-06-01 18:43 ` Darrick J. Wong 2007-06-01 21:37 ` Andi Kleen 2007-06-01 22:39 ` Darrick J. Wong 2007-06-02 1:59 ` Pallipadi, Venkatesh 2007-06-02 6:43 ` Dave Jones 2007-06-02 14:19 ` Dependent CPU core speed reporting not updated withCPUFREQ_SHARED_TYPE_HW? Pallipadi, Venkatesh 2007-06-04 17:07 ` Darrick J. Wong
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox