linux-pm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* cpupower reports uninitialized values for offline cpus
@ 2015-09-29 18:58 Jacob Tanenbaum
  2015-09-29 20:25 ` Prarit Bhargava
  0 siblings, 1 reply; 6+ messages in thread
From: Jacob Tanenbaum @ 2015-09-29 18:58 UTC (permalink / raw)
  To: linux-pm; +Cc: trenn, Prarit Bhargava

Hi guys,

I have found a bug in the cpupower tool. In the most recent pull of 
linus's tree cpupower reports bogus information for offlined cpus.

[root@hp-dl980g7-02 linux]# uname -a
Linux hp-dl980g7-02.rhts.eng.bos.redhat.com 4.3.0-rc3+ #1 SMP Tue Sep 29 
12:03:15 EDT 2015 x86_64 x86_64 x86_64 GNU/Linux
[root@hp-dl980g7-02 linux]#  echo 0 > /sys/devices/system/cpu/cpu150/online
[root@hp-dl980g7-02 linux]#  echo 0 > /sys/devices/system/cpu/cpu1/online
[root@hp-dl980g7-02 linux]#  echo 0 > /sys/devices/system/cpu/cpu140/online
[root@hp-dl980g7-02 linux]# cpupower monitor
               |Nehalem                    || Mperf              || 
Idle_Stats
PKG |CORE|CPU | C3   | C6   | PC3  | PC6  || C0   | Cx   | Freq || POLL 
| C1-N | C1E- | C3-N | C6-N
    0|   0|   0|  0.00| 99.24|  0.00|  0.00||  0.38| 99.62|  1596|| 
0.00|  0.00|  0.00|  0.00| 99.92
    0|   0|  80|  0.00| 99.25|  0.00|  0.00||  0.28| 99.72|  1681|| 
0.00|  0.00|  0.00|  0.00| 99.97
    0|   1|  81|  0.00| 99.47|  0.00|  0.00||  0.27| 99.73|  1711|| 
0.00|  0.00|  0.00|  0.00| 99.94
...
    7|   7| 157|  0.00| 99.47|  0.00|  0.00||  0.25| 99.75|  1735|| 
0.00|  0.00|  0.00|  0.00| 99.98
    7|   8|  78|  0.00| 99.48|  0.00|  0.00||  0.26| 99.74|  1741|| 
0.00|  0.00|  0.00|  0.00| 99.98
    7|   8| 158|  0.00| 99.47|  0.00|  0.00||  0.25| 99.75|  1726|| 
0.00|  0.00|  0.00|  0.00| 99.98
    7|   9|  79|  0.00| 99.57|  0.00|  0.00||  0.23| 99.77|  1745|| 
0.00|  0.00|  0.00|  0.00| 99.96
5472|   0|   1|******|******|******|******||******|******|******|| 
0.00|  0.00|  0.00|  0.00|  0.00 *is offline
10567|   0| 159|******|******|******|******||******|******|******|| 
0.00|  0.00|  0.00|  0.00|  0.00 *is offline
1661206560|859272560| 
150|******|******|******|******||******|******|******||  0.00| 0.00|  
0.00|  0.00|  0.00 *is offline
1661206560|943093104| 
140|******|******|******|******||******|******|******||  0.00| 0.00|  
0.00|  0.00|  0.00 *is offline


Also the number of sockets in cpu->pkgs from get_cpu_topology will be wrong.

This is because when a cpu is downed the topology directory 
/sys/devices/system/cpu/cpu[num]/ is removed and get_cpu_topology
relies on the files there to get the correct information. Patch 
404c2db6... has the loop skip acquiring the data for the offline
cpus but that causes the values to remain uninitialized. Because of the 
way cpu_top->pkgs is calculated each uninitialized and
unique cpu_top->core_info[cpu].core will erroneously increase the 
cpu_top->pkgs value.

I want to fix this bug I am just not sure how to proceed, I see three 
options...

   1. Change the code to not report any information on offline cpus
   2. When printing the results check if the cpu is offline and if it is 
obscure the incorrect information and change the cpu->pkg
      value to reflect all sockets with at least one online cpu.
   3. Change the sysfs to retain topology data for offlined cpus, may 
require adding an additional state to reflect the difference
      between a processor that has been brought down via software and a 
processor that has been removed from the machine.

I would like to submit a patch to fix this bug and not have one 
presented to me to test.

Thanks for the direction,
Jake T.


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: cpupower reports uninitialized values for offline cpus
  2015-09-29 18:58 cpupower reports uninitialized values for offline cpus Jacob Tanenbaum
@ 2015-09-29 20:25 ` Prarit Bhargava
  2015-09-30 16:23   ` Thomas Renninger
  0 siblings, 1 reply; 6+ messages in thread
From: Prarit Bhargava @ 2015-09-29 20:25 UTC (permalink / raw)
  To: Jacob Tanenbaum, linux-pm; +Cc: trenn



On 09/29/2015 02:58 PM, Jacob Tanenbaum wrote:
> Hi guys,
> 
> I have found a bug in the cpupower tool. In the most recent pull of linus's tree
> cpupower reports bogus information for offlined cpus.
> 
> [root@hp-dl980g7-02 linux]# uname -a
> Linux hp-dl980g7-02.rhts.eng.bos.redhat.com 4.3.0-rc3+ #1 SMP Tue Sep 29
> 12:03:15 EDT 2015 x86_64 x86_64 x86_64 GNU/Linux
> [root@hp-dl980g7-02 linux]#  echo 0 > /sys/devices/system/cpu/cpu150/online
> [root@hp-dl980g7-02 linux]#  echo 0 > /sys/devices/system/cpu/cpu1/online
> [root@hp-dl980g7-02 linux]#  echo 0 > /sys/devices/system/cpu/cpu140/online
> [root@hp-dl980g7-02 linux]# cpupower monitor
>               |Nehalem                    || Mperf              || Idle_Stats
> PKG |CORE|CPU | C3   | C6   | PC3  | PC6  || C0   | Cx   | Freq || POLL | C1-N |
> C1E- | C3-N | C6-N
>    0|   0|   0|  0.00| 99.24|  0.00|  0.00||  0.38| 99.62|  1596|| 0.00|  0.00| 
> 0.00|  0.00| 99.92
>    0|   0|  80|  0.00| 99.25|  0.00|  0.00||  0.28| 99.72|  1681|| 0.00|  0.00| 
> 0.00|  0.00| 99.97
>    0|   1|  81|  0.00| 99.47|  0.00|  0.00||  0.27| 99.73|  1711|| 0.00|  0.00| 
> 0.00|  0.00| 99.94
> ...
>    7|   7| 157|  0.00| 99.47|  0.00|  0.00||  0.25| 99.75|  1735|| 0.00|  0.00| 
> 0.00|  0.00| 99.98
>    7|   8|  78|  0.00| 99.48|  0.00|  0.00||  0.26| 99.74|  1741|| 0.00|  0.00| 
> 0.00|  0.00| 99.98
>    7|   8| 158|  0.00| 99.47|  0.00|  0.00||  0.25| 99.75|  1726|| 0.00|  0.00| 
> 0.00|  0.00| 99.98
>    7|   9|  79|  0.00| 99.57|  0.00|  0.00||  0.23| 99.77|  1745|| 0.00|  0.00| 
> 0.00|  0.00| 99.96
> 5472|   0|   1|******|******|******|******||******|******|******|| 0.00|  0.00| 
> 0.00|  0.00|  0.00 *is offline
> 10567|   0| 159|******|******|******|******||******|******|******|| 0.00| 
> 0.00|  0.00|  0.00|  0.00 *is offline
> 1661206560|859272560| 150|******|******|******|******||******|******|******|| 
> 0.00| 0.00|  0.00|  0.00|  0.00 *is offline
> 1661206560|943093104| 140|******|******|******|******||******|******|******|| 
> 0.00| 0.00|  0.00|  0.00|  0.00 *is offline
> 
> 
> Also the number of sockets in cpu->pkgs from get_cpu_topology will be wrong.
> 
> This is because when a cpu is downed the topology directory
> /sys/devices/system/cpu/cpu[num]/ is removed and get_cpu_topology
> relies on the files there to get the correct information. Patch 404c2db6... has
> the loop skip acquiring the data for the offline
> cpus but that causes the values to remain uninitialized. Because of the way
> cpu_top->pkgs is calculated each uninitialized and
> unique cpu_top->core_info[cpu].core will erroneously increase the cpu_top->pkgs
> value.
> 
> I want to fix this bug I am just not sure how to proceed, I see three options...
> 
>   1. Change the code to not report any information on offline cpus

Thomas,

I'm not sure I see any benefit in reporting the offline cpu status here.  We
could I suppose output a message at the beginning of the output to indicate that
some cpus have been hotplugged, but I'm not sure that is even necessary.

I would prefer changing cpupower to reporting only online CPUs, but there's
probably some situation I'm unaware of that we need to report offline cpu status.

>   2. When printing the results check if the cpu is offline and if it is obscure
> the incorrect information and change the cpu->pkg
>      value to reflect all sockets with at least one online cpu.

This is papering over the issue IMO, but still an option.

>   3. Change the sysfs to retain topology data for offlined cpus, may require
> adding an additional state to reflect the difference
>      between a processor that has been brought down via software and a processor
> that has been removed from the machine.
> 

Yeah, there's a problem here in which the topology files come and go when a CPU
is online'd.  The kernel's CPU hotplug mechanism makes no distinction if the CPU
is being physically hot added or soft onlined so that makes the "lifetime" of
the files in the topology directory difficult to determine.

> I would like to submit a patch to fix this bug and not have one presented to me
> to test.

Yep -- I think that's a good idea :)  More the merrier and all that ...

P.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: cpupower reports uninitialized values for offline cpus
  2015-09-29 20:25 ` Prarit Bhargava
@ 2015-09-30 16:23   ` Thomas Renninger
  2015-09-30 18:25     ` Prarit Bhargava
  0 siblings, 1 reply; 6+ messages in thread
From: Thomas Renninger @ 2015-09-30 16:23 UTC (permalink / raw)
  To: Prarit Bhargava; +Cc: Jacob Tanenbaum, linux-pm

On Tuesday, September 29, 2015 04:25:47 PM Prarit Bhargava wrote:
> On 09/29/2015 02:58 PM, Jacob Tanenbaum wrote:
> > Hi guys,
> > 
> > I have found a bug in the cpupower tool. In the most recent pull of
> > linus's tree cpupower reports bogus information for offlined cpus.
...
> Thomas,
> 
> I'm not sure I see any benefit in reporting the offline cpu status here.
There have been discussions to use CPU offlining as a power saving mechanism.
This is afaik no real option on x86, at least not too frequent, but it could
be on other archs?
So the info would be nice to have...

> We
> could I suppose output a message at the beginning of the output to indicate
> that some cpus have been hotplugged, but I'm not sure that is even
> necessary.
Yep, just that people do not complain that cores are missing...
May make sense.
 
> I would prefer changing cpupower to reporting only online CPUs, but there's
> probably some situation I'm unaware of that we need to report offline cpu
> status.
> >   2. When printing the results check if the cpu is offline and if it is
> >   obscure> 
> > the incorrect information and change the cpu->pkg
> >      value to reflect all sockets with at least one online cpu.

Hm, not sure how difficult to implement that is right now, but IMO info should
be set to -1 if the files do not exist or initialized with -1.
In this case there is no sane info to show and the cores need not to show up.

> 
> This is papering over the issue IMO, but still an option.
> 
> >   3. Change the sysfs to retain topology data for offlined cpus, may
> >   require

I had a quick look. I think this could be done.
/sys/devices/system/cpu/cpu*/cpuidle
also is kept, not sure why, because I cannot see any advantage to access
this info on an offlined CPU.

But reading the topology of offlined CPUs makes sense.
Argh, but yes. For real on-/offlining this is hard...

    Thomas

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: cpupower reports uninitialized values for offline cpus
  2015-09-30 16:23   ` Thomas Renninger
@ 2015-09-30 18:25     ` Prarit Bhargava
  2015-10-01 12:29       ` Thomas Renninger
  0 siblings, 1 reply; 6+ messages in thread
From: Prarit Bhargava @ 2015-09-30 18:25 UTC (permalink / raw)
  To: Thomas Renninger; +Cc: Jacob Tanenbaum, linux-pm



On 09/30/2015 12:23 PM, Thomas Renninger wrote:
> On Tuesday, September 29, 2015 04:25:47 PM Prarit Bhargava wrote:
>> On 09/29/2015 02:58 PM, Jacob Tanenbaum wrote:
>>> Hi guys,
>>>
>>> I have found a bug in the cpupower tool. In the most recent pull of
>>> linus's tree cpupower reports bogus information for offlined cpus.
> ...
>> Thomas,
>>
>> I'm not sure I see any benefit in reporting the offline cpu status here.
> There have been discussions to use CPU offlining as a power saving mechanism.

Yeah, I think in the past that this suggestion had been made but ISTR some
question as to exactly how much power would be saved in a "soft-remove" scenario
for a core & socket, and if there is any real benefit.

> This is afaik no real option on x86, at least not too frequent, but it could
> be on other archs?
> So the info would be nice to have...
> 
>> We
>> could I suppose output a message at the beginning of the output to indicate
>> that some cpus have been hotplugged, but I'm not sure that is even
>> necessary.
> Yep, just that people do not complain that cores are missing...
> May make sense.

Ok -- let's go with that for now, post, and see if anyone complains loudly about
that "loss" of info.  Jake, are you comfortable with that?

>>
>>>   3. Change the sysfs to retain topology data for offlined cpus, may
>>>   require
> 
> I had a quick look. I think this could be done.
> /sys/devices/system/cpu/cpu*/cpuidle
> also is kept, not sure why, because I cannot see any advantage to access
> this info on an offlined CPU.
> 

I think I see the problem.  When the intel_idle driver is loaded it registers a
cpu hotplug notifier, cpu_hotplug_notify(), which registers (and creates the
sysfs cpuidle dir) as each cpu is created.  This is done via a call to
intel_idle_cpu_init(), which calls cpuidle_register_device().

The reverse isn't true, however, and AFAICT, _all_ the cpuidle directories are
only destroyed when the driver is unloaded in intel_idle_exit().

I think that's a bug too FWIW, and I agree -- there isn't any point in having
that data lying around.

> But reading the topology of offlined CPUs makes sense.
> Argh, but yes. For real on-/offlining this is hard...

Yeah, we'd have to add an additional state to the CPU hotplug callback state
machine (CPU_ADD_PHYSICAL?) to identify when a cpu was hot-added.  I suppose it
could be done but it seems like an awful lot of churn for little gain.

P.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: cpupower reports uninitialized values for offline cpus
  2015-09-30 18:25     ` Prarit Bhargava
@ 2015-10-01 12:29       ` Thomas Renninger
  2015-10-01 13:11         ` Prarit Bhargava
  0 siblings, 1 reply; 6+ messages in thread
From: Thomas Renninger @ 2015-10-01 12:29 UTC (permalink / raw)
  To: Prarit Bhargava; +Cc: Jacob Tanenbaum, linux-pm

On Wednesday, September 30, 2015 02:25:47 PM Prarit Bhargava wrote:
> On 09/30/2015 12:23 PM, Thomas Renninger wrote:
...
> Yeah, we'd have to add an additional state to the CPU hotplug callback state
> machine (CPU_ADD_PHYSICAL?) to identify when a cpu was hot-added.  I
> suppose it could be done but it seems like an awful lot of churn for little
> gain.

It shouldn't be that hard.
If I find some time I give it a try.
This is a rather central piece of kernel code and affects
more or less all architectures, so getting this finally mainline
may take a while...

So it would be nice if somewhen in the future we have 2 independent
fixes:

The cpupower one as you wrote before, ignoring cores without
topology sysfs info (but not simply checking for "is core offline"),
better check whether the directory or a specific file exists.

And a kernel one which still shows topology info of soft offlined
cores which with above cpupower solution will result in previous
output (offlined cores listed, but with valid topology info).

Makes sense?

      Thomas

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: cpupower reports uninitialized values for offline cpus
  2015-10-01 12:29       ` Thomas Renninger
@ 2015-10-01 13:11         ` Prarit Bhargava
  0 siblings, 0 replies; 6+ messages in thread
From: Prarit Bhargava @ 2015-10-01 13:11 UTC (permalink / raw)
  To: Thomas Renninger; +Cc: Jacob Tanenbaum, linux-pm



On 10/01/2015 08:29 AM, Thomas Renninger wrote:
> On Wednesday, September 30, 2015 02:25:47 PM Prarit Bhargava wrote:
>> On 09/30/2015 12:23 PM, Thomas Renninger wrote:
> ...
>> Yeah, we'd have to add an additional state to the CPU hotplug callback state
>> machine (CPU_ADD_PHYSICAL?) to identify when a cpu was hot-added.  I
>> suppose it could be done but it seems like an awful lot of churn for little
>> gain.
> 
> It shouldn't be that hard.
> If I find some time I give it a try.
> This is a rather central piece of kernel code and affects
> more or less all architectures, so getting this finally mainline
> may take a while...
> 
> So it would be nice if somewhen in the future we have 2 independent
> fixes:
> 
> The cpupower one as you wrote before, ignoring cores without
> topology sysfs info (but not simply checking for "is core offline"),
> better check whether the directory or a specific file exists.
> 
> And a kernel one which still shows topology info of soft offlined
> cores which with above cpupower solution will result in previous
> output (offlined cores listed, but with valid topology info).
> 
> Makes sense?

Yup, makes sense.  Jacob (Jake) is going to handle the cpupower stuff right
away.  I'll dive into the topology stuff in a little bit and/or assist Jake with it.

Thanks Thomas :)

P.

> 
>       Thomas
> 

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2015-10-01 13:11 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-09-29 18:58 cpupower reports uninitialized values for offline cpus Jacob Tanenbaum
2015-09-29 20:25 ` Prarit Bhargava
2015-09-30 16:23   ` Thomas Renninger
2015-09-30 18:25     ` Prarit Bhargava
2015-10-01 12:29       ` Thomas Renninger
2015-10-01 13:11         ` Prarit Bhargava

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).