public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* Re: [3.2.16 -> 3.2.17 regression] High reported CPU load when idle
       [not found] <20120523.144057.899060240318474097.anders@netinsight.net>
@ 2012-05-23 21:53 ` Jonathan Nieder
  2012-05-24 21:45   ` Jonathan Nieder
  0 siblings, 1 reply; 8+ messages in thread
From: Jonathan Nieder @ 2012-05-23 21:53 UTC (permalink / raw)
  To: Anders Boström
  Cc: linux-kernel, Lesław Kopeć, Aman Gupta, Doug Smythies

Hi Anders,

Anders Boström wrote[1]:

> Starting with 3.2.17-1, the CPU load accounting is broken when the
> computer is idle. The CPU load is reported as >0.50 when
> idle. 3.2.16-1 don't suffer from this problem.
>
> Suspected patch is the upstream patch
> "sched: Fix nohz load accounting -- again!"
> commit 5e2d50da11f0e6ec3ce8fe658d7c83b0b4346c68 to 3.2 and
> originating from c308b56b5398779cd3da0f62ab26b0453494c3d4 .
>
> See also:
>
> https://bugs.launchpad.net/unity/+bug/991370
> https://lkml.org/lkml/2012/5/22/310
> https://bugzilla.redhat.com/show_bug.cgi?id=822877
> https://bbs.archlinux.org/viewtopic.php?id=141289

Thanks for writing.

If I understand correctly, the load average calculation both before
and after that commit is broken, in different ways.

I'm cc-ing Lesław Kopeć, Aman Gupta, and Doug Smythies who worked
on the above change[2].  I recommend pulling Thomas Gleixner
<tglx@linutronix.de> into the conversation once you have a better
idea of what's going on or a new change to recommend.  If you'd like
to also track this on a bugtracker, http://bugzilla.kernel.org/,
product Process Management, component Scheduler might be a good place.

Aside from that, I can't really offer much to help you, but others
on linux-kernel might.

Hope that helps,
Jonathan

[1] http://bugs.debian.org/674153
[2] http://thread.gmane.org/gmane.linux.kernel/1249223/focus=1262319
[3] http://thread.gmane.org/gmane.linux.kernel/1291870/focus=1292058

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [3.2.16 -> 3.2.17 regression] High reported CPU load when idle
  2012-05-23 21:53 ` [3.2.16 -> 3.2.17 regression] High reported CPU load when idle Jonathan Nieder
@ 2012-05-24 21:45   ` Jonathan Nieder
  2012-05-30 14:30     ` Doug Smythies
  0 siblings, 1 reply; 8+ messages in thread
From: Jonathan Nieder @ 2012-05-24 21:45 UTC (permalink / raw)
  To: Anders Boström
  Cc: linux-kernel, Lesław Kopeć, Aman Gupta, Doug Smythies,
	Peter Zijlstra, Thomas Gleixner

(cc-ing Peter and Thomas because there is a nice graph)
> Anders Boström wrote[1]:

>> Starting with 3.2.17-1, the CPU load accounting is broken when the
>> computer is idle. The CPU load is reported as >0.50 when
>> idle. 3.2.16-1 don't suffer from this problem.
>>
>> Suspected patch is the upstream patch
>> "sched: Fix nohz load accounting -- again!"
>> commit 5e2d50da11f0e6ec3ce8fe658d7c83b0b4346c68 to 3.2 and
>> originating from c308b56b5398779cd3da0f62ab26b0453494c3d4 .
>>
>> See also:
>>
>> https://bugs.launchpad.net/unity/+bug/991370
>> https://lkml.org/lkml/2012/5/22/310
>> https://bugzilla.redhat.com/show_bug.cgi?id=822877
>> https://bbs.archlinux.org/viewtopic.php?id=141289

I just found [1] from [2] which seems to describe the symptoms pretty
well.  Peter, Thomas, advice?

Anders et al: does 556061b00c9f ("sched/nohz: Fix rq->cpu_load[]
calculations", 2012-05-11) change anything?

Thanks,
Jonathan

[1] https://launchpadlibrarian.net/105809696/commit_low_load_rev2.png
[2] https://bugs.launchpad.net/ubuntu/+source/linux/+bug/838811

^ permalink raw reply	[flat|nested] 8+ messages in thread

* RE: [3.2.16 -> 3.2.17 regression] High reported CPU load when idle
  2012-05-24 21:45   ` Jonathan Nieder
@ 2012-05-30 14:30     ` Doug Smythies
  2012-05-30 14:54       ` Anders Boström
  2012-06-10 17:49       ` Jonathan Nieder
  0 siblings, 2 replies; 8+ messages in thread
From: Doug Smythies @ 2012-05-30 14:30 UTC (permalink / raw)
  To: 'Jonathan Nieder', 'Anders Boström'
  Cc: linux-kernel, 'Lesław Kopeć',
	'Aman Gupta', 'Peter Zijlstra',
	'Thomas Gleixner', Doug Smythies

Hi,

The referenced PNG file was sent to everyone on the address list on 2012.05.22 and the previous version was sent 2012.05.09.
The only reason the PNG file was made was for the e-mail and because I was instructed not to refer to external sources.
The web page version of the PNG file, which is kept up to date, is at [3].

"does 556061b00c9f ("sched/nohz: Fix rq->cpu_load[] calculations", 2012-05-11) change anything?"
I back edited those changes into my test environment yesterday. It made no difference with respect to this issue. (minimally tested.)

This statement: "Starting with 3.2.17-1, the CPU load accounting is broken when the computer is idle. The CPU load is reported as >0.50 when idle. 3.2.16-1 don't suffer from this problem."
In my opinion has the following mistakes:
. The computer is not actually idle. If it was actually idle the reported load average would be 0.
. Yes, the new kernel reported load average is high, as detailed in the PNG file or the web notes.
. The older kernel suffers from a different problem, under all other conditions being the same, the reported load average would have been too low.

[3] http://www.smythies.com/~doug/network/load_average/new.html

Doug Smythies

-----Original Message-----
From: Jonathan Nieder [mailto:jrnieder@gmail.com] 
Sent: May-24-2012 14:45
To: Anders Boström
Cc: linux-kernel@vger.kernel.org; Lesław Kopeć; Aman Gupta; Doug Smythies; Peter Zijlstra; Thomas Gleixner
Subject: Re: [3.2.16 -> 3.2.17 regression] High reported CPU load when idle

(cc-ing Peter and Thomas because there is a nice graph)
> Anders Boström wrote[1]:

>> Starting with 3.2.17-1, the CPU load accounting is broken when the 
>> computer is idle. The CPU load is reported as >0.50 when idle. 
>> 3.2.16-1 don't suffer from this problem.
>>
>> Suspected patch is the upstream patch
>> "sched: Fix nohz load accounting -- again!"
>> commit 5e2d50da11f0e6ec3ce8fe658d7c83b0b4346c68 to 3.2 and 
>> originating from c308b56b5398779cd3da0f62ab26b0453494c3d4 .
>>
>> See also:
>>
>> https://bugs.launchpad.net/unity/+bug/991370
>> https://lkml.org/lkml/2012/5/22/310
>> https://bugzilla.redhat.com/show_bug.cgi?id=822877
>> https://bbs.archlinux.org/viewtopic.php?id=141289

I just found [1] from [2] which seems to describe the symptoms pretty well.  Peter, Thomas, advice?

Anders et al: does 556061b00c9f ("sched/nohz: Fix rq->cpu_load[] calculations", 2012-05-11) change anything?

Thanks,
Jonathan

[1] https://launchpadlibrarian.net/105809696/commit_low_load_rev2.png
[2] https://bugs.launchpad.net/ubuntu/+source/linux/+bug/838811



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [3.2.16 -> 3.2.17 regression] High reported CPU load when idle
  2012-05-30 14:30     ` Doug Smythies
@ 2012-05-30 14:54       ` Anders Boström
  2012-06-05 15:35         ` Lesław Kopeć
  2012-06-10 17:49       ` Jonathan Nieder
  1 sibling, 1 reply; 8+ messages in thread
From: Anders Boström @ 2012-05-30 14:54 UTC (permalink / raw)
  To: dsmythies; +Cc: jrnieder, linux-kernel, leslaw.kopec, aman, a.p.zijlstra, tglx

>>>>> "DS" == Doug Smythies <dsmythies@telus.net> writes:

 DS> This statement: "Starting with 3.2.17-1, the CPU load accounting is broken when the computer is idle. The CPU load is reported as >0.50 when idle. 3.2.16-1 don't suffer from this problem."
 DS> In my opinion has the following mistakes:
 DS> . The computer is not actually idle. If it was actually idle the reported load average would be 0.

Well, I tested in single user mode, with very few processes running,
mostly init, getty, bash and top (+ a lot of kernel threads). And
3.2.17 reported a load of >0.5 . Under the same conditions 3.2.16
typically reports 0.01 or 0.00 .

 DS> . Yes, the new kernel reported load average is high, as detailed in the PNG file or the web notes.
 DS> . The older kernel suffers from a different problem, under all other conditions being the same, the reported load average would have been too low.

I don't know if 0.01 is *too* low, but it should be much closer to the
truth than >0.5.

/ Anders

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [3.2.16 -> 3.2.17 regression] High reported CPU load when idle
  2012-05-30 14:54       ` Anders Boström
@ 2012-06-05 15:35         ` Lesław Kopeć
  2012-06-08 17:01           ` Doug Smythies
  0 siblings, 1 reply; 8+ messages in thread
From: Lesław Kopeć @ 2012-06-05 15:35 UTC (permalink / raw)
  To: Anders Boström
  Cc: dsmythies, jrnieder, linux-kernel, aman, a.p.zijlstra, tglx

[-- Attachment #1: Type: text/plain, Size: 1334 bytes --]

On 05/30/2012 04:54 PM, Anders Boström wrote:

>  DS> This statement: "Starting with 3.2.17-1, the CPU load accounting is broken when the computer is idle. The CPU load is reported as >0.50 when idle. 3.2.16-1 don't suffer from this problem."
>  DS> In my opinion has the following mistakes:
>  DS> . The computer is not actually idle. If it was actually idle the reported load average would be 0.
> 
> Well, I tested in single user mode, with very few processes running,
> mostly init, getty, bash and top (+ a lot of kernel threads). And
> 3.2.17 reported a load of >0.5 . Under the same conditions 3.2.16
> typically reports 0.01 or 0.00 .

I've tried to reproduce the problem, but haven't had much luck. I've
tested vanilla and Debian kernels versions 3.2.16 and 3.2.17. Load on an
idle or slightly busy system is the same across all versions.

vanilla 3.2.16		0.15 0.07 0.06
vanilla 3.2.17		0.17 0.11 0.13
Debian 3.2.16-1		0.13 0.07 0.05
Debian 3.2.17-1		0.10 0.09 0.11

When the system is completely idle load drops to 0. I've also tried
3.2.17 with 556061b00c9f, but it makes no difference and in comparison
to plain 3.2.17 load is the same even on a busy system.

I can't explain why we're getting different results on the same kernels.
If you'd like more details just ask.

-- 
Lesław Kopeć


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 262 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* RE: [3.2.16 -> 3.2.17 regression] High reported CPU load when idle
  2012-06-05 15:35         ` Lesław Kopeć
@ 2012-06-08 17:01           ` Doug Smythies
  0 siblings, 0 replies; 8+ messages in thread
From: Doug Smythies @ 2012-06-08 17:01 UTC (permalink / raw)
  To: 'Lesław Kopeć', 'Anders Boström'
  Cc: jrnieder, linux-kernel, aman, a.p.zijlstra, tglx, Doug Smythies

[-- Attachment #1: Type: text/plain, Size: 3068 bytes --]

>>  On 2012.05.30 07:54, Anders Boström wrote:
>   On 2012.06.05 08:35, Lesław Kopeć wrote:

>> Well, I tested in single user mode, with very few processes running, 
>> mostly init, getty, bash and top (+ a lot of kernel threads). And
>> 3.2.17 reported a load of >0.5 . Under the same conditions 3.2.16 
>> typically reports 0.01 or 0.00 .

>> I don't know if 0.01 is *too* low, but it should be much closer to the
>> truth than >0.5.

I agree. However the not "idle" case needs to also be considered. For a
real load of 5.70 a reported load average of 0 is much further from the
truth than the 5.6 being reported now, for example.

> When the system is completely idle load drops to 0. I've also tried
> 3.2.17 with 556061b00c9f, but it makes no difference and in comparison
> to plain 3.2.17 load is the same even on a busy system.

> I can't explain why we're getting different results on the same kernels.

The different results are due to differences in the processes that are
running on those same kernels, and in particular the frequency at which
those processes do stuff and sleep. Where enough detail has been
available on various problem reports, I have always found much more CPU
activity than on my server system with no GUI. These have typically been
GUI based "desktop" linux systems. Where I have been able to figure
it out, the real "idle" load has been between 0.1 and 0.2 and reported
as about 0.8 to 1.2.

All of my analysis work for this reported load averages work has been
based on the assumption that the background load is close enough to 0 to
ignore. Obviously that assumption needed to be checked, [1]. Also see
the attached PNG file (also posted at [2]). (Summary: The same as Lesław)

By the way, I found and tested 5aaa0b7a2ed5b12692c9ffb5222182bd558d3146
It is similar (minimally tested).

I am certainly not an expert, and I find the load average area of the
code extremely difficult to follow and understand. That being said, I
think the root issue here is the 10 tick grace period. I think that
cpu idle enter exit transitions can not be ignored during this period,
and somehow needs to be accumulated towards the next sample time. So far,
I have been unsuccessful trying to help with a suggested solution. I will
continue to try.

Disclaimers:

My web pages and notes often refer to reported load averages to two
decimal places. I agree that is ridiculous. One should only expect
+- 0.1 to 0.15 at best, and for the 15 minute average, after settle
time. Worse for the shorter time constants.

It is hoped that readers understand that the 15 minute reported load
average never goes below 0.05 (after it has gone above that value once).
That is a simple finite number of bits integer math issue.

[1] http://www.smythies.com/~doug/network/load_average/background.html
[2] http://www.smythies.com/~doug/network/load_average/background_histograms.png

See also general related web notes at: http://www.smythies.com/~doug/network/load_average/index.html

Doug Smythies


[-- Attachment #2: background_histograms.png --]
[-- Type: image/png, Size: 43479 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [3.2.16 -> 3.2.17 regression] High reported CPU load when idle
  2012-05-30 14:30     ` Doug Smythies
  2012-05-30 14:54       ` Anders Boström
@ 2012-06-10 17:49       ` Jonathan Nieder
  2012-06-12  6:12         ` Doug Smythies
  1 sibling, 1 reply; 8+ messages in thread
From: Jonathan Nieder @ 2012-06-10 17:49 UTC (permalink / raw)
  To: Doug Smythies
  Cc: 'Anders Boström', linux-kernel,
	'Lesław Kopeć', 'Aman Gupta',
	'Peter Zijlstra', 'Thomas Gleixner', Charles Wang

Hi Doug et al,

Doug Smythies wrote:

> "does 556061b00c9f ("sched/nohz: Fix rq->cpu_load[] calculations",
> 2012-05-11) change anything?"
>
> I back edited those changes into my test environment yesterday. It
> made no difference with respect to this issue. (minimally tested.)
[...]
> By the way, I found and tested 5aaa0b7a2ed5b12692c9ffb5222182bd558d3146
> It is similar (minimally tested).
>
> I am certainly not an expert, and I find the load average area of the
> code extremely difficult to follow and understand. That being said, I
> think the root issue here is the 10 tick grace period. I think that
> cpu idle enter exit transitions can not be ignored during this period,
> and somehow needs to be accumulated towards the next sample time. So far,
> I have been unsuccessful trying to help with a suggested solution. I will
> continue to try.

Another load average related patch is being discussed (not meant
particularly to address the too-low load case, just mentioning it
FYI):

	sched: Folding nohz load accounting more accurate

	After patch 453494c3d4 (sched: Fix nohz load accounting -- again!), we can fold
	the idle into calc_load_tasks_idle between the last cpu load calculating and
	calc_global_load calling. However problem still exits between the first cpu 
	load calculating and the last cpu load calculating. Every time when we do load 
	calculating, calc_load_tasks_idle will be added into calc_load_tasks, even if
	the idle load is caused by calculated cpus. This problem is also described in
	the following link:

	https://lkml.org/lkml/2012/5/24/419

	This bug can be found in our work load. The average running processes number 
	is about 15, but the load only shows about 4.

>From [*].

Hope that helps,
Jonathan

[*] http://thread.gmane.org/gmane.linux.kernel/1310462

^ permalink raw reply	[flat|nested] 8+ messages in thread

* RE: [3.2.16 -> 3.2.17 regression] High reported CPU load when idle
  2012-06-10 17:49       ` Jonathan Nieder
@ 2012-06-12  6:12         ` Doug Smythies
  0 siblings, 0 replies; 8+ messages in thread
From: Doug Smythies @ 2012-06-12  6:12 UTC (permalink / raw)
  To: 'Jonathan Nieder'
  Cc: 'Anders Boström', linux-kernel,
	'Lesław Kopeć', 'Aman Gupta',
	'Peter Zijlstra', 'Thomas Gleixner',
	'Charles Wang', Doug Smythies

[-- Attachment #1: Type: text/plain, Size: 1749 bytes --]

>> On 2012.06.08 10:01 Doug Smythies wrote:
>  On 2012.06.10 10:50 Jonathan Nieder wrote:

>> By the way, I found and tested 
>> 5aaa0b7a2ed5b12692c9ffb5222182bd558d3146
>> It is similar (minimally tested).

Which a day later was included in kernel 3.5 RC2, which I also tested for
low load conditions only (i.e. in case I made some mistake with my manual
back edit.)
Herein, the abbreviation "5aaa" means Kernel 3.5 RC2 with
5aaa0b7a2ed5b12692c9ffb5222182bd558d3146 and its predecessors.

> Another load average related patch is being discussed (not meant
> particularly to address the too-low load case, just mentioning it FYI):

>	sched: Folding nohz load accounting more accurate
> [...]
> From [*].
> [*] http://thread.gmane.org/gmane.linux.kernel/1310462

Jonathan: Thanks for the reference.

I also back edited that patch (by Charles Wang) into my working Kernel.
Herein, the abbreviation "Wang" means my working Kernel (3.2.0-24.39
(Ubuntu reference)) with these back edits: The above referenced patch by
Charles Wang; 5aaa0b7a2ed5b12692c9ffb5222182bd558d3146;
556061b00c9f2fd6a5524b6bde823ef12f299ecf;
and c308b56b5398779cd3da0f62ab26b0453494c3d4.

The abbreviation "c308" means my working kernel with only
c308b56b5398779cd3da0f62ab26b0453494c3d4.

The abbreviation "Control" means a tick based kernel compiled with
CONFIG_NO_HZ=no.

See the attached PNG file (and or [1]) for relatively low load test
results.

Summary:

"c308" and "5aaa" are the same, with reported load averages higher than
actual.
"Wang" is worse, with reported load averages in error even higher.
"Control" tends to track, but sometimes reported load averages are
somewhat low.

[1] http://www.smythies.com/~doug/network/load_average/wang_compare.png

Doug Smythies


[-- Attachment #2: wang_compare.png --]
[-- Type: image/png, Size: 78127 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2012-06-12  6:12 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20120523.144057.899060240318474097.anders@netinsight.net>
2012-05-23 21:53 ` [3.2.16 -> 3.2.17 regression] High reported CPU load when idle Jonathan Nieder
2012-05-24 21:45   ` Jonathan Nieder
2012-05-30 14:30     ` Doug Smythies
2012-05-30 14:54       ` Anders Boström
2012-06-05 15:35         ` Lesław Kopeć
2012-06-08 17:01           ` Doug Smythies
2012-06-10 17:49       ` Jonathan Nieder
2012-06-12  6:12         ` Doug Smythies

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox