* Re: [3.2.16 -> 3.2.17 regression] High reported CPU load when idle [not found] <20120523.144057.899060240318474097.anders@netinsight.net> @ 2012-05-23 21:53 ` Jonathan Nieder 2012-05-24 21:45 ` Jonathan Nieder 0 siblings, 1 reply; 8+ messages in thread From: Jonathan Nieder @ 2012-05-23 21:53 UTC (permalink / raw) To: Anders Boström Cc: linux-kernel, Lesław Kopeć, Aman Gupta, Doug Smythies Hi Anders, Anders Boström wrote[1]: > Starting with 3.2.17-1, the CPU load accounting is broken when the > computer is idle. The CPU load is reported as >0.50 when > idle. 3.2.16-1 don't suffer from this problem. > > Suspected patch is the upstream patch > "sched: Fix nohz load accounting -- again!" > commit 5e2d50da11f0e6ec3ce8fe658d7c83b0b4346c68 to 3.2 and > originating from c308b56b5398779cd3da0f62ab26b0453494c3d4 . > > See also: > > https://bugs.launchpad.net/unity/+bug/991370 > https://lkml.org/lkml/2012/5/22/310 > https://bugzilla.redhat.com/show_bug.cgi?id=822877 > https://bbs.archlinux.org/viewtopic.php?id=141289 Thanks for writing. If I understand correctly, the load average calculation both before and after that commit is broken, in different ways. I'm cc-ing Lesław Kopeć, Aman Gupta, and Doug Smythies who worked on the above change[2]. I recommend pulling Thomas Gleixner <tglx@linutronix.de> into the conversation once you have a better idea of what's going on or a new change to recommend. If you'd like to also track this on a bugtracker, http://bugzilla.kernel.org/, product Process Management, component Scheduler might be a good place. Aside from that, I can't really offer much to help you, but others on linux-kernel might. Hope that helps, Jonathan [1] http://bugs.debian.org/674153 [2] http://thread.gmane.org/gmane.linux.kernel/1249223/focus=1262319 [3] http://thread.gmane.org/gmane.linux.kernel/1291870/focus=1292058 ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [3.2.16 -> 3.2.17 regression] High reported CPU load when idle 2012-05-23 21:53 ` [3.2.16 -> 3.2.17 regression] High reported CPU load when idle Jonathan Nieder @ 2012-05-24 21:45 ` Jonathan Nieder 2012-05-30 14:30 ` Doug Smythies 0 siblings, 1 reply; 8+ messages in thread From: Jonathan Nieder @ 2012-05-24 21:45 UTC (permalink / raw) To: Anders Boström Cc: linux-kernel, Lesław Kopeć, Aman Gupta, Doug Smythies, Peter Zijlstra, Thomas Gleixner (cc-ing Peter and Thomas because there is a nice graph) > Anders Boström wrote[1]: >> Starting with 3.2.17-1, the CPU load accounting is broken when the >> computer is idle. The CPU load is reported as >0.50 when >> idle. 3.2.16-1 don't suffer from this problem. >> >> Suspected patch is the upstream patch >> "sched: Fix nohz load accounting -- again!" >> commit 5e2d50da11f0e6ec3ce8fe658d7c83b0b4346c68 to 3.2 and >> originating from c308b56b5398779cd3da0f62ab26b0453494c3d4 . >> >> See also: >> >> https://bugs.launchpad.net/unity/+bug/991370 >> https://lkml.org/lkml/2012/5/22/310 >> https://bugzilla.redhat.com/show_bug.cgi?id=822877 >> https://bbs.archlinux.org/viewtopic.php?id=141289 I just found [1] from [2] which seems to describe the symptoms pretty well. Peter, Thomas, advice? Anders et al: does 556061b00c9f ("sched/nohz: Fix rq->cpu_load[] calculations", 2012-05-11) change anything? Thanks, Jonathan [1] https://launchpadlibrarian.net/105809696/commit_low_load_rev2.png [2] https://bugs.launchpad.net/ubuntu/+source/linux/+bug/838811 ^ permalink raw reply [flat|nested] 8+ messages in thread
* RE: [3.2.16 -> 3.2.17 regression] High reported CPU load when idle 2012-05-24 21:45 ` Jonathan Nieder @ 2012-05-30 14:30 ` Doug Smythies 2012-05-30 14:54 ` Anders Boström 2012-06-10 17:49 ` Jonathan Nieder 0 siblings, 2 replies; 8+ messages in thread From: Doug Smythies @ 2012-05-30 14:30 UTC (permalink / raw) To: 'Jonathan Nieder', 'Anders Boström' Cc: linux-kernel, 'Lesław Kopeć', 'Aman Gupta', 'Peter Zijlstra', 'Thomas Gleixner', Doug Smythies Hi, The referenced PNG file was sent to everyone on the address list on 2012.05.22 and the previous version was sent 2012.05.09. The only reason the PNG file was made was for the e-mail and because I was instructed not to refer to external sources. The web page version of the PNG file, which is kept up to date, is at [3]. "does 556061b00c9f ("sched/nohz: Fix rq->cpu_load[] calculations", 2012-05-11) change anything?" I back edited those changes into my test environment yesterday. It made no difference with respect to this issue. (minimally tested.) This statement: "Starting with 3.2.17-1, the CPU load accounting is broken when the computer is idle. The CPU load is reported as >0.50 when idle. 3.2.16-1 don't suffer from this problem." In my opinion has the following mistakes: . The computer is not actually idle. If it was actually idle the reported load average would be 0. . Yes, the new kernel reported load average is high, as detailed in the PNG file or the web notes. . The older kernel suffers from a different problem, under all other conditions being the same, the reported load average would have been too low. [3] http://www.smythies.com/~doug/network/load_average/new.html Doug Smythies -----Original Message----- From: Jonathan Nieder [mailto:jrnieder@gmail.com] Sent: May-24-2012 14:45 To: Anders Boström Cc: linux-kernel@vger.kernel.org; Lesław Kopeć; Aman Gupta; Doug Smythies; Peter Zijlstra; Thomas Gleixner Subject: Re: [3.2.16 -> 3.2.17 regression] High reported CPU load when idle (cc-ing Peter and Thomas because there is a nice graph) > Anders Boström wrote[1]: >> Starting with 3.2.17-1, the CPU load accounting is broken when the >> computer is idle. The CPU load is reported as >0.50 when idle. >> 3.2.16-1 don't suffer from this problem. >> >> Suspected patch is the upstream patch >> "sched: Fix nohz load accounting -- again!" >> commit 5e2d50da11f0e6ec3ce8fe658d7c83b0b4346c68 to 3.2 and >> originating from c308b56b5398779cd3da0f62ab26b0453494c3d4 . >> >> See also: >> >> https://bugs.launchpad.net/unity/+bug/991370 >> https://lkml.org/lkml/2012/5/22/310 >> https://bugzilla.redhat.com/show_bug.cgi?id=822877 >> https://bbs.archlinux.org/viewtopic.php?id=141289 I just found [1] from [2] which seems to describe the symptoms pretty well. Peter, Thomas, advice? Anders et al: does 556061b00c9f ("sched/nohz: Fix rq->cpu_load[] calculations", 2012-05-11) change anything? Thanks, Jonathan [1] https://launchpadlibrarian.net/105809696/commit_low_load_rev2.png [2] https://bugs.launchpad.net/ubuntu/+source/linux/+bug/838811 ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [3.2.16 -> 3.2.17 regression] High reported CPU load when idle 2012-05-30 14:30 ` Doug Smythies @ 2012-05-30 14:54 ` Anders Boström 2012-06-05 15:35 ` Lesław Kopeć 2012-06-10 17:49 ` Jonathan Nieder 1 sibling, 1 reply; 8+ messages in thread From: Anders Boström @ 2012-05-30 14:54 UTC (permalink / raw) To: dsmythies; +Cc: jrnieder, linux-kernel, leslaw.kopec, aman, a.p.zijlstra, tglx >>>>> "DS" == Doug Smythies <dsmythies@telus.net> writes: DS> This statement: "Starting with 3.2.17-1, the CPU load accounting is broken when the computer is idle. The CPU load is reported as >0.50 when idle. 3.2.16-1 don't suffer from this problem." DS> In my opinion has the following mistakes: DS> . The computer is not actually idle. If it was actually idle the reported load average would be 0. Well, I tested in single user mode, with very few processes running, mostly init, getty, bash and top (+ a lot of kernel threads). And 3.2.17 reported a load of >0.5 . Under the same conditions 3.2.16 typically reports 0.01 or 0.00 . DS> . Yes, the new kernel reported load average is high, as detailed in the PNG file or the web notes. DS> . The older kernel suffers from a different problem, under all other conditions being the same, the reported load average would have been too low. I don't know if 0.01 is *too* low, but it should be much closer to the truth than >0.5. / Anders ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [3.2.16 -> 3.2.17 regression] High reported CPU load when idle 2012-05-30 14:54 ` Anders Boström @ 2012-06-05 15:35 ` Lesław Kopeć 2012-06-08 17:01 ` Doug Smythies 0 siblings, 1 reply; 8+ messages in thread From: Lesław Kopeć @ 2012-06-05 15:35 UTC (permalink / raw) To: Anders Boström Cc: dsmythies, jrnieder, linux-kernel, aman, a.p.zijlstra, tglx [-- Attachment #1: Type: text/plain, Size: 1334 bytes --] On 05/30/2012 04:54 PM, Anders Boström wrote: > DS> This statement: "Starting with 3.2.17-1, the CPU load accounting is broken when the computer is idle. The CPU load is reported as >0.50 when idle. 3.2.16-1 don't suffer from this problem." > DS> In my opinion has the following mistakes: > DS> . The computer is not actually idle. If it was actually idle the reported load average would be 0. > > Well, I tested in single user mode, with very few processes running, > mostly init, getty, bash and top (+ a lot of kernel threads). And > 3.2.17 reported a load of >0.5 . Under the same conditions 3.2.16 > typically reports 0.01 or 0.00 . I've tried to reproduce the problem, but haven't had much luck. I've tested vanilla and Debian kernels versions 3.2.16 and 3.2.17. Load on an idle or slightly busy system is the same across all versions. vanilla 3.2.16 0.15 0.07 0.06 vanilla 3.2.17 0.17 0.11 0.13 Debian 3.2.16-1 0.13 0.07 0.05 Debian 3.2.17-1 0.10 0.09 0.11 When the system is completely idle load drops to 0. I've also tried 3.2.17 with 556061b00c9f, but it makes no difference and in comparison to plain 3.2.17 load is the same even on a busy system. I can't explain why we're getting different results on the same kernels. If you'd like more details just ask. -- Lesław Kopeć [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 262 bytes --] ^ permalink raw reply [flat|nested] 8+ messages in thread
* RE: [3.2.16 -> 3.2.17 regression] High reported CPU load when idle 2012-06-05 15:35 ` Lesław Kopeć @ 2012-06-08 17:01 ` Doug Smythies 0 siblings, 0 replies; 8+ messages in thread From: Doug Smythies @ 2012-06-08 17:01 UTC (permalink / raw) To: 'Lesław Kopeć', 'Anders Boström' Cc: jrnieder, linux-kernel, aman, a.p.zijlstra, tglx, Doug Smythies [-- Attachment #1: Type: text/plain, Size: 3068 bytes --] >> On 2012.05.30 07:54, Anders Boström wrote: > On 2012.06.05 08:35, Lesław Kopeć wrote: >> Well, I tested in single user mode, with very few processes running, >> mostly init, getty, bash and top (+ a lot of kernel threads). And >> 3.2.17 reported a load of >0.5 . Under the same conditions 3.2.16 >> typically reports 0.01 or 0.00 . >> I don't know if 0.01 is *too* low, but it should be much closer to the >> truth than >0.5. I agree. However the not "idle" case needs to also be considered. For a real load of 5.70 a reported load average of 0 is much further from the truth than the 5.6 being reported now, for example. > When the system is completely idle load drops to 0. I've also tried > 3.2.17 with 556061b00c9f, but it makes no difference and in comparison > to plain 3.2.17 load is the same even on a busy system. > I can't explain why we're getting different results on the same kernels. The different results are due to differences in the processes that are running on those same kernels, and in particular the frequency at which those processes do stuff and sleep. Where enough detail has been available on various problem reports, I have always found much more CPU activity than on my server system with no GUI. These have typically been GUI based "desktop" linux systems. Where I have been able to figure it out, the real "idle" load has been between 0.1 and 0.2 and reported as about 0.8 to 1.2. All of my analysis work for this reported load averages work has been based on the assumption that the background load is close enough to 0 to ignore. Obviously that assumption needed to be checked, [1]. Also see the attached PNG file (also posted at [2]). (Summary: The same as Lesław) By the way, I found and tested 5aaa0b7a2ed5b12692c9ffb5222182bd558d3146 It is similar (minimally tested). I am certainly not an expert, and I find the load average area of the code extremely difficult to follow and understand. That being said, I think the root issue here is the 10 tick grace period. I think that cpu idle enter exit transitions can not be ignored during this period, and somehow needs to be accumulated towards the next sample time. So far, I have been unsuccessful trying to help with a suggested solution. I will continue to try. Disclaimers: My web pages and notes often refer to reported load averages to two decimal places. I agree that is ridiculous. One should only expect +- 0.1 to 0.15 at best, and for the 15 minute average, after settle time. Worse for the shorter time constants. It is hoped that readers understand that the 15 minute reported load average never goes below 0.05 (after it has gone above that value once). That is a simple finite number of bits integer math issue. [1] http://www.smythies.com/~doug/network/load_average/background.html [2] http://www.smythies.com/~doug/network/load_average/background_histograms.png See also general related web notes at: http://www.smythies.com/~doug/network/load_average/index.html Doug Smythies [-- Attachment #2: background_histograms.png --] [-- Type: image/png, Size: 43479 bytes --] ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [3.2.16 -> 3.2.17 regression] High reported CPU load when idle 2012-05-30 14:30 ` Doug Smythies 2012-05-30 14:54 ` Anders Boström @ 2012-06-10 17:49 ` Jonathan Nieder 2012-06-12 6:12 ` Doug Smythies 1 sibling, 1 reply; 8+ messages in thread From: Jonathan Nieder @ 2012-06-10 17:49 UTC (permalink / raw) To: Doug Smythies Cc: 'Anders Boström', linux-kernel, 'Lesław Kopeć', 'Aman Gupta', 'Peter Zijlstra', 'Thomas Gleixner', Charles Wang Hi Doug et al, Doug Smythies wrote: > "does 556061b00c9f ("sched/nohz: Fix rq->cpu_load[] calculations", > 2012-05-11) change anything?" > > I back edited those changes into my test environment yesterday. It > made no difference with respect to this issue. (minimally tested.) [...] > By the way, I found and tested 5aaa0b7a2ed5b12692c9ffb5222182bd558d3146 > It is similar (minimally tested). > > I am certainly not an expert, and I find the load average area of the > code extremely difficult to follow and understand. That being said, I > think the root issue here is the 10 tick grace period. I think that > cpu idle enter exit transitions can not be ignored during this period, > and somehow needs to be accumulated towards the next sample time. So far, > I have been unsuccessful trying to help with a suggested solution. I will > continue to try. Another load average related patch is being discussed (not meant particularly to address the too-low load case, just mentioning it FYI): sched: Folding nohz load accounting more accurate After patch 453494c3d4 (sched: Fix nohz load accounting -- again!), we can fold the idle into calc_load_tasks_idle between the last cpu load calculating and calc_global_load calling. However problem still exits between the first cpu load calculating and the last cpu load calculating. Every time when we do load calculating, calc_load_tasks_idle will be added into calc_load_tasks, even if the idle load is caused by calculated cpus. This problem is also described in the following link: https://lkml.org/lkml/2012/5/24/419 This bug can be found in our work load. The average running processes number is about 15, but the load only shows about 4. >From [*]. Hope that helps, Jonathan [*] http://thread.gmane.org/gmane.linux.kernel/1310462 ^ permalink raw reply [flat|nested] 8+ messages in thread
* RE: [3.2.16 -> 3.2.17 regression] High reported CPU load when idle 2012-06-10 17:49 ` Jonathan Nieder @ 2012-06-12 6:12 ` Doug Smythies 0 siblings, 0 replies; 8+ messages in thread From: Doug Smythies @ 2012-06-12 6:12 UTC (permalink / raw) To: 'Jonathan Nieder' Cc: 'Anders Boström', linux-kernel, 'Lesław Kopeć', 'Aman Gupta', 'Peter Zijlstra', 'Thomas Gleixner', 'Charles Wang', Doug Smythies [-- Attachment #1: Type: text/plain, Size: 1749 bytes --] >> On 2012.06.08 10:01 Doug Smythies wrote: > On 2012.06.10 10:50 Jonathan Nieder wrote: >> By the way, I found and tested >> 5aaa0b7a2ed5b12692c9ffb5222182bd558d3146 >> It is similar (minimally tested). Which a day later was included in kernel 3.5 RC2, which I also tested for low load conditions only (i.e. in case I made some mistake with my manual back edit.) Herein, the abbreviation "5aaa" means Kernel 3.5 RC2 with 5aaa0b7a2ed5b12692c9ffb5222182bd558d3146 and its predecessors. > Another load average related patch is being discussed (not meant > particularly to address the too-low load case, just mentioning it FYI): > sched: Folding nohz load accounting more accurate > [...] > From [*]. > [*] http://thread.gmane.org/gmane.linux.kernel/1310462 Jonathan: Thanks for the reference. I also back edited that patch (by Charles Wang) into my working Kernel. Herein, the abbreviation "Wang" means my working Kernel (3.2.0-24.39 (Ubuntu reference)) with these back edits: The above referenced patch by Charles Wang; 5aaa0b7a2ed5b12692c9ffb5222182bd558d3146; 556061b00c9f2fd6a5524b6bde823ef12f299ecf; and c308b56b5398779cd3da0f62ab26b0453494c3d4. The abbreviation "c308" means my working kernel with only c308b56b5398779cd3da0f62ab26b0453494c3d4. The abbreviation "Control" means a tick based kernel compiled with CONFIG_NO_HZ=no. See the attached PNG file (and or [1]) for relatively low load test results. Summary: "c308" and "5aaa" are the same, with reported load averages higher than actual. "Wang" is worse, with reported load averages in error even higher. "Control" tends to track, but sometimes reported load averages are somewhat low. [1] http://www.smythies.com/~doug/network/load_average/wang_compare.png Doug Smythies [-- Attachment #2: wang_compare.png --] [-- Type: image/png, Size: 78127 bytes --] ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2012-06-12 6:12 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <20120523.144057.899060240318474097.anders@netinsight.net>
2012-05-23 21:53 ` [3.2.16 -> 3.2.17 regression] High reported CPU load when idle Jonathan Nieder
2012-05-24 21:45 ` Jonathan Nieder
2012-05-30 14:30 ` Doug Smythies
2012-05-30 14:54 ` Anders Boström
2012-06-05 15:35 ` Lesław Kopeć
2012-06-08 17:01 ` Doug Smythies
2012-06-10 17:49 ` Jonathan Nieder
2012-06-12 6:12 ` Doug Smythies
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox