[GIT PULL] cputime patch for 2.6.30-rc6

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [GIT PULL] cputime patch for 2.6.30-rc6
@ 2009-05-18 14:09 Martin Schwidefsky
  2009-05-18 15:24 ` Peter Zijlstra
  0 siblings, 1 reply; 16+ messages in thread
From: Martin Schwidefsky @ 2009-05-18 14:09 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel, Michael Abbott, Jan Engelhardt

Hi Linus,

please pull from 'cputime' branch of

	git://git390.marist.edu/pub/scm/linux-2.6.git cputime

to receive the following updates:

Michael Abbott (1):
      Fix idle time field in /proc/uptime

 fs/proc/uptime.c |    8 +++++++-
 1 files changed, 7 insertions(+), 1 deletions(-)

diff --git a/fs/proc/uptime.c b/fs/proc/uptime.c
index 0c10a0b..c0ac0d7 100644
--- a/fs/proc/uptime.c
+++ b/fs/proc/uptime.c
@@ -4,13 +4,19 @@
 #include <linux/sched.h>
 #include <linux/seq_file.h>
 #include <linux/time.h>
+#include <linux/kernel_stat.h>
 #include <asm/cputime.h>
 
 static int uptime_proc_show(struct seq_file *m, void *v)
 {
 	struct timespec uptime;
 	struct timespec idle;
-	cputime_t idletime = cputime_add(init_task.utime, init_task.stime);
+	int len, i;
+	cputime_t idletime = 0;
+
+	for_each_possible_cpu(i)
+		idletime = cputime64_add(idletime, kstat_cpu(i).cpustat.idle);
+	idletime = cputime64_to_clock_t(idletime);
 
 	do_posix_clock_monotonic_gettime(&uptime);
 	monotonic_to_bootbased(&uptime);

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [GIT PULL] cputime patch for 2.6.30-rc6
  2009-05-18 14:09 [GIT PULL] cputime patch for 2.6.30-rc6 Martin Schwidefsky
@ 2009-05-18 15:24 ` Peter Zijlstra
  2009-05-18 16:28   ` Michael Abbott
                     ` (2 more replies)
  0 siblings, 3 replies; 16+ messages in thread
From: Peter Zijlstra @ 2009-05-18 15:24 UTC (permalink / raw)
  To: Martin Schwidefsky
  Cc: Linus Torvalds, linux-kernel, Michael Abbott, Jan Engelhardt

On Mon, 2009-05-18 at 16:09 +0200, Martin Schwidefsky wrote:
> Hi Linus,
> 
> please pull from 'cputime' branch of
> 
> 	git://git390.marist.edu/pub/scm/linux-2.6.git cputime
> 
> to receive the following updates:
> 
> Michael Abbott (1):
>       Fix idle time field in /proc/uptime
> 
>  fs/proc/uptime.c |    8 +++++++-
>  1 files changed, 7 insertions(+), 1 deletions(-)
> 
> diff --git a/fs/proc/uptime.c b/fs/proc/uptime.c
> index 0c10a0b..c0ac0d7 100644
> --- a/fs/proc/uptime.c
> +++ b/fs/proc/uptime.c
> @@ -4,13 +4,19 @@
>  #include <linux/sched.h>
>  #include <linux/seq_file.h>
>  #include <linux/time.h>
> +#include <linux/kernel_stat.h>
>  #include <asm/cputime.h>
>  
>  static int uptime_proc_show(struct seq_file *m, void *v)
>  {
>  	struct timespec uptime;
>  	struct timespec idle;
> -	cputime_t idletime = cputime_add(init_task.utime, init_task.stime);
> +	int len, i;
> +	cputime_t idletime = 0;

cputime_zero, I guess..

> +	for_each_possible_cpu(i)
> +		idletime = cputime64_add(idletime, kstat_cpu(i).cpustat.idle);
> +	idletime = cputime64_to_clock_t(idletime);
>  
>  	do_posix_clock_monotonic_gettime(&uptime);
>  	monotonic_to_bootbased(&uptime);

This is a world readable proc file, adding a for_each_possible_cpu() in
there scares me a little (this wouldn't be the first and only such case
though).

Suppose you have lots of cpus, and all those cpus are dirtying those
cachelines (who's updating idle time when they're idle?), then this loop
can cause a massive cacheline bounce fest.

Then think about userspace doing: 
  while :; do cat /proc/uptime > /dev/null; done



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [GIT PULL] cputime patch for 2.6.30-rc6
  2009-05-18 15:24 ` Peter Zijlstra
@ 2009-05-18 16:28   ` Michael Abbott
  2009-05-19  9:00     ` Martin Schwidefsky
  2009-05-19  8:49   ` Martin Schwidefsky
  2009-05-19 13:32   ` Jan Engelhardt
  2 siblings, 1 reply; 16+ messages in thread
From: Michael Abbott @ 2009-05-18 16:28 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Martin Schwidefsky, Linus Torvalds, linux-kernel, Jan Engelhardt

On Mon, 18 May 2009, Peter Zijlstra wrote:
> On Mon, 2009-05-18 at 16:09 +0200, Martin Schwidefsky wrote:
> > Hi Linus,
> > 
> > please pull from 'cputime' branch of
> > 
> > 	git://git390.marist.edu/pub/scm/linux-2.6.git cputime
> > 
> > to receive the following updates:
> > 
> > Michael Abbott (1):
> >       Fix idle time field in /proc/uptime
> > 
> >  fs/proc/uptime.c |    8 +++++++-
> >  1 files changed, 7 insertions(+), 1 deletions(-)
> > 
> > diff --git a/fs/proc/uptime.c b/fs/proc/uptime.c
> > index 0c10a0b..c0ac0d7 100644
> > --- a/fs/proc/uptime.c
> > +++ b/fs/proc/uptime.c
> > @@ -4,13 +4,19 @@
> >  #include <linux/sched.h>
> >  #include <linux/seq_file.h>
> >  #include <linux/time.h>
> > +#include <linux/kernel_stat.h>
> >  #include <asm/cputime.h>
> >  
> >  static int uptime_proc_show(struct seq_file *m, void *v)
> >  {
> >  	struct timespec uptime;
> >  	struct timespec idle;
> > -	cputime_t idletime = cputime_add(init_task.utime, init_task.stime);
> > +	int len, i;
> > +	cputime_t idletime = 0;
> 
> cputime_zero, I guess..
> 
> > +	for_each_possible_cpu(i)
> > +		idletime = cputime64_add(idletime, kstat_cpu(i).cpustat.idle);
> > +	idletime = cputime64_to_clock_t(idletime);
> >  
> >  	do_posix_clock_monotonic_gettime(&uptime);
> >  	monotonic_to_bootbased(&uptime);
> 
> This is a world readable proc file, adding a for_each_possible_cpu() in
> there scares me a little (this wouldn't be the first and only such case
> though).
> 
> Suppose you have lots of cpus, and all those cpus are dirtying those
> cachelines (who's updating idle time when they're idle?), then this loop
> can cause a massive cacheline bounce fest.
> 
> Then think about userspace doing: 
>   while :; do cat /proc/uptime > /dev/null; done

Well, the offending code derives pretty well directly from /proc/stat, 
which is used, for example, by top.  So if there is an issue then I guess 
it already exists.  

There is a pending problem in this code: for a multiple cpu system we'll 
end up with more idle time than elapsed time, which is not really very 
nice.  Unfortunately *something* has to be done here, as it looks as if 
.utime and .stime (at least for init_task) have lost any meaning.  I sort 
of though of dividing by number of cpus, but that's not going to work very 
well..

I came to this problem from a uni-processor instrument which uses 
/proc/uptime to determine whether the system is overloaded (and discovers 
on the current kernel that it is, permanently!).  This fix is definitely 
imperfect, but I think a better fix will require rather deeper knowledge 
of kernel time accounting than I can offer.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [GIT PULL] cputime patch for 2.6.30-rc6
  2009-05-18 15:24 ` Peter Zijlstra
  2009-05-18 16:28   ` Michael Abbott
@ 2009-05-19  8:49   ` Martin Schwidefsky
  2009-05-19  9:00     ` Peter Zijlstra
  2009-05-19 13:32   ` Jan Engelhardt
  2 siblings, 1 reply; 16+ messages in thread
From: Martin Schwidefsky @ 2009-05-19  8:49 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, linux-kernel, Michael Abbott, Jan Engelhardt

On Mon, 18 May 2009 17:24:03 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> On Mon, 2009-05-18 at 16:09 +0200, Martin Schwidefsky wrote:
> > Hi Linus,
> > 
> > please pull from 'cputime' branch of
> > 
> > 	git://git390.marist.edu/pub/scm/linux-2.6.git cputime
> > 
> > to receive the following updates:
> > 
> > Michael Abbott (1):
> >       Fix idle time field in /proc/uptime
> > 
> >  fs/proc/uptime.c |    8 +++++++-
> >  1 files changed, 7 insertions(+), 1 deletions(-)
> > 
> > diff --git a/fs/proc/uptime.c b/fs/proc/uptime.c
> > index 0c10a0b..c0ac0d7 100644
> > --- a/fs/proc/uptime.c
> > +++ b/fs/proc/uptime.c
> > @@ -4,13 +4,19 @@
> >  #include <linux/sched.h>
> >  #include <linux/seq_file.h>
> >  #include <linux/time.h>
> > +#include <linux/kernel_stat.h>
> >  #include <asm/cputime.h>
> >  
> >  static int uptime_proc_show(struct seq_file *m, void *v)
> >  {
> >  	struct timespec uptime;
> >  	struct timespec idle;
> > -	cputime_t idletime = cputime_add(init_task.utime, init_task.stime);
> > +	int len, i;
> > +	cputime_t idletime = 0;
> 
> cputime_zero, I guess..

Yes, this needs to be cputime_zero.
 
> > +	for_each_possible_cpu(i)
> > +		idletime = cputime64_add(idletime, kstat_cpu(i).cpustat.idle);
> > +	idletime = cputime64_to_clock_t(idletime);
> >  
> >  	do_posix_clock_monotonic_gettime(&uptime);
> >  	monotonic_to_bootbased(&uptime);
> 
> This is a world readable proc file, adding a for_each_possible_cpu() in
> there scares me a little (this wouldn't be the first and only such case
> though).
> 
> Suppose you have lots of cpus, and all those cpus are dirtying those
> cachelines (who's updating idle time when they're idle?), then this loop
> can cause a massive cacheline bounce fest.
> 
> Then think about userspace doing: 
>   while :; do cat /proc/uptime > /dev/null; done

Well, don't do stupid things like that. That falls into the same
category as programs calling gettimeofday all the time.

-- 
blue skies,
   Martin.

"Reality continues to ruin my life." - Calvin.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [GIT PULL] cputime patch for 2.6.30-rc6
  2009-05-19  8:49   ` Martin Schwidefsky
@ 2009-05-19  9:00     ` Peter Zijlstra
  2009-05-25 10:50       ` Martin Schwidefsky
  0 siblings, 1 reply; 16+ messages in thread
From: Peter Zijlstra @ 2009-05-19  9:00 UTC (permalink / raw)
  To: Martin Schwidefsky
  Cc: Linus Torvalds, linux-kernel, Michael Abbott, Jan Engelhardt

On Tue, 2009-05-19 at 10:49 +0200, Martin Schwidefsky wrote:

> Well, don't do stupid things like that. That falls into the same
> category as programs calling gettimeofday all the time.

Is gtod a global DoS? The worry I have about these cputime
for_each_cpu() loops in the proc code is that any odd unpriv user can
take the whole machine down (provided the machine is large enough).

Granted, this would only be a real problem for SGI and their insane
boxes atm, but it might very well be a problem for us in the near
future, given the rate Intel seems to increase cores.

So, I'm really not objecting too much to the patch at hand, but I'd love
to find a solution to this problem.

My personal favourite is to lower the resolution/accuracy of the answer
the bigger the box gets.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [GIT PULL] cputime patch for 2.6.30-rc6
  2009-05-18 16:28   ` Michael Abbott
@ 2009-05-19  9:00     ` Martin Schwidefsky
  2009-05-19  9:31       ` Peter Zijlstra
  0 siblings, 1 reply; 16+ messages in thread
From: Martin Schwidefsky @ 2009-05-19  9:00 UTC (permalink / raw)
  To: Michael Abbott
  Cc: Peter Zijlstra, Linus Torvalds, linux-kernel, Jan Engelhardt

On Mon, 18 May 2009 17:28:53 +0100 (BST)
Michael Abbott <michael@araneidae.co.uk> wrote:

> > > +	for_each_possible_cpu(i)
> > > +		idletime = cputime64_add(idletime, kstat_cpu(i).cpustat.idle);
> > > +	idletime = cputime64_to_clock_t(idletime);
> > >  
> > >  	do_posix_clock_monotonic_gettime(&uptime);
> > >  	monotonic_to_bootbased(&uptime);
> > 
> > This is a world readable proc file, adding a for_each_possible_cpu() in
> > there scares me a little (this wouldn't be the first and only such case
> > though).
> > 
> > Suppose you have lots of cpus, and all those cpus are dirtying those
> > cachelines (who's updating idle time when they're idle?), then this loop
> > can cause a massive cacheline bounce fest.
> > 
> > Then think about userspace doing: 
> >   while :; do cat /proc/uptime > /dev/null; done
> 
> Well, the offending code derives pretty well directly from /proc/stat, 
> which is used, for example, by top.  So if there is an issue then I guess 
> it already exists.  
> 
> There is a pending problem in this code: for a multiple cpu system we'll 
> end up with more idle time than elapsed time, which is not really very 
> nice.  Unfortunately *something* has to be done here, as it looks as if 
> .utime and .stime (at least for init_task) have lost any meaning.  I sort 
> of though of dividing by number of cpus, but that's not going to work very 
> well..

I don't see a problem here. In an idle multiple cpu system there IS
more idle time than elapsed time. What would makes sense is to compare
elapsed time * #cpus with the idle time. But then there is cpu hotplug
which forces you to look at the delta of two measuring points where the
number of cpus did not change.

> I came to this problem from a uni-processor instrument which uses 
> /proc/uptime to determine whether the system is overloaded (and discovers 
> on the current kernel that it is, permanently!).  This fix is definitely 
> imperfect, but I think a better fix will require rather deeper knowledge 
> of kernel time accounting than I can offer.

Hmm, I would use the idle time field from /proc/stat for that.

-- 
blue skies,
   Martin.

"Reality continues to ruin my life." - Calvin.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [GIT PULL] cputime patch for 2.6.30-rc6
  2009-05-19  9:00     ` Martin Schwidefsky
@ 2009-05-19  9:31       ` Peter Zijlstra
  2009-05-20  8:09         ` Martin Schwidefsky
  0 siblings, 1 reply; 16+ messages in thread
From: Peter Zijlstra @ 2009-05-19  9:31 UTC (permalink / raw)
  To: Martin Schwidefsky
  Cc: Michael Abbott, Linus Torvalds, linux-kernel, Jan Engelhardt

On Tue, 2009-05-19 at 11:00 +0200, Martin Schwidefsky wrote:
> On Mon, 18 May 2009 17:28:53 +0100 (BST)
> Michael Abbott <michael@araneidae.co.uk> wrote:
> 
> > > > +	for_each_possible_cpu(i)
> > > > +		idletime = cputime64_add(idletime, kstat_cpu(i).cpustat.idle);
> > > > +	idletime = cputime64_to_clock_t(idletime);
> > > >  
> > > >  	do_posix_clock_monotonic_gettime(&uptime);
> > > >  	monotonic_to_bootbased(&uptime);
> > > 
> > > This is a world readable proc file, adding a for_each_possible_cpu() in
> > > there scares me a little (this wouldn't be the first and only such case
> > > though).
> > > 
> > > Suppose you have lots of cpus, and all those cpus are dirtying those
> > > cachelines (who's updating idle time when they're idle?), then this loop
> > > can cause a massive cacheline bounce fest.
> > > 
> > > Then think about userspace doing: 
> > >   while :; do cat /proc/uptime > /dev/null; done
> > 
> > Well, the offending code derives pretty well directly from /proc/stat, 
> > which is used, for example, by top.  So if there is an issue then I guess 
> > it already exists.  
> > 
> > There is a pending problem in this code: for a multiple cpu system we'll 
> > end up with more idle time than elapsed time, which is not really very 
> > nice.  Unfortunately *something* has to be done here, as it looks as if 
> > .utime and .stime (at least for init_task) have lost any meaning.  I sort 
> > of though of dividing by number of cpus, but that's not going to work very 
> > well..
> 
> I don't see a problem here. In an idle multiple cpu system there IS
> more idle time than elapsed time. What would makes sense is to compare
> elapsed time * #cpus with the idle time. But then there is cpu hotplug
> which forces you to look at the delta of two measuring points where the
> number of cpus did not change.

Sure, this one case isn't that bad, esp. as you note its about idle
time. However, see for example /proc/stat and fs/proc/stat.c:

        for_each_possible_cpu(i) {
                user = cputime64_add(user, kstat_cpu(i).cpustat.user);
                nice = cputime64_add(nice, kstat_cpu(i).cpustat.nice);
                system = cputime64_add(system, kstat_cpu(i).cpustat.system);
                idle = cputime64_add(idle, kstat_cpu(i).cpustat.idle);
                idle = cputime64_add(idle, arch_idle_time(i));
                iowait = cputime64_add(iowait, kstat_cpu(i).cpustat.iowait);
                irq = cputime64_add(irq, kstat_cpu(i).cpustat.irq);
                softirq = cputime64_add(softirq, kstat_cpu(i).cpustat.softirq);
                steal = cputime64_add(steal, kstat_cpu(i).cpustat.steal);
                guest = cputime64_add(guest, kstat_cpu(i).cpustat.guest);
                for_each_irq_nr(j) {
                        sum += kstat_irqs_cpu(j, i);
                }
                sum += arch_irq_stat_cpu(i);
        }

If that isn't a problem on a large machine, then I don't know what is.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [GIT PULL] cputime patch for 2.6.30-rc6
  2009-05-18 15:24 ` Peter Zijlstra
  2009-05-18 16:28   ` Michael Abbott
  2009-05-19  8:49   ` Martin Schwidefsky
@ 2009-05-19 13:32   ` Jan Engelhardt
  2 siblings, 0 replies; 16+ messages in thread
From: Jan Engelhardt @ 2009-05-19 13:32 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Martin Schwidefsky, Linus Torvalds, linux-kernel, Michael Abbott


On Monday 2009-05-18 17:24, Peter Zijlstra wrote:
>> +	for_each_possible_cpu(i)
>> +		idletime = cputime64_add(idletime, kstat_cpu(i).cpustat.idle);
>> +	idletime = cputime64_to_clock_t(idletime);
>>  
>>  	do_posix_clock_monotonic_gettime(&uptime);
>>  	monotonic_to_bootbased(&uptime);
>
>This is a world readable proc file, adding a for_each_possible_cpu() in
>there scares me a little (this wouldn't be the first and only such case
>though).

Well maybe procfs could gain chmod support (with the same lifetime
guarantee that one would have with tmpfs), so that init scripts of
big boxes can chmod it 400 on boot if there is that potential user
threat:

>Then think about userspace doing: 
>  while :; do cat /proc/uptime > /dev/null; done

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [GIT PULL] cputime patch for 2.6.30-rc6
  2009-05-19  9:31       ` Peter Zijlstra
@ 2009-05-20  8:09         ` Martin Schwidefsky
  2009-05-20  8:19           ` Peter Zijlstra
  2009-05-20  8:44           ` Michael Abbott
  0 siblings, 2 replies; 16+ messages in thread
From: Martin Schwidefsky @ 2009-05-20  8:09 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Michael Abbott, Linus Torvalds, linux-kernel, Jan Engelhardt

On Tue, 19 May 2009 11:31:28 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> On Tue, 2009-05-19 at 11:00 +0200, Martin Schwidefsky wrote:
> > I don't see a problem here. In an idle multiple cpu system there IS
> > more idle time than elapsed time. What would makes sense is to compare
> > elapsed time * #cpus with the idle time. But then there is cpu hotplug
> > which forces you to look at the delta of two measuring points where the
> > number of cpus did not change.
> 
> Sure, this one case isn't that bad, esp. as you note its about idle
> time. However, see for example /proc/stat and fs/proc/stat.c:
> 
>         for_each_possible_cpu(i) {
>                 user = cputime64_add(user, kstat_cpu(i).cpustat.user);
>                 nice = cputime64_add(nice, kstat_cpu(i).cpustat.nice);
>                 system = cputime64_add(system, kstat_cpu(i).cpustat.system);
>                 idle = cputime64_add(idle, kstat_cpu(i).cpustat.idle);
>                 idle = cputime64_add(idle, arch_idle_time(i));
>                 iowait = cputime64_add(iowait, kstat_cpu(i).cpustat.iowait);
>                 irq = cputime64_add(irq, kstat_cpu(i).cpustat.irq);
>                 softirq = cputime64_add(softirq, kstat_cpu(i).cpustat.softirq);
>                 steal = cputime64_add(steal, kstat_cpu(i).cpustat.steal);
>                 guest = cputime64_add(guest, kstat_cpu(i).cpustat.guest);
>                 for_each_irq_nr(j) {
>                         sum += kstat_irqs_cpu(j, i);
>                 }
>                 sum += arch_irq_stat_cpu(i);
>         }
> 
> If that isn't a problem on a large machine, then I don't know what is.

Well, we better distinguish between the semantical problem and the
performance consideration, no? One thing is what the proc field is
supposed to contain, the other is how fast you can do it.

I have been refering to the semantical problem, but your point with the
performance is very valid as well. So
1) are we agreed that the second field of /proc/uptime should contain
the aggregate idle time of all cpus?
2) I agree that an endless loop of 'cat /proc/uptime' or
'cat /proc/stat' can have negative performance impact on a system with
many cpus. One way to deal with it would be to restrict access to the
interface. I do not like that idea to much. Another way would be to
limit the number of for_each_possible_cpu loops per second. Create
global variables that contain the aggregate values for the different
fields. If the last update has been too recent (e.g. less than 0.1
seconds ago), just print the old values again.

-- 
blue skies,
   Martin.

"Reality continues to ruin my life." - Calvin.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [GIT PULL] cputime patch for 2.6.30-rc6
  2009-05-20  8:09         ` Martin Schwidefsky
@ 2009-05-20  8:19           ` Peter Zijlstra
  2009-05-20  8:44           ` Michael Abbott
  1 sibling, 0 replies; 16+ messages in thread
From: Peter Zijlstra @ 2009-05-20  8:19 UTC (permalink / raw)
  To: Martin Schwidefsky
  Cc: Michael Abbott, Linus Torvalds, linux-kernel, Jan Engelhardt

On Wed, 2009-05-20 at 10:09 +0200, Martin Schwidefsky wrote:
> On Tue, 19 May 2009 11:31:28 +0200
> Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > On Tue, 2009-05-19 at 11:00 +0200, Martin Schwidefsky wrote:
> > > I don't see a problem here. In an idle multiple cpu system there IS
> > > more idle time than elapsed time. What would makes sense is to compare
> > > elapsed time * #cpus with the idle time. But then there is cpu hotplug
> > > which forces you to look at the delta of two measuring points where the
> > > number of cpus did not change.
> > 
> > Sure, this one case isn't that bad, esp. as you note its about idle
> > time. However, see for example /proc/stat and fs/proc/stat.c:
> > 
> >         for_each_possible_cpu(i) {
> >                 user = cputime64_add(user, kstat_cpu(i).cpustat.user);
> >                 nice = cputime64_add(nice, kstat_cpu(i).cpustat.nice);
> >                 system = cputime64_add(system, kstat_cpu(i).cpustat.system);
> >                 idle = cputime64_add(idle, kstat_cpu(i).cpustat.idle);
> >                 idle = cputime64_add(idle, arch_idle_time(i));
> >                 iowait = cputime64_add(iowait, kstat_cpu(i).cpustat.iowait);
> >                 irq = cputime64_add(irq, kstat_cpu(i).cpustat.irq);
> >                 softirq = cputime64_add(softirq, kstat_cpu(i).cpustat.softirq);
> >                 steal = cputime64_add(steal, kstat_cpu(i).cpustat.steal);
> >                 guest = cputime64_add(guest, kstat_cpu(i).cpustat.guest);
> >                 for_each_irq_nr(j) {
> >                         sum += kstat_irqs_cpu(j, i);
> >                 }
> >                 sum += arch_irq_stat_cpu(i);
> >         }
> > 
> > If that isn't a problem on a large machine, then I don't know what is.
> 
> Well, we better distinguish between the semantical problem and the
> performance consideration, no? One thing is what the proc field is
> supposed to contain, the other is how fast you can do it.
> 
> I have been refering to the semantical problem, but your point with the
> performance is very valid as well. So
> 1) are we agreed that the second field of /proc/uptime should contain
> the aggregate idle time of all cpus?

More or less, yeah ;-)

> 2) I agree that an endless loop of 'cat /proc/uptime' or
> 'cat /proc/stat' can have negative performance impact on a system with
> many cpus. One way to deal with it would be to restrict access to the
> interface. I do not like that idea to much. Another way would be to
> limit the number of for_each_possible_cpu loops per second. Create
> global variables that contain the aggregate values for the different
> fields. If the last update has been too recent (e.g. less than 0.1
> seconds ago), just print the old values again.

Right, I was thinking about using things like percpu_counter and vmstat
which get more inaccurate the larger your machine is, that is inaccurate
in the temporal sense, they do converge to the actual result in that
nothing is lost, it might just take a while to show up.



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [GIT PULL] cputime patch for 2.6.30-rc6
  2009-05-20  8:09         ` Martin Schwidefsky
  2009-05-20  8:19           ` Peter Zijlstra
@ 2009-05-20  8:44           ` Michael Abbott
  2009-05-25 11:06             ` Martin Schwidefsky
  1 sibling, 1 reply; 16+ messages in thread
From: Michael Abbott @ 2009-05-20  8:44 UTC (permalink / raw)
  To: Martin Schwidefsky
  Cc: Peter Zijlstra, Linus Torvalds, linux-kernel, Jan Engelhardt

On Wed, 20 May 2009, Martin Schwidefsky wrote:
> On Tue, 19 May 2009 11:31:28 +0200 Peter Zijlstra <peterz@infradead.org> wrote:
> > On Tue, 2009-05-19 at 11:00 +0200, Martin Schwidefsky wrote:
> > > I don't see a problem here. In an idle multiple cpu system there IS 
> > > more idle time than elapsed time. What would makes sense is to 
> > > compare elapsed time * #cpus with the idle time. But then there is 
> > > cpu hotplug which forces you to look at the delta of two measuring 
> > > points where the number of cpus did not change.
> 
> Well, we better distinguish between the semantical problem and the
> performance consideration, no? One thing is what the proc field is
> supposed to contain, the other is how fast you can do it.
> 
> I have been refering to the semantical problem, but your point with the 
> performance is very valid as well. So
> 
> 1) are we agreed that the second field of /proc/uptime should contain 
> the aggregate idle time of all cpus?

I think that this very simple node (only two fields, let me call them 
uptime and idle) should be amenable to simple interpretation.  In 
particular, I'd like to be able to compute

	busy = uptime - idle

In other words, idle should be in the same units of time as uptime, which 
is to say it should be in units of elapsed time, not aggregate CPU time.  

As far as I can see, that is how it used to be (at least, that's how it 
behaves on a dual core processor running a hugely patched 2.6.9 kernel 
that I'm looking at here).

Thus, the patch to /proc/uptime that we're discussing has, in the view I'm 
trying to argue, replaced a clearly broken idle field with a more subtly 
broken version.

The strongest point in my case is this: the first field is in units of 
elapsed wall clock time, and therefore the second should also be.  Of 
course, established practice here is also important, but there is some 
ambiguity in the man pages I'm looking at -- this field is described in 
proc(5) as:

	the amount of time spent in idle process (seconds)

Not aggregate CPU time, but elapsed time, I think is meant here, but the 
wording is ambiguous.  Grepping for uptime in Documentation doesn't come 
up with anything clear either, but I can quote old (2.6.9) precedent 
(don't have a more recent multi-core system to hand).

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [GIT PULL] cputime patch for 2.6.30-rc6
  2009-05-19  9:00     ` Peter Zijlstra
@ 2009-05-25 10:50       ` Martin Schwidefsky
  2009-05-25 11:09         ` Peter Zijlstra
  0 siblings, 1 reply; 16+ messages in thread
From: Martin Schwidefsky @ 2009-05-25 10:50 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, linux-kernel, Michael Abbott, Jan Engelhardt

On Tue, 19 May 2009 11:00:35 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> So, I'm really not objecting too much to the patch at hand, but I'd love
> to find a solution to this problem.

It is not hard so solve the problem for /proc/uptime, e.g. like this:

static u64 uptime_jiffies = INITIAL_JIFFIES;
static struct timespec ts_uptime;
static struct timespec ts_idle;

static int uptime_proc_show(struct seq_file *m, void *v)
{
        cputime_t idletime;
        u64 now;
        int i;

        now = get_jiffies_64();
        if (uptime_jiffies != now) {
                uptime_jiffies = now;
                idletime = cputime_zero;
                for_each_possible_cpu(i)
                        idletime = cputime64_add(idletime,
                                                 kstat_cpu(i).cpustat.idle);
                do_posix_clock_monotonic_gettime(&ts_uptime);
                monotonic_to_bootbased(&ts_uptime);
                cputime_to_timespec(idletime, &ts_idle);
        }

        seq_printf(m, "%lu.%02lu %lu.%02lu\n",
                        (unsigned long) ts_uptime.tv_sec,
                        (ts_uptime.tv_nsec / (NSEC_PER_SEC / 100)),
                        (unsigned long) ts_idle.tv_sec,
                        (ts_idle.tv_nsec / (NSEC_PER_SEC / 100)));
        return 0;
}

For /proc/stat it is less clear. Just storing the values in static
variables is not such a good idea as there are lots of values.
10*NR_CPUS + NR_IRQS values to be exact. With NR_CPUS in the thousands
this will waste quite a bit of memory.

I fixed another problem with Michael original patch and added the new
one to reduce the frequency of accesses to kstat_cpu for /proc/uptime.
You'll find both at:

	git://git390.marist.edu/pub/scm/linux-2.6.git cputime

-- 
blue skies,
   Martin.

"Reality continues to ruin my life." - Calvin.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [GIT PULL] cputime patch for 2.6.30-rc6
  2009-05-20  8:44           ` Michael Abbott
@ 2009-05-25 11:06             ` Martin Schwidefsky
  0 siblings, 0 replies; 16+ messages in thread
From: Martin Schwidefsky @ 2009-05-25 11:06 UTC (permalink / raw)
  To: Michael Abbott
  Cc: Peter Zijlstra, Linus Torvalds, linux-kernel, Jan Engelhardt

On Wed, 20 May 2009 09:44:33 +0100 (BST)
Michael Abbott <michael@araneidae.co.uk> wrote:

> On Wed, 20 May 2009, Martin Schwidefsky wrote:
> > On Tue, 19 May 2009 11:31:28 +0200 Peter Zijlstra <peterz@infradead.org> wrote:
> > > On Tue, 2009-05-19 at 11:00 +0200, Martin Schwidefsky wrote:
> > > > I don't see a problem here. In an idle multiple cpu system there IS 
> > > > more idle time than elapsed time. What would makes sense is to 
> > > > compare elapsed time * #cpus with the idle time. But then there is 
> > > > cpu hotplug which forces you to look at the delta of two measuring 
> > > > points where the number of cpus did not change.
> > 
> > Well, we better distinguish between the semantical problem and the
> > performance consideration, no? One thing is what the proc field is
> > supposed to contain, the other is how fast you can do it.
> > 
> > I have been refering to the semantical problem, but your point with the 
> > performance is very valid as well. So
> > 
> > 1) are we agreed that the second field of /proc/uptime should contain 
> > the aggregate idle time of all cpus?
> 
> I think that this very simple node (only two fields, let me call them 
> uptime and idle) should be amenable to simple interpretation.  In 
> particular, I'd like to be able to compute
> 
> 	busy = uptime - idle

On a single cpu system this works. On an SMP with cpu hotplug it breaks.

> In other words, idle should be in the same units of time as uptime, which 
> is to say it should be in units of elapsed time, not aggregate CPU time.  

They are, no? The idle time is in units of cputime but what is measured
is wall-clock time. In fact the idle time is the result of the
calculation
	idle = wall_clock - user - nice - system -
		 iowait - irq - softirq - steal - guest
Trouble is that this is fundamentally per cpu. Just showing a single
cpu is kaputt, showing the sum of all cpus divided by the number of
cpus has problem with cpu hotplug. That leaves the simple approach to
just do the sum over all cpus.
 
> Thus, the patch to /proc/uptime that we're discussing has, in the view I'm 
> trying to argue, replaced a clearly broken idle field with a more subtly 
> broken version.

What is the least broken version?
 
> The strongest point in my case is this: the first field is in units of 
> elapsed wall clock time, and therefore the second should also be.  Of 
> course, established practice here is also important, but there is some 
> ambiguity in the man pages I'm looking at -- this field is described in 
> proc(5) as:
> 
> 	the amount of time spent in idle process (seconds)
> 
> Not aggregate CPU time, but elapsed time, I think is meant here, but the 
> wording is ambiguous.  Grepping for uptime in Documentation doesn't come 
> up with anything clear either, but I can quote old (2.6.9) precedent 
> (don't have a more recent multi-core system to hand).

In my view of the world the idle field IS wall clock. That is how we
calculate it on s390.

-- 
blue skies,
   Martin.

"Reality continues to ruin my life." - Calvin.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [GIT PULL] cputime patch for 2.6.30-rc6
  2009-05-25 10:50       ` Martin Schwidefsky
@ 2009-05-25 11:09         ` Peter Zijlstra
  2009-05-25 11:24           ` Jan Engelhardt
  2009-05-25 11:35           ` Martin Schwidefsky
  0 siblings, 2 replies; 16+ messages in thread
From: Peter Zijlstra @ 2009-05-25 11:09 UTC (permalink / raw)
  To: Martin Schwidefsky
  Cc: Linus Torvalds, linux-kernel, Michael Abbott, Jan Engelhardt

On Mon, 2009-05-25 at 12:50 +0200, Martin Schwidefsky wrote:
> On Tue, 19 May 2009 11:00:35 +0200
> Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > So, I'm really not objecting too much to the patch at hand, but I'd love
> > to find a solution to this problem.
> 
> It is not hard so solve the problem for /proc/uptime, e.g. like this:
> 
> static u64 uptime_jiffies = INITIAL_JIFFIES;
> static struct timespec ts_uptime;
> static struct timespec ts_idle;
> 
> static int uptime_proc_show(struct seq_file *m, void *v)
> {
>         cputime_t idletime;
>         u64 now;
>         int i;
> 
>         now = get_jiffies_64();
>         if (uptime_jiffies != now) {
>                 uptime_jiffies = now;
>                 idletime = cputime_zero;
>                 for_each_possible_cpu(i)
>                         idletime = cputime64_add(idletime,
>                                                  kstat_cpu(i).cpustat.idle);
>                 do_posix_clock_monotonic_gettime(&ts_uptime);
>                 monotonic_to_bootbased(&ts_uptime);
>                 cputime_to_timespec(idletime, &ts_idle);
>         }
> 
>         seq_printf(m, "%lu.%02lu %lu.%02lu\n",
>                         (unsigned long) ts_uptime.tv_sec,
>                         (ts_uptime.tv_nsec / (NSEC_PER_SEC / 100)),
>                         (unsigned long) ts_idle.tv_sec,
>                         (ts_idle.tv_nsec / (NSEC_PER_SEC / 100)));
>         return 0;
> }
> 
> For /proc/stat it is less clear. Just storing the values in static
> variables is not such a good idea as there are lots of values.
> 10*NR_CPUS + NR_IRQS values to be exact. With NR_CPUS in the thousands
> this will waste quite a bit of memory.

Right, I know of for_each_possible_cpu() loops that took longer than a
jiffy and caused general melt-down -- not saying the loop for idle time
will be one such a loop, but then it seems silly anyway, who's
incrementing the idle time when we're idle?

I really prefer using things like percpu_counter/vmstat that have error
bounds that scale with the number of cpus in the system.

We simply have to start educating people that numbers on the global
state of the machine are inaccurate (they were anyway, because by the
time the userspace bits that read the /proc file get scheduled again the
numbers will have changed again).

There's a variant of Heisenberg's uncertainty principle applicable to
(parallel) computers in that one either gets concurrency or accuracy on
global state, you cannot have both.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [GIT PULL] cputime patch for 2.6.30-rc6
  2009-05-25 11:09         ` Peter Zijlstra
@ 2009-05-25 11:24           ` Jan Engelhardt
  2009-05-25 11:35           ` Martin Schwidefsky
  1 sibling, 0 replies; 16+ messages in thread
From: Jan Engelhardt @ 2009-05-25 11:24 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Martin Schwidefsky, Linus Torvalds, linux-kernel, Michael Abbott


On Monday 2009-05-25 13:09, Peter Zijlstra wrote:
>
>I really prefer using things like percpu_counter/vmstat that have error
>bounds that scale with the number of cpus in the system.
>
>We simply have to start educating people that numbers on the global
>state of the machine are inaccurate (they were anyway, because by the
>time the userspace bits that read the /proc file get scheduled again the
>numbers will have changed again).
>
>There's a variant of Heisenberg's uncertainty principle applicable to
>(parallel) computers in that one either gets concurrency or accuracy on
>global state, you cannot have both.
>
Inaccuracies are fine, esp. for something like monotic increasing
idle uptime.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [GIT PULL] cputime patch for 2.6.30-rc6
  2009-05-25 11:09         ` Peter Zijlstra
  2009-05-25 11:24           ` Jan Engelhardt
@ 2009-05-25 11:35           ` Martin Schwidefsky
  1 sibling, 0 replies; 16+ messages in thread
From: Martin Schwidefsky @ 2009-05-25 11:35 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, linux-kernel, Michael Abbott, Jan Engelhardt

On Mon, 25 May 2009 13:09:26 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> On Mon, 2009-05-25 at 12:50 +0200, Martin Schwidefsky wrote:
> > On Tue, 19 May 2009 11:00:35 +0200
> > Peter Zijlstra <peterz@infradead.org> wrote:
> > 
> > > So, I'm really not objecting too much to the patch at hand, but I'd love
> > > to find a solution to this problem.
> > 
> > It is not hard so solve the problem for /proc/uptime, e.g. like this:
> > 
> > static u64 uptime_jiffies = INITIAL_JIFFIES;
> > static struct timespec ts_uptime;
> > static struct timespec ts_idle;
> > 
> > static int uptime_proc_show(struct seq_file *m, void *v)
> > {
> >         cputime_t idletime;
> >         u64 now;
> >         int i;
> > 
> >         now = get_jiffies_64();
> >         if (uptime_jiffies != now) {
> >                 uptime_jiffies = now;
> >                 idletime = cputime_zero;
> >                 for_each_possible_cpu(i)
> >                         idletime = cputime64_add(idletime,
> >                                                  kstat_cpu(i).cpustat.idle);
> >                 do_posix_clock_monotonic_gettime(&ts_uptime);
> >                 monotonic_to_bootbased(&ts_uptime);
> >                 cputime_to_timespec(idletime, &ts_idle);
> >         }
> > 
> >         seq_printf(m, "%lu.%02lu %lu.%02lu\n",
> >                         (unsigned long) ts_uptime.tv_sec,
> >                         (ts_uptime.tv_nsec / (NSEC_PER_SEC / 100)),
> >                         (unsigned long) ts_idle.tv_sec,
> >                         (ts_idle.tv_nsec / (NSEC_PER_SEC / 100)));
> >         return 0;
> > }
> > 
> > For /proc/stat it is less clear. Just storing the values in static
> > variables is not such a good idea as there are lots of values.
> > 10*NR_CPUS + NR_IRQS values to be exact. With NR_CPUS in the thousands
> > this will waste quite a bit of memory.
> 
> Right, I know of for_each_possible_cpu() loops that took longer than a
> jiffy and caused general melt-down -- not saying the loop for idle time
> will be one such a loop, but then it seems silly anyway, who's
> incrementing the idle time when we're idle?

Psst, I do ;-) Look at the arch_idle_time macro in fs/proc/stat.c..
 
> I really prefer using things like percpu_counter/vmstat that have error
> bounds that scale with the number of cpus in the system.
> 
> We simply have to start educating people that numbers on the global
> state of the machine are inaccurate (they were anyway, because by the
> time the userspace bits that read the /proc file get scheduled again the
> numbers will have changed again).

That is one problem, the other is that the values you'll get are not
atomic in any way. Not even the totals in /proc/stat match the sum over
the cpus.

> There's a variant of Heisenberg's uncertainty principle applicable to
> (parallel) computers in that one either gets concurrency or accuracy on
> global state, you cannot have both.

If the time you need to generate a value is longer than the maximum
error you do have a problem.

-- 
blue skies,
   Martin.

"Reality continues to ruin my life." - Calvin.


^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2009-05-25 11:36 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-05-18 14:09 [GIT PULL] cputime patch for 2.6.30-rc6 Martin Schwidefsky
2009-05-18 15:24 ` Peter Zijlstra
2009-05-18 16:28   ` Michael Abbott
2009-05-19  9:00     ` Martin Schwidefsky
2009-05-19  9:31       ` Peter Zijlstra
2009-05-20  8:09         ` Martin Schwidefsky
2009-05-20  8:19           ` Peter Zijlstra
2009-05-20  8:44           ` Michael Abbott
2009-05-25 11:06             ` Martin Schwidefsky
2009-05-19  8:49   ` Martin Schwidefsky
2009-05-19  9:00     ` Peter Zijlstra
2009-05-25 10:50       ` Martin Schwidefsky
2009-05-25 11:09         ` Peter Zijlstra
2009-05-25 11:24           ` Jan Engelhardt
2009-05-25 11:35           ` Martin Schwidefsky
2009-05-19 13:32   ` Jan Engelhardt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox