public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [patch] x86_64: Avoid too many remote cpu references due to /proc/stat
@ 2007-07-13  0:06 Ravikiran G Thirumalai
  2007-07-13  0:09 ` [patch] " Ravikiran G Thirumalai
                   ` (2 more replies)
  0 siblings, 3 replies; 5+ messages in thread
From: Ravikiran G Thirumalai @ 2007-07-13  0:06 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Andrew Morton, linux-kernel, Shai Fultheim (Shai@scalex86.org)

Too many remote cpu references due to /proc/stat.

On x86_64, with newer kernel versions, kstat_irqs is a bit of a problem.
On every call to kstat_irqs, the process brings in per-cpu data from all
online cpus.  Doing this for NR_IRQS, which is now 256 + 32 * NR_CPUS
results in (256+32*63) * 63 remote cpu references on a 64 cpu config.
/proc/stat is parsed by common commands like top, who etc, causing
lots of cacheline transfers

This statistic seems useless. Other 'big iron' arches disable this.
Can we disable computing/reporting this statistic?  This piece of
statistic is not human readable on x86_64 anymore,

If not, can we optimize computing this statistic so as to avoid
too many remote references (patch to follow)

Signed-off-by: Ravikiran Thirumalai <kiran@scalex86.org>

Index: linux-2.6.22/fs/proc/proc_misc.c
===================================================================
--- linux-2.6.22.orig/fs/proc/proc_misc.c	2007-07-12 16:31:02.000000000 -0700
+++ linux-2.6.22/fs/proc/proc_misc.c	2007-07-12 16:33:45.226221759 -0700
@@ -498,7 +498,8 @@ static int show_stat(struct seq_file *p,
 	}
 	seq_printf(p, "intr %llu", (unsigned long long)sum);
 
-#if !defined(CONFIG_PPC64) && !defined(CONFIG_ALPHA) && !defined(CONFIG_IA64)
+#if !defined(CONFIG_PPC64) && !defined(CONFIG_ALPHA) && !defined(CONFIG_IA64) \
+					&& !defined(CONFIG_X86_64)
 	for (i = 0; i < NR_IRQS; i++)
 		seq_printf(p, " %u", kstat_irqs(i));
 #endif

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [patch] Avoid too many remote cpu references due to /proc/stat
  2007-07-13  0:06 [patch] x86_64: Avoid too many remote cpu references due to /proc/stat Ravikiran G Thirumalai
@ 2007-07-13  0:09 ` Ravikiran G Thirumalai
  2007-07-13  2:13 ` [patch] x86_64: " Andrew Morton
  2007-07-13 10:03 ` Christoph Hellwig
  2 siblings, 0 replies; 5+ messages in thread
From: Ravikiran G Thirumalai @ 2007-07-13  0:09 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Andrew Morton, linux-kernel, Shai Fultheim (Shai@scalex86.org)

On Thu, Jul 12, 2007 at 05:06:15PM -0700, Ravikiran G Thirumalai wrote:
> Too many remote cpu references due to /proc/stat.
> 
> On x86_64, with newer kernel versions, kstat_irqs is a bit of a problem.
> On every call to kstat_irqs, the process brings in per-cpu data from all
> online cpus.  Doing this for NR_IRQS, which is now 256 + 32 * NR_CPUS
> results in (256+32*63) * 63 remote cpu references on a 64 cpu config.
> /proc/stat is parsed by common commands like top, who etc, causing
> lots of cacheline transfers
> 
> This statistic seems useless. Other 'big iron' arches disable this.
> Can we disable computing/reporting this statistic?  This piece of
> statistic is not human readable on x86_64 anymore,
> 
> If not, can we optimize computing this statistic so as to avoid
> too many remote references (patch to follow)

Optimize show_stat to collect per-irq information just once.

On x86_64, with newer kernel versions, kstat_irqs is a bit of a problem.
On every call to kstat_irqs, the process brings in per-cpu data from all
online cpus.  Doing this for NR_IRQS, which is now 256 + 32 * NR_CPUS
results in (256+32*63) * 63 remote cpu references on a 64 cpu config.
Considering the fact that we already compute this value per-cpu, we can
save on the remote references as below.

Signed-off-by: Alok N Kataria <alok.kataria@calsoftinc.com>
Signed-off-by: Ravikiran Thirumalai <kiran@scalex86.org>

Index: linux-2.6.22/fs/proc/proc_misc.c
===================================================================
--- linux-2.6.22.orig/fs/proc/proc_misc.c	2007-07-11 14:32:33.013197741 -0700
+++ linux-2.6.22/fs/proc/proc_misc.c	2007-07-12 16:28:24.871389279 -0700
@@ -443,6 +443,11 @@ static int show_stat(struct seq_file *p,
 	unsigned long jif;
 	cputime64_t user, nice, system, idle, iowait, irq, softirq, steal;
 	u64 sum = 0;
+	unsigned int *per_irq_sum;
+
+	per_irq_sum = kzalloc(sizeof(unsigned int)*NR_IRQS, GFP_KERNEL);
+	if (!per_irq_sum)
+		return -ENOMEM;
 
 	user = nice = system = idle = iowait =
 		irq = softirq = steal = cputime64_zero;
@@ -461,8 +466,11 @@ static int show_stat(struct seq_file *p,
 		irq = cputime64_add(irq, kstat_cpu(i).cpustat.irq);
 		softirq = cputime64_add(softirq, kstat_cpu(i).cpustat.softirq);
 		steal = cputime64_add(steal, kstat_cpu(i).cpustat.steal);
-		for (j = 0 ; j < NR_IRQS ; j++)
-			sum += kstat_cpu(i).irqs[j];
+		for (j = 0 ; j < NR_IRQS ; j++) {
+			unsigned int temp = kstat_cpu(i).irqs[j];
+			sum += temp;
+			per_irq_sum[j] += temp;
+		}
 	}
 
 	seq_printf(p, "cpu  %llu %llu %llu %llu %llu %llu %llu %llu\n",
@@ -500,7 +508,7 @@ static int show_stat(struct seq_file *p,
 
 #if !defined(CONFIG_PPC64) && !defined(CONFIG_ALPHA) && !defined(CONFIG_IA64)
 	for (i = 0; i < NR_IRQS; i++)
-		seq_printf(p, " %u", kstat_irqs(i));
+		seq_printf(p, " %u", per_irq_sum[i]);
 #endif
 
 	seq_printf(p,
@@ -515,6 +523,7 @@ static int show_stat(struct seq_file *p,
 		nr_running(),
 		nr_iowait());
 
+	kfree(per_irq_sum);
 	return 0;
 }
 

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [patch] x86_64: Avoid too many remote cpu references due to /proc/stat
  2007-07-13  0:06 [patch] x86_64: Avoid too many remote cpu references due to /proc/stat Ravikiran G Thirumalai
  2007-07-13  0:09 ` [patch] " Ravikiran G Thirumalai
@ 2007-07-13  2:13 ` Andrew Morton
  2007-07-13  6:50   ` Ravikiran G Thirumalai
  2007-07-13 10:03 ` Christoph Hellwig
  2 siblings, 1 reply; 5+ messages in thread
From: Andrew Morton @ 2007-07-13  2:13 UTC (permalink / raw)
  To: Ravikiran G Thirumalai
  Cc: Andi Kleen, linux-kernel, Shai Fultheim (Shai@scalex86.org)

On Thu, 12 Jul 2007 17:06:16 -0700 Ravikiran G Thirumalai <kiran@scalex86.org> wrote:

> Too many remote cpu references due to /proc/stat.
> 
> On x86_64, with newer kernel versions, kstat_irqs is a bit of a problem.
> On every call to kstat_irqs, the process brings in per-cpu data from all
> online cpus.  Doing this for NR_IRQS, which is now 256 + 32 * NR_CPUS
> results in (256+32*63) * 63 remote cpu references on a 64 cpu config.
> /proc/stat is parsed by common commands like top, who etc, causing
> lots of cacheline transfers

(256+32*63) * 63 = 143136

Do you have any actual numbers for how much this hurts?

> This statistic seems useless. Other 'big iron' arches disable this.
> Can we disable computing/reporting this statistic?  This piece of
> statistic is not human readable on x86_64 anymore,

Did you consider using percpu_counters (or such) at interrupt-time? 
(warning: percpu_counters aren't presently interrupt safe).

> If not, can we optimize computing this statistic so as to avoid
> too many remote references (patch to follow)

You other patch is a straightforward optimisation and should just be merged.

But afaict it will only provide a 2x speedup which I doubt is sufficient?



Another thought is: how many of the NR_IRQS counters are actually non-zero?
Because a pretty obvious optimisation would be to have a global
bitmap[NR_IRQS] and do

	if (!bitmap[irq])
		bitmap[irq] = 1;

at interrupt-time, then just print a "0" for the interrupts which have
never occurred within show_stats().


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [patch] x86_64: Avoid too many remote cpu references due to /proc/stat
  2007-07-13  2:13 ` [patch] x86_64: " Andrew Morton
@ 2007-07-13  6:50   ` Ravikiran G Thirumalai
  0 siblings, 0 replies; 5+ messages in thread
From: Ravikiran G Thirumalai @ 2007-07-13  6:50 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Andi Kleen, linux-kernel, Shai Fultheim (Shai@scalex86.org)

On Thu, Jul 12, 2007 at 07:13:17PM -0700, Andrew Morton wrote:
> On Thu, 12 Jul 2007 17:06:16 -0700 Ravikiran G Thirumalai <kiran@scalex86.org> wrote:
>
> > Too many remote cpu references due to /proc/stat.
> >
> > On x86_64, with newer kernel versions, kstat_irqs is a bit of a problem.
> > On every call to kstat_irqs, the process brings in per-cpu data from all
> > online cpus.  Doing this for NR_IRQS, which is now 256 + 32 * NR_CPUS
> > results in (256+32*63) * 63 remote cpu references on a 64 cpu config.
> > /proc/stat is parsed by common commands like top, who etc, causing
> > lots of cacheline transfers
>
> (256+32*63) * 63 = 143136
>
> Do you have any actual numbers for how much this hurts?

Under certain load conditions with 2.6.21/22, this hurts a *lot* infact,
I have noticed delays in seconds when /proc/stat is read (48P box).
However, the reason for this is not the transfer of per-cpu kstat, but
dereferencing of the percpu data offset from a remote cpu pda!  Since the
cpu pda is kmalloced, with 2.6.21, the zone pcp structures happen to
share the same page  (and in some cases the internode cacheline as well)
with the cpu_pda.  Excessive page allocation activities on the remote
node causes reads to the cpu pda of the remote cpus to result in cache misses.

>
> > This statistic seems useless. Other 'big iron' arches disable this.
> > Can we disable computing/reporting this statistic?  This piece of
> > statistic is not human readable on x86_64 anymore,
>
> Did you consider using percpu_counters (or such) at interrupt-time? 
> (warning: percpu_counters aren't presently interrupt safe).

No, but given that alloc_percpu of a counter wastes so much memory, an array
of NR_IRQS percpu_counters will only cause more bloat no?


>
> > If not, can we optimize computing this statistic so as to avoid
> > too many remote references (patch to follow)
>
> You other patch is a straightforward optimisation and should just be merged.
>

Sure!  But I still don't see this statistic _really_ being useful anywhere;
/proc/stat output with NR_CPUS of 64 or 96 is ugly, but I might be missing
something

> But afaict it will only provide a 2x speedup which I doubt is sufficient?

No, it provides much more than 2x speedup with the 'hurting' workload.
This is because the optimized patch goes like this:

for_each_possible_cpu(i) {
	...
	for (j = 0 ; j < NR_IRQS ; j++) {
		unsigned int temp = kstat_cpu(i).irqs[j];
	...
	}
}

But the existing code with kstat_irqs goes as:

for (j = 0 ; j < NR_IRQS ; j++) {
	for_each_possible_cpu(cpu) {
		sum += kstat_cpu(cpu).irqs[irq]
	...
}

Former case causes less cache misses to the pda per-cpu dataoffset than
the latter by a huge margin.

Cache miss itself is not so bad per miss, but the number of misses
in the latter case adds up!

Yes, the cpu pda needs to be on an internode boundary by itself,  I will be
submitting a patch for the same soon.  I am thinking of page aligning cpu
pda and using the 12 lower bits in %gs to store the cpu processor id maybe :)

>
>
>
> Another thought is: how many of the NR_IRQS counters are actually non-zero?
> Because a pretty obvious optimisation would be to have a global
> bitmap[NR_IRQS] and do
>
> 	if (!bitmap[irq])
> 		bitmap[irq] = 1;
>
> at interrupt-time, then just print a "0" for the interrupts which have
> never occurred within show_stats().

Yep.  That would work too.


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [patch] x86_64: Avoid too many remote cpu references due to /proc/stat
  2007-07-13  0:06 [patch] x86_64: Avoid too many remote cpu references due to /proc/stat Ravikiran G Thirumalai
  2007-07-13  0:09 ` [patch] " Ravikiran G Thirumalai
  2007-07-13  2:13 ` [patch] x86_64: " Andrew Morton
@ 2007-07-13 10:03 ` Christoph Hellwig
  2 siblings, 0 replies; 5+ messages in thread
From: Christoph Hellwig @ 2007-07-13 10:03 UTC (permalink / raw)
  To: Ravikiran G Thirumalai
  Cc: Andi Kleen, Andrew Morton, linux-kernel,
	Shai Fultheim (Shai@scalex86.org)

On Thu, Jul 12, 2007 at 05:06:16PM -0700, Ravikiran G Thirumalai wrote:
> Too many remote cpu references due to /proc/stat.
> 
> On x86_64, with newer kernel versions, kstat_irqs is a bit of a problem.
> On every call to kstat_irqs, the process brings in per-cpu data from all
> online cpus.  Doing this for NR_IRQS, which is now 256 + 32 * NR_CPUS
> results in (256+32*63) * 63 remote cpu references on a 64 cpu config.
> /proc/stat is parsed by common commands like top, who etc, causing
> lots of cacheline transfers
> 
> This statistic seems useless. Other 'big iron' arches disable this.
> Can we disable computing/reporting this statistic?  This piece of
> statistic is not human readable on x86_64 anymore,

At which point we might just remove it completely..


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2007-07-13 10:03 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-07-13  0:06 [patch] x86_64: Avoid too many remote cpu references due to /proc/stat Ravikiran G Thirumalai
2007-07-13  0:09 ` [patch] " Ravikiran G Thirumalai
2007-07-13  2:13 ` [patch] x86_64: " Andrew Morton
2007-07-13  6:50   ` Ravikiran G Thirumalai
2007-07-13 10:03 ` Christoph Hellwig

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox