Better local_t implementation needed

linux-arch.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Better local_t implementation needed
@ 2007-04-20 10:56 Andi Kleen
  2007-04-20 16:10 ` Christoph Lameter
                   ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: Andi Kleen @ 2007-04-20 10:56 UTC (permalink / raw)
  To: linux-arch; +Cc: Christoph Lameter

Right now local_t falls back to atomic.h on a lot of architectures:

% grep generic include/asm*/local.h
include/asm-arm/local.h:#include <asm-generic/local.h>
include/asm-arm26/local.h:#include <asm-generic/local.h>
include/asm-avr32/local.h:#include <asm-generic/local.h>
include/asm-cris/local.h:#include <asm-generic/local.h>
include/asm-frv/local.h:#include <asm-generic/local.h>
include/asm-h8300/local.h:#include <asm-generic/local.h>
include/asm-m32r/local.h:#include <asm-generic/local.h>
include/asm-m68k/local.h:#include <asm-generic/local.h>
include/asm-m68knommu/local.h:#include <asm-generic/local.h>
include/asm-powerpc/local.h:#include <asm-generic/local.h>
include/asm-s390/local.h:#include <asm-generic/local.h>
include/asm-sh/local.h:#include <asm-generic/local.h>
include/asm-sh64/local.h:#include <asm-generic/local.h>
include/asm-sparc/local.h:#include <asm-generic/local.h>
include/asm-v850/local.h:#include <asm-generic/local.h>
include/asm-xtensa/local.h:#include <asm-generic/local.h>

and asm-generic.h/local.h falls back to atomic_t

This is unfortunate because if one wants to use local_t for 
per CPU counters it will be a full atomic operation which is probably
slow at least on all architectures that support MP.
Using local_t for per cpu counters is nice because then
one can use cpu_local_add() etc. and that generates very good 
code at least on x86 and a few other architectures. That would
then allow very cheap per CPU statistics, which are useful
in a number of subsystems (like networking or MM code)

e.g. on x86 with some of the pending per cpu patches we could
in the end implement cpu_local_add as a single non atomic instruction.
This would compare very favourably to the complicated
code sequences that right now are generated for some of the
statistics counters.

There used to be a portable implementation of local.h 
that instead defines local_t as a two value array 
indexed by in_interrupt(). I'm considering to add that back.

Drawback will be larger code. 

Architectures that have cheap atomic_t can just use the atomic_t 
fallback. That should be all architectures that are not MP capable?

If you have cheap save_flags/cli/restore_flags that could be also used.

Or some other architecture specific implementation. For example x86
which has atomic on a CPU read/modify/write instructions can just use those.

I would urge you that if it's  easy to do a better local_t to implement it.

Comments?

-Andi

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Better local_t implementation needed
  2007-04-20 10:56 Better local_t implementation needed Andi Kleen
@ 2007-04-20 16:10 ` Christoph Lameter
  2007-04-20 17:01   ` Andi Kleen
  2007-04-20 18:31 ` Luck, Tony
  2007-04-20 21:25 ` Roman Zippel
  2 siblings, 1 reply; 12+ messages in thread
From: Christoph Lameter @ 2007-04-20 16:10 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-arch

On Fri, 20 Apr 2007, Andi Kleen wrote:

> This is unfortunate because if one wants to use local_t for 
> per CPU counters it will be a full atomic operation which is probably
> slow at least on all architectures that support MP.

Right.

> Using local_t for per cpu counters is nice because then
> one can use cpu_local_add() etc. and that generates very good 
> code at least on x86 and a few other architectures. That would
> then allow very cheap per CPU statistics, which are useful
> in a number of subsystems (like networking or MM code)

Is the per cpu access fixed on x86? Last I checked it was not atomic.

i.e. inc(per_cpu_var) would

1. Calculate the per_cpu_var address by determining the processor id
   and then the address.

2. Increment that address.

This would mean that preemption needs to be disabled otherwise we can
get into trouble between 1 and 2.

What should be happening is that per_cpu_var does not require the use
of processor id but simply increments an address. Thus no preemption 
problems and less code.

I vaguely recall that there were issues with a segment register and the 
linker?

> e.g. on x86 with some of the pending per cpu patches we could
> in the end implement cpu_local_add as a single non atomic instruction.
> This would compare very favourably to the complicated
> code sequences that right now are generated for some of the
> statistics counters.

Right. I'd like to see that.

> There used to be a portable implementation of local.h 
> that instead defines local_t as a two value array 
> indexed by in_interrupt(). I'm considering to add that back.

Uhhh.... Yuck.

> Drawback will be larger code. 

Fix the per cpu area access instead?

> Comments?

Right now local_t is in many areas to complicated to handle and generates 
bad code. Fixing the per cpu area acces on x86_64 etc would make some 
progress here. I also do not think that atomic operations are that big an 
issue. We are writing a cacheline in any scenario which requires exclusive 
owernership. That is the major cost.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Better local_t implementation needed
  2007-04-20 16:10 ` Christoph Lameter
@ 2007-04-20 17:01   ` Andi Kleen
  2007-04-20 17:05     ` Christoph Lameter
  0 siblings, 1 reply; 12+ messages in thread
From: Andi Kleen @ 2007-04-20 17:01 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linux-arch

On Friday 20 April 2007 18:10:32 Christoph Lameter wrote:

> 
> > Using local_t for per cpu counters is nice because then
> > one can use cpu_local_add() etc. and that generates very good 
> > code at least on x86 and a few other architectures. That would
> > then allow very cheap per CPU statistics, which are useful
> > in a number of subsystems (like networking or MM code)
> 
> Is the per cpu access fixed on x86? Last I checked it was not atomic.

With upcomming patches per cpu can be directly referenced using %fs/%gs
Then cpu_local_add() etc will be a single instruction that is atomic
regarding interrupts.

> Uhhh.... Yuck.
> 
> > Drawback will be larger code. 
> 
> Fix the per cpu area access instead?

That doesn't help on architectures that don't have r-m-w instructions
on memory (like all RISCSs)

-Andi

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Better local_t implementation needed
  2007-04-20 17:01   ` Andi Kleen
@ 2007-04-20 17:05     ` Christoph Lameter
  0 siblings, 0 replies; 12+ messages in thread
From: Christoph Lameter @ 2007-04-20 17:05 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-arch

On Fri, 20 Apr 2007, Andi Kleen wrote:

> > Is the per cpu access fixed on x86? Last I checked it was not atomic.
> 
> With upcomming patches per cpu can be directly referenced using %fs/%gs
> Then cpu_local_add() etc will be a single instruction that is atomic
> regarding interrupts.

Ahh.. Great! When is that feature going to be available? 2.6.22?

^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: Better local_t implementation needed
  2007-04-20 10:56 Better local_t implementation needed Andi Kleen
  2007-04-20 16:10 ` Christoph Lameter
@ 2007-04-20 18:31 ` Luck, Tony
  2007-04-20 20:14   ` Andi Kleen
  2007-04-20 21:25 ` Roman Zippel
  2 siblings, 1 reply; 12+ messages in thread
From: Luck, Tony @ 2007-04-20 18:31 UTC (permalink / raw)
  To: Andi Kleen, linux-arch; +Cc: Christoph Lameter

> Using local_t for per cpu counters is nice because then
> one can use cpu_local_add() etc. and that generates very good 
> code at least on x86 and a few other architectures. That would
> then allow very cheap per CPU statistics, which are useful
> in a number of subsystems (like networking or MM code)

But be warned that aggregating the per-cpu counters can be very
expensive on high cpu count ia64 systems.  We've had issues with
/proc files that provide system counts that do this in a cache
hostile way.

-Tony

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Better local_t implementation needed
  2007-04-20 18:31 ` Luck, Tony
@ 2007-04-20 20:14   ` Andi Kleen
  2007-04-20 20:27     ` Christoph Lameter
  0 siblings, 1 reply; 12+ messages in thread
From: Andi Kleen @ 2007-04-20 20:14 UTC (permalink / raw)
  To: Luck, Tony; +Cc: linux-arch, Christoph Lameter

On Friday 20 April 2007 20:31:45 Luck, Tony wrote:
> > Using local_t for per cpu counters is nice because then
> > one can use cpu_local_add() etc. and that generates very good 
> > code at least on x86 and a few other architectures. That would
> > then allow very cheap per CPU statistics, which are useful
> > in a number of subsystems (like networking or MM code)
> 
> But be warned that aggregating the per-cpu counters can be very
> expensive on high cpu count ia64 systems.  We've had issues with
> /proc files that provide system counts that do this in a cache
> hostile way.

Can you expand? What cache hostile way?

-Andi

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Better local_t implementation needed
  2007-04-20 20:14   ` Andi Kleen
@ 2007-04-20 20:27     ` Christoph Lameter
  0 siblings, 0 replies; 12+ messages in thread
From: Christoph Lameter @ 2007-04-20 20:27 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Luck, Tony, linux-arch

On Fri, 20 Apr 2007, Andi Kleen wrote:

> > But be warned that aggregating the per-cpu counters can be very
> > expensive on high cpu count ia64 systems.  We've had issues with
> > /proc files that provide system counts that do this in a cache
> > hostile way.
> 
> Can you expand? What cache hostile way?

We have fixed these issues using the ZVCs AFAIK. The event counters still 
have the same scheme though. They are rarely ever aggregated.

Per cpu counters should not be in arrays indexed by cpu. Because that will
place counters used by multiple processors in a cache line causing false
aliasing. VM event counters are indexed by event item and placed in the
per cpu areas for each counter.

Here is a patch that makes the VM use local_t. Patch is good for x86_64 
(exploits atomicity of inc/dec effectively if Andy has fixed the atomicity 
issues) but bad for IA64 since we are going from regular integer 
operations to atomics. I would think that the patch is not acceptable as 
long as cpu_local causes a regression on IA64.

Index: linux-2.6.21-rc6/include/linux/vmstat.h
===================================================================
--- linux-2.6.21-rc6.orig/include/linux/vmstat.h	2007-04-20 13:09:54.000000000 -0700
+++ linux-2.6.21-rc6/include/linux/vmstat.h	2007-04-20 13:13:30.000000000 -0700
@@ -6,6 +6,7 @@
 #include <linux/mm.h>
 #include <linux/mmzone.h>
 #include <asm/atomic.h>
+#include <asm/local.h>
 
 #ifdef CONFIG_ZONE_DMA
 #define DMA_ZONE(xx) xx##_DMA,
@@ -51,32 +52,26 @@ enum vm_event_item { PGPGIN, PGPGOUT, PS
  * generated will simply be the increment of a global address.
  */
 
-struct vm_event_state {
-	unsigned long event[NR_VM_EVENT_ITEMS];
-};
-
-DECLARE_PER_CPU(struct vm_event_state, vm_event_states);
+DECLARE_PER_CPU(local_t, vm_event_states)[NR_VM_EVENT_ITEMS];
 
 static inline void __count_vm_event(enum vm_event_item item)
 {
-	__get_cpu_var(vm_event_states).event[item]++;
+	__cpu_local_inc(vm_event_states[item]);
 }
 
 static inline void count_vm_event(enum vm_event_item item)
 {
-	get_cpu_var(vm_event_states).event[item]++;
-	put_cpu();
+	cpu_local_inc(vm_event_states[item]);
 }
 
 static inline void __count_vm_events(enum vm_event_item item, long delta)
 {
-	__get_cpu_var(vm_event_states).event[item] += delta;
+	__cpu_local_add(delta, vm_event_states[item]);
 }
 
 static inline void count_vm_events(enum vm_event_item item, long delta)
 {
-	get_cpu_var(vm_event_states).event[item] += delta;
-	put_cpu();
+	cpu_local_add(delta, vm_event_states[item]);
 }
 
 extern void all_vm_events(unsigned long *);
Index: linux-2.6.21-rc6/mm/vmstat.c
===================================================================
--- linux-2.6.21-rc6.orig/mm/vmstat.c	2007-04-20 13:10:08.000000000 -0700
+++ linux-2.6.21-rc6/mm/vmstat.c	2007-04-20 13:18:53.000000000 -0700
@@ -14,7 +14,8 @@
 #include <linux/cpu.h>
 
 #ifdef CONFIG_VM_EVENT_COUNTERS
-DEFINE_PER_CPU(struct vm_event_state, vm_event_states) = {{0}};
+DEFINE_PER_CPU(local_t, vm_event_states)[NR_VM_EVENT_ITEMS] =
+						{ LOCAL_INIT(0) };
 EXPORT_PER_CPU_SYMBOL(vm_event_states);
 
 static void sum_vm_events(unsigned long *ret, cpumask_t *cpumask)
@@ -26,7 +27,7 @@ static void sum_vm_events(unsigned long 
 
 	cpu = first_cpu(*cpumask);
 	while (cpu < NR_CPUS) {
-		struct vm_event_state *this = &per_cpu(vm_event_states, cpu);
+		local_t *this = per_cpu(vm_event_states, cpu);
 
 		cpu = next_cpu(cpu, *cpumask);
 
@@ -35,7 +36,7 @@ static void sum_vm_events(unsigned long 
 
 
 		for (i = 0; i < NR_VM_EVENT_ITEMS; i++)
-			ret[i] += this->event[i];
+			ret[i] += local_read(&this[i]);
 	}
 }
 
@@ -59,12 +60,12 @@ EXPORT_SYMBOL_GPL(all_vm_events);
  */
 void vm_events_fold_cpu(int cpu)
 {
-	struct vm_event_state *fold_state = &per_cpu(vm_event_states, cpu);
+	local_t *fold_state = per_cpu(vm_event_states, cpu);
 	int i;
 
 	for (i = 0; i < NR_VM_EVENT_ITEMS; i++) {
-		count_vm_events(i, fold_state->event[i]);
-		fold_state->event[i] = 0;
+		count_vm_events(i, local_read(&fold_state[i]));
+		local_set(&fold_state[i], 0);
 	}
 }
 #endif /* CONFIG_HOTPLUG */
@@ -586,7 +587,8 @@ static void *vmstat_start(struct seq_fil
 
 #ifdef CONFIG_VM_EVENT_COUNTERS
 	v = kmalloc(NR_VM_ZONE_STAT_ITEMS * sizeof(unsigned long)
-			+ sizeof(struct vm_event_state), GFP_KERNEL);
+			+ sizeof(local_t) * NR_VM_EVENT_ITEMS,
+			GFP_KERNEL);
 #else
 	v = kmalloc(NR_VM_ZONE_STAT_ITEMS * sizeof(unsigned long),
 			GFP_KERNEL);

^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: Better local_t implementation needed
       [not found] <617E1C2C70743745A92448908E030B2A015F2392@scsmsx411.amr.corp.intel.com>
@ 2007-04-20 20:38 ` Christoph Lameter
  0 siblings, 0 replies; 12+ messages in thread
From: Christoph Lameter @ 2007-04-20 20:38 UTC (permalink / raw)
  To: Luck, Tony; +Cc: Andi Kleen, linux-arch

On Fri, 20 Apr 2007, Luck, Tony wrote:

> Look for the !defined(CONFIG_IA64) in fs/proc/proc_misc.c.
> 
> 	for (i = 0; i < NR_IRQS; i++)
> 		seq_printf(p, " %u", kstat_irqs(i));
> 
> That loop took several seconds on a big SGI box.

I guess the sum_vm_events() in mm/vmstat.c does the right thing. It goes 
by cpu and adds up all event counter of one cpu area before going to the 
next.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Better local_t implementation needed
  2007-04-20 10:56 Better local_t implementation needed Andi Kleen
  2007-04-20 16:10 ` Christoph Lameter
  2007-04-20 18:31 ` Luck, Tony
@ 2007-04-20 21:25 ` Roman Zippel
  2007-04-20 22:39   ` David Miller
  2 siblings, 1 reply; 12+ messages in thread
From: Roman Zippel @ 2007-04-20 21:25 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-arch, Christoph Lameter

Hi,

On Friday 20 April 2007 12:56, Andi Kleen wrote:

> There used to be a portable implementation of local.h
> that instead defines local_t as a two value array
> indexed by in_interrupt(). I'm considering to add that back.

That doesn't really work in general, interrupts can interrupt other 
interrupts.

bye, Roman

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Better local_t implementation needed
  2007-04-20 21:25 ` Roman Zippel
@ 2007-04-20 22:39   ` David Miller
  2007-04-21  0:25     ` Christoph Lameter
  0 siblings, 1 reply; 12+ messages in thread
From: David Miller @ 2007-04-20 22:39 UTC (permalink / raw)
  To: zippel; +Cc: ak, linux-arch, clameter

From: Roman Zippel <zippel@linux-m68k.org>
Date: Fri, 20 Apr 2007 23:25:04 +0200

> On Friday 20 April 2007 12:56, Andi Kleen wrote:
> 
> > There used to be a portable implementation of local.h
> > that instead defines local_t as a two value array
> > indexed by in_interrupt(). I'm considering to add that back.
> 
> That doesn't really work in general, interrupts can interrupt other 
> interrupts.

That's correct for hardware interrupts.  It will, however, work for
soft interrupts and similar contexts.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Better local_t implementation needed
  2007-04-20 22:39   ` David Miller
@ 2007-04-21  0:25     ` Christoph Lameter
  2007-04-21  4:45       ` David Miller
  0 siblings, 1 reply; 12+ messages in thread
From: Christoph Lameter @ 2007-04-21  0:25 UTC (permalink / raw)
  To: David Miller; +Cc: zippel, ak, linux-arch

On Fri, 20 Apr 2007, David Miller wrote:

> That's correct for hardware interrupts.  It will, however, work for
> soft interrupts and similar contexts.

Is there really a significant gain? It seems that such logic would be more 
expensive than an atomic operation.

local_t is saving atomic overhead right? It does no good for cacheline 
contention etc. Adding another int will increase cache footprint. 
Another rat hole?

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Better local_t implementation needed
  2007-04-21  0:25     ` Christoph Lameter
@ 2007-04-21  4:45       ` David Miller
  0 siblings, 0 replies; 12+ messages in thread
From: David Miller @ 2007-04-21  4:45 UTC (permalink / raw)
  To: clameter; +Cc: zippel, ak, linux-arch

From: Christoph Lameter <clameter@sgi.com>
Date: Fri, 20 Apr 2007 17:25:16 -0700 (PDT)

> On Fri, 20 Apr 2007, David Miller wrote:
> 
> > That's correct for hardware interrupts.  It will, however, work for
> > soft interrupts and similar contexts.
> 
> Is there really a significant gain? It seems that such logic would be more 
> expensive than an atomic operation.
> 
> local_t is saving atomic overhead right? It does no good for cacheline 
> contention etc. Adding another int will increase cache footprint. 
> Another rat hole?

We've been doing it for SNMP statistics in the networking for a long
time and I'm pretty sure it's better than an atomic at least on
sparc64 where the atomic is 40 cycles minimum on several processors.

I really don't want to start using local_t's all over the damn place
if they are implemented as atomics until they are proven to be faster
which I doubt they are on sparc64 for one.

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2007-04-21  4:45 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-04-20 10:56 Better local_t implementation needed Andi Kleen
2007-04-20 16:10 ` Christoph Lameter
2007-04-20 17:01   ` Andi Kleen
2007-04-20 17:05     ` Christoph Lameter
2007-04-20 18:31 ` Luck, Tony
2007-04-20 20:14   ` Andi Kleen
2007-04-20 20:27     ` Christoph Lameter
2007-04-20 21:25 ` Roman Zippel
2007-04-20 22:39   ` David Miller
2007-04-21  0:25     ` Christoph Lameter
2007-04-21  4:45       ` David Miller
     [not found] <617E1C2C70743745A92448908E030B2A015F2392@scsmsx411.amr.corp.intel.com>
2007-04-20 20:38 ` Christoph Lameter

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).