* light weight counters: race free through local_t?
@ 2006-06-10 5:30 Christoph Lameter
2006-06-14 16:05 ` Zoltan Menyhart
0 siblings, 1 reply; 9+ messages in thread
From: Christoph Lameter @ 2006-06-10 5:30 UTC (permalink / raw)
To: akpm, ak; +Cc: linux-kernel, linux-ia64
One way of making the light weight counters race free for x86_64 and
i386 is to use local_t. With that those two platforms are fine.
However, the others fall back to atomic operations.
Maybe we could deal with that on per platform basis? Some platforms may
want to switch the local_t implementation away from atomics to regular
integers if preemption is not configured. Most commercial Linux distros
ship with preempt off. So this would preserve the speed of light weight
counters, while holding off the worst races. But it still could allow
interrupts while the counter is being incremented and so it would not
be race free. This would no longer allow the use of local_t for
refcounts but I think those uses are not that performance critical
and we may just switch to atomic. Maybe I am just off in fantasyland.
Andi?
Another thing to investigate (at least on ia64) is how significant the
impact of a fetchadd instruction is if none of the results are used. Maybe
it is tolerable and we can stay with the existing implementation.
On IA64 we we would trade an interrupt disable/ load / add / store /interrupt
enable against one fetchadd instruction with this patch. How bad/good a
trade is that?
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Index: linux-2.6.17-rc6-mm1/mm/page_alloc.c
===================================================================
--- linux-2.6.17-rc6-mm1.orig/mm/page_alloc.c 2006-06-09 17:51:18.789713315 -0700
+++ linux-2.6.17-rc6-mm1/mm/page_alloc.c 2006-06-09 22:06:09.491210492 -0700
@@ -1583,7 +1583,7 @@ static void show_node(struct zone *zone)
#endif
#ifdef CONFIG_VM_EVENT_COUNTERS
-DEFINE_PER_CPU(struct vm_event_state, vm_event_states) = {{0}};
+DEFINE_PER_CPU(struct vm_event_state, vm_event_states) = {{LOCAL_INIT(0)}};
static void sum_vm_events(unsigned long *ret, cpumask_t *cpumask)
{
@@ -1603,7 +1603,7 @@ static void sum_vm_events(unsigned long
for (i=0; i< NR_VM_EVENT_ITEMS; i++)
- ret[i] += this->event[i];
+ ret[i] += local_read(&this->event[i]);
}
}
@@ -2881,7 +2881,7 @@ static void vm_events_fold_cpu(int cpu)
for (i = 0; i < NR_VM_EVENT_ITEMS; i++) {
count_vm_events(i, fold_state->event[i]);
- fold_state->event[i] = 0;
+ local_set(fold_state->event[i], 0);
}
}
Index: linux-2.6.17-rc6-mm1/include/linux/page-flags.h
===================================================================
--- linux-2.6.17-rc6-mm1.orig/include/linux/page-flags.h 2006-06-09 22:06:18.001424043 -0700
+++ linux-2.6.17-rc6-mm1/include/linux/page-flags.h 2006-06-09 22:29:48.581945624 -0700
@@ -8,7 +8,7 @@
#include <linux/percpu.h>
#include <linux/cache.h>
#include <linux/types.h>
-
+#include <asm/local.h>
#include <asm/pgtable.h>
/*
@@ -108,10 +108,6 @@
/*
* Light weight per cpu counter implementation.
*
- * Note that these can race. We do not bother to enable preemption
- * or care about interrupt races. All we care about is to have some
- * approximate count of events.
- *
* Counters should only be incremented and no critical kernel component
* should rely on the counter values.
*
@@ -134,24 +130,24 @@ enum vm_event_item { PGPGIN, PGPGOUT, PS
};
struct vm_event_state {
- unsigned long event[NR_VM_EVENT_ITEMS];
+ local_t event[NR_VM_EVENT_ITEMS];
};
DECLARE_PER_CPU(struct vm_event_state, vm_event_states);
static inline unsigned long get_cpu_vm_events(enum vm_event_item item)
{
- return __get_cpu_var(vm_event_states).event[item];
+ cpu_local_read(vm_event_states.event[item]);
}
static inline void count_vm_event(enum vm_event_item item)
{
- __get_cpu_var(vm_event_states).event[item]++;
+ cpu_local_inc(vm_event_states.event[item]);
}
static inline void count_vm_events(enum vm_event_item item, long delta)
{
- __get_cpu_var(vm_event_states).event[item] += delta;
+ cpu_local_add(delta, vm_event_states.event[item]);
}
extern void all_vm_events(unsigned long *);
^ permalink raw reply [flat|nested] 9+ messages in thread* Re: light weight counters: race free through local_t?
2006-06-10 5:30 light weight counters: race free through local_t? Christoph Lameter
@ 2006-06-14 16:05 ` Zoltan Menyhart
2006-06-14 16:33 ` Christoph Lameter
0 siblings, 1 reply; 9+ messages in thread
From: Zoltan Menyhart @ 2006-06-14 16:05 UTC (permalink / raw)
To: Christoph Lameter; +Cc: akpm, ak, linux-kernel, linux-ia64
Christoph Lameter wrote:
> One way of making the light weight counters race free for x86_64 and
> i386 is to use local_t. With that those two platforms are fine.
>
> However, the others fall back to atomic operations.
>
> Maybe we could deal with that on per platform basis? Some platforms may
> want to switch the local_t implementation away from atomics to regular
> integers if preemption is not configured. Most commercial Linux distros
> ship with preempt off. So this would preserve the speed of light weight
> counters, while holding off the worst races. But it still could allow
> interrupts while the counter is being incremented and so it would not
> be race free. This would no longer allow the use of local_t for
> refcounts but I think those uses are not that performance critical
> and we may just switch to atomic. Maybe I am just off in fantasyland.
>
> Andi?
>
> Another thing to investigate (at least on ia64) is how significant the
> impact of a fetchadd instruction is if none of the results are used. Maybe
> it is tolerable and we can stay with the existing implementation.
>
> On IA64 we we would trade an interrupt disable/ load / add / store /interrupt
> enable against one fetchadd instruction with this patch. How bad/good a
> trade is that?
On one hand, switching to local counters can be a good idea if they are not
evicted from the caches by the usual NRU/LRU cache replacement...
On the other hand, I do not think the ia64's fetchadd instruction is
expensive.
If your data is in L2, then it takes 11 clock cycles.
I do not think the counters have got much chance to stay in L1.
Anyway, L1 is write through, you'll need to copy the updated value
back into L2.
As the shortest L2 access takes 5 clock cycles...
You need 2 of them. (Assuming a counter is always in L2.)
And add interrupt disable/enable time...
Regards,
Zoltan Menyhart
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: light weight counters: race free through local_t?
2006-06-14 16:05 ` Zoltan Menyhart
@ 2006-06-14 16:33 ` Christoph Lameter
2006-06-15 12:22 ` Zoltan Menyhart
0 siblings, 1 reply; 9+ messages in thread
From: Christoph Lameter @ 2006-06-14 16:33 UTC (permalink / raw)
To: Zoltan Menyhart; +Cc: akpm, ak, linux-kernel, linux-ia64
On Wed, 14 Jun 2006, Zoltan Menyhart wrote:
> > On IA64 we we would trade an interrupt disable/ load / add / store
> > /interrupt enable against one fetchadd instruction with this patch. How
> > bad/good a trade is that?
>
> On one hand, switching to local counters can be a good idea if they are not
> evicted from the caches by the usual NRU/LRU cache replacement...
>
> On the other hand, I do not think the ia64's fetchadd instruction is
> expensive.
> If your data is in L2, then it takes 11 clock cycles.
>
> I do not think the counters have got much chance to stay in L1.
> Anyway, L1 is write through, you'll need to copy the updated value
> back into L2.
> As the shortest L2 access takes 5 clock cycles...
> You need 2 of them. (Assuming a counter is always in L2.)
> And add interrupt disable/enable time...
Could you do a clock cycle comparision of an
atomic_inc(__get_per_cpu(var))
(the fallback of local_t on ia64)
vs.
local_irq_save(flags)
__get_per_cpu(var)++
local_irq_restore(flags)
(ZVC like implementation)
vs.
get_per_cpu(var)++
put_cpu()
(current light weight counters)
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: light weight counters: race free through local_t?
2006-06-14 16:33 ` Christoph Lameter
@ 2006-06-15 12:22 ` Zoltan Menyhart
2006-06-15 15:56 ` Christoph Lameter
2006-06-15 16:06 ` Christoph Lameter
0 siblings, 2 replies; 9+ messages in thread
From: Zoltan Menyhart @ 2006-06-15 12:22 UTC (permalink / raw)
To: Christoph Lameter; +Cc: akpm, ak, linux-kernel, linux-ia64
Christoph Lameter wrote:
> Could you do a clock cycle comparision of an
>
> atomic_inc(__get_per_cpu(var))
> (the fallback of local_t on ia64)
>
> vs.
>
> local_irq_save(flags)
> __get_per_cpu(var)++
> local_irq_restore(flags)
> (ZVC like implementation)
>
> vs.
>
> get_per_cpu(var)++
> put_cpu()
> (current light weight counters)
The only thing I have at hand is a small test for the 1st case:
#include <stdio.h>
#include <asm/atomic.h>
#define GET_ITC() \
({ \
unsigned long ia64_intri_res; \
\
asm volatile ("mov %0=ar.itc" : "=r"(ia64_intri_res)); \
ia64_intri_res; \
})
#define N (1000 * 1000 * 100L)
atomic_t data;
main(int c, char *v[])
{
unsigned long cycles;
int i;
cycles = GET_ITC();
for (i = 0; i < N; i++)
ia64_fetchadd4_rel(&data, 1);
cycles = GET_ITC() - cycles;
printf("%ld %d\n", cycles / N, atomic_read(&data));
}
It gives 11 clock cycles.
(The loop organizing instructions are "absorbed".)
"atomic_inc(__get_per_cpu(var))" compiles into:
mov rx = 0xffffffffffffxxxx // &__get_per_cpu(var)
;;
fetchadd4.rel ry = [rx], 1
It _should_ take 11 clock cycles, too. (Assuming it is in L2.)
For the 2nd case:
With a bit of modification, I can measure what
"__get_per_cpu(var)++" costs: 7 or 10 clock cycles, depending on
if the chance to find the counter in L1 is 100% or 0%:
int data;
static inline void store(int *addr, int data){
asm volatile ("st4 [%1] = %0" :: "r"(data), "r"(addr) : "memory");
}
static inline int load_nt1(int *addr)
{
int tmp;
asm volatile ("ld4.nt1 %0=[%1]" : "=r"(tmp) : "r" (addr));
return tmp;
}
main(int c, char *v[])
{
unsigned long cycles;
int i, d;
cycles = GET_ITC();
for (i = 0; i < N; i++)
// Avoid optimizing out the "st4"
store(&data, data + 1);
cycles = GET_ITC() - cycles;
printf("%ld %d\n", cycles / N, data);
cycles = GET_ITC();
for (i = 0; i < N; i++){
// Do not use L1
d = load_nt1(&data);
store(&data, d + 1);
}
cycles = GET_ITC() - cycles;
printf("%ld %d\n", cycles / N, data);
}
"local_irq_save(flags)" compiles into:
mov rx = psr ;; // 13 clock cycles
rsm 0x4000 ;; // 5 clock cycles
"local_irq_restore(flags)" compiles into (at least):
ssm 0x4000 // 5 clock cycles
For the 3dr case:
If CONFIG_PREEMPT, then you need to add 2 * 7 clock cycles
for inc_preempt_count() / dec_preempt_count() + some more
for preempt_check_resched().
My conclusion: let's stick to atomic counters.
Regards,
Zoltan
^ permalink raw reply [flat|nested] 9+ messages in thread* Re: light weight counters: race free through local_t?
2006-06-15 12:22 ` Zoltan Menyhart
@ 2006-06-15 15:56 ` Christoph Lameter
2006-06-15 16:46 ` Zoltan Menyhart
2006-06-15 16:06 ` Christoph Lameter
1 sibling, 1 reply; 9+ messages in thread
From: Christoph Lameter @ 2006-06-15 15:56 UTC (permalink / raw)
To: Zoltan Menyhart; +Cc: akpm, ak, linux-kernel, linux-ia64
On Thu, 15 Jun 2006, Zoltan Menyhart wrote:
> My conclusion: let's stick to atomic counters.
Good to know. But we would run into trouble if the atomic counters would
be contended.
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: light weight counters: race free through local_t?
2006-06-15 15:56 ` Christoph Lameter
@ 2006-06-15 16:46 ` Zoltan Menyhart
2006-06-15 18:14 ` Christoph Lameter
0 siblings, 1 reply; 9+ messages in thread
From: Zoltan Menyhart @ 2006-06-15 16:46 UTC (permalink / raw)
To: Christoph Lameter; +Cc: akpm, ak, linux-kernel, linux-ia64
Christoph Lameter wrote:
> Good to know. But we would run into trouble if the atomic counters would
> be contended.
Why do you think an atomic operation is more expensive in a contended case?
Let's take the following assumption:
Every time when a CPU wants to increment the counter, it's cache line is
in someone other's cache.
Case of an atomic operation:
1. Issuing an atomic operation (apparently can be absorbed)
2. Goes down by the OzQ of the L2 (no L1 action)
(assuming the queue is "sufficiently empty")
3. L2 miss
4. Fetching data, obtaining exclusivity - very long
5. L2 is ready - do the atomic operation to the L2 entry - modified state
(min 5 clock cycles for normal L2 access + 5 for the atomic operation)
Case of __get_per_cpu(var)++:
1. Issuing an "ld4" can be absorbed
2. Goes down by the OzQ of the L2 (L1 miss)
(assuming the queue is "sufficiently empty")
3. L2 miss
4. Fetching data, shared state - very long
5. L2 is ready - copy to L1 - into rx
(min 5 clock cycle due to the L2)
6. Increment (1 cycle)
7. Post store to the OzQ of the L2 (L1 updated)
(assuming the queue is "sufficiently empty")
8. L2 hit - obtain exclusivity - moderately long
9. L2 is ready - update it - modified state
(min 5 clock cycles)
Please note
- the additional bus operation (address only) in step 8
- a 3rd CPU can kill our data between steps 5 - 8
(very low probability)
(In case of strong contention, efforts have to be made to avoid cache
line sharing with other frequently used data - as usually...)
> Hmm... What about side effects such as pipeline stalls? fetchadd is
> semaphore operation. Typically we use acquire semantics for volatiles.
> Here the fetchadd has release semantics.
> If we would use release semantics then the fetchadd would require all
> prior accesses to be complete.
... or ".rel".
Yes, it is a one-direction barrier.
However, if there is not too many stuff in the OzQ, it has not too much
impact.
(It is very difficult to fill in the OzQ: SW pipe-line, N independent
loads / stores without ";;" - not very much frequent in the kernel)
> Acquire semantics may be easier. But the best would be a fetchadd without
> any serialization that would be like the inc/dec memory on i386, which
> does not exist in the IA64 instruction set.
A second though about the case of __get_per_cpu(var)++:
If it is coded by hand, I can use "ld4.bias" to obtain immediately
the exclusivity - no need for step 8.
However, you cannot avoid the cost of the protection around this
incitement.
I still prefer the atomic operations.
Regard,
Zoltan
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: light weight counters: race free through local_t?
2006-06-15 16:46 ` Zoltan Menyhart
@ 2006-06-15 18:14 ` Christoph Lameter
2006-06-16 9:14 ` Zoltan Menyhart
0 siblings, 1 reply; 9+ messages in thread
From: Christoph Lameter @ 2006-06-15 18:14 UTC (permalink / raw)
To: Zoltan Menyhart; +Cc: akpm, ak, linux-kernel, linux-ia64
Given what you just said it seems that an atomic op is better than first
reading the cacheline and then incrementing a value in it right? Because
if we first read then we acquire the cacheline in shared more. Later when
we write to it we have to acquire it again in exclusive mode. The fetchadd
would acquire the cacheline immediately in exclusive mode and thus save
the acquisition in shared mode.
> Yes, it is a one-direction barrier.
> However, if there is not too many stuff in the OzQ, it has not too much
> impact.
It serializes prior accesses and has a simiar effect as an unlock
operation. The processor cannot freely reorder instructions around the
increment. That must have some sort of an impact on performance.
> I still prefer the atomic operations.
Noted. Andi: It seems that we can fall back on ia64 to an atomic
operation without concern for performance.
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: light weight counters: race free through local_t?
2006-06-15 18:14 ` Christoph Lameter
@ 2006-06-16 9:14 ` Zoltan Menyhart
0 siblings, 0 replies; 9+ messages in thread
From: Zoltan Menyhart @ 2006-06-16 9:14 UTC (permalink / raw)
To: Christoph Lameter; +Cc: akpm, ak, linux-kernel, linux-ia64
Let's put it in another way:
Do the statistics need to be absolutely precise?
I guess they do not.
By the time you can display them, they have already been changed.
Let's take an example:
zone_statistics(struct zonelist *zonelist, struct zone *z)
{
...
// local_irq_save(flags); // No IRQ lock out
cpu = smp_processor_id(); // Can become another CPU
p = &z->pageset[cpu]; // Can count for someone else
if (pg == orig) {
stat_incr(&z->pageset[cpu].numa_hit); // Unsafe
} else {
...
// local_irq_restore(flags);
}
Where "stat_incr()" is arch. dependent and possibly unsafe routine.
For IA64:
// Unsafe statistics
static inline void stat_incr(int *addr){
int tmp;
// Obtain immediately the cache line exclusivity, do not touch L1
asm volatile ("ld4.bias.nt1 %0=[%1]" : "=r"(tmp) : "r" (addr));
tmp++;
asm volatile ("st4 [%1] = %0" :: "r"(tmp), "r"(addr) : "memory");
}
It takes 10 clock cycles.
Regards,
Zoltan
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: light weight counters: race free through local_t?
2006-06-15 12:22 ` Zoltan Menyhart
2006-06-15 15:56 ` Christoph Lameter
@ 2006-06-15 16:06 ` Christoph Lameter
1 sibling, 0 replies; 9+ messages in thread
From: Christoph Lameter @ 2006-06-15 16:06 UTC (permalink / raw)
To: Zoltan Menyhart; +Cc: akpm, ak, linux-kernel, linux-ia64
Hmm... What about side effects such as pipeline stalls? fetchadd is
semaphore operation. Typically we use acquire semantics for volatiles.
Here the fetchadd has release semantics.
If we would use release semantics then the fetchadd would require all
prior accesses to be complete.
Acquire semantics may be easier. But the best would be a fetchadd without
any serialization that would be like the inc/dec memory on i386, which
does not exist in the IA64 instruction set.
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2006-06-16 9:14 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-06-10 5:30 light weight counters: race free through local_t? Christoph Lameter
2006-06-14 16:05 ` Zoltan Menyhart
2006-06-14 16:33 ` Christoph Lameter
2006-06-15 12:22 ` Zoltan Menyhart
2006-06-15 15:56 ` Christoph Lameter
2006-06-15 16:46 ` Zoltan Menyhart
2006-06-15 18:14 ` Christoph Lameter
2006-06-16 9:14 ` Zoltan Menyhart
2006-06-15 16:06 ` Christoph Lameter
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox