All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 1/2] tracing/function-return-tracer: Make the function return tracer lockless
@ 2008-11-12 21:47 Frederic Weisbecker
  2008-11-12 22:15 ` Ingo Molnar
  2008-11-13 14:53 ` Andi Kleen
  0 siblings, 2 replies; 27+ messages in thread
From: Frederic Weisbecker @ 2008-11-12 21:47 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Steven Rostedt, Linux Kernel

Impact: remove spinlocks and irq disabling in function return tracer.

I've tried to figure out all of the race condition that could happen when
the tracer pushes or pops a return address trace to/from the current
thread_info.

Theory:

_ One thread can only execute on one cpu at a time. So this code doesn't need
  to be SMP-safe. Just drop the spinlock.
_ The only race could happen between the current thread and an interrupt. If an
  interrupt is raised, it will increase the index of the return stack storage and
  then execute until the end of the tracing to finally free the index it used.
  We don't need to disable irqs.

This is theorical. In practice, I've tested it with a two-core SMP and had no
problem at all. Perhaps -tip testing could confirm it.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
---
 arch/x86/kernel/ftrace.c |   43 +++++--------------------------------------
 1 files changed, 5 insertions(+), 38 deletions(-)

diff --git a/arch/x86/kernel/ftrace.c b/arch/x86/kernel/ftrace.c
index 16a571d..1db0e12 100644
--- a/arch/x86/kernel/ftrace.c
+++ b/arch/x86/kernel/ftrace.c
@@ -44,62 +44,37 @@ void ftrace_nmi_exit(void)
 	atomic_dec(&in_nmi);
 }
 
-/*
- * Synchronize accesses to return adresses stack with
- * interrupts.
- */
-static raw_spinlock_t ret_stack_lock;
-
 /* Add a function return address to the trace stack on thread info.*/
 static int push_return_trace(unsigned long ret, unsigned long long time,
 				unsigned long func)
 {
 	int index;
-	struct thread_info *ti;
-	unsigned long flags;
-	int err = 0;
-
-	raw_local_irq_save(flags);
-	__raw_spin_lock(&ret_stack_lock);
+	struct thread_info *ti = current_thread_info();
 
-	ti = current_thread_info();
 	/* The return trace stack is full */
-	if (ti->curr_ret_stack == FTRACE_RET_STACK_SIZE - 1) {
-		err = -EBUSY;
-		goto out;
-	}
+	if (ti->curr_ret_stack == FTRACE_RET_STACK_SIZE - 1)
+		return -EBUSY;
 
 	index = ++ti->curr_ret_stack;
 	ti->ret_stack[index].ret = ret;
 	ti->ret_stack[index].func = func;
 	ti->ret_stack[index].calltime = time;
 
-out:
-	__raw_spin_unlock(&ret_stack_lock);
-	raw_local_irq_restore(flags);
-	return err;
+	return 0;
 }
 
 /* Retrieve a function return address to the trace stack on thread info.*/
 static void pop_return_trace(unsigned long *ret, unsigned long long *time,
 				unsigned long *func)
 {
-	struct thread_info *ti;
 	int index;
-	unsigned long flags;
-
-	raw_local_irq_save(flags);
-	__raw_spin_lock(&ret_stack_lock);
 
-	ti = current_thread_info();
+	struct thread_info *ti = current_thread_info();
 	index = ti->curr_ret_stack;
 	*ret = ti->ret_stack[index].ret;
 	*func = ti->ret_stack[index].func;
 	*time = ti->ret_stack[index].calltime;
 	ti->curr_ret_stack--;
-
-	__raw_spin_unlock(&ret_stack_lock);
-	raw_local_irq_restore(flags);
 }
 
 /*
@@ -175,14 +150,6 @@ void prepare_ftrace_return(unsigned long *parent, unsigned long self_addr)
 		*parent = old;
 }
 
-static int __init init_ftrace_function_return(void)
-{
-	ret_stack_lock = (raw_spinlock_t)__RAW_SPIN_LOCK_UNLOCKED;
-	return 0;
-}
-device_initcall(init_ftrace_function_return);
-
-
 #endif
 
 #ifdef CONFIG_DYNAMIC_FTRACE
-- 
1.5.2.5

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/2] tracing/function-return-tracer: Make the function return tracer lockless
  2008-11-12 21:47 [PATCH 1/2] tracing/function-return-tracer: Make the function return tracer lockless Frederic Weisbecker
@ 2008-11-12 22:15 ` Ingo Molnar
  2008-11-13  8:50   ` Frédéric Weisbecker
  2008-11-13 14:53 ` Andi Kleen
  1 sibling, 1 reply; 27+ messages in thread
From: Ingo Molnar @ 2008-11-12 22:15 UTC (permalink / raw)
  To: Frederic Weisbecker; +Cc: Steven Rostedt, Linux Kernel


* Frederic Weisbecker <fweisbec@gmail.com> wrote:

> Impact: remove spinlocks and irq disabling in function return tracer.
> 
> I've tried to figure out all of the race condition that could happen when
> the tracer pushes or pops a return address trace to/from the current
> thread_info.
> 
> Theory:
> 
> _ One thread can only execute on one cpu at a time. So this code doesn't need
>   to be SMP-safe. Just drop the spinlock.
> _ The only race could happen between the current thread and an interrupt. If an
>   interrupt is raised, it will increase the index of the return stack storage and
>   then execute until the end of the tracing to finally free the index it used.
>   We don't need to disable irqs.
> 
> This is theorical. In practice, I've tested it with a two-core SMP 
> and had no problem at all. Perhaps -tip testing could confirm it.
> 
> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
> Cc: Steven Rostedt <rostedt@goodmis.org>

applied to tip/tracing/function-return-tracer, thanks Frederic!

	Ingo

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/2] tracing/function-return-tracer: Make the function return tracer lockless
  2008-11-12 22:15 ` Ingo Molnar
@ 2008-11-13  8:50   ` Frédéric Weisbecker
  2008-11-13  8:55     ` Ingo Molnar
  0 siblings, 1 reply; 27+ messages in thread
From: Frédéric Weisbecker @ 2008-11-13  8:50 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Steven Rostedt, Linux Kernel

2008/11/12 Ingo Molnar <mingo@elte.hu>:
>
> * Frederic Weisbecker <fweisbec@gmail.com> wrote:
>
>> Impact: remove spinlocks and irq disabling in function return tracer.
>>
>> I've tried to figure out all of the race condition that could happen when
>> the tracer pushes or pops a return address trace to/from the current
>> thread_info.
>>
>> Theory:
>>
>> _ One thread can only execute on one cpu at a time. So this code doesn't need
>>   to be SMP-safe. Just drop the spinlock.
>> _ The only race could happen between the current thread and an interrupt. If an
>>   interrupt is raised, it will increase the index of the return stack storage and
>>   then execute until the end of the tracing to finally free the index it used.
>>   We don't need to disable irqs.
>>
>> This is theorical. In practice, I've tested it with a two-core SMP
>> and had no problem at all. Perhaps -tip testing could confirm it.
>>
>> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
>> Cc: Steven Rostedt <rostedt@goodmis.org>
>
> applied to tip/tracing/function-return-tracer, thanks Frederic!
>
>        Ingo
>


BTW I'm wondering about consistency in time capturing.
When I look into kernel/sched_clock.c I see this in introduction:

"The clock: sched_clock_cpu() is monotonic per cpu, and should be somewhat
consistent between cpus (never more than 2 jiffies difference)."

Two Jiffies, that could result in a lot of inconsistency in the way of
nanosec capturing. The current task
can be preempted between the call time and the return time and I'm
doing a cpu_clock(raw_smp_processor_id)
on these two times.
Should I keep the same processor_id for these two captures? But what
would happen if this cpu is shut
down between these two times?
One other solution would be to plan time capture in usec but I would
mostly lose the interest of function cost measuring....

What do you think?
Thanks.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/2] tracing/function-return-tracer: Make the function return tracer lockless
  2008-11-13  8:50   ` Frédéric Weisbecker
@ 2008-11-13  8:55     ` Ingo Molnar
  2008-11-13  9:16       ` Frédéric Weisbecker
  0 siblings, 1 reply; 27+ messages in thread
From: Ingo Molnar @ 2008-11-13  8:55 UTC (permalink / raw)
  To: Frédéric Weisbecker
  Cc: Steven Rostedt, Linux Kernel, Peter Zijlstra


* Frédéric Weisbecker <fweisbec@gmail.com> wrote:

> BTW I'm wondering about consistency in time capturing. When I look 
> into kernel/sched_clock.c I see this in introduction:
> 
> "The clock: sched_clock_cpu() is monotonic per cpu, and should be 
> somewhat consistent between cpus (never more than 2 jiffies 
> difference)."
> 
> Two Jiffies, that could result in a lot of inconsistency in the way 
> of nanosec capturing. The current task can be preempted between the 
> call time and the return time and I'm doing a 
> cpu_clock(raw_smp_processor_id) on these two times. Should I keep 
> the same processor_id for these two captures? But what would happen 
> if this cpu is shut down between these two times? One other solution 
> would be to plan time capture in usec but I would mostly lose the 
> interest of function cost measuring....
> 
> What do you think?

in practice the jitter is much lower - a couple of microseconds - up 
to a few dozen at most.

But it's a possibility, and i think the best solution is something 
that Steve suggested yesterday: a /debug/tracing/trace_options flag 
that turns on global ordering for tracing timestamps. Something like:

  echo global_timestamps > /debug/tracing/trace_options

tracers could also change the default of this flag. The function-cost 
tracer will probably want to default to globally synchronous 
timestamps, while the preempt and irqsoff tracers want to default to 
local timestamps only.

Would something like this work for you?

	Ingo

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/2] tracing/function-return-tracer: Make the function return tracer lockless
  2008-11-13  8:55     ` Ingo Molnar
@ 2008-11-13  9:16       ` Frédéric Weisbecker
  2008-11-13  9:23         ` Ingo Molnar
  0 siblings, 1 reply; 27+ messages in thread
From: Frédéric Weisbecker @ 2008-11-13  9:16 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Steven Rostedt, Linux Kernel, Peter Zijlstra

2008/11/13 Ingo Molnar <mingo@elte.hu>:
>
> * Frédéric Weisbecker <fweisbec@gmail.com> wrote:
>
>> BTW I'm wondering about consistency in time capturing. When I look
>> into kernel/sched_clock.c I see this in introduction:
>>
>> "The clock: sched_clock_cpu() is monotonic per cpu, and should be
>> somewhat consistent between cpus (never more than 2 jiffies
>> difference)."
>>
>> Two Jiffies, that could result in a lot of inconsistency in the way
>> of nanosec capturing. The current task can be preempted between the
>> call time and the return time and I'm doing a
>> cpu_clock(raw_smp_processor_id) on these two times. Should I keep
>> the same processor_id for these two captures? But what would happen
>> if this cpu is shut down between these two times? One other solution
>> would be to plan time capture in usec but I would mostly lose the
>> interest of function cost measuring....
>>
>> What do you think?
>
> in practice the jitter is much lower - a couple of microseconds - up
> to a few dozen at most.
>
> But it's a possibility, and i think the best solution is something
> that Steve suggested yesterday: a /debug/tracing/trace_options flag
> that turns on global ordering for tracing timestamps. Something like:
>
>  echo global_timestamps > /debug/tracing/trace_options
>
> tracers could also change the default of this flag. The function-cost
> tracer will probably want to default to globally synchronous
> timestamps, while the preempt and irqsoff tracers want to default to
> local timestamps only.
>
> Would something like this work for you?
>
>        Ingo
>


But I guess this flag would apply on the timestamp inserted by the
ring-buffer. Unfortunately I can't use it
since I have to capture the clock for two times and not only during
insertion in the ring-buffer.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/2] tracing/function-return-tracer: Make the function return tracer lockless
  2008-11-13  9:16       ` Frédéric Weisbecker
@ 2008-11-13  9:23         ` Ingo Molnar
  2008-11-13  9:27           ` Frédéric Weisbecker
  0 siblings, 1 reply; 27+ messages in thread
From: Ingo Molnar @ 2008-11-13  9:23 UTC (permalink / raw)
  To: Frédéric Weisbecker
  Cc: Steven Rostedt, Linux Kernel, Peter Zijlstra


* Frédéric Weisbecker <fweisbec@gmail.com> wrote:

> 2008/11/13 Ingo Molnar <mingo@elte.hu>:
> >
> > * Frédéric Weisbecker <fweisbec@gmail.com> wrote:
> >
> >> BTW I'm wondering about consistency in time capturing. When I look
> >> into kernel/sched_clock.c I see this in introduction:
> >>
> >> "The clock: sched_clock_cpu() is monotonic per cpu, and should be
> >> somewhat consistent between cpus (never more than 2 jiffies
> >> difference)."
> >>
> >> Two Jiffies, that could result in a lot of inconsistency in the way
> >> of nanosec capturing. The current task can be preempted between the
> >> call time and the return time and I'm doing a
> >> cpu_clock(raw_smp_processor_id) on these two times. Should I keep
> >> the same processor_id for these two captures? But what would happen
> >> if this cpu is shut down between these two times? One other solution
> >> would be to plan time capture in usec but I would mostly lose the
> >> interest of function cost measuring....
> >>
> >> What do you think?
> >
> > in practice the jitter is much lower - a couple of microseconds - up
> > to a few dozen at most.
> >
> > But it's a possibility, and i think the best solution is something
> > that Steve suggested yesterday: a /debug/tracing/trace_options flag
> > that turns on global ordering for tracing timestamps. Something like:
> >
> >  echo global_timestamps > /debug/tracing/trace_options
> >
> > tracers could also change the default of this flag. The function-cost
> > tracer will probably want to default to globally synchronous
> > timestamps, while the preempt and irqsoff tracers want to default to
> > local timestamps only.
> >
> > Would something like this work for you?
> >
> >        Ingo
> >
> 
> 
> But I guess this flag would apply on the timestamp inserted by the 
> ring-buffer. Unfortunately I can't use it since I have to capture 
> the clock for two times and not only during insertion in the 
> ring-buffer.

i think the clock should be a property of the tracer, not of the ring 
buffer. Hence if a tracer has the option set, it will get coherent 
timestamps - including ringbuffer insertion timestamps.

	Ingo

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/2] tracing/function-return-tracer: Make the function return tracer lockless
  2008-11-13  9:23         ` Ingo Molnar
@ 2008-11-13  9:27           ` Frédéric Weisbecker
  2008-11-13  9:40             ` Ingo Molnar
  0 siblings, 1 reply; 27+ messages in thread
From: Frédéric Weisbecker @ 2008-11-13  9:27 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Steven Rostedt, Linux Kernel, Peter Zijlstra

2008/11/13 Ingo Molnar <mingo@elte.hu>:
>
> * Frédéric Weisbecker <fweisbec@gmail.com> wrote:
>
>> 2008/11/13 Ingo Molnar <mingo@elte.hu>:
>> >
>> > * Frédéric Weisbecker <fweisbec@gmail.com> wrote:
>> >
>> >> BTW I'm wondering about consistency in time capturing. When I look
>> >> into kernel/sched_clock.c I see this in introduction:
>> >>
>> >> "The clock: sched_clock_cpu() is monotonic per cpu, and should be
>> >> somewhat consistent between cpus (never more than 2 jiffies
>> >> difference)."
>> >>
>> >> Two Jiffies, that could result in a lot of inconsistency in the way
>> >> of nanosec capturing. The current task can be preempted between the
>> >> call time and the return time and I'm doing a
>> >> cpu_clock(raw_smp_processor_id) on these two times. Should I keep
>> >> the same processor_id for these two captures? But what would happen
>> >> if this cpu is shut down between these two times? One other solution
>> >> would be to plan time capture in usec but I would mostly lose the
>> >> interest of function cost measuring....
>> >>
>> >> What do you think?
>> >
>> > in practice the jitter is much lower - a couple of microseconds - up
>> > to a few dozen at most.
>> >
>> > But it's a possibility, and i think the best solution is something
>> > that Steve suggested yesterday: a /debug/tracing/trace_options flag
>> > that turns on global ordering for tracing timestamps. Something like:
>> >
>> >  echo global_timestamps > /debug/tracing/trace_options
>> >
>> > tracers could also change the default of this flag. The function-cost
>> > tracer will probably want to default to globally synchronous
>> > timestamps, while the preempt and irqsoff tracers want to default to
>> > local timestamps only.
>> >
>> > Would something like this work for you?
>> >
>> >        Ingo
>> >
>>
>>
>> But I guess this flag would apply on the timestamp inserted by the
>> ring-buffer. Unfortunately I can't use it since I have to capture
>> the clock for two times and not only during insertion in the
>> ring-buffer.
>
> i think the clock should be a property of the tracer, not of the ring
> buffer. Hence if a tracer has the option set, it will get coherent
> timestamps - including ringbuffer insertion timestamps.
>
>        Ingo
>



If so that would be suitable. And for his purpose, the ring-buffer
would propose a function for time snapshot
adapted to the current flags and that rely on sched_clock?

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/2] tracing/function-return-tracer: Make the function return tracer lockless
  2008-11-13  9:27           ` Frédéric Weisbecker
@ 2008-11-13  9:40             ` Ingo Molnar
  2008-11-13 12:36               ` Frédéric Weisbecker
  0 siblings, 1 reply; 27+ messages in thread
From: Ingo Molnar @ 2008-11-13  9:40 UTC (permalink / raw)
  To: Frédéric Weisbecker
  Cc: Steven Rostedt, Linux Kernel, Peter Zijlstra


* Frédéric Weisbecker <fweisbec@gmail.com> wrote:

> If so that would be suitable. And for his purpose, the ring-buffer 
> would propose a function for time snapshot adapted to the current 
> flags and that rely on sched_clock?

yeah - and it would be all means by a global entity in the beginning - 
i.e. we'd just generalize the code around ring_buffer_time_stamp() to 
listen to the "globally coherent" flag, and allow it to be used for 
callgraph cost measurement code too.

If the "globally coherent" flag is set, then the implementation would 
be something like:

A simple "last global timestamp" value combined with a "last local 
timestamp" value, and the global timestamp is only ever moved forward. 
It is updated via cmpxchg loop. This gives coherency and a monotonic 
clock. The local timestamp would be taken from cpu_clock(cpu), and a 
global timestamp would be constructed out of it. Or something like 
that.

Would that work? [ Would you be interested in sending patches? :-) ]

	Ingo

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/2] tracing/function-return-tracer: Make the function return tracer lockless
  2008-11-13  9:40             ` Ingo Molnar
@ 2008-11-13 12:36               ` Frédéric Weisbecker
  2008-11-13 12:54                 ` Ingo Molnar
  0 siblings, 1 reply; 27+ messages in thread
From: Frédéric Weisbecker @ 2008-11-13 12:36 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Steven Rostedt, Linux Kernel, Peter Zijlstra

2008/11/13 Ingo Molnar <mingo@elte.hu>:
> yeah - and it would be all means by a global entity in the beginning -
> i.e. we'd just generalize the code around ring_buffer_time_stamp() to
> listen to the "globally coherent" flag, and allow it to be used for
> callgraph cost measurement code too.
>
> If the "globally coherent" flag is set, then the implementation would
> be something like:
>
> A simple "last global timestamp" value combined with a "last local
> timestamp" value, and the global timestamp is only ever moved forward.
> It is updated via cmpxchg loop. This gives coherency and a monotonic
> clock. The local timestamp would be taken from cpu_clock(cpu), and a
> global timestamp would be constructed out of it. Or something like
> that.
>
> Would that work? [ Would you be interested in sending patches? :-) ]
>
>        Ingo
>


Ok, so correct me if I'm wrong.
Global timestamp would be captured by using sched_clock().
That's what is done currently in ring_buffer_time_stamp()
And the global timestamp would be combination of a last global
timestamp and a relative position from now to this last at
each insertion in the ring-buffer (or tracing time capture). Am I right?
I don't really understand why you want to update with a cmpxchg loop...

And the local timestamp would be built through cpu_clock() with
absolute values at each captures? Because we can't consider
relative values since this is loosing sense between cpu. Unless we
have per_cpu relative values.....

Then, depending of the current_tracer, ring_buffer_time_stamp would
act as a wrapper:
if (current_tracer->time_flag == TIME_GLOBAL_COHERENT)
    return ring_buffer_global_timestamp();
else
    return ring_buffer_local_timestamp();

Or more efficient with a function pointer set at the same we change
the time flag for the current tracer.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/2] tracing/function-return-tracer: Make the function return tracer lockless
  2008-11-13 12:36               ` Frédéric Weisbecker
@ 2008-11-13 12:54                 ` Ingo Molnar
  2008-11-13 12:59                   ` Peter Zijlstra
  2008-11-13 17:26                   ` Frédéric Weisbecker
  0 siblings, 2 replies; 27+ messages in thread
From: Ingo Molnar @ 2008-11-13 12:54 UTC (permalink / raw)
  To: Frédéric Weisbecker
  Cc: Steven Rostedt, Linux Kernel, Peter Zijlstra


* Frédéric Weisbecker <fweisbec@gmail.com> wrote:

> Ok, so correct me if I'm wrong. Global timestamp would be captured 
> by using sched_clock(). That's what is done currently in 
> ring_buffer_time_stamp() And the global timestamp would be 
> combination of a last global timestamp and a relative position from 
> now to this last at each insertion in the ring-buffer (or tracing 
> time capture). Am I right? I don't really understand why you want to 
> update with a cmpxchg loop...

the cmpxchg loop would be needed to ensure timestamp monotonicity: 
every new "global time" is cmpxchg-ed with the "previous global time" 
(and is first monotonicity checked).

"prev_global_time" also acts as a global serializer: it ensures that 
events are timestamped in a monotonic and ordered way.

i.e. something like this (pseudocode, without the cmpxchg):

 u64 prev_global_time;

 DEFINE_PER_CPU(prev_local_time);

 u64 global_time()
 {
	u64 now, delta, now_global;

	prev_global = prev_global_time;
	now = sched_clock();
	delta = now - per_cpu(prev_local_time, this_cpu);
	per_cpu(prev_local_time, this_cpu) = now;

	now_global = prev_global + delta;
	prev_global = now_global;

	return now_global;
 }

note how we build "global time" out of "local time".

The cmpxchg would be used to put the above one into a loop, and 
instead of updating the global time in a racy way:

	prev_global = now_global;

We'd update it via the cmpxchg:

	atomic64_t prev_global_time;

	...

	while (atomic64_cmpxchg(&prev_global_time,
				 prev_global, now_global) != prev_global) {
		[...]
	}

To make sure the global time goes monotonic. (this way we also avoid a 
spinlock - locks are fragile for instrumentation)

	Ingo

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/2] tracing/function-return-tracer: Make the function return tracer lockless
  2008-11-13 12:54                 ` Ingo Molnar
@ 2008-11-13 12:59                   ` Peter Zijlstra
  2008-11-13 13:02                     ` Ingo Molnar
  2008-11-13 17:26                   ` Frédéric Weisbecker
  1 sibling, 1 reply; 27+ messages in thread
From: Peter Zijlstra @ 2008-11-13 12:59 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Frédéric Weisbecker, Steven Rostedt, Linux Kernel

On Thu, 2008-11-13 at 13:54 +0100, Ingo Molnar wrote:
> * Frédéric Weisbecker <fweisbec@gmail.com> wrote:
> 
> > Ok, so correct me if I'm wrong. Global timestamp would be captured 
> > by using sched_clock(). That's what is done currently in 
> > ring_buffer_time_stamp() And the global timestamp would be 
> > combination of a last global timestamp and a relative position from 
> > now to this last at each insertion in the ring-buffer (or tracing 
> > time capture). Am I right? I don't really understand why you want to 
> > update with a cmpxchg loop...
> 
> the cmpxchg loop would be needed to ensure timestamp monotonicity: 
> every new "global time" is cmpxchg-ed with the "previous global time" 
> (and is first monotonicity checked).
> 
> "prev_global_time" also acts as a global serializer: it ensures that 
> events are timestamped in a monotonic and ordered way.
> 
> i.e. something like this (pseudocode, without the cmpxchg):
> 
>  u64 prev_global_time;
> 
>  DEFINE_PER_CPU(prev_local_time);
> 
>  u64 global_time()
>  {
> 	u64 now, delta, now_global;
> 
> 	prev_global = prev_global_time;
> 	now = sched_clock();
> 	delta = now - per_cpu(prev_local_time, this_cpu);
> 	per_cpu(prev_local_time, this_cpu) = now;
> 
> 	now_global = prev_global + delta;
> 	prev_global = now_global;
> 
> 	return now_global;
>  }
> 
> note how we build "global time" out of "local time".

This goes down shit-creek real fast if the TSC goes funny and jumps fwd
or something.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/2] tracing/function-return-tracer: Make the function return tracer lockless
  2008-11-13 12:59                   ` Peter Zijlstra
@ 2008-11-13 13:02                     ` Ingo Molnar
  0 siblings, 0 replies; 27+ messages in thread
From: Ingo Molnar @ 2008-11-13 13:02 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Frédéric Weisbecker, Steven Rostedt, Linux Kernel


* Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:

> On Thu, 2008-11-13 at 13:54 +0100, Ingo Molnar wrote:
> > * Frédéric Weisbecker <fweisbec@gmail.com> wrote:
> > 
> > > Ok, so correct me if I'm wrong. Global timestamp would be captured 
> > > by using sched_clock(). That's what is done currently in 
> > > ring_buffer_time_stamp() And the global timestamp would be 
> > > combination of a last global timestamp and a relative position from 
> > > now to this last at each insertion in the ring-buffer (or tracing 
> > > time capture). Am I right? I don't really understand why you want to 
> > > update with a cmpxchg loop...
> > 
> > the cmpxchg loop would be needed to ensure timestamp monotonicity: 
> > every new "global time" is cmpxchg-ed with the "previous global time" 
> > (and is first monotonicity checked).
> > 
> > "prev_global_time" also acts as a global serializer: it ensures that 
> > events are timestamped in a monotonic and ordered way.
> > 
> > i.e. something like this (pseudocode, without the cmpxchg):
> > 
> >  u64 prev_global_time;
> > 
> >  DEFINE_PER_CPU(prev_local_time);
> > 
> >  u64 global_time()
> >  {
> > 	u64 now, delta, now_global;
> > 
> > 	prev_global = prev_global_time;
> > 	now = sched_clock();
> > 	delta = now - per_cpu(prev_local_time, this_cpu);
> > 	per_cpu(prev_local_time, this_cpu) = now;
> > 
> > 	now_global = prev_global + delta;
> > 	prev_global = now_global;
> > 
> > 	return now_global;
> >  }
> > 
> > note how we build "global time" out of "local time".
> 
> This goes down shit-creek real fast if the TSC goes funny and jumps 
> fwd or something.

yes, it was simplified pseudocode: instead of raw sched_clock() it 
should use a more reliable source like cpu_clock(this_cpu).

the important bit is how to build a global clock out of the local 
clock.

	Ingo

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/2] tracing/function-return-tracer: Make the function return tracer lockless
  2008-11-12 21:47 [PATCH 1/2] tracing/function-return-tracer: Make the function return tracer lockless Frederic Weisbecker
  2008-11-12 22:15 ` Ingo Molnar
@ 2008-11-13 14:53 ` Andi Kleen
  2008-11-13 17:01   ` Frédéric Weisbecker
  1 sibling, 1 reply; 27+ messages in thread
From: Andi Kleen @ 2008-11-13 14:53 UTC (permalink / raw)
  To: Frederic Weisbecker; +Cc: Ingo Molnar, Steven Rostedt, Linux Kernel

Frederic Weisbecker <fweisbec@gmail.com> writes:
O

> _ The only race could happen between the current thread and an interrupt. If an
>   interrupt is raised, it will increase the index of the return stack storage and
>   then execute until the end of the tracing to finally free the index it used.
>   We don't need to disable irqs.
>
> This is theorical. In practice, I've tested it with a two-core SMP and had no
> problem at all. Perhaps -tip testing could confirm it.

The problem I think is that you assume the ++ is atomic against
interrupts, which is not guaranteed by the C compiler. e.g.
it would be perfectly legal for the compiler to generate code like

local register i
i = index;
write to index'ed array using i
                            <--------- interrupt here would overwrite data
...
index = i + 1;

You would need to convert the index access to a "local_add_return()" and 
possibly also add memory barriers.

-Andi

-- 
ak@linux.intel.com

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/2] tracing/function-return-tracer: Make the function return tracer lockless
  2008-11-13 14:53 ` Andi Kleen
@ 2008-11-13 17:01   ` Frédéric Weisbecker
  2008-11-13 17:12     ` Andi Kleen
  0 siblings, 1 reply; 27+ messages in thread
From: Frédéric Weisbecker @ 2008-11-13 17:01 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Ingo Molnar, Steven Rostedt, Linux Kernel

2008/11/13 Andi Kleen <andi@firstfloor.org>:
> The problem I think is that you assume the ++ is atomic against
> interrupts, which is not guaranteed by the C compiler. e.g.
> it would be perfectly legal for the compiler to generate code like
>
> local register i
> i = index;
> write to index'ed array using i
>                            <--------- interrupt here would overwrite data
> ...
> index = i + 1;


Yes in the common case that would be a danger. But here, if an
interrupt is raised, it will increment
the counter and then decrement it at return time without dropping the
cpu. So after the interrupt, the
value will remain the same...

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/2] tracing/function-return-tracer: Make the function return tracer lockless
  2008-11-13 17:01   ` Frédéric Weisbecker
@ 2008-11-13 17:12     ` Andi Kleen
  2008-11-13 17:21       ` Steven Rostedt
  0 siblings, 1 reply; 27+ messages in thread
From: Andi Kleen @ 2008-11-13 17:12 UTC (permalink / raw)
  To: Frédéric Weisbecker
  Cc: Andi Kleen, Ingo Molnar, Steven Rostedt, Linux Kernel

> > local register i
> > i = index;
> > write to index'ed array using i
> >                            <--------- interrupt here would overwrite data
> > ...
> > index = i + 1;
> 
> 
> Yes in the common case that would be a danger. But here, if an
> interrupt is raised, it will increment
> the counter and then decrement it at return time without dropping the
> cpu. So after the interrupt, the
> value will remain the same...

The buffer contents will not be necessarily
the same. See the scenario above. The interrupt would use the 
same i as the current function and would overwrite the
partially written entry.

-Andi

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/2] tracing/function-return-tracer: Make the function return tracer lockless
  2008-11-13 17:12     ` Andi Kleen
@ 2008-11-13 17:21       ` Steven Rostedt
  2008-11-13 17:34         ` Andi Kleen
  0 siblings, 1 reply; 27+ messages in thread
From: Steven Rostedt @ 2008-11-13 17:21 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Frédéric Weisbecker, Ingo Molnar, Linux Kernel


On Thu, 13 Nov 2008, Andi Kleen wrote:

> > > local register i
> > > i = index;
> > > write to index'ed array using i
> > >                            <--------- interrupt here would overwrite data
> > > ...
> > > index = i + 1;
> > 
> > 
> > Yes in the common case that would be a danger. But here, if an
> > interrupt is raised, it will increment
> > the counter and then decrement it at return time without dropping the
> > cpu. So after the interrupt, the
> > value will remain the same...
> 
> The buffer contents will not be necessarily
> the same. See the scenario above. The interrupt would use the 
> same i as the current function and would overwrite the
> partially written entry.


So the answer to this is:

  i = index++;
  barrier();
  write to index i (not index);

-- Steve



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/2] tracing/function-return-tracer: Make the function return tracer lockless
  2008-11-13 12:54                 ` Ingo Molnar
  2008-11-13 12:59                   ` Peter Zijlstra
@ 2008-11-13 17:26                   ` Frédéric Weisbecker
  2008-11-13 18:57                     ` Ingo Molnar
  1 sibling, 1 reply; 27+ messages in thread
From: Frédéric Weisbecker @ 2008-11-13 17:26 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Steven Rostedt, Linux Kernel, Peter Zijlstra

2008/11/13 Ingo Molnar <mingo@elte.hu>:
> "prev_global_time" also acts as a global serializer: it ensures that
> events are timestamped in a monotonic and ordered way.
>
> i.e. something like this (pseudocode, without the cmpxchg):
>
>  u64 prev_global_time;
>
>  DEFINE_PER_CPU(prev_local_time);
>
>  u64 global_time()
>  {
>        u64 now, delta, now_global;
>
>        prev_global = prev_global_time;
>        now = sched_clock();
>        delta = now - per_cpu(prev_local_time, this_cpu);
>        per_cpu(prev_local_time, this_cpu) = now;
>
>        now_global = prev_global + delta;
>        prev_global = now_global;
>
>        return now_global;
>  }
>
> note how we build "global time" out of "local time".
>
> The cmpxchg would be used to put the above one into a loop, and
> instead of updating the global time in a racy way:
>
>        prev_global = now_global;
>
> We'd update it via the cmpxchg:
>
>        atomic64_t prev_global_time;
>
>        ...
>
>        while (atomic64_cmpxchg(&prev_global_time,
>                                 prev_global, now_global) != prev_global) {
>                [...]
>        }
>
> To make sure the global time goes monotonic. (this way we also avoid a
> spinlock - locks are fragile for instrumentation)

Ok, I understand better.
But consider the following:

 u64 global_time()
 {
       u64 now, delta, now_global;
       prev_global = prev_global_time;

       while (atomic64_cmpxchg(&prev_global_time,
                                 prev_global, now_global) != prev_global) {

           now = sched_clock();
           delta = now - per_cpu(prev_local_time, this_cpu);
           per_cpu(prev_local_time, this_cpu) = now;
           now_global = prev_global + delta;
           prev_global = now_global;
       }
       return now_global;
 }

Sarting with prev_global_time = 0
If we have two cpu and the above function is executed 5 times on the first cpu.
We couldl have per_cpu(prev_local_time) = 50 for example. And so
prev_global_time will be equal to 50.

Just after that, almost at the same time, cpu2 calls global_time()

delta will be equal to 50 (sched_clock() - per_cpu(prev_local_time)
which is 0) and prev_global_time will be 50 + 50 = 100.
This is not consistent.
I don't know where but I'm pretty sure I missed something....

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/2] tracing/function-return-tracer: Make the function return tracer lockless
  2008-11-13 17:34         ` Andi Kleen
@ 2008-11-13 17:32           ` Steven Rostedt
  2008-11-13 17:46             ` Frédéric Weisbecker
  2008-11-13 18:29             ` Andi Kleen
  0 siblings, 2 replies; 27+ messages in thread
From: Steven Rostedt @ 2008-11-13 17:32 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Frédéric Weisbecker, Ingo Molnar, Linux Kernel


On Thu, 13 Nov 2008, Andi Kleen wrote:

> > So the answer to this is:
> > 
> >   i = index++;
> >   barrier();
> >   write to index i (not index);
> 
> That was my first thought when I wrote the original email,
> but the disadvantage is that barrier() is a big hammer
> that flushes everything and can make the code much worse.
> That is why I suggested local_add_return() instead.

barrier() is a compiler barrier, does nothing with the caches, and is 
quite cheap. We only need a compiler barrier because we are only 
protecting ourselves from things that happen on the current CPU. No other 
devices or other CPUs are involved.

-- Steve


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/2] tracing/function-return-tracer: Make the function return tracer lockless
  2008-11-13 17:21       ` Steven Rostedt
@ 2008-11-13 17:34         ` Andi Kleen
  2008-11-13 17:32           ` Steven Rostedt
  0 siblings, 1 reply; 27+ messages in thread
From: Andi Kleen @ 2008-11-13 17:34 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Andi Kleen, Frédéric Weisbecker, Ingo Molnar,
	Linux Kernel

> So the answer to this is:
> 
>   i = index++;
>   barrier();
>   write to index i (not index);

That was my first thought when I wrote the original email,
but the disadvantage is that barrier() is a big hammer
that flushes everything and can make the code much worse.
That is why I suggested local_add_return() instead.

-Andi

-- 
ak@linux.intel.com

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/2] tracing/function-return-tracer: Make the function return tracer lockless
  2008-11-13 17:32           ` Steven Rostedt
@ 2008-11-13 17:46             ` Frédéric Weisbecker
  2008-11-13 18:29             ` Andi Kleen
  1 sibling, 0 replies; 27+ messages in thread
From: Frédéric Weisbecker @ 2008-11-13 17:46 UTC (permalink / raw)
  To: Steven Rostedt; +Cc: Andi Kleen, Ingo Molnar, Linux Kernel

2008/11/13 Steven Rostedt <rostedt@goodmis.org>:
>
> On Thu, 13 Nov 2008, Andi Kleen wrote:
>
>> > So the answer to this is:
>> >
>> >   i = index++;
>> >   barrier();
>> >   write to index i (not index);
>>
>> That was my first thought when I wrote the original email,
>> but the disadvantage is that barrier() is a big hammer
>> that flushes everything and can make the code much worse.
>> That is why I suggested local_add_return() instead.
>
> barrier() is a compiler barrier, does nothing with the caches, and is
> quite cheap. We only need a compiler barrier because we are only
> protecting ourselves from things that happen on the current CPU. No other
> devices or other CPUs are involved.


Oh I see the issue now. The value of the index could have been
incremented in a register and not yet
in the memory...
So yes, a barrier() to make these operations flushed in memory before
using the index.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/2] tracing/function-return-tracer: Make the function return tracer lockless
  2008-11-13 17:32           ` Steven Rostedt
  2008-11-13 17:46             ` Frédéric Weisbecker
@ 2008-11-13 18:29             ` Andi Kleen
  2008-11-13 18:35               ` Steven Rostedt
  1 sibling, 1 reply; 27+ messages in thread
From: Andi Kleen @ 2008-11-13 18:29 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Andi Kleen, Frédéric Weisbecker, Ingo Molnar,
	Linux Kernel

On Thu, Nov 13, 2008 at 12:32:23PM -0500, Steven Rostedt wrote:
> 
> On Thu, 13 Nov 2008, Andi Kleen wrote:
> 
> > > So the answer to this is:
> > > 
> > >   i = index++;
> > >   barrier();
> > >   write to index i (not index);
> > 
> > That was my first thought when I wrote the original email,
> > but the disadvantage is that barrier() is a big hammer
> > that flushes everything and can make the code much worse.
> > That is why I suggested local_add_return() instead.
> 
> barrier() is a compiler barrier, does nothing with the caches, and is 
> quite cheap. We only need a compiler barrier because we are only 

I did not refer to CPU caches, but the compiler's register allocation
[ok if you want the registers are the "level 0 cache"]. A memory barrier
all messes it up. That is why it is better to only clobber specific
memory regions, which is what local_* does.

-Andi

-- 
ak@linux.intel.com

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/2] tracing/function-return-tracer: Make the function return tracer lockless
  2008-11-13 18:29             ` Andi Kleen
@ 2008-11-13 18:35               ` Steven Rostedt
  2008-11-13 18:55                 ` Andi Kleen
  0 siblings, 1 reply; 27+ messages in thread
From: Steven Rostedt @ 2008-11-13 18:35 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Frédéric Weisbecker, Ingo Molnar, Linux Kernel


On Thu, 13 Nov 2008, Andi Kleen wrote:

> On Thu, Nov 13, 2008 at 12:32:23PM -0500, Steven Rostedt wrote:
> > 
> > On Thu, 13 Nov 2008, Andi Kleen wrote:
> > 
> > > > So the answer to this is:
> > > > 
> > > >   i = index++;
> > > >   barrier();
> > > >   write to index i (not index);
> > > 
> > > That was my first thought when I wrote the original email,
> > > but the disadvantage is that barrier() is a big hammer
> > > that flushes everything and can make the code much worse.
> > > That is why I suggested local_add_return() instead.
> > 
> > barrier() is a compiler barrier, does nothing with the caches, and is 
> > quite cheap. We only need a compiler barrier because we are only 
> 
> I did not refer to CPU caches, but the compiler's register allocation
> [ok if you want the registers are the "level 0 cache"]. A memory barrier
> all messes it up. That is why it is better to only clobber specific
> memory regions, which is what local_* does.
> 


#define barrier() __asm__ __volatile__("": : :"memory")

static inline long local_add_return(long i, local_t *l)
{
	long __i;
#ifdef CONFIG_M386
	unsigned long flags;
	if (unlikely(boot_cpu_data.x86 <= 3))
		goto no_xadd;
#endif
	/* Modern 486+ processor */
	__i = i;
	asm volatile(_ASM_XADD "%0, %1;"
		     : "+r" (i), "+m" (l->a.counter)
		     : : "memory");
	return i + __i;

#ifdef CONFIG_M386
no_xadd: /* Legacy 386 processor */
	local_irq_save(flags);
	__i = local_read(l);
	local_set(l, i + __i);
	local_irq_restore(flags);
	return i + __i;
#endif
}



Now tell me again how local_* is more efficient than barrier?

Not to mention, if this is ever used on other archs with load-linked and 
store-conditional, it gets even worse.

-- Steve


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/2] tracing/function-return-tracer: Make the function return tracer lockless
  2008-11-13 18:35               ` Steven Rostedt
@ 2008-11-13 18:55                 ` Andi Kleen
  2008-11-13 19:03                   ` Steven Rostedt
  0 siblings, 1 reply; 27+ messages in thread
From: Andi Kleen @ 2008-11-13 18:55 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Andi Kleen, Frédéric Weisbecker, Ingo Molnar,
	Linux Kernel

> Now tell me again how local_* is more efficient than barrier?

Look at the generated assembler, not the source. The M386 check is two instructions
with the i386 stuff just moved away somewhere out of line that is
about never taken. The rest is short and straight forward in asm too.
And a lot of kernels are not compiled with 386 support anyways.

-Andi

-- 
ak@linux.intel.com

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/2] tracing/function-return-tracer: Make the function return tracer lockless
  2008-11-13 17:26                   ` Frédéric Weisbecker
@ 2008-11-13 18:57                     ` Ingo Molnar
  2008-11-13 20:06                       ` Frédéric Weisbecker
  0 siblings, 1 reply; 27+ messages in thread
From: Ingo Molnar @ 2008-11-13 18:57 UTC (permalink / raw)
  To: Frédéric Weisbecker
  Cc: Steven Rostedt, Linux Kernel, Peter Zijlstra


* Frédéric Weisbecker <fweisbec@gmail.com> wrote:

> 2008/11/13 Ingo Molnar <mingo@elte.hu>:
> > "prev_global_time" also acts as a global serializer: it ensures that
> > events are timestamped in a monotonic and ordered way.
> >
> > i.e. something like this (pseudocode, without the cmpxchg):
> >
> >  u64 prev_global_time;
> >
> >  DEFINE_PER_CPU(prev_local_time);
> >
> >  u64 global_time()
> >  {
> >        u64 now, delta, now_global;
> >
> >        prev_global = prev_global_time;
> >        now = sched_clock();
> >        delta = now - per_cpu(prev_local_time, this_cpu);
> >        per_cpu(prev_local_time, this_cpu) = now;
> >
> >        now_global = prev_global + delta;
> >        prev_global = now_global;
> >
> >        return now_global;
> >  }
> >
> > note how we build "global time" out of "local time".
> >
> > The cmpxchg would be used to put the above one into a loop, and
> > instead of updating the global time in a racy way:
> >
> >        prev_global = now_global;
> >
> > We'd update it via the cmpxchg:
> >
> >        atomic64_t prev_global_time;
> >
> >        ...
> >
> >        while (atomic64_cmpxchg(&prev_global_time,
> >                                 prev_global, now_global) != prev_global) {
> >                [...]
> >        }
> >
> > To make sure the global time goes monotonic. (this way we also avoid a
> > spinlock - locks are fragile for instrumentation)
> 
> Ok, I understand better.
> But consider the following:
> 
>  u64 global_time()
>  {
>        u64 now, delta, now_global;
>        prev_global = prev_global_time;
> 
>        while (atomic64_cmpxchg(&prev_global_time,
>                                  prev_global, now_global) != prev_global) {
> 
>            now = sched_clock();
>            delta = now - per_cpu(prev_local_time, this_cpu);
>            per_cpu(prev_local_time, this_cpu) = now;
>            now_global = prev_global + delta;
>            prev_global = now_global;
>        }
>        return now_global;
>  }
> 
> Sarting with prev_global_time = 0 If we have two cpu and the above 
> function is executed 5 times on the first cpu. We couldl have 
> per_cpu(prev_local_time) = 50 for example. And so prev_global_time 
> will be equal to 50.
> 
> Just after that, almost at the same time, cpu2 calls global_time()
> 
> delta will be equal to 50 (sched_clock() - per_cpu(prev_local_time) 
> which is 0) and prev_global_time will be 50 + 50 = 100. This is not 
> consistent. I don't know where but I'm pretty sure I missed 
> something....

you are right - it needs a bit more logic.

I think the simplest would be something like this:

 atomic64_t global_clock = INIT_ATOMIC64(0);

 u64 global_time()
 {
	u64 now, delta, now_global, prev_global;

	do {
		prev_global = atomic64_read(&global_clock);
		now = cpu_clock(raw_smp_processor_id());

		if ((s64)(now - prev_global) < 0) {
			now = prev_global;
			break;
		}
	} while (atomic64_cmpxchg(&global_clock,
				   prev_global, now) != prev_global);
 
	return now;
 }

This is the simplest way of implementing monotonic time: we only allow 
global_clock to go forwards. If all cpu_clock()s are perfectly in 
sync, we've got no problem: then "now - prev_global" will never be 
negative and we can return the local clock as the latest global time.

But if one of the CPU clocks is "behind", the function returns the 
latest global time up until the local clock catches up. Time wont be 
allowed to jump around by going back. If the clock is behind for a 
long time, then we get a lot of timestamps with the same value - that 
will be very visible in the trace and we'll then work in improving the 
cpu_clock() implementation.

So i think we could start with this simplest approach, and see how 
often we get the same timestamp for a long time (indication of the 
clocks being not perfectly in sync).

	Ingo

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/2] tracing/function-return-tracer: Make the function return tracer lockless
  2008-11-13 18:55                 ` Andi Kleen
@ 2008-11-13 19:03                   ` Steven Rostedt
  2008-11-13 19:19                     ` Andi Kleen
  0 siblings, 1 reply; 27+ messages in thread
From: Steven Rostedt @ 2008-11-13 19:03 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Frédéric Weisbecker, Ingo Molnar, Linux Kernel


On Thu, 13 Nov 2008, Andi Kleen wrote:
>>> 
>>> I did not refer to CPU caches, but the compiler's register allocation
>>> [ok if you want the registers are the "level 0 cache"]. A memory barrier
>>> all messes it up. That is why it is better to only clobber specific
>>> memory regions, which is what local_* does.
>>> 

>> Now tell me again how local_* is more efficient than barrier?
> 
> Look at the generated assembler, not the source. The M386 check is two instructions
> with the i386 stuff just moved away somewhere out of line that is
> about never taken. The rest is short and straight forward in asm too.
> And a lot of kernels are not compiled with 386 support anyways.
> 

OK, lets look at just the asm.

barrier:
	__asm__ __volatile__("": : :"memory")

local_add_return:
	asm volatile(_ASM_XADD "%0, %1;"
		     : "+r" (i), "+m" (l->a.counter)
		     : : "memory");


Your argument was that barrier clobbers memory, but it looks like the 
local_add_return does the same.

Lets go look at the code that we are talking about:

 	index = ++ti->curr_ret_stack;
 	ti->ret_stack[index].ret = ret;
 	ti->ret_stack[index].func = func;
 	ti->ret_stack[index].calltime = time;

Which as you stated was incorrect, and it is. But you suggested that we 
should switch to using local_add_return which would cause us to change a 
lot of code to handle the type change. When this is ported to other archs, 
they will copy that too, and then be using a lesser efficient algorithm.

All that is needed is to change the pointer to always point to the next 
item, and then do:

        index = ti->curr_ret_stack++;
	barrier();
        ti->ret_stack[index].ret = ret;
        ti->ret_stack[index].func = func;
        ti->ret_stack[index].calltime = time;

I don't see any major gain in switching to local_add_return.

-- Steve

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/2] tracing/function-return-tracer: Make the function return tracer lockless
  2008-11-13 19:03                   ` Steven Rostedt
@ 2008-11-13 19:19                     ` Andi Kleen
  0 siblings, 0 replies; 27+ messages in thread
From: Andi Kleen @ 2008-11-13 19:19 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Andi Kleen, Frédéric Weisbecker, Ingo Molnar,
	Linux Kernel

> barrier:
> 	__asm__ __volatile__("": : :"memory")
> 
> local_add_return:
> 	asm volatile(_ASM_XADD "%0, %1;"
> 		     : "+r" (i), "+m" (l->a.counter)
> 		     : : "memory");
> 
> 
> Your argument was that barrier clobbers memory, but it looks like the 
> local_add_return does the same.

That's a fair point. I'm not sure why the local_add_* has a memory
barrier at all, imho it shouldn't need it. I bet it's just
a hold over from the atomic code this was based one.

> I don't see any major gain in switching to local_add_return.

That's true with the current implementation. It would
be a good idea to consider dropping that explicit 
barrier though, then it would be likely a slight win.

-Andi

-- 
ak@linux.intel.com

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/2] tracing/function-return-tracer: Make the function return tracer lockless
  2008-11-13 18:57                     ` Ingo Molnar
@ 2008-11-13 20:06                       ` Frédéric Weisbecker
  0 siblings, 0 replies; 27+ messages in thread
From: Frédéric Weisbecker @ 2008-11-13 20:06 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Steven Rostedt, Linux Kernel, Peter Zijlstra

2008/11/13 Ingo Molnar <mingo@elte.hu>:
> you are right - it needs a bit more logic.
>
> I think the simplest would be something like this:
>
>  atomic64_t global_clock = INIT_ATOMIC64(0);
>
>  u64 global_time()
>  {
>        u64 now, delta, now_global, prev_global;
>
>        do {
>                prev_global = atomic64_read(&global_clock);
>                now = cpu_clock(raw_smp_processor_id());
>
>                if ((s64)(now - prev_global) < 0) {
>                        now = prev_global;
>                        break;
>                }
>        } while (atomic64_cmpxchg(&global_clock,
>                                   prev_global, now) != prev_global);
>
>        return now;
>  }
>
> This is the simplest way of implementing monotonic time: we only allow
> global_clock to go forwards. If all cpu_clock()s are perfectly in
> sync, we've got no problem: then "now - prev_global" will never be
> negative and we can return the local clock as the latest global time.
>
> But if one of the CPU clocks is "behind", the function returns the
> latest global time up until the local clock catches up. Time wont be
> allowed to jump around by going back. If the clock is behind for a
> long time, then we get a lot of timestamps with the same value -


Ok! I understand now this approach.
So, if global ordering flag is set, we return this kind of protected value,
and on the opposite, we return the normal cpu_clock() local value.

> Would that work? [ Would you be interested in sending patches? :-) ]

Yes :-)

> that
> will be very visible in the trace and we'll then work in improving the
> cpu_clock() implementation.
>
> So i think we could start with this simplest approach, and see how
> often we get the same timestamp for a long time (indication of the
> clocks being not perfectly in sync).


Ok, good idea.
Thanks for the explanations!

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2008-11-13 20:06 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-11-12 21:47 [PATCH 1/2] tracing/function-return-tracer: Make the function return tracer lockless Frederic Weisbecker
2008-11-12 22:15 ` Ingo Molnar
2008-11-13  8:50   ` Frédéric Weisbecker
2008-11-13  8:55     ` Ingo Molnar
2008-11-13  9:16       ` Frédéric Weisbecker
2008-11-13  9:23         ` Ingo Molnar
2008-11-13  9:27           ` Frédéric Weisbecker
2008-11-13  9:40             ` Ingo Molnar
2008-11-13 12:36               ` Frédéric Weisbecker
2008-11-13 12:54                 ` Ingo Molnar
2008-11-13 12:59                   ` Peter Zijlstra
2008-11-13 13:02                     ` Ingo Molnar
2008-11-13 17:26                   ` Frédéric Weisbecker
2008-11-13 18:57                     ` Ingo Molnar
2008-11-13 20:06                       ` Frédéric Weisbecker
2008-11-13 14:53 ` Andi Kleen
2008-11-13 17:01   ` Frédéric Weisbecker
2008-11-13 17:12     ` Andi Kleen
2008-11-13 17:21       ` Steven Rostedt
2008-11-13 17:34         ` Andi Kleen
2008-11-13 17:32           ` Steven Rostedt
2008-11-13 17:46             ` Frédéric Weisbecker
2008-11-13 18:29             ` Andi Kleen
2008-11-13 18:35               ` Steven Rostedt
2008-11-13 18:55                 ` Andi Kleen
2008-11-13 19:03                   ` Steven Rostedt
2008-11-13 19:19                     ` Andi Kleen

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.