[RFC] PyTimechart

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC] PyTimechart
@ 2010-05-11 21:10 Pierre Tardy
  2010-05-11 21:36 ` Frederic Weisbecker
  0 siblings, 1 reply; 34+ messages in thread
From: Pierre Tardy @ 2010-05-11 21:10 UTC (permalink / raw)
  To: linux-kernel, rostedt, mathieu.desnoyers, arjan, ziga.mahkovec

Hello,

PyTimechart is another implementation of two very useful tools
available for the linux community:
perf-timechart ( http://blog.fenrus.org/?p=5 ) and bootchart (
http://www.bootchart.org/ )

The two tools share a common idea of making their output to SVG files.
While it is a very good idea for small traces, the generated SVG can
be very heavy, and turns out to be good stress tests for inkscape
developers...

PyTimechart is a tool that parses ftrace text traces, and display them
with the help of a very powerful dynamic plot framework, Chaco (
http://code.enthought.com/chaco/ )
The GUI makes the best it can to ease the browsing of huge traces.

The best is to look at those two 30s screencasts, to figure out how that work.

a look at mplayer startup:
http://tardyp.free.fr/pytimechart/mplayer_start.mp4
a look at ubuntu boot:
http://tardyp.free.fr/pytimechart/boot.mp4

This tool still is in the state of a prototype, I dont know if it
worth to continue to improve it or to implement it ideas in LTTV or
Kernel Shark.
It is actually very useful in its current form (
http://elinux.org/images/0/07/Effect_of_wakeups_on_Moorestown_power.pdf
), and will work without recompiling the kernel of recent distros.

the source code with build instruction is located here
http://gitorious.org/pytimechart

What do you think?

Regards,
-- 
Pierre

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC] PyTimechart
  2010-05-11 21:10 [RFC] PyTimechart Pierre Tardy
@ 2010-05-11 21:36 ` Frederic Weisbecker
  2010-05-12 13:37   ` Pierre Tardy
  0 siblings, 1 reply; 34+ messages in thread
From: Frederic Weisbecker @ 2010-05-11 21:36 UTC (permalink / raw)
  To: Pierre Tardy, Ingo Molnar, Arnaldo Carvalho de Melo,
	Peter Zijlstra, Tom Zanussi, Paul Mackerras
  Cc: linux-kernel, rostedt, mathieu.desnoyers, arjan, ziga.mahkovec

On Tue, May 11, 2010 at 11:10:36PM +0200, Pierre Tardy wrote:
> Hello,
> 
> PyTimechart is another implementation of two very useful tools
> available for the linux community:
> perf-timechart ( http://blog.fenrus.org/?p=5 ) and bootchart (
> http://www.bootchart.org/ )
> 
> The two tools share a common idea of making their output to SVG files.
> While it is a very good idea for small traces, the generated SVG can
> be very heavy, and turns out to be good stress tests for inkscape
> developers...
> 
> PyTimechart is a tool that parses ftrace text traces, and display them
> with the help of a very powerful dynamic plot framework, Chaco (
> http://code.enthought.com/chaco/ )
> The GUI makes the best it can to ease the browsing of huge traces.
> 
> The best is to look at those two 30s screencasts, to figure out how that work.
> 
> a look at mplayer startup:
> http://tardyp.free.fr/pytimechart/mplayer_start.mp4
> a look at ubuntu boot:
> http://tardyp.free.fr/pytimechart/boot.mp4
> 
> This tool still is in the state of a prototype, I dont know if it
> worth to continue to improve it or to implement it ideas in LTTV or
> Kernel Shark.
> It is actually very useful in its current form (
> http://elinux.org/images/0/07/Effect_of_wakeups_on_Moorestown_power.pdf
> ), and will work without recompiling the kernel of recent distros.
> 
> the source code with build instruction is located here
> http://gitorious.org/pytimechart
> 
> What do you think?


I think I've been dreaming about this several times.
I've often used perf timechart lately and I'm really frustrated by
its inherent slowness due to its huge svg files. It's barely usable
with a small trace on two cpus, and it's impossible to go further,
which is really a pity for such a powerful tool.

IMO, this GUI is exactly what we want.

If you could plug it to the perf scripting facilities, I would
be very happy to see it merged in perf tools.

Plugging to the scripting API is really easy, run:

	$ perf timechart record
	$ sudo ./perf trace -g python
	generated Python script: perf-trace.py

Now look at ./perf-trace.py, you'll find all the necessary hooks
generated with a default behaviour (displaying traces), which
you can observe by launching:

	$ perf trace -s perf-trace.py

You just need to change it to plug your app on it.

(Adding more Cc).


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC] PyTimechart
  2010-05-11 21:36 ` Frederic Weisbecker
@ 2010-05-12 13:37   ` Pierre Tardy
  2010-05-12 14:48     ` Frederic Weisbecker
  0 siblings, 1 reply; 34+ messages in thread
From: Pierre Tardy @ 2010-05-12 13:37 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Peter Zijlstra,
	Tom Zanussi, Paul Mackerras, linux-kernel, rostedt,
	mathieu.desnoyers, arjan, ziga.mahkovec

> Plugging to the scripting API is really easy, run:
>
>        $ perf timechart record
>        $ sudo ./perf trace -g python
>        generated Python script: perf-trace.py

This is something that I want to try since I saw the perf scripting
patch, thanks for the recipe, I'll look at it in the next few days.

The next problem is that perf timechart record is frozen on the number
of event it records.
Pytimechart is able to display irq:* and workqueues:* events, as well
as trace_printks ( I dont know if perf is able to dump those )

I use trace_printk to mark the big traces with events I'm trying to debug.
Being able to look at ISR and workqueues is very useful, and there is
a special feature that warns when an ISR lasts more than 1ms.

For inclusion in mainline, this might take more time. The quality of
some portion of the code is far from being linux kernel's standards
:-}

Regards,
Pierre

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC] PyTimechart
  2010-05-12 13:37   ` Pierre Tardy
@ 2010-05-12 14:48     ` Frederic Weisbecker
  2010-05-12 15:36       ` Steven Rostedt
  2010-05-12 17:13       ` Pierre Tardy
  0 siblings, 2 replies; 34+ messages in thread
From: Frederic Weisbecker @ 2010-05-12 14:48 UTC (permalink / raw)
  To: Pierre Tardy
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Peter Zijlstra,
	Tom Zanussi, Paul Mackerras, linux-kernel, rostedt,
	mathieu.desnoyers, arjan, ziga.mahkovec

On Wed, May 12, 2010 at 03:37:27PM +0200, Pierre Tardy wrote:
> > Plugging to the scripting API is really easy, run:
> >
> >        $ perf timechart record
> >        $ sudo ./perf trace -g python
> >        generated Python script: perf-trace.py
> 
> This is something that I want to try since I saw the perf scripting
> patch, thanks for the recipe, I'll look at it in the next few days.
> 
> The next problem is that perf timechart record is frozen on the number
> of event it records.



What do you mean? That it can't do incremental records or so?
It's supported by perf record, may be not by perf timechart but
that would be a matter of a little patch.


> Pytimechart is able to display irq:* and workqueues:* events, as well
> as trace_printks ( I dont know if perf is able to dump those )
> 
> I use trace_printk to mark the big traces with events I'm trying to debug.
> Being able to look at ISR and workqueues is very useful, and there is
> a special feature that warns when an ISR lasts more than 1ms.


No problem with the irq and workqueues, if that's really useful, we
can change perf timechart to handle this.

But we don't yet support trace_printk in perf. May be we could wrap
them in trace events.

Why not starting with a release that does the same things
than timechart and trying to integrate the rest incrementally?


> For inclusion in mainline, this might take more time. The quality of
> some portion of the code is far from being linux kernel's standards
> :-}


The code I'm looking at here doesn't look dirty to me at a glance:
http://gitorious.org/pytimechart/pytimechart/trees/master


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC] PyTimechart
  2010-05-12 14:48     ` Frederic Weisbecker
@ 2010-05-12 15:36       ` Steven Rostedt
  2010-05-12 16:46         ` Perf and ftrace [was Re: PyTimechart] Frederic Weisbecker
  2010-05-12 16:59         ` [RFC] PyTimechart Ingo Molnar
  2010-05-12 17:13       ` Pierre Tardy
  1 sibling, 2 replies; 34+ messages in thread
From: Steven Rostedt @ 2010-05-12 15:36 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Pierre Tardy, Ingo Molnar, Arnaldo Carvalho de Melo,
	Peter Zijlstra, Tom Zanussi, Paul Mackerras, linux-kernel,
	mathieu.desnoyers, arjan, ziga.mahkovec

On Wed, 2010-05-12 at 16:48 +0200, Frederic Weisbecker wrote:
> On Wed, May 12, 2010 at 03:37:27PM +0200, Pierre Tardy wrote:

> But we don't yet support trace_printk in perf. May be we could wrap
> them in trace events.

Hmm, do we really want to do that?

We really need to get the perf and ftrace trace buffers combined. I
understand why perf chose to do the mmap buffers for the counting, but
for live streaming, it is very inefficient compared to splice.

I would hate to add more duplicate code to have perf support
trace_printk().

-- Steve




^ permalink raw reply	[flat|nested] 34+ messages in thread

* Perf and ftrace [was Re: PyTimechart]
  2010-05-12 15:36       ` Steven Rostedt
@ 2010-05-12 16:46         ` Frederic Weisbecker
  2010-05-12 17:00           ` Peter Zijlstra
  2010-05-12 16:59         ` [RFC] PyTimechart Ingo Molnar
  1 sibling, 1 reply; 34+ messages in thread
From: Frederic Weisbecker @ 2010-05-12 16:46 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Pierre Tardy, Ingo Molnar, Arnaldo Carvalho de Melo,
	Peter Zijlstra, Tom Zanussi, Paul Mackerras, linux-kernel,
	mathieu.desnoyers, arjan, ziga.mahkovec

On Wed, May 12, 2010 at 11:36:36AM -0400, Steven Rostedt wrote:
> On Wed, 2010-05-12 at 16:48 +0200, Frederic Weisbecker wrote:
> > On Wed, May 12, 2010 at 03:37:27PM +0200, Pierre Tardy wrote:
> 
> > But we don't yet support trace_printk in perf. May be we could wrap
> > them in trace events.
> 
> Hmm, do we really want to do that?
> 
> We really need to get the perf and ftrace trace buffers combined. I
> understand why perf chose to do the mmap buffers for the counting

I don't think that's the reason. I mean that's the reason for
every perf tools that live record and analyse events as they come
(perf top, perf stat).

But there is no strong reason for perf record not to use splice,
a part the fact that perf doesn't support splice.

> but
> for live streaming, it is very inefficient compared to splice.

Yeah, totally agreed.

I'm looking forward the day we'll have a ring buffer that can be
either lockless per-cpu or support contention, and that can be
spliced, mmap'ed and read, and that supports overwriting mode.
So that we can unify all this mess between perf and ftrace.

But note splice is only part of the problem, eventually not
the biggest one for now (but it is one important):

perf starts to show its weaknesses now that we are playing with
lock events (by nature high freq events).
This is mostly due to the fact we are doing a round pass on all
per cpu mmap'ed buffers. The time you handle an event buffer, you've
already lost a lot of events from another one.

trace-cmd is certainly much more efficient in this regard (one thread
per cpu splicing one file per cpu), atlhough less convenient for
cross analysis as you need to handle several files.

perf record works well with every events but lock ones.

I plan to try something like a perf multiplex: one thread per
cpu that reads the mmap'ed buffers and write in its own file,
and in the end you gather the whole in a single one.

This will solve the first and problem: this scheme will probably catch up
with 80% of trace-cmd efficiency, until we get a true splice support.

In fact, I hope trace-cmd will come to be merged in tools/, I'm not
worried anymore about having two different tools that do the same
things wrt tracing, because I think they will eventually get
merged together step by step: the format parsing API, kernelshark,
sched/lock/timechart/kmem/etc... tools.

And sharing the same buffer will probably announce the final merge
between both, with a single and strong tracing tool set.

> I would hate to add more duplicate code to have perf support
> trace_printk().

No, having trace_printk() implemented on top on trace events is
a win on both sides: we can toggle their activation, filter, have
their format, etc...

The duplication would only reside in the tracing callback, and
as a temporary thing like the others until we finally have this
common buffer.

Thanks.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC] PyTimechart
  2010-05-12 15:36       ` Steven Rostedt
  2010-05-12 16:46         ` Perf and ftrace [was Re: PyTimechart] Frederic Weisbecker
@ 2010-05-12 16:59         ` Ingo Molnar
  2010-05-12 17:15           ` Steven Rostedt
  1 sibling, 1 reply; 34+ messages in thread
From: Ingo Molnar @ 2010-05-12 16:59 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Frederic Weisbecker, Pierre Tardy, Arnaldo Carvalho de Melo,
	Peter Zijlstra, Tom Zanussi, Paul Mackerras, linux-kernel,
	mathieu.desnoyers, arjan, ziga.mahkovec


* Steven Rostedt <rostedt@goodmis.org> wrote:

> On Wed, 2010-05-12 at 16:48 +0200, Frederic Weisbecker wrote:
> > On Wed, May 12, 2010 at 03:37:27PM +0200, Pierre Tardy wrote:
> 
> > But we don't yet support trace_printk in perf. May be we could wrap
> > them in trace events.
> 
> Hmm, do we really want to do that?
> 
> We really need to get the perf and ftrace trace buffers combined. I 
> understand why perf chose to do the mmap buffers for the counting, but for 
> live streaming, it is very inefficient compared to splice.

The thing is that for a very long time ftrace didnt have splice support and 
survived just fine. Even today most of the ftrace usage isnt utilizing splice.

Yes, splice might help in some situations but on average it's an independent 
speedup on the order of magnitude of a few percents, not a 'must have' item.

So please keep these issues separate.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Perf and ftrace [was Re: PyTimechart]
  2010-05-12 16:46         ` Perf and ftrace [was Re: PyTimechart] Frederic Weisbecker
@ 2010-05-12 17:00           ` Peter Zijlstra
  2010-05-12 17:07             ` Mathieu Desnoyers
  2010-05-12 17:11             ` Ingo Molnar
  0 siblings, 2 replies; 34+ messages in thread
From: Peter Zijlstra @ 2010-05-12 17:00 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Steven Rostedt, Pierre Tardy, Ingo Molnar,
	Arnaldo Carvalho de Melo, Tom Zanussi, Paul Mackerras,
	linux-kernel, mathieu.desnoyers, arjan, ziga.mahkovec, davem

On Wed, 2010-05-12 at 18:46 +0200, Frederic Weisbecker wrote:
> But there is no strong reason for perf record not to use splice,
> a part the fact that perf doesn't support splice.
> 
Its mostly an interface/api question. You cannot easily splice() a
mmap()'ed buffer on machines that have address constraints like sparc.

The thing I was thinking about is adding a new syscall that creates a
single buffer of specified size and provides a fd. Then use
PERF_EVENT_IOC_SET_OUTPUT, to connect an event to that fd/buffer and use
splice() on that fd.

It could reuse most of the perf buffer code, but simply not map it into
userspace and therefore not have the restriction on the vaddr.

Once you have that, a .splice_read implementation shouldn't be too hard.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Perf and ftrace [was Re: PyTimechart]
  2010-05-12 17:00           ` Peter Zijlstra
@ 2010-05-12 17:07             ` Mathieu Desnoyers
  2010-05-12 17:47               ` Peter Zijlstra
  2010-05-12 17:11             ` Ingo Molnar
  1 sibling, 1 reply; 34+ messages in thread
From: Mathieu Desnoyers @ 2010-05-12 17:07 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Frederic Weisbecker, Steven Rostedt, Pierre Tardy, Ingo Molnar,
	Arnaldo Carvalho de Melo, Tom Zanussi, Paul Mackerras,
	linux-kernel, arjan, ziga.mahkovec, davem

* Peter Zijlstra (peterz@infradead.org) wrote:
> On Wed, 2010-05-12 at 18:46 +0200, Frederic Weisbecker wrote:
> > But there is no strong reason for perf record not to use splice,
> > a part the fact that perf doesn't support splice.
> > 
> Its mostly an interface/api question. You cannot easily splice() a
> mmap()'ed buffer on machines that have address constraints like sparc.

Ah ? Can you explain this issue a bit more ? There is possibly a concern I don't
quite see here.

Thanks,

Mathieu

> 
> The thing I was thinking about is adding a new syscall that creates a
> single buffer of specified size and provides a fd. Then use
> PERF_EVENT_IOC_SET_OUTPUT, to connect an event to that fd/buffer and use
> splice() on that fd.
> 
> It could reuse most of the perf buffer code, but simply not map it into
> userspace and therefore not have the restriction on the vaddr.
> 
> Once you have that, a .splice_read implementation shouldn't be too hard.
> 

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Perf and ftrace [was Re: PyTimechart]
  2010-05-12 17:00           ` Peter Zijlstra
  2010-05-12 17:07             ` Mathieu Desnoyers
@ 2010-05-12 17:11             ` Ingo Molnar
  1 sibling, 0 replies; 34+ messages in thread
From: Ingo Molnar @ 2010-05-12 17:11 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Frederic Weisbecker, Steven Rostedt, Pierre Tardy,
	Arnaldo Carvalho de Melo, Tom Zanussi, Paul Mackerras,
	linux-kernel, mathieu.desnoyers, arjan, ziga.mahkovec, davem


* Peter Zijlstra <peterz@infradead.org> wrote:

> On Wed, 2010-05-12 at 18:46 +0200, Frederic Weisbecker wrote:
> > But there is no strong reason for perf record not to use splice,
> > a part the fact that perf doesn't support splice.
> 
> Its mostly an interface/api question. You cannot easily splice() a mmap()'ed 
> buffer on machines that have address constraints like sparc.
> 
> The thing I was thinking about is adding a new syscall that creates a single 
> buffer of specified size and provides a fd. Then use 
> PERF_EVENT_IOC_SET_OUTPUT, to connect an event to that fd/buffer and use 
> splice() on that fd.
> 
> It could reuse most of the perf buffer code, but simply not map it into 
> userspace and therefore not have the restriction on the vaddr.
> 
> Once you have that, a .splice_read implementation shouldn't be too hard.

That would be a really cool approach. Since most of the performance-sensitive 
streaming happens in the likes of perf record, the actual interface can be 
enhanced with no effect to the end user.

	Ingo

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC] PyTimechart
  2010-05-12 14:48     ` Frederic Weisbecker
  2010-05-12 15:36       ` Steven Rostedt
@ 2010-05-12 17:13       ` Pierre Tardy
  2010-05-13 15:02         ` Arnaldo Carvalho de Melo
  1 sibling, 1 reply; 34+ messages in thread
From: Pierre Tardy @ 2010-05-12 17:13 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Peter Zijlstra,
	Tom Zanussi, Paul Mackerras, linux-kernel, rostedt,
	mathieu.desnoyers, arjan, ziga.mahkovec

On Wed, May 12, 2010 at 4:48 PM, Frederic Weisbecker <fweisbec@gmail.com> wrote:
> On Wed, May 12, 2010 at 03:37:27PM +0200, Pierre Tardy wrote:
>> > Plugging to the scripting API is really easy, run:
>> >
>> >        $ perf timechart record
>> >        $ sudo ./perf trace -g python
>> >        generated Python script: perf-trace.py
>>
>> This is something that I want to try since I saw the perf scripting
>> patch, thanks for the recipe, I'll look at it in the next few days.
>>
>> The next problem is that perf timechart record is frozen on the number
>> of event it records.
>
>
>
> What do you mean? That it can't do incremental records or so?
> It's supported by perf record, may be not by perf timechart but
> that would be a matter of a little patch.

I mean that perf timechart record is implemented as an hardcoded perf
record call.
Typically, user wants to add is own events for pytimechart to display
them in a generic manner.
I agree this is a trivial change.


>
> Why not starting with a release that does the same things
> than timechart and trying to integrate the rest incrementally?
Yes, I'll see what I can do.

> The code I'm looking at here doesn't look dirty to me at a glance:
> http://gitorious.org/pytimechart/pytimechart/trees/master
That's because you did not read between the lines ;-)
The display core needs to be simplified.


Regards,
Pierre

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC] PyTimechart
  2010-05-12 16:59         ` [RFC] PyTimechart Ingo Molnar
@ 2010-05-12 17:15           ` Steven Rostedt
  2010-05-12 18:23             ` Ingo Molnar
  0 siblings, 1 reply; 34+ messages in thread
From: Steven Rostedt @ 2010-05-12 17:15 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Frederic Weisbecker, Pierre Tardy, Arnaldo Carvalho de Melo,
	Peter Zijlstra, Tom Zanussi, Paul Mackerras, linux-kernel,
	mathieu.desnoyers, arjan, ziga.mahkovec

On Wed, 2010-05-12 at 18:59 +0200, Ingo Molnar wrote:
> * Steven Rostedt <rostedt@goodmis.org> wrote:
> 
> > On Wed, 2010-05-12 at 16:48 +0200, Frederic Weisbecker wrote:
> > > On Wed, May 12, 2010 at 03:37:27PM +0200, Pierre Tardy wrote:
> > 
> > > But we don't yet support trace_printk in perf. May be we could wrap
> > > them in trace events.
> > 
> > Hmm, do we really want to do that?
> > 
> > We really need to get the perf and ftrace trace buffers combined. I 
> > understand why perf chose to do the mmap buffers for the counting, but for 
> > live streaming, it is very inefficient compared to splice.
> 
> The thing is that for a very long time ftrace didnt have splice support and 
> survived just fine. Even today most of the ftrace usage isnt utilizing splice.

Actually, trace-cmd implements the splice interface and is used by
several people. I find myself using trace-cmd 90% of the time that I use
ftrace, specifically because of this speedup.

> 
> Yes, splice might help in some situations but on average it's an independent 
> speedup on the order of magnitude of a few percents, not a 'must have' item.

I'll have start running benchmarks to see what the actual speed up is.
I'm guessing it may be more than a few percent. It allows for zero copy
overhead and reuse of the data page.

-- Steve



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Perf and ftrace [was Re: PyTimechart]
  2010-05-12 17:07             ` Mathieu Desnoyers
@ 2010-05-12 17:47               ` Peter Zijlstra
  2010-05-12 17:53                 ` Mathieu Desnoyers
  0 siblings, 1 reply; 34+ messages in thread
From: Peter Zijlstra @ 2010-05-12 17:47 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Frederic Weisbecker, Steven Rostedt, Pierre Tardy, Ingo Molnar,
	Arnaldo Carvalho de Melo, Tom Zanussi, Paul Mackerras,
	linux-kernel, arjan, ziga.mahkovec, davem

On Wed, 2010-05-12 at 13:07 -0400, Mathieu Desnoyers wrote:
> > Its mostly an interface/api question. You cannot easily splice() a
> > mmap()'ed buffer on machines that have address constraints like sparc.
> 
> Ah ? Can you explain this issue a bit more ? There is possibly a concern I don't
> quite see here. 

IIRC Sparc has virtually tagged D-caches on the lower 9 bits of the pfn,
so to avoid cache aliasing both mappings (kernel and user) need to be
aligned.

Since splice needs to swap pages it needs to allocate replacement pages
with the exact right alignment, which is rather expensive (in either
time or space).



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Perf and ftrace [was Re: PyTimechart]
  2010-05-12 17:47               ` Peter Zijlstra
@ 2010-05-12 17:53                 ` Mathieu Desnoyers
  2010-05-12 18:00                   ` Peter Zijlstra
  0 siblings, 1 reply; 34+ messages in thread
From: Mathieu Desnoyers @ 2010-05-12 17:53 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Frederic Weisbecker, Steven Rostedt, Pierre Tardy, Ingo Molnar,
	Arnaldo Carvalho de Melo, Tom Zanussi, Paul Mackerras,
	linux-kernel, arjan, ziga.mahkovec, davem

* Peter Zijlstra (peterz@infradead.org) wrote:
> On Wed, 2010-05-12 at 13:07 -0400, Mathieu Desnoyers wrote:
> > > Its mostly an interface/api question. You cannot easily splice() a
> > > mmap()'ed buffer on machines that have address constraints like sparc.
> > 
> > Ah ? Can you explain this issue a bit more ? There is possibly a concern I don't
> > quite see here. 
> 
> IIRC Sparc has virtually tagged D-caches on the lower 9 bits of the pfn,
> so to avoid cache aliasing both mappings (kernel and user) need to be
> aligned.

I see.

> 
> Since splice needs to swap pages it needs to allocate replacement pages
> with the exact right alignment, which is rather expensive (in either
> time or space).
> 

Hrm ? Why does splice() _need_ to swap pages exactly ? Or is it purely just the
way the current Ftrace implementation happens to work ? ;)

Would you be fine with an alternative implementation and API that support both
mmap() and splice(), without any need to swap pages ?

Thanks,

Mathieu


-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Perf and ftrace [was Re: PyTimechart]
  2010-05-12 17:53                 ` Mathieu Desnoyers
@ 2010-05-12 18:00                   ` Peter Zijlstra
  2010-05-12 18:04                     ` Mathieu Desnoyers
  0 siblings, 1 reply; 34+ messages in thread
From: Peter Zijlstra @ 2010-05-12 18:00 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Frederic Weisbecker, Steven Rostedt, Pierre Tardy, Ingo Molnar,
	Arnaldo Carvalho de Melo, Tom Zanussi, Paul Mackerras,
	linux-kernel, arjan, ziga.mahkovec, davem

On Wed, 2010-05-12 at 13:53 -0400, Mathieu Desnoyers wrote:
> 
> Would you be fine with an alternative implementation and API that support both
> mmap() and splice(), without any need to swap pages ? 

Sure, what did you have in mind?

You basically give you page to splice, never to see it again, you'd need
to fill that hole in your buffer again.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Perf and ftrace [was Re: PyTimechart]
  2010-05-12 18:00                   ` Peter Zijlstra
@ 2010-05-12 18:04                     ` Mathieu Desnoyers
  2010-05-12 18:08                       ` Peter Zijlstra
  0 siblings, 1 reply; 34+ messages in thread
From: Mathieu Desnoyers @ 2010-05-12 18:04 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Frederic Weisbecker, Steven Rostedt, Pierre Tardy, Ingo Molnar,
	Arnaldo Carvalho de Melo, Tom Zanussi, Paul Mackerras,
	linux-kernel, arjan, ziga.mahkovec, davem

* Peter Zijlstra (peterz@infradead.org) wrote:
> On Wed, 2010-05-12 at 13:53 -0400, Mathieu Desnoyers wrote:
> > 
> > Would you be fine with an alternative implementation and API that support both
> > mmap() and splice(), without any need to swap pages ? 
> 
> Sure, what did you have in mind?
> 
> You basically give you page to splice, never to see it again, you'd need
> to fill that hole in your buffer again.
> 

Can't we keep multiple references to each page ? (shared page) so it's still in
the buffer, also accessed by mmap(), and in addition accessed by splice.

Mathieu


-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Perf and ftrace [was Re: PyTimechart]
  2010-05-12 18:04                     ` Mathieu Desnoyers
@ 2010-05-12 18:08                       ` Peter Zijlstra
  2010-05-12 18:37                         ` Mathieu Desnoyers
  0 siblings, 1 reply; 34+ messages in thread
From: Peter Zijlstra @ 2010-05-12 18:08 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Frederic Weisbecker, Steven Rostedt, Pierre Tardy, Ingo Molnar,
	Arnaldo Carvalho de Melo, Tom Zanussi, Paul Mackerras,
	linux-kernel, arjan, ziga.mahkovec, davem

On Wed, 2010-05-12 at 14:04 -0400, Mathieu Desnoyers wrote:
> Can't we keep multiple references to each page ? (shared page) so it's still in
> the buffer, also accessed by mmap(), and in addition accessed by splice.

I'm not sure, the problem seems to be that a splice-consumer might want
to inject the page into a whole different address-space, over-writing
page->mapping/->index etc.




^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC] PyTimechart
  2010-05-12 17:15           ` Steven Rostedt
@ 2010-05-12 18:23             ` Ingo Molnar
  0 siblings, 0 replies; 34+ messages in thread
From: Ingo Molnar @ 2010-05-12 18:23 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Frederic Weisbecker, Pierre Tardy, Arnaldo Carvalho de Melo,
	Peter Zijlstra, Tom Zanussi, Paul Mackerras, linux-kernel,
	mathieu.desnoyers, arjan, ziga.mahkovec


* Steven Rostedt <rostedt@goodmis.org> wrote:

> On Wed, 2010-05-12 at 18:59 +0200, Ingo Molnar wrote:
> > * Steven Rostedt <rostedt@goodmis.org> wrote:
> > 
> > > On Wed, 2010-05-12 at 16:48 +0200, Frederic Weisbecker wrote:
> > > > On Wed, May 12, 2010 at 03:37:27PM +0200, Pierre Tardy wrote:
> > > 
> > > > But we don't yet support trace_printk in perf. May be we could wrap
> > > > them in trace events.
> > > 
> > > Hmm, do we really want to do that?
> > > 
> > > We really need to get the perf and ftrace trace buffers combined. I 
> > > understand why perf chose to do the mmap buffers for the counting, but 
> > > for live streaming, it is very inefficient compared to splice.
> > 
> > The thing is that for a very long time ftrace didnt have splice support 
> > and survived just fine. Even today most of the ftrace usage isnt utilizing 
> > splice.
> 
> Actually, trace-cmd implements the splice interface and is used by several 
> people. I find myself using trace-cmd 90% of the time that I use ftrace, 
> specifically because of this speedup.

i know, but most people still use /debug/tracing/ bits not trace-cmd.

> > Yes, splice might help in some situations but on average it's an 
> > independent speedup on the order of magnitude of a few percents, not a 
> > 'must have' item.
> 
> I'll have start running benchmarks to see what the actual speed up is. I'm 
> guessing it may be more than a few percent. It allows for zero copy overhead 
> and reuse of the data page.

Make sure you measure it in the context of a full app like PyTimechart.

You can measure the overhead using perf stat ;-)

	Ingo

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Perf and ftrace [was Re: PyTimechart]
  2010-05-12 18:08                       ` Peter Zijlstra
@ 2010-05-12 18:37                         ` Mathieu Desnoyers
  2010-05-12 18:46                           ` Steven Rostedt
  2010-05-12 18:49                           ` Peter Zijlstra
  0 siblings, 2 replies; 34+ messages in thread
From: Mathieu Desnoyers @ 2010-05-12 18:37 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Frederic Weisbecker, Steven Rostedt, Pierre Tardy, Ingo Molnar,
	Arnaldo Carvalho de Melo, Tom Zanussi, Paul Mackerras,
	linux-kernel, arjan, ziga.mahkovec, davem

* Peter Zijlstra (peterz@infradead.org) wrote:
> On Wed, 2010-05-12 at 14:04 -0400, Mathieu Desnoyers wrote:
> > Can't we keep multiple references to each page ? (shared page) so it's still in
> > the buffer, also accessed by mmap(), and in addition accessed by splice.
> 
> I'm not sure, the problem seems to be that a splice-consumer might want
> to inject the page into a whole different address-space, over-writing
> page->mapping/->index etc.

OK, I see. In LTTng, I dropped the mmap() support when I integrated splice(). In
both case, I can share the pages between the "output" (mmap or splice) and the
ring buffer because my ring buffer does not care about
page->mapping/->index/etc, so I never have to swap them.

However, doing mmap() and splice() at the same time on the same pages seems
problematic for the reason you point out here (and not very useful anyway).
But I think restrictions could be done more transparently than what you propose,
e.g.:

1) create buffer -> return fd
   (perform pfn alignment for the architecture worse-case, e.g. support mmap()
    on sparc)

2a) mmap(fd)
    return -EBUSY if any of the pages has non-NULL mapping.
3a) munmap(fd)

2b) splice(fd)
    return -EBUSY if any of the pages has non-NULL mapping.

2c) read(fd)
    Could probably be done concurrently with splice() or mmap().

This way we would ensure that only mmap or splice is used on the buffer at a
given time without crippling the API.

Thoughts ?

Mathieu


-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Perf and ftrace [was Re: PyTimechart]
  2010-05-12 18:37                         ` Mathieu Desnoyers
@ 2010-05-12 18:46                           ` Steven Rostedt
  2010-05-12 20:27                             ` Mathieu Desnoyers
  2010-05-12 18:49                           ` Peter Zijlstra
  1 sibling, 1 reply; 34+ messages in thread
From: Steven Rostedt @ 2010-05-12 18:46 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Peter Zijlstra, Frederic Weisbecker, Pierre Tardy, Ingo Molnar,
	Arnaldo Carvalho de Melo, Tom Zanussi, Paul Mackerras,
	linux-kernel, arjan, ziga.mahkovec, davem

On Wed, 2010-05-12 at 14:37 -0400, Mathieu Desnoyers wrote:

> OK, I see. In LTTng, I dropped the mmap() support when I integrated splice(). In
> both case, I can share the pages between the "output" (mmap or splice) and the
> ring buffer because my ring buffer does not care about
> page->mapping/->index/etc, so I never have to swap them.

I'm curious, how do you handle the overwrite mode without swapping?

-- Steve



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Perf and ftrace [was Re: PyTimechart]
  2010-05-12 18:37                         ` Mathieu Desnoyers
  2010-05-12 18:46                           ` Steven Rostedt
@ 2010-05-12 18:49                           ` Peter Zijlstra
  2010-05-12 18:51                             ` Mathieu Desnoyers
  1 sibling, 1 reply; 34+ messages in thread
From: Peter Zijlstra @ 2010-05-12 18:49 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Frederic Weisbecker, Steven Rostedt, Pierre Tardy, Ingo Molnar,
	Arnaldo Carvalho de Melo, Tom Zanussi, Paul Mackerras,
	linux-kernel, arjan, ziga.mahkovec, davem

On Wed, 2010-05-12 at 14:37 -0400, Mathieu Desnoyers wrote:
> * Peter Zijlstra (peterz@infradead.org) wrote:
> > On Wed, 2010-05-12 at 14:04 -0400, Mathieu Desnoyers wrote:
> > > Can't we keep multiple references to each page ? (shared page) so it's still in
> > > the buffer, also accessed by mmap(), and in addition accessed by splice.
> > 
> > I'm not sure, the problem seems to be that a splice-consumer might want
> > to inject the page into a whole different address-space, over-writing
> > page->mapping/->index etc.
> 
> OK, I see. In LTTng, I dropped the mmap() support when I integrated splice(). In
> both case, I can share the pages between the "output" (mmap or splice) and the
> ring buffer because my ring buffer does not care about
> page->mapping/->index/etc, so I never have to swap them.
> 
> However, doing mmap() and splice() at the same time on the same pages seems
> problematic for the reason you point out here (and not very useful anyway).
> But I think restrictions could be done more transparently than what you propose,
> e.g.:
> 
> 1) create buffer -> return fd
>    (perform pfn alignment for the architecture worse-case, e.g. support mmap()
>     on sparc)
> 
> 2a) mmap(fd)
>     return -EBUSY if any of the pages has non-NULL mapping.
> 3a) munmap(fd)
> 
> 2b) splice(fd)
>     return -EBUSY if any of the pages has non-NULL mapping.
> 
> 2c) read(fd)
>     Could probably be done concurrently with splice() or mmap().
> 
> This way we would ensure that only mmap or splice is used on the buffer at a
> given time without crippling the API.
> 
> Thoughts ?

Right, so the problem is that we now use mmap() to size the buffer. I
guess we could go adding a size attribute to perf_event_attr, but I
think its makes more sense to separate the actual event and the output
buffer objects.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Perf and ftrace [was Re: PyTimechart]
  2010-05-12 18:49                           ` Peter Zijlstra
@ 2010-05-12 18:51                             ` Mathieu Desnoyers
  2010-05-12 19:01                               ` Peter Zijlstra
  0 siblings, 1 reply; 34+ messages in thread
From: Mathieu Desnoyers @ 2010-05-12 18:51 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Frederic Weisbecker, Steven Rostedt, Pierre Tardy, Ingo Molnar,
	Arnaldo Carvalho de Melo, Tom Zanussi, Paul Mackerras,
	linux-kernel, arjan, ziga.mahkovec, davem

* Peter Zijlstra (peterz@infradead.org) wrote:
> On Wed, 2010-05-12 at 14:37 -0400, Mathieu Desnoyers wrote:
> > * Peter Zijlstra (peterz@infradead.org) wrote:
> > > On Wed, 2010-05-12 at 14:04 -0400, Mathieu Desnoyers wrote:
> > > > Can't we keep multiple references to each page ? (shared page) so it's still in
> > > > the buffer, also accessed by mmap(), and in addition accessed by splice.
> > > 
> > > I'm not sure, the problem seems to be that a splice-consumer might want
> > > to inject the page into a whole different address-space, over-writing
> > > page->mapping/->index etc.
> > 
> > OK, I see. In LTTng, I dropped the mmap() support when I integrated splice(). In
> > both case, I can share the pages between the "output" (mmap or splice) and the
> > ring buffer because my ring buffer does not care about
> > page->mapping/->index/etc, so I never have to swap them.
> > 
> > However, doing mmap() and splice() at the same time on the same pages seems
> > problematic for the reason you point out here (and not very useful anyway).
> > But I think restrictions could be done more transparently than what you propose,
> > e.g.:
> > 
> > 1) create buffer -> return fd
> >    (perform pfn alignment for the architecture worse-case, e.g. support mmap()
> >     on sparc)
> > 
> > 2a) mmap(fd)
> >     return -EBUSY if any of the pages has non-NULL mapping.
> > 3a) munmap(fd)
> > 
> > 2b) splice(fd)
> >     return -EBUSY if any of the pages has non-NULL mapping.
> > 
> > 2c) read(fd)
> >     Could probably be done concurrently with splice() or mmap().
> > 
> > This way we would ensure that only mmap or splice is used on the buffer at a
> > given time without crippling the API.
> > 
> > Thoughts ?
> 
> Right, so the problem is that we now use mmap() to size the buffer. I
> guess we could go adding a size attribute to perf_event_attr, but I
> think its makes more sense to separate the actual event and the output
> buffer objects.

It makes it hard to use splice() or read() if you don't specify the buffer size
at creation time. That alone seems like a pretty good argument for fixing the
size before the mmap() call.

Thanks,

Mathieu


-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Perf and ftrace [was Re: PyTimechart]
  2010-05-12 18:51                             ` Mathieu Desnoyers
@ 2010-05-12 19:01                               ` Peter Zijlstra
  2010-05-12 20:52                                 ` Mathieu Desnoyers
  0 siblings, 1 reply; 34+ messages in thread
From: Peter Zijlstra @ 2010-05-12 19:01 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Frederic Weisbecker, Steven Rostedt, Pierre Tardy, Ingo Molnar,
	Arnaldo Carvalho de Melo, Tom Zanussi, Paul Mackerras,
	linux-kernel, arjan, ziga.mahkovec, davem

On Wed, 2010-05-12 at 14:51 -0400, Mathieu Desnoyers wrote:
> 
> It makes it hard to use splice() or read() if you don't specify the buffer size
> at creation time. That alone seems like a pretty good argument for fixing the
> size before the mmap() call. 

Our read() doesn't actually provide the same data splice()/mmap() would.
But yeah, hence my proposal to create a separate syscall that creates
buffer objects.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Perf and ftrace [was Re: PyTimechart]
  2010-05-12 18:46                           ` Steven Rostedt
@ 2010-05-12 20:27                             ` Mathieu Desnoyers
  2010-05-12 22:21                               ` Steven Rostedt
  0 siblings, 1 reply; 34+ messages in thread
From: Mathieu Desnoyers @ 2010-05-12 20:27 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Peter Zijlstra, Frederic Weisbecker, Pierre Tardy, Ingo Molnar,
	Arnaldo Carvalho de Melo, Tom Zanussi, Paul Mackerras,
	linux-kernel, arjan, ziga.mahkovec, davem

* Steven Rostedt (rostedt@goodmis.org) wrote:
> On Wed, 2010-05-12 at 14:37 -0400, Mathieu Desnoyers wrote:
> 
> > OK, I see. In LTTng, I dropped the mmap() support when I integrated splice(). In
> > both case, I can share the pages between the "output" (mmap or splice) and the
> > ring buffer because my ring buffer does not care about
> > page->mapping/->index/etc, so I never have to swap them.
> 
> I'm curious, how do you handle the overwrite mode without swapping?

Explanation extracted from:

http://www.lttng.org/pub/thesis/desnoyers-dissertation-2009-12.pdf

5.4 Atomic Buffering Scheme
5.4.3 Algorithms

"This is achieved by adding a supplementary sub-buffer, owned by the reader. A
table with pointers to the sub-buffers being used by the writer allows the
reader to change the reference to each sub-buffer atomically. The
ReadGetSubbuf() algorithm is responsible for atomically exchanging the reference
to the sub-buffer about to be read with the sub-buffer currently owned by the
reader. If the CAS operation fails, the reader does not get access to the buffer
for reading."

I know your mother tongue is C, not English, so I just prepared a git repo with
the current state of my work (please note that I'm currently in the process of
cleaning up this code).

http://git.kernel.org/?p=linux/kernel/git/compudj/linux-2.6-ringbuffer.git

Interesting bits below.

Thanks,

Mathieu

Note: The "frontend" refers to the buffer writer/reader synchronization
algorithm. The "backend" deals with allocation of the memory buffers. This
frontend/backend separation permits to use the same ring buffer synchronization
code to write data to kernel pages, to video memory, to serial ports, etc etc,
without having to deal with different synchronization schemes.

Where the reader grabs the sub-buffer :

kernel/trace/ring_buffer_frontend.c: ring_buffer_get_subbuf()

396         ret = update_read_sb_index(&buf->backend, &chan->backend, consumed_idx);
397         if (ret)
398                 return ret;

and releases it:

kernel/trace/ring_buffer_frontend.c: ring_buffer_put_subbuf()

415         RCHAN_SB_SET_NOREF(buf->backend.buf_rsb.pages);

The writer clears the "noref" flag when it starts writing to a subbuffer, and
clears that flag when it has fully committed a subbuffer.

The primitives used by the "synchronization frontend" are declared in the
backend here:

kernel/trace/ring_buffer_page_backend_internal.h:

Interesting definitions and data structures for our current discussions:

17 #define RCHAN_SB_IS_NOREF(x)    ((unsigned long)(x) & RCHAN_NOREF_FLAG)
18 #define RCHAN_SB_SET_NOREF(x)   \
19         (x = (struct ring_buffer_backend_page *) \
20                 ((unsigned long)(x) | RCHAN_NOREF_FLAG))
21 #define RCHAN_SB_CLEAR_NOREF(x) \
22         (x = (struct ring_buffer_backend_page *) \
23                 ((unsigned long)(x) & ~RCHAN_NOREF_FLAG))
24
25 struct ring_buffer_backend_page {
26         void *virt;                     /* page virtual address (cached) */
27         struct page *page;              /* pointer to page structure */
28 };
29
30 struct ring_buffer_backend_subbuffer {
31         /* Pointer to backend pages for subbuf */
32         struct ring_buffer_backend_page *pages;
33 };

...

41 struct ring_buffer_backend {
42         /* Array of chanbuf_sb for writer */
43         struct ring_buffer_backend_subbuffer *buf_wsb;
44         /* chanbuf_sb for reader */
45         struct ring_buffer_backend_subbuffer buf_rsb;

...

97 /**
98  * ring_buffer_clear_noref_flag - Clear the noref subbuffer flag, for writer.
99  */
100 static __inline__
101 void ring_buffer_clear_noref_flag(struct ring_buffer_backend *bufb,
102                                   unsigned long idx)
103 {
104         struct ring_buffer_backend_page *sb_pages, *new_sb_pages;
105
106         sb_pages = bufb->buf_wsb[idx].pages;
107         for (;;) {
108                 if (!RCHAN_SB_IS_NOREF(sb_pages))
109                         return; /* Already writing to this buffer */
110                 new_sb_pages = sb_pages;
111                 RCHAN_SB_CLEAR_NOREF(new_sb_pages);
112                 new_sb_pages = cmpxchg(&bufb->buf_wsb[idx].pages,
113                         sb_pages, new_sb_pages);
114                 if (likely(new_sb_pages == sb_pages))
115                         break;
116                 sb_pages = new_sb_pages;
117         }
118 }
119
120 /**
121  * ring_buffer_set_noref_flag - Set the noref subbuffer flag, for writer.
122  */
123 static __inline__
124 void ring_buffer_set_noref_flag(struct ring_buffer_backend *bufb,
125                                 unsigned long idx)
126 {
127         struct ring_buffer_backend_page *sb_pages, *new_sb_pages;
128
129         sb_pages = bufb->buf_wsb[idx].pages;
130         for (;;) {
131                 if (RCHAN_SB_IS_NOREF(sb_pages))
132                         return; /* Already set */
133                 new_sb_pages = sb_pages;
134                 RCHAN_SB_SET_NOREF(new_sb_pages);
135                 new_sb_pages = cmpxchg(&bufb->buf_wsb[idx].pages,
136                         sb_pages, new_sb_pages);
137                 if (likely(new_sb_pages == sb_pages))
138                         break;
139                 sb_pages = new_sb_pages;
140         }
141 }
142
143 /**
144  * update_read_sb_index - Read-side subbuffer index update.
145  */
146 static __inline__
147 int update_read_sb_index(struct ring_buffer_backend *bufb,
148                          struct channel_backend *chanb,
149                          unsigned long consumed_idx)
150 {
151         struct ring_buffer_backend_page *old_wpage, *new_wpage;
152
153         if (unlikely(chanb->extra_reader_sb)) {
154                 /*
155                  * Exchange the target writer subbuffer with our own unused
156                  * subbuffer.
157                  */
158                 old_wpage = bufb->buf_wsb[consumed_idx].pages;
159                 if (unlikely(!RCHAN_SB_IS_NOREF(old_wpage)))
160                         return -EAGAIN;
161                 WARN_ON_ONCE(!RCHAN_SB_IS_NOREF(bufb->buf_rsb.pages));
162                 new_wpage = cmpxchg(&bufb->buf_wsb[consumed_idx].pages,
163                                 old_wpage,
164                                 bufb->buf_rsb.pages);
165                 if (unlikely(old_wpage != new_wpage))
166                         return -EAGAIN;
167                 bufb->buf_rsb.pages = new_wpage;
168                 RCHAN_SB_CLEAR_NOREF(bufb->buf_rsb.pages);
169         } else {
170                 /* No page exchange, use the writer page directly */
171                 bufb->buf_rsb.pages = bufb->buf_wsb[consumed_idx].pages;
172                 RCHAN_SB_CLEAR_NOREF(bufb->buf_rsb.pages);
173         }
174         return 0;
175 }


-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Perf and ftrace [was Re: PyTimechart]
  2010-05-12 19:01                               ` Peter Zijlstra
@ 2010-05-12 20:52                                 ` Mathieu Desnoyers
  0 siblings, 0 replies; 34+ messages in thread
From: Mathieu Desnoyers @ 2010-05-12 20:52 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Frederic Weisbecker, Steven Rostedt, Pierre Tardy, Ingo Molnar,
	Arnaldo Carvalho de Melo, Tom Zanussi, Paul Mackerras,
	linux-kernel, arjan, ziga.mahkovec, davem

* Peter Zijlstra (peterz@infradead.org) wrote:
> On Wed, 2010-05-12 at 14:51 -0400, Mathieu Desnoyers wrote:
> > 
> > It makes it hard to use splice() or read() if you don't specify the buffer size
> > at creation time. That alone seems like a pretty good argument for fixing the
> > size before the mmap() call. 
> 
> Our read() doesn't actually provide the same data splice()/mmap() would.
> But yeah, hence my proposal to create a separate syscall that creates
> buffer objects.
> 

Well, it's a bit strange to have read() vs splice() outputting different data,
isn't it ?

I would think it's rather more natural to have read() and splice()
implementations outputting the exact same data, and mmap() just mmapping the
ring buffer in userspace. If you really want to export that information in text
format, then a different file descriptor could be used for that, with read()
and splice() support.

Thanks,

Mathieu


-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Perf and ftrace [was Re: PyTimechart]
  2010-05-12 20:27                             ` Mathieu Desnoyers
@ 2010-05-12 22:21                               ` Steven Rostedt
  2010-05-13 13:20                                 ` Mathieu Desnoyers
  0 siblings, 1 reply; 34+ messages in thread
From: Steven Rostedt @ 2010-05-12 22:21 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Peter Zijlstra, Frederic Weisbecker, Pierre Tardy, Ingo Molnar,
	Arnaldo Carvalho de Melo, Tom Zanussi, Paul Mackerras,
	linux-kernel, arjan, ziga.mahkovec, davem

On Wed, 2010-05-12 at 16:27 -0400, Mathieu Desnoyers wrote:
> * Steven Rostedt (rostedt@goodmis.org) wrote:
> > On Wed, 2010-05-12 at 14:37 -0400, Mathieu Desnoyers wrote:
> > 
> > > OK, I see. In LTTng, I dropped the mmap() support when I integrated splice(). In
> > > both case, I can share the pages between the "output" (mmap or splice) and the
> > > ring buffer because my ring buffer does not care about
> > > page->mapping/->index/etc, so I never have to swap them.
> > 
> > I'm curious, how do you handle the overwrite mode without swapping?
> 
> Explanation extracted from:
> 
> http://www.lttng.org/pub/thesis/desnoyers-dissertation-2009-12.pdf
> 
> 5.4 Atomic Buffering Scheme
> 5.4.3 Algorithms
> 
> "This is achieved by adding a supplementary sub-buffer, owned by the reader. A
> table with pointers to the sub-buffers being used by the writer allows the
> reader to change the reference to each sub-buffer atomically. The
> ReadGetSubbuf() algorithm is responsible for atomically exchanging the reference
> to the sub-buffer about to be read with the sub-buffer currently owned by the
> reader.

AKA - swapping

As I asked, this seems to do exactly what my ring buffer does, except
you use a table where I swap out the list.  But this is still swapping.


>  If the CAS operation fails, the reader does not get access to the buffer
> for reading."
> 
> I know your mother tongue is C, not English, so I just prepared a git repo with
> the current state of my work (please note that I'm currently in the process of
> cleaning up this code).
> 
> http://git.kernel.org/?p=linux/kernel/git/compudj/linux-2.6-ringbuffer.git
> 
> Interesting bits below.
> 
> Thanks,
> 
> Mathieu
> 
> Note: The "frontend" refers to the buffer writer/reader synchronization
> algorithm. The "backend" deals with allocation of the memory buffers. This
> frontend/backend separation permits to use the same ring buffer synchronization
> code to write data to kernel pages, to video memory, to serial ports, etc etc,
> without having to deal with different synchronization schemes.

OK

> 
> Where the reader grabs the sub-buffer :
> 
> kernel/trace/ring_buffer_frontend.c: ring_buffer_get_subbuf()
> 
> 396         ret = update_read_sb_index(&buf->backend, &chan->backend, consumed_idx);
> 397         if (ret)
> 398                 return ret;
> 
> and releases it:
> 
> kernel/trace/ring_buffer_frontend.c: ring_buffer_put_subbuf()
> 
> 415         RCHAN_SB_SET_NOREF(buf->backend.buf_rsb.pages);
> 
> The writer clears the "noref" flag when it starts writing to a subbuffer, and
> clears that flag when it has fully committed a subbuffer.

Should one of the "clears" above be a set?

> 
> The primitives used by the "synchronization frontend" are declared in the
> backend here:
> 
> kernel/trace/ring_buffer_page_backend_internal.h:
> 
> Interesting definitions and data structures for our current discussions:
> 
> 17 #define RCHAN_SB_IS_NOREF(x)    ((unsigned long)(x) & RCHAN_NOREF_FLAG)
> 18 #define RCHAN_SB_SET_NOREF(x)   \

I really hate caps, even for macros. If it acts like a function, keep it
lowercase. Caps are for constants not functions.

Linux convention has always had lowercase for macros that act like
functions. Heck, why not just make these static inlines?


> 19         (x = (struct ring_buffer_backend_page *) \
> 20                 ((unsigned long)(x) | RCHAN_NOREF_FLAG))
> 21 #define RCHAN_SB_CLEAR_NOREF(x) \
> 22         (x = (struct ring_buffer_backend_page *) \
> 23                 ((unsigned long)(x) & ~RCHAN_NOREF_FLAG))
> 24
> 25 struct ring_buffer_backend_page {
> 26         void *virt;                     /* page virtual address (cached) */
> 27         struct page *page;              /* pointer to page structure */
> 28 };
> 29
> 30 struct ring_buffer_backend_subbuffer {
> 31         /* Pointer to backend pages for subbuf */
> 32         struct ring_buffer_backend_page *pages;
> 33 };
> 
> ...
> 
> 41 struct ring_buffer_backend {
> 42         /* Array of chanbuf_sb for writer */
> 43         struct ring_buffer_backend_subbuffer *buf_wsb;
> 44         /* chanbuf_sb for reader */
> 45         struct ring_buffer_backend_subbuffer buf_rsb;

So this is equivalent to my reader_page?

> 
> ...
> 
> 97 /**
> 98  * ring_buffer_clear_noref_flag - Clear the noref subbuffer flag, for writer.
> 99  */
> 100 static __inline__
> 101 void ring_buffer_clear_noref_flag(struct ring_buffer_backend *bufb,
> 102                                   unsigned long idx)
> 103 {
> 104         struct ring_buffer_backend_page *sb_pages, *new_sb_pages;
> 105
> 106         sb_pages = bufb->buf_wsb[idx].pages;
> 107         for (;;) {
> 108                 if (!RCHAN_SB_IS_NOREF(sb_pages))
> 109                         return; /* Already writing to this buffer */
> 110                 new_sb_pages = sb_pages;
> 111                 RCHAN_SB_CLEAR_NOREF(new_sb_pages);
> 112                 new_sb_pages = cmpxchg(&bufb->buf_wsb[idx].pages,
> 113                         sb_pages, new_sb_pages);
> 114                 if (likely(new_sb_pages == sb_pages))
> 115                         break;
> 116                 sb_pages = new_sb_pages;

The writer calls this??

> 117         }
> 118 }
> 119
> 120 /**
> 121  * ring_buffer_set_noref_flag - Set the noref subbuffer flag, for writer.
> 122  */
> 123 static __inline__
> 124 void ring_buffer_set_noref_flag(struct ring_buffer_backend *bufb,
> 125                                 unsigned long idx)
> 126 {
> 127         struct ring_buffer_backend_page *sb_pages, *new_sb_pages;
> 128
> 129         sb_pages = bufb->buf_wsb[idx].pages;
> 130         for (;;) {
> 131                 if (RCHAN_SB_IS_NOREF(sb_pages))
> 132                         return; /* Already set */
> 133                 new_sb_pages = sb_pages;
> 134                 RCHAN_SB_SET_NOREF(new_sb_pages);
> 135                 new_sb_pages = cmpxchg(&bufb->buf_wsb[idx].pages,
> 136                         sb_pages, new_sb_pages);
> 137                 if (likely(new_sb_pages == sb_pages))
> 138                         break;
> 139                 sb_pages = new_sb_pages;

Again, the writer calls this??

> 140         }
> 141 }
> 142
> 143 /**
> 144  * update_read_sb_index - Read-side subbuffer index update.
> 145  */
> 146 static __inline__
> 147 int update_read_sb_index(struct ring_buffer_backend *bufb,
> 148                          struct channel_backend *chanb,
> 149                          unsigned long consumed_idx)
> 150 {
> 151         struct ring_buffer_backend_page *old_wpage, *new_wpage;
> 152
> 153         if (unlikely(chanb->extra_reader_sb)) {
> 154                 /*
> 155                  * Exchange the target writer subbuffer with our own unused
> 156                  * subbuffer.
> 157                  */
> 158                 old_wpage = bufb->buf_wsb[consumed_idx].pages;
> 159                 if (unlikely(!RCHAN_SB_IS_NOREF(old_wpage)))
> 160                         return -EAGAIN;
> 161                 WARN_ON_ONCE(!RCHAN_SB_IS_NOREF(bufb->buf_rsb.pages));
> 162                 new_wpage = cmpxchg(&bufb->buf_wsb[consumed_idx].pages,
> 163                                 old_wpage,
> 164                                 bufb->buf_rsb.pages);

This looks just like the swap with reader_page that I do, except you use
a table and I use the list.  How do you replenish the buf_rsb.pages if
the splice keeps the page you just received active?

-- Steve


> 165                 if (unlikely(old_wpage != new_wpage))
> 166                         return -EAGAIN;
> 167                 bufb->buf_rsb.pages = new_wpage;
> 168                 RCHAN_SB_CLEAR_NOREF(bufb->buf_rsb.pages);
> 169         } else {
> 170                 /* No page exchange, use the writer page directly */
> 171                 bufb->buf_rsb.pages = bufb->buf_wsb[consumed_idx].pages;
> 172                 RCHAN_SB_CLEAR_NOREF(bufb->buf_rsb.pages);
> 173         }
> 174         return 0;
> 175 }
> 
> 



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Perf and ftrace [was Re: PyTimechart]
  2010-05-12 22:21                               ` Steven Rostedt
@ 2010-05-13 13:20                                 ` Mathieu Desnoyers
  2010-05-13 15:42                                   ` Steven Rostedt
  0 siblings, 1 reply; 34+ messages in thread
From: Mathieu Desnoyers @ 2010-05-13 13:20 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Peter Zijlstra, Frederic Weisbecker, Pierre Tardy, Ingo Molnar,
	Arnaldo Carvalho de Melo, Tom Zanussi, Paul Mackerras,
	linux-kernel, arjan, ziga.mahkovec, davem

* Steven Rostedt (rostedt@goodmis.org) wrote:
> On Wed, 2010-05-12 at 16:27 -0400, Mathieu Desnoyers wrote:
> > * Steven Rostedt (rostedt@goodmis.org) wrote:
> > > On Wed, 2010-05-12 at 14:37 -0400, Mathieu Desnoyers wrote:
> > > 
> > > > OK, I see. In LTTng, I dropped the mmap() support when I integrated splice(). In
> > > > both case, I can share the pages between the "output" (mmap or splice) and the
> > > > ring buffer because my ring buffer does not care about
> > > > page->mapping/->index/etc, so I never have to swap them.
> > > 
> > > I'm curious, how do you handle the overwrite mode without swapping?
> > 
> > Explanation extracted from:
> > 
> > http://www.lttng.org/pub/thesis/desnoyers-dissertation-2009-12.pdf
> > 
> > 5.4 Atomic Buffering Scheme
> > 5.4.3 Algorithms
> > 
> > "This is achieved by adding a supplementary sub-buffer, owned by the reader. A
> > table with pointers to the sub-buffers being used by the writer allows the
> > reader to change the reference to each sub-buffer atomically. The
> > ReadGetSubbuf() algorithm is responsible for atomically exchanging the reference
> > to the sub-buffer about to be read with the sub-buffer currently owned by the
> > reader.
> 
> AKA - swapping
> 
> As I asked, this seems to do exactly what my ring buffer does, except
> you use a table where I swap out the list.  But this is still swapping.

Yes, we could use the word swapping to explain this scheme I guess. Yes, it is
in some sense similar, with the distinction that here the ring buffer
reserve/commit (reader/writer synchronization) is all performed in the frontend,
thus independent from this page swapping.

When the buffer is in non-overwrite mode, I simply don't allocate a separate
subbuffer for the reader and don't need to perform swapping: the
producer/consumer offsets deal with reader/writer concurrency by mutually
excluding readers from the writer offset range and vice-versa.

> 
> 
> >  If the CAS operation fails, the reader does not get access to the buffer
> > for reading."
> > 
> > I know your mother tongue is C, not English, so I just prepared a git repo with
> > the current state of my work (please note that I'm currently in the process of
> > cleaning up this code).
> > 
> > http://git.kernel.org/?p=linux/kernel/git/compudj/linux-2.6-ringbuffer.git
> > 
> > Interesting bits below.
> > 
> > Thanks,
> > 
> > Mathieu
> > 
> > Note: The "frontend" refers to the buffer writer/reader synchronization
> > algorithm. The "backend" deals with allocation of the memory buffers. This
> > frontend/backend separation permits to use the same ring buffer synchronization
> > code to write data to kernel pages, to video memory, to serial ports, etc etc,
> > without having to deal with different synchronization schemes.
> 
> OK
> 
> > 
> > Where the reader grabs the sub-buffer :
> > 
> > kernel/trace/ring_buffer_frontend.c: ring_buffer_get_subbuf()
> > 
> > 396         ret = update_read_sb_index(&buf->backend, &chan->backend, consumed_idx);
> > 397         if (ret)
> > 398                 return ret;
> > 
> > and releases it:
> > 
> > kernel/trace/ring_buffer_frontend.c: ring_buffer_put_subbuf()
> > 
> > 415         RCHAN_SB_SET_NOREF(buf->backend.buf_rsb.pages);
> > 
> > The writer clears the "noref" flag when it starts writing to a subbuffer, and
> > clears that flag when it has fully committed a subbuffer.
> 
> Should one of the "clears" above be a set?

Yes, the second "clears" in my explanation is indeed a "set".

> 
> > 
> > The primitives used by the "synchronization frontend" are declared in the
> > backend here:
> > 
> > kernel/trace/ring_buffer_page_backend_internal.h:
> > 
> > Interesting definitions and data structures for our current discussions:
> > 
> > 17 #define RCHAN_SB_IS_NOREF(x)    ((unsigned long)(x) & RCHAN_NOREF_FLAG)
> > 18 #define RCHAN_SB_SET_NOREF(x)   \
> 
> I really hate caps, even for macros. If it acts like a function, keep it
> lowercase. Caps are for constants not functions.
> 
> Linux convention has always had lowercase for macros that act like
> functions. Heck, why not just make these static inlines?

Will do. Good point!

> 
> 
> > 19         (x = (struct ring_buffer_backend_page *) \
> > 20                 ((unsigned long)(x) | RCHAN_NOREF_FLAG))
> > 21 #define RCHAN_SB_CLEAR_NOREF(x) \
> > 22         (x = (struct ring_buffer_backend_page *) \
> > 23                 ((unsigned long)(x) & ~RCHAN_NOREF_FLAG))
> > 24
> > 25 struct ring_buffer_backend_page {
> > 26         void *virt;                     /* page virtual address (cached) */
> > 27         struct page *page;              /* pointer to page structure */
> > 28 };
> > 29
> > 30 struct ring_buffer_backend_subbuffer {
> > 31         /* Pointer to backend pages for subbuf */
> > 32         struct ring_buffer_backend_page *pages;
> > 33 };
> > 
> > ...
> > 
> > 41 struct ring_buffer_backend {
> > 42         /* Array of chanbuf_sb for writer */
> > 43         struct ring_buffer_backend_subbuffer *buf_wsb;
> > 44         /* chanbuf_sb for reader */
> > 45         struct ring_buffer_backend_subbuffer buf_rsb;
> 
> So this is equivalent to my reader_page?

Yes. But in this case, it is a reader "subbuffer", which is an array of pages.

> 
> > 
> > ...
> > 
> > 97 /**
> > 98  * ring_buffer_clear_noref_flag - Clear the noref subbuffer flag, for writer.
> > 99  */
> > 100 static __inline__
> > 101 void ring_buffer_clear_noref_flag(struct ring_buffer_backend *bufb,
> > 102                                   unsigned long idx)
> > 103 {
> > 104         struct ring_buffer_backend_page *sb_pages, *new_sb_pages;
> > 105
> > 106         sb_pages = bufb->buf_wsb[idx].pages;
> > 107         for (;;) {
> > 108                 if (!RCHAN_SB_IS_NOREF(sb_pages))
> > 109                         return; /* Already writing to this buffer */
> > 110                 new_sb_pages = sb_pages;
> > 111                 RCHAN_SB_CLEAR_NOREF(new_sb_pages);
> > 112                 new_sb_pages = cmpxchg(&bufb->buf_wsb[idx].pages,
> > 113                         sb_pages, new_sb_pages);
> > 114                 if (likely(new_sb_pages == sb_pages))
> > 115                         break;
> > 116                 sb_pages = new_sb_pages;
> 
> The writer calls this??

Yes. But the common case (for each event) is simply a
"if (!RCHAN_SB_IS_NOREF(sb_pages))" test that returns. The cmpxchg() is only
performed at subbuffer boundary.

Will update the comment above to:

/**
 * ring_buffer_clear_noref - Clear the noref subbuffer flag, called by writer.
 */
static __inline__
void ring_buffer_clear_noref(struct ring_buffer_backend *bufb,
                             unsigned long idx)


> 
> > 117         }
> > 118 }
> > 119
> > 120 /**
> > 121  * ring_buffer_set_noref_flag - Set the noref subbuffer flag, for writer.
> > 122  */
> > 123 static __inline__
> > 124 void ring_buffer_set_noref_flag(struct ring_buffer_backend *bufb,
> > 125                                 unsigned long idx)
> > 126 {
> > 127         struct ring_buffer_backend_page *sb_pages, *new_sb_pages;
> > 128
> > 129         sb_pages = bufb->buf_wsb[idx].pages;
> > 130         for (;;) {
> > 131                 if (RCHAN_SB_IS_NOREF(sb_pages))
> > 132                         return; /* Already set */
> > 133                 new_sb_pages = sb_pages;
> > 134                 RCHAN_SB_SET_NOREF(new_sb_pages);
> > 135                 new_sb_pages = cmpxchg(&bufb->buf_wsb[idx].pages,
> > 136                         sb_pages, new_sb_pages);
> > 137                 if (likely(new_sb_pages == sb_pages))
> > 138                         break;
> > 139                 sb_pages = new_sb_pages;
> 
> Again, the writer calls this??

Yep.

> 
> > 140         }
> > 141 }
> > 142
> > 143 /**
> > 144  * update_read_sb_index - Read-side subbuffer index update.
> > 145  */
> > 146 static __inline__
> > 147 int update_read_sb_index(struct ring_buffer_backend *bufb,
> > 148                          struct channel_backend *chanb,
> > 149                          unsigned long consumed_idx)
> > 150 {
> > 151         struct ring_buffer_backend_page *old_wpage, *new_wpage;
> > 152
> > 153         if (unlikely(chanb->extra_reader_sb)) {
> > 154                 /*
> > 155                  * Exchange the target writer subbuffer with our own unused
> > 156                  * subbuffer.
> > 157                  */
> > 158                 old_wpage = bufb->buf_wsb[consumed_idx].pages;
> > 159                 if (unlikely(!RCHAN_SB_IS_NOREF(old_wpage)))
> > 160                         return -EAGAIN;
> > 161                 WARN_ON_ONCE(!RCHAN_SB_IS_NOREF(bufb->buf_rsb.pages));
> > 162                 new_wpage = cmpxchg(&bufb->buf_wsb[consumed_idx].pages,
> > 163                                 old_wpage,
> > 164                                 bufb->buf_rsb.pages);
> 
> This looks just like the swap with reader_page that I do, except you use
> a table and I use the list.  How do you replenish the buf_rsb.pages if
> the splice keeps the page you just received active?

I don't allow other reads to proceed as long as splice is holding pages that
belong to the reader-owned subbuffer. The read semantic is basically:

ring_buffer_open_read() /* only one reader at a time can open a ring buffer */
get_subbuf_size()
while (buffer is not finalized and empty) {
  poll()
  ret = ring_buffer_get_subbuf()
  if (!ret)
    continue;
  /* The splice ops below can be performed in multiple calls, e.g. first splice
   * only a portion of a subbuffer to a pipe, then splice to the disk/network,
   * and move to the next subbuffer portion until all the subbuffer is sent.
   */
  splice one subbuffer worth of data to a pipe
  splice the data from pipe to disk/network
  ring_buffer_put_subbuf()
}
ring_buffer_close_read()

The reader code above works both with flight recorder and non-overwrite mode.

The code above assumes that upon return from the splice() to disk/network,
splice() is not using the pages anymore (I assume that splice() performs the
transfer synchronously with the call).

The VFS interface I use for get_subbuf_size(), ring_buffer_get_subbuf() and
ring_buffer_put_subbuf() are new ioctls. Note that these can be used for both
splice() and mmap() types of backend access, as they only call into the
frontend.

Thanks,

Mathieu


> 
> -- Steve
> 
> 
> > 165                 if (unlikely(old_wpage != new_wpage))
> > 166                         return -EAGAIN;
> > 167                 bufb->buf_rsb.pages = new_wpage;
> > 168                 RCHAN_SB_CLEAR_NOREF(bufb->buf_rsb.pages);
> > 169         } else {
> > 170                 /* No page exchange, use the writer page directly */
> > 171                 bufb->buf_rsb.pages = bufb->buf_wsb[consumed_idx].pages;
> > 172                 RCHAN_SB_CLEAR_NOREF(bufb->buf_rsb.pages);
> > 173         }
> > 174         return 0;
> > 175 }
> > 
> > 
> 
> 

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC] PyTimechart
  2010-05-12 17:13       ` Pierre Tardy
@ 2010-05-13 15:02         ` Arnaldo Carvalho de Melo
  0 siblings, 0 replies; 34+ messages in thread
From: Arnaldo Carvalho de Melo @ 2010-05-13 15:02 UTC (permalink / raw)
  To: Pierre Tardy
  Cc: Frederic Weisbecker, Ingo Molnar, Peter Zijlstra, Tom Zanussi,
	Paul Mackerras, linux-kernel, rostedt, mathieu.desnoyers, arjan,
	ziga.mahkovec

Em Wed, May 12, 2010 at 07:13:00PM +0200, Pierre Tardy escreveu:
> > The code I'm looking at here doesn't look dirty to me at a glance:
> > http://gitorious.org/pytimechart/pytimechart/trees/master

> That's because you did not read between the lines ;-)
> The display core needs to be simplified.

Not even having looked at the code, I like your attitude! :-)

- Arnaldo

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Perf and ftrace [was Re: PyTimechart]
  2010-05-13 13:20                                 ` Mathieu Desnoyers
@ 2010-05-13 15:42                                   ` Steven Rostedt
  2010-05-13 16:31                                     ` Mathieu Desnoyers
  0 siblings, 1 reply; 34+ messages in thread
From: Steven Rostedt @ 2010-05-13 15:42 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Peter Zijlstra, Frederic Weisbecker, Pierre Tardy, Ingo Molnar,
	Arnaldo Carvalho de Melo, Tom Zanussi, Paul Mackerras,
	linux-kernel, arjan, ziga.mahkovec, davem

On Thu, 2010-05-13 at 09:20 -0400, Mathieu Desnoyers wrote:
> * Steven Rostedt (rostedt@goodmis.org) wrote:
> > On Wed, 2010-05-12 at 16:27 -0400, Mathieu Desnoyers wrote:
> > > * Steven Rostedt (rostedt@goodmis.org) wrote:
> > > > On Wed, 2010-05-12 at 14:37 -0400, Mathieu Desnoyers wrote:
> > > > 
> > > > > OK, I see. In LTTng, I dropped the mmap() support when I integrated splice(). In
> > > > > both case, I can share the pages between the "output" (mmap or splice) and the
> > > > > ring buffer because my ring buffer does not care about
> > > > > page->mapping/->index/etc, so I never have to swap them.
> > > > 
> > > > I'm curious, how do you handle the overwrite mode without swapping?
> > > 
> > > Explanation extracted from:
> > > 
> > > http://www.lttng.org/pub/thesis/desnoyers-dissertation-2009-12.pdf
> > > 
> > > 5.4 Atomic Buffering Scheme
> > > 5.4.3 Algorithms
> > > 
> > > "This is achieved by adding a supplementary sub-buffer, owned by the reader. A
> > > table with pointers to the sub-buffers being used by the writer allows the
> > > reader to change the reference to each sub-buffer atomically. The
> > > ReadGetSubbuf() algorithm is responsible for atomically exchanging the reference
> > > to the sub-buffer about to be read with the sub-buffer currently owned by the
> > > reader.
> > 
> > AKA - swapping
> > 
> > As I asked, this seems to do exactly what my ring buffer does, except
> > you use a table where I swap out the list.  But this is still swapping.
> 
> Yes, we could use the word swapping to explain this scheme I guess. Yes, it is
> in some sense similar, with the distinction that here the ring buffer
> reserve/commit (reader/writer synchronization) is all performed in the frontend,
> thus independent from this page swapping.


The difference is only in the semantics.

> 
> When the buffer is in non-overwrite mode, I simply don't allocate a separate
> subbuffer for the reader and don't need to perform swapping: the
> producer/consumer offsets deal with reader/writer concurrency by mutually
> excluding readers from the writer offset range and vice-versa.

Yeah, I thought about doing the same, but 1) I didn't want to add
different code, 2) ftrace has yet had a need to use non-overwrite mode.

> 
> > 
> > 
> > >  If the CAS operation fails, the reader does not get access to the buffer
> > > for reading."
> > > 
> > > I know your mother tongue is C, not English, so I just prepared a git repo with
> > > the current state of my work (please note that I'm currently in the process of
> > > cleaning up this code).
> > > 
> > > http://git.kernel.org/?p=linux/kernel/git/compudj/linux-2.6-ringbuffer.git
> > > 
> > > Interesting bits below.
> > > 
> > > Thanks,
> > > 
> > > Mathieu
> > > 
> > > Note: The "frontend" refers to the buffer writer/reader synchronization
> > > algorithm. The "backend" deals with allocation of the memory buffers. This
> > > frontend/backend separation permits to use the same ring buffer synchronization
> > > code to write data to kernel pages, to video memory, to serial ports, etc etc,
> > > without having to deal with different synchronization schemes.
> > 
> > OK
> > 
> > > 
> > > Where the reader grabs the sub-buffer :
> > > 
> > > kernel/trace/ring_buffer_frontend.c: ring_buffer_get_subbuf()
> > > 
> > > 396         ret = update_read_sb_index(&buf->backend, &chan->backend, consumed_idx);
> > > 397         if (ret)
> > > 398                 return ret;
> > > 
> > > and releases it:
> > > 
> > > kernel/trace/ring_buffer_frontend.c: ring_buffer_put_subbuf()
> > > 
> > > 415         RCHAN_SB_SET_NOREF(buf->backend.buf_rsb.pages);
> > > 
> > > The writer clears the "noref" flag when it starts writing to a subbuffer, and
> > > clears that flag when it has fully committed a subbuffer.
> > 
> > Should one of the "clears" above be a set?
> 
> Yes, the second "clears" in my explanation is indeed a "set".
> 
> > 
> > > 
> > > The primitives used by the "synchronization frontend" are declared in the
> > > backend here:
> > > 
> > > kernel/trace/ring_buffer_page_backend_internal.h:
> > > 
> > > Interesting definitions and data structures for our current discussions:
> > > 
> > > 17 #define RCHAN_SB_IS_NOREF(x)    ((unsigned long)(x) & RCHAN_NOREF_FLAG)
> > > 18 #define RCHAN_SB_SET_NOREF(x)   \
> > 
> > I really hate caps, even for macros. If it acts like a function, keep it
> > lowercase. Caps are for constants not functions.
> > 
> > Linux convention has always had lowercase for macros that act like
> > functions. Heck, why not just make these static inlines?
> 
> Will do. Good point!
> 
> > 
> > 
> > > 19         (x = (struct ring_buffer_backend_page *) \
> > > 20                 ((unsigned long)(x) | RCHAN_NOREF_FLAG))
> > > 21 #define RCHAN_SB_CLEAR_NOREF(x) \
> > > 22         (x = (struct ring_buffer_backend_page *) \
> > > 23                 ((unsigned long)(x) & ~RCHAN_NOREF_FLAG))
> > > 24
> > > 25 struct ring_buffer_backend_page {
> > > 26         void *virt;                     /* page virtual address (cached) */
> > > 27         struct page *page;              /* pointer to page structure */
> > > 28 };
> > > 29
> > > 30 struct ring_buffer_backend_subbuffer {
> > > 31         /* Pointer to backend pages for subbuf */
> > > 32         struct ring_buffer_backend_page *pages;
> > > 33 };
> > > 
> > > ...
> > > 
> > > 41 struct ring_buffer_backend {
> > > 42         /* Array of chanbuf_sb for writer */
> > > 43         struct ring_buffer_backend_subbuffer *buf_wsb;
> > > 44         /* chanbuf_sb for reader */
> > > 45         struct ring_buffer_backend_subbuffer buf_rsb;
> > 
> > So this is equivalent to my reader_page?
> 
> Yes. But in this case, it is a reader "subbuffer", which is an array of pages.

It's still a subbuffer. Again, difference in semantics.


> 
> > 
> > > 
> > > ...
> > > 
> > > 97 /**
> > > 98  * ring_buffer_clear_noref_flag - Clear the noref subbuffer flag, for writer.
> > > 99  */
> > > 100 static __inline__
> > > 101 void ring_buffer_clear_noref_flag(struct ring_buffer_backend *bufb,
> > > 102                                   unsigned long idx)
> > > 103 {
> > > 104         struct ring_buffer_backend_page *sb_pages, *new_sb_pages;
> > > 105
> > > 106         sb_pages = bufb->buf_wsb[idx].pages;
> > > 107         for (;;) {
> > > 108                 if (!RCHAN_SB_IS_NOREF(sb_pages))
> > > 109                         return; /* Already writing to this buffer */
> > > 110                 new_sb_pages = sb_pages;
> > > 111                 RCHAN_SB_CLEAR_NOREF(new_sb_pages);
> > > 112                 new_sb_pages = cmpxchg(&bufb->buf_wsb[idx].pages,
> > > 113                         sb_pages, new_sb_pages);
> > > 114                 if (likely(new_sb_pages == sb_pages))
> > > 115                         break;
> > > 116                 sb_pages = new_sb_pages;
> > 
> > The writer calls this??
> 
> Yes. But the common case (for each event) is simply a
> "if (!RCHAN_SB_IS_NOREF(sb_pages))" test that returns. The cmpxchg() is only
> performed at subbuffer boundary.

Is the cmpxchg only contending with other writers?

> 
> Will update the comment above to:
> 
> /**
>  * ring_buffer_clear_noref - Clear the noref subbuffer flag, called by writer.
>  */
> static __inline__
> void ring_buffer_clear_noref(struct ring_buffer_backend *bufb,
>                              unsigned long idx)
> 
> 
> > 
> > > 117         }
> > > 118 }
> > > 119
> > > 120 /**
> > > 121  * ring_buffer_set_noref_flag - Set the noref subbuffer flag, for writer.
> > > 122  */
> > > 123 static __inline__
> > > 124 void ring_buffer_set_noref_flag(struct ring_buffer_backend *bufb,
> > > 125                                 unsigned long idx)
> > > 126 {
> > > 127         struct ring_buffer_backend_page *sb_pages, *new_sb_pages;
> > > 128
> > > 129         sb_pages = bufb->buf_wsb[idx].pages;
> > > 130         for (;;) {
> > > 131                 if (RCHAN_SB_IS_NOREF(sb_pages))
> > > 132                         return; /* Already set */
> > > 133                 new_sb_pages = sb_pages;
> > > 134                 RCHAN_SB_SET_NOREF(new_sb_pages);
> > > 135                 new_sb_pages = cmpxchg(&bufb->buf_wsb[idx].pages,
> > > 136                         sb_pages, new_sb_pages);
> > > 137                 if (likely(new_sb_pages == sb_pages))
> > > 138                         break;
> > > 139                 sb_pages = new_sb_pages;
> > 
> > Again, the writer calls this??
> 
> Yep.
> 
> > 
> > > 140         }
> > > 141 }
> > > 142
> > > 143 /**
> > > 144  * update_read_sb_index - Read-side subbuffer index update.
> > > 145  */
> > > 146 static __inline__
> > > 147 int update_read_sb_index(struct ring_buffer_backend *bufb,
> > > 148                          struct channel_backend *chanb,
> > > 149                          unsigned long consumed_idx)
> > > 150 {
> > > 151         struct ring_buffer_backend_page *old_wpage, *new_wpage;
> > > 152
> > > 153         if (unlikely(chanb->extra_reader_sb)) {
> > > 154                 /*
> > > 155                  * Exchange the target writer subbuffer with our own unused
> > > 156                  * subbuffer.
> > > 157                  */
> > > 158                 old_wpage = bufb->buf_wsb[consumed_idx].pages;
> > > 159                 if (unlikely(!RCHAN_SB_IS_NOREF(old_wpage)))
> > > 160                         return -EAGAIN;
> > > 161                 WARN_ON_ONCE(!RCHAN_SB_IS_NOREF(bufb->buf_rsb.pages));
> > > 162                 new_wpage = cmpxchg(&bufb->buf_wsb[consumed_idx].pages,
> > > 163                                 old_wpage,
> > > 164                                 bufb->buf_rsb.pages);
> > 
> > This looks just like the swap with reader_page that I do, except you use
> > a table and I use the list.  How do you replenish the buf_rsb.pages if
> > the splice keeps the page you just received active?
> 
> I don't allow other reads to proceed as long as splice is holding pages that
> belong to the reader-owned subbuffer. The read semantic is basically:
> 
> ring_buffer_open_read() /* only one reader at a time can open a ring buffer */
> get_subbuf_size()
> while (buffer is not finalized and empty) {
>   poll()
>   ret = ring_buffer_get_subbuf()
>   if (!ret)
>     continue;
>   /* The splice ops below can be performed in multiple calls, e.g. first splice
>    * only a portion of a subbuffer to a pipe, then splice to the disk/network,
>    * and move to the next subbuffer portion until all the subbuffer is sent.
>    */
>   splice one subbuffer worth of data to a pipe
>   splice the data from pipe to disk/network
>   ring_buffer_put_subbuf()
> }
> ring_buffer_close_read()
> 
> The reader code above works both with flight recorder and non-overwrite mode.
> 
> The code above assumes that upon return from the splice() to disk/network,
> splice() is not using the pages anymore (I assume that splice() performs the
> transfer synchronously with the call).
> 
> The VFS interface I use for get_subbuf_size(), ring_buffer_get_subbuf() and
> ring_buffer_put_subbuf() are new ioctls. Note that these can be used for both
> splice() and mmap() types of backend access, as they only call into the
> frontend.

Hmm, so basically you lose pages until they are returned. I guess I can
trivially add the same thing now to the current ring buffer.

-- Steve

> > 
> > > 165                 if (unlikely(old_wpage != new_wpage))
> > > 166                         return -EAGAIN;
> > > 167                 bufb->buf_rsb.pages = new_wpage;
> > > 168                 RCHAN_SB_CLEAR_NOREF(bufb->buf_rsb.pages);
> > > 169         } else {
> > > 170                 /* No page exchange, use the writer page directly */
> > > 171                 bufb->buf_rsb.pages = bufb->buf_wsb[consumed_idx].pages;
> > > 172                 RCHAN_SB_CLEAR_NOREF(bufb->buf_rsb.pages);
> > > 173         }
> > > 174         return 0;
> > > 175 }
> > > 
> > > 
> > 
> > 
> 



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Perf and ftrace [was Re: PyTimechart]
  2010-05-13 15:42                                   ` Steven Rostedt
@ 2010-05-13 16:31                                     ` Mathieu Desnoyers
  2010-05-13 16:46                                       ` Steven Rostedt
  2010-05-14  7:53                                       ` Peter Zijlstra
  0 siblings, 2 replies; 34+ messages in thread
From: Mathieu Desnoyers @ 2010-05-13 16:31 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Peter Zijlstra, Frederic Weisbecker, Pierre Tardy, Ingo Molnar,
	Arnaldo Carvalho de Melo, Tom Zanussi, Paul Mackerras,
	linux-kernel, arjan, ziga.mahkovec, davem

* Steven Rostedt (rostedt@goodmis.org) wrote:
> On Thu, 2010-05-13 at 09:20 -0400, Mathieu Desnoyers wrote:
[...]
> > > > 
> > > > ...
> > > > 
> > > > 97 /**
> > > > 98  * ring_buffer_clear_noref_flag - Clear the noref subbuffer flag, for writer.
> > > > 99  */
> > > > 100 static __inline__
> > > > 101 void ring_buffer_clear_noref_flag(struct ring_buffer_backend *bufb,
> > > > 102                                   unsigned long idx)
> > > > 103 {
> > > > 104         struct ring_buffer_backend_page *sb_pages, *new_sb_pages;
> > > > 105
> > > > 106         sb_pages = bufb->buf_wsb[idx].pages;
> > > > 107         for (;;) {
> > > > 108                 if (!RCHAN_SB_IS_NOREF(sb_pages))
> > > > 109                         return; /* Already writing to this buffer */
> > > > 110                 new_sb_pages = sb_pages;
> > > > 111                 RCHAN_SB_CLEAR_NOREF(new_sb_pages);
> > > > 112                 new_sb_pages = cmpxchg(&bufb->buf_wsb[idx].pages,
> > > > 113                         sb_pages, new_sb_pages);
> > > > 114                 if (likely(new_sb_pages == sb_pages))
> > > > 115                         break;
> > > > 116                 sb_pages = new_sb_pages;
> > > 
> > > The writer calls this??
> > 
> > Yes. But the common case (for each event) is simply a
> > "if (!RCHAN_SB_IS_NOREF(sb_pages))" test that returns. The cmpxchg() is only
> > performed at subbuffer boundary.
> 
> Is the cmpxchg only contending with other writers?

No. Would have this been the case, I would have used a cmpxchg_local(). This
cmpxchg used to deal with subbuffer swap is touching the subbuffer "pages"
pointer, which can be updated concurrently by other writers as well as readers.

The writer clears the noref flags when starting to write in a subbuffers, and
sets it when delivering the subbuffer (when it is fully committed).

The reader can only ever swap the subbuffer with the one it owns if the noref
flag is set. The reader uses a cmpxchg() too to perform the swap.

[...]

> > > 
> > > This looks just like the swap with reader_page that I do, except you use
> > > a table and I use the list.  How do you replenish the buf_rsb.pages if
> > > the splice keeps the page you just received active?
> > 
> > I don't allow other reads to proceed as long as splice is holding pages that
> > belong to the reader-owned subbuffer. The read semantic is basically:
> > 
> > ring_buffer_open_read() /* only one reader at a time can open a ring buffer */
> > get_subbuf_size()
> > while (buffer is not finalized and empty) {
> >   poll()
> >   ret = ring_buffer_get_subbuf()
> >   if (!ret)
> >     continue;
> >   /* The splice ops below can be performed in multiple calls, e.g. first splice
> >    * only a portion of a subbuffer to a pipe, then splice to the disk/network,
> >    * and move to the next subbuffer portion until all the subbuffer is sent.
> >    */
> >   splice one subbuffer worth of data to a pipe
> >   splice the data from pipe to disk/network
> >   ring_buffer_put_subbuf()
> > }
> > ring_buffer_close_read()
> > 
> > The reader code above works both with flight recorder and non-overwrite mode.
> > 
> > The code above assumes that upon return from the splice() to disk/network,
> > splice() is not using the pages anymore (I assume that splice() performs the
> > transfer synchronously with the call).
> > 
> > The VFS interface I use for get_subbuf_size(), ring_buffer_get_subbuf() and
> > ring_buffer_put_subbuf() are new ioctls. Note that these can be used for both
> > splice() and mmap() types of backend access, as they only call into the
> > frontend.
> 
> Hmm, so basically you lose pages until they are returned. I guess I can
> trivially add the same thing now to the current ring buffer.

Yep. Having the ability to keep an array of pages (rather that just a single
page at a time) allows splice() to move many pages at once efficiently, while
permitting this "pages owned by the readers, lend to splice() until it returns"
simplification. I also never have to allocate pages while tracing: all the pages
I need are allocated when the buffer is created (and at the special case of cpu
hotplug, but this is expected for per-cpu buffers).

In addition, this would play well with mmap() too: we can simply add a
ring_buffer_get_mmap_offset() method to the backend (exported through another
ioctl) that would let user-space know the start of the mmap'd buffer range
currently owned by the reader. So we can inform user-space of the currently
owned page range without even changing the underlying memory map.

Thanks,

Mathieu


-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Perf and ftrace [was Re: PyTimechart]
  2010-05-13 16:31                                     ` Mathieu Desnoyers
@ 2010-05-13 16:46                                       ` Steven Rostedt
  2010-05-13 17:10                                         ` Mathieu Desnoyers
  2010-05-14  7:53                                       ` Peter Zijlstra
  1 sibling, 1 reply; 34+ messages in thread
From: Steven Rostedt @ 2010-05-13 16:46 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Peter Zijlstra, Frederic Weisbecker, Pierre Tardy, Ingo Molnar,
	Arnaldo Carvalho de Melo, Tom Zanussi, Paul Mackerras,
	linux-kernel, arjan, ziga.mahkovec, davem

On Thu, 2010-05-13 at 12:31 -0400, Mathieu Desnoyers wrote:
> * Steven Rostedt (rostedt@goodmis.org) wrote:
> > On Thu, 2010-05-13 at 09:20 -0400, Mathieu Desnoyers wrote:
> [...]
> > > > > 
> > > > > ...
> > > > > 
> > > > > 97 /**
> > > > > 98  * ring_buffer_clear_noref_flag - Clear the noref subbuffer flag, for writer.
> > > > > 99  */
> > > > > 100 static __inline__
> > > > > 101 void ring_buffer_clear_noref_flag(struct ring_buffer_backend *bufb,
> > > > > 102                                   unsigned long idx)
> > > > > 103 {
> > > > > 104         struct ring_buffer_backend_page *sb_pages, *new_sb_pages;
> > > > > 105
> > > > > 106         sb_pages = bufb->buf_wsb[idx].pages;
> > > > > 107         for (;;) {
> > > > > 108                 if (!RCHAN_SB_IS_NOREF(sb_pages))
> > > > > 109                         return; /* Already writing to this buffer */
> > > > > 110                 new_sb_pages = sb_pages;
> > > > > 111                 RCHAN_SB_CLEAR_NOREF(new_sb_pages);
> > > > > 112                 new_sb_pages = cmpxchg(&bufb->buf_wsb[idx].pages,
> > > > > 113                         sb_pages, new_sb_pages);
> > > > > 114                 if (likely(new_sb_pages == sb_pages))
> > > > > 115                         break;
> > > > > 116                 sb_pages = new_sb_pages;
> > > > 
> > > > The writer calls this??
> > > 
> > > Yes. But the common case (for each event) is simply a
> > > "if (!RCHAN_SB_IS_NOREF(sb_pages))" test that returns. The cmpxchg() is only
> > > performed at subbuffer boundary.
> > 
> > Is the cmpxchg only contending with other writers?
> 
> No. Would have this been the case, I would have used a cmpxchg_local(). This
> cmpxchg used to deal with subbuffer swap is touching the subbuffer "pages"
> pointer, which can be updated concurrently by other writers as well as readers.
> 
> The writer clears the noref flags when starting to write in a subbuffers, and
> sets it when delivering the subbuffer (when it is fully committed).
> 
> The reader can only ever swap the subbuffer with the one it owns if the noref
> flag is set. The reader uses a cmpxchg() too to perform the swap.
> 

So the writer can loop because of a reader? And theoretically, a
aggressive reader that is constantly swapping pages can halt a writer?

I thought you said you did not have loops in the writers?

> [...]
> 
> > > > 
> > > > This looks just like the swap with reader_page that I do, except you use
> > > > a table and I use the list.  How do you replenish the buf_rsb.pages if
> > > > the splice keeps the page you just received active?
> > > 
> > > I don't allow other reads to proceed as long as splice is holding pages that
> > > belong to the reader-owned subbuffer. The read semantic is basically:
> > > 
> > > ring_buffer_open_read() /* only one reader at a time can open a ring buffer */
> > > get_subbuf_size()
> > > while (buffer is not finalized and empty) {
> > >   poll()
> > >   ret = ring_buffer_get_subbuf()
> > >   if (!ret)
> > >     continue;
> > >   /* The splice ops below can be performed in multiple calls, e.g. first splice
> > >    * only a portion of a subbuffer to a pipe, then splice to the disk/network,
> > >    * and move to the next subbuffer portion until all the subbuffer is sent.
> > >    */
> > >   splice one subbuffer worth of data to a pipe
> > >   splice the data from pipe to disk/network
> > >   ring_buffer_put_subbuf()
> > > }
> > > ring_buffer_close_read()
> > > 
> > > The reader code above works both with flight recorder and non-overwrite mode.
> > > 
> > > The code above assumes that upon return from the splice() to disk/network,
> > > splice() is not using the pages anymore (I assume that splice() performs the
> > > transfer synchronously with the call).
> > > 
> > > The VFS interface I use for get_subbuf_size(), ring_buffer_get_subbuf() and
> > > ring_buffer_put_subbuf() are new ioctls. Note that these can be used for both
> > > splice() and mmap() types of backend access, as they only call into the
> > > frontend.
> > 
> > Hmm, so basically you lose pages until they are returned. I guess I can
> > trivially add the same thing now to the current ring buffer.
> 
> Yep. Having the ability to keep an array of pages (rather that just a single
> page at a time) allows splice() to move many pages at once efficiently, while
> permitting this "pages owned by the readers, lend to splice() until it returns"
> simplification. I also never have to allocate pages while tracing: all the pages
> I need are allocated when the buffer is created (and at the special case of cpu
> hotplug, but this is expected for per-cpu buffers).

Yep, I just wrote the changes for my buffer. It was rather trivial ;-)
Although, I just set a hard coded 16 pages max of cached pages.

-- Steve

> 
> In addition, this would play well with mmap() too: we can simply add a
> ring_buffer_get_mmap_offset() method to the backend (exported through another
> ioctl) that would let user-space know the start of the mmap'd buffer range
> currently owned by the reader. So we can inform user-space of the currently
> owned page range without even changing the underlying memory map.
> 
> Thanks,
> 
> Mathieu
> 
> 



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Perf and ftrace [was Re: PyTimechart]
  2010-05-13 16:46                                       ` Steven Rostedt
@ 2010-05-13 17:10                                         ` Mathieu Desnoyers
  2010-05-13 18:33                                           ` Steven Rostedt
  0 siblings, 1 reply; 34+ messages in thread
From: Mathieu Desnoyers @ 2010-05-13 17:10 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Peter Zijlstra, Frederic Weisbecker, Pierre Tardy, Ingo Molnar,
	Arnaldo Carvalho de Melo, Tom Zanussi, Paul Mackerras,
	linux-kernel, arjan, ziga.mahkovec, davem

* Steven Rostedt (rostedt@goodmis.org) wrote:
> On Thu, 2010-05-13 at 12:31 -0400, Mathieu Desnoyers wrote:
> > * Steven Rostedt (rostedt@goodmis.org) wrote:
> > > On Thu, 2010-05-13 at 09:20 -0400, Mathieu Desnoyers wrote:
> > [...]
> > > > > > 
> > > > > > ...
> > > > > > 
> > > > > > 97 /**
> > > > > > 98  * ring_buffer_clear_noref_flag - Clear the noref subbuffer flag, for writer.
> > > > > > 99  */
> > > > > > 100 static __inline__
> > > > > > 101 void ring_buffer_clear_noref_flag(struct ring_buffer_backend *bufb,
> > > > > > 102                                   unsigned long idx)
> > > > > > 103 {
> > > > > > 104         struct ring_buffer_backend_page *sb_pages, *new_sb_pages;
> > > > > > 105
> > > > > > 106         sb_pages = bufb->buf_wsb[idx].pages;
> > > > > > 107         for (;;) {
> > > > > > 108                 if (!RCHAN_SB_IS_NOREF(sb_pages))
> > > > > > 109                         return; /* Already writing to this buffer */
> > > > > > 110                 new_sb_pages = sb_pages;
> > > > > > 111                 RCHAN_SB_CLEAR_NOREF(new_sb_pages);
> > > > > > 112                 new_sb_pages = cmpxchg(&bufb->buf_wsb[idx].pages,
> > > > > > 113                         sb_pages, new_sb_pages);
> > > > > > 114                 if (likely(new_sb_pages == sb_pages))
> > > > > > 115                         break;
> > > > > > 116                 sb_pages = new_sb_pages;
> > > > > 
> > > > > The writer calls this??
> > > > 
> > > > Yes. But the common case (for each event) is simply a
> > > > "if (!RCHAN_SB_IS_NOREF(sb_pages))" test that returns. The cmpxchg() is only
> > > > performed at subbuffer boundary.
> > > 
> > > Is the cmpxchg only contending with other writers?
> > 
> > No. Would have this been the case, I would have used a cmpxchg_local(). This
> > cmpxchg used to deal with subbuffer swap is touching the subbuffer "pages"
> > pointer, which can be updated concurrently by other writers as well as readers.
> > 
> > The writer clears the noref flags when starting to write in a subbuffers, and
> > sets it when delivering the subbuffer (when it is fully committed).
> > 
> > The reader can only ever swap the subbuffer with the one it owns if the noref
> > flag is set. The reader uses a cmpxchg() too to perform the swap.
> > 
> 
> So the writer can loop because of a reader? And theoretically, a
> aggressive reader that is constantly swapping pages can halt a writer?

Hehe, I'm glad you ask :) This has been one of my major concerns with this
algorithm.

Taken from my thesis:

8.6 Formal Verification of LTTng
8.6.3 Real-Time Impact

"The second synchronization added by the overwrite mode involves letting the
reader thread exchange individual sub-buffers with its own extra sub-buffer
before reading them. This exchange is performed with a CAS instruction which
verified if the use flag is set by the writer. If the sub-buffer is in use, the
reader will retry later.  On the writer side, a CAS instruction is used in a
busy-loop to set and clear the use flag (for portability, because an atomic
instruction to set a mask is not available for all architecture in the Linux
kernel). Given that the reader can only ever exchange a sub-buffer and cause the
writer to restart once (further accesses to that sub-buffers are impossible,
because the reader will block on the writer position), the reader thread cannot
starve the writer."

IOW, if the reader is trying to get subbuffers like crazy, it will maybe cause
one loop restart when taking the subbuffer the writer was about to write to, but
after that, the reader will be forced to exchange the next subbuffer, and so on
until it wraps-around and comes back to the subbuffer preceding the writer
position, as stop there. Note that during all this execution, the reader could
only cause a single retry of the writer. In summary, the frontend range
restrictions guarantee that the writer will retry at most once.

Note that we could also use an atomic "set bit/clear bit" for the writer, along
with write memory barriers, when available on the architecture, and that would
remove all need for loops. However, architectures that need to emulate set
bit/clear bit with a cmpxchg() would behave pretty much in the same way as my
current code does.

> 
> I thought you said you did not have loops in the writers?

Well, while technically a loop, it is a bounded single-retry, so it's OK as far
as wait-free guarantees are concerned. We could even put a WARN_ON() there to
ensure that we catch double retry that should never happen, but that would be at
the cost of a runtime veritication. A DEBUG_TRACER_RT config option could select
this though.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Perf and ftrace [was Re: PyTimechart]
  2010-05-13 17:10                                         ` Mathieu Desnoyers
@ 2010-05-13 18:33                                           ` Steven Rostedt
  0 siblings, 0 replies; 34+ messages in thread
From: Steven Rostedt @ 2010-05-13 18:33 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Peter Zijlstra, Frederic Weisbecker, Pierre Tardy, Ingo Molnar,
	Arnaldo Carvalho de Melo, Tom Zanussi, Paul Mackerras,
	linux-kernel, arjan, ziga.mahkovec, davem

On Thu, 2010-05-13 at 13:10 -0400, Mathieu Desnoyers wrote:
> * Steven Rostedt (rostedt@goodmis.org) wrote:

> IOW, if the reader is trying to get subbuffers like crazy, it will maybe cause
> one loop restart when taking the subbuffer the writer was about to write to, but
> after that, the reader will be forced to exchange the next subbuffer, and so on
> until it wraps-around and comes back to the subbuffer preceding the writer
> position, as stop there. Note that during all this execution, the reader could
> only cause a single retry of the writer. In summary, the frontend range
> restrictions guarantee that the writer will retry at most once.
> 

And I take it that this is all verified in your proof.

> Note that we could also use an atomic "set bit/clear bit" for the writer, along
> with write memory barriers, when available on the architecture, and that would
> remove all need for loops. However, architectures that need to emulate set
> bit/clear bit with a cmpxchg() would behave pretty much in the same way as my
> current code does.
> 
> > 
> > I thought you said you did not have loops in the writers?

Just a note, that the current ring buffer has no such loops.

> 
> Well, while technically a loop, it is a bounded single-retry, so it's OK as far
> as wait-free guarantees are concerned. We could even put a WARN_ON() there to
> ensure that we catch double retry that should never happen, but that would be at
> the cost of a runtime veritication. A DEBUG_TRACER_RT config option could select
> this though.

No, no one would ever track that. As with todays ring buffer, we keep
all checks enabled. Sorry if it slows it down, but the kernel is at such
a flex that we must have these checks on all the time.

-- Steve



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Perf and ftrace [was Re: PyTimechart]
  2010-05-13 16:31                                     ` Mathieu Desnoyers
  2010-05-13 16:46                                       ` Steven Rostedt
@ 2010-05-14  7:53                                       ` Peter Zijlstra
  1 sibling, 0 replies; 34+ messages in thread
From: Peter Zijlstra @ 2010-05-14  7:53 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Steven Rostedt, Frederic Weisbecker, Pierre Tardy, Ingo Molnar,
	Arnaldo Carvalho de Melo, Tom Zanussi, Paul Mackerras,
	linux-kernel, arjan, ziga.mahkovec, davem

On Thu, 2010-05-13 at 12:31 -0400, Mathieu Desnoyers wrote:
> 
> In addition, this would play well with mmap() too: we can simply add a
> ring_buffer_get_mmap_offset() method to the backend (exported through another
> ioctl) that would let user-space know the start of the mmap'd buffer range
> currently owned by the reader. So we can inform user-space of the currently
> owned page range without even changing the underlying memory map. 

I still think keeping refs to splice pages is tricky at best. Suppose
they're spliced into the pagecache of a file, it could stay there for a
long time under some conditions.

Also, the splice-client (say the pagecache) and the mmap will both want
the pageframe to contain different information.

^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2010-05-14  7:54 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-05-11 21:10 [RFC] PyTimechart Pierre Tardy
2010-05-11 21:36 ` Frederic Weisbecker
2010-05-12 13:37   ` Pierre Tardy
2010-05-12 14:48     ` Frederic Weisbecker
2010-05-12 15:36       ` Steven Rostedt
2010-05-12 16:46         ` Perf and ftrace [was Re: PyTimechart] Frederic Weisbecker
2010-05-12 17:00           ` Peter Zijlstra
2010-05-12 17:07             ` Mathieu Desnoyers
2010-05-12 17:47               ` Peter Zijlstra
2010-05-12 17:53                 ` Mathieu Desnoyers
2010-05-12 18:00                   ` Peter Zijlstra
2010-05-12 18:04                     ` Mathieu Desnoyers
2010-05-12 18:08                       ` Peter Zijlstra
2010-05-12 18:37                         ` Mathieu Desnoyers
2010-05-12 18:46                           ` Steven Rostedt
2010-05-12 20:27                             ` Mathieu Desnoyers
2010-05-12 22:21                               ` Steven Rostedt
2010-05-13 13:20                                 ` Mathieu Desnoyers
2010-05-13 15:42                                   ` Steven Rostedt
2010-05-13 16:31                                     ` Mathieu Desnoyers
2010-05-13 16:46                                       ` Steven Rostedt
2010-05-13 17:10                                         ` Mathieu Desnoyers
2010-05-13 18:33                                           ` Steven Rostedt
2010-05-14  7:53                                       ` Peter Zijlstra
2010-05-12 18:49                           ` Peter Zijlstra
2010-05-12 18:51                             ` Mathieu Desnoyers
2010-05-12 19:01                               ` Peter Zijlstra
2010-05-12 20:52                                 ` Mathieu Desnoyers
2010-05-12 17:11             ` Ingo Molnar
2010-05-12 16:59         ` [RFC] PyTimechart Ingo Molnar
2010-05-12 17:15           ` Steven Rostedt
2010-05-12 18:23             ` Ingo Molnar
2010-05-12 17:13       ` Pierre Tardy
2010-05-13 15:02         ` Arnaldo Carvalho de Melo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).