linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
To: Peter Zijlstra <peterz@infradead.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Ingo Molnar <mingo@elte.hu>, LKML <linux-kernel@vger.kernel.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	Steven Rostedt <rostedt@goodmis.org>,
	Steven Rostedt <rostedt@rostedt.homelinux.com>,
	Thomas Gleixner <tglx@linutronix.de>,
	Christoph Hellwig <hch@lst.de>, Li Zefan <lizf@cn.fujitsu.com>,
	Lai Jiangshan <laijs@cn.fujitsu.com>,
	Johannes Berg <johannes.berg@intel.com>,
	Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>,
	Arnaldo Carvalho de Melo <acme@infradead.org>,
	Tom Zanussi <tzanussi@gmail.com>,
	KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>,
	Andi Kleen <andi@firstfloor.org>,
	"H. Peter Anvin" <hpa@zytor.com>,
	Jeremy Fitzhardinge <jeremy@goop.org>,
	"Frank Ch. Eigler" <fche@redhat.com>,
	Tejun Heo <htejun@gmail.com>
Subject: Re: [patch 1/2] x86_64 page fault NMI-safe
Date: Wed, 4 Aug 2010 10:45:39 -0400	[thread overview]
Message-ID: <20100804144539.GA4617@Krystal> (raw)
In-Reply-To: <1280904410.1923.700.camel@laptop>

* Peter Zijlstra (peterz@infradead.org) wrote:
> On Tue, 2010-08-03 at 14:25 -0400, Mathieu Desnoyers wrote:
> > * Peter Zijlstra (peterz@infradead.org) wrote:
> > > On Thu, 2010-07-15 at 12:26 -0400, Mathieu Desnoyers wrote:
> > > 
> > > > I was more thinking along the lines of making sure a ring buffer has the proper
> > > > support for your use-case. It shares a lot of requirements with a standard ring
> > > > buffer:
> > > > 
> > > > - Need to be lock-less
> > > > - Need to reserve space, write data in a buffer
> > > > 
> > > > By configuring a ring buffer with 4k sub-buffer size (that's configurable
> > > > dynamically), 
> > > 
> > > FWIW I really utterly detest the whole concept of sub-buffers.
> > 
> > This reluctance against splitting a buffer into sub-buffers might contribute to
> > explain the poor performance experienced with the Perf ring buffer.
> 
> That's just unsubstantiated FUD.

Extracted from:
http://lkml.org/lkml/2010/7/9/368

(executive summary)

* Throughput

   * Flight recorder mode

Ring Buffer Library        83 ns/entry (512kB sub-buffers, no reader)
                           89 ns/entry (512kB sub-buffers: read 0.3M entries/s)


Ftrace Ring Buffer:       103 ns/entry (no reader)
                          187 ns/entry (read by event:     read 0.4M entries/s)

Perf record               (flight recorder mode unavailable)


   * Discard mode

Ring Buffer Library:      96 ns/entry discarded
                          257 ns/entry written (read: 2.8M entries/s)

Perf Ring Buffer:         423 ns/entry written (read: 2.3M entries/s)
(Note that this number is based on the perf event approximation output (based on
a 24 bytes/entry estimation) rather than the benchmark module count due its
inaccuracy, which is caused by perf not letting the benchmark module know about
discarded events.)

It is really hard to get a clear picture of the data write overhead with perf,
because you _need_ to consume data. Making perf support flight recorder mode
would really help getting benchmarks that are easier to compare.

> 
> >  These
> > "sub-buffers" are really nothing new: these are called "periods" in the audio
> > world. They help lowering the ring buffer performance overhead because:
> > 
> > 1) They allow writing into the ring buffer without SMP-safe synchronization
> > primitives and memory barriers for each record. Synchronization is only needed
> > across sub-buffer boundaries, which amortizes the cost over a large number of
> > events.
> 
> The only SMP barrier we (should) have is when we update the user visible
> head pointer. The buffer code itself uses local{,64}_t for all other
> atomic ops.
> 
> If you want to amortize that barrier, simply hold off the head update
> for a while, no need to introduce sub-buffers.

I understand your point about amortized synchronization. However I still don't
see how you can achieve flight recorder mode, efficient seek on multi-GB traces
without reading the whole event stream, and live streaming without sub-buffers
(and, ideally, without much headhaches involved). ;)

> 
> > 2) They are much more splice (and, in general, page-exchange) friendly, because
> > records written after a synchronization point start at the beginning of a page.
> > This removes the need for extra copies.
> 
> This just doesn't make any sense at all, I could splice full pages just
> fine, splice keeps page order so these synchronization points aren't
> critical in any way.

If you need to read non-filled pages, then you need to splice pages piece-wise.
This does not fit well with flight recorder tracing, for which the solution
Steven and I have found is to atomically exchange pages (for Ftrace) or
sub-buffers (for the generic ring buffer library) between the reader and writer.

> 
> The only problem I have with splice atm is that we don't have a buffer
> interface without mmap() and we cannot splice pages out from under
> mmap() on all architectures in a sane manner.

The problem Perf has is probably more with flight recorder (overwrite) tracing
support than splice() per se, in this you are right.

> 
> > So I have to ask: do you detest the sub-buffer concept only because you are tied
> > to the current Perf userspace ABI which cannot support this without an ABI
> > change ?
> 
> No because I don't see the point.

OK, good to know you are open to ABI changes if I present convincing arguments.

> 
> > I'm trying to help out here, but it does not make the task easy if we have both
> > hands tied in our back because we have to keep backward ABI compatibility for a
> > tool (perf) forever, even considering its sources are shipped with the kernel.
> 
> Dude, its a published user<->kernel ABI, also you're not saying why you
> would want to break it. In your other email you allude to things like
> flight recorder mode, that could be done with the current set-up, no
> need to break the ABI at all. All you need to do is track the tail
> pointer and publish it.

How do you plan to read the data concurrently with the writer overwriting the
data while you are reading it without corruption ?

> 
> > Nope. I'm thinking that we can use a buffer just to save the stack as we call
> > functions and return, e.g.
> 
> We don't have a callback on function entry, and I'm not going to use
> mcount for that, that's simply insane.

OK, now I get a clearer picture of what Frederic is trying to do.

> 
> > call X -> reserve space to save "X" and arguments.
> > call Y -> same for Y.
> > call Z -> same for Z.
> > return -> discard event for Z.
> > return -> discard event for Y.
> > 
> > if we grab the buffer content at that point, then we have X and its arguments,
> > which is the function currently executed. That would require the ability to
> > uncommit and unreserve an event, which is not a problem as long as we have not
> > committed a full sub-buffer.
> 
> Again, I'm not really seeing the point of using sub-buffers at all.

This part of the email is unrelated to sub-buffers.

> 
> Also, what happens when we write an event after Y? Then the discard must
> fail or turn Y into a NOP, leaving a hole in the buffer.

Given that this buffer is simply used to dump the stack unwind result then I
think my scenario above was simply mislead.

> 
> > I thought that this buffer was chasing the function entry/exits rather than
> > doing a stack unwind, but I might be wrong. Perhaps Frederic could tell us more
> > about his use-case ?
> 
> No, its a pure stack unwind from NMI context. When we get an event (PMI,
> tracepoint, whatever) we write out event, if the consumer asked for a
> stacktrace with each event, we unwind the stack for him.

So why the copy ? Frederic seems to put the stack unwind in a special temporary
buffer. Why is it not saved directly into the trace buffers ?

> > > Additionally, if you have multiple consumers you can simply copy the
> > > stacktrace again, avoiding the whole pointer chase exercise. While you
> > > could conceivably copy from one ringbuffer into another that will result
> > > in very nasty serialization issues.
> > 
> > Assuming Frederic is saving information to this stack-like ring buffer at each
> > function entry and discarding at each function return, then we don't have the
> > pointer chase.
> > 
> > What I am proposing does not even involve a copy: when we want to take a
> > snapshot, we just have to force a sub-buffer switch on the ring buffer. The
> > "returns" happening at the beginning of the next (empty) sub-buffer would
> > clearly fail to discard records (expecting non-existing entry records). We would
> > then have to save a small record saying that a function return occurred. The
> > current stack frame at the end of the next sub-buffer could be deduced from the
> > complete collection of stack frame samples.
> 
> And suppose the stack-trace was all of 16 entries (not uncommon for a
> kernel stack), then you waste a whole page for 128 bytes (assuming your
> sub-buffer is page sized). I'll take the memcopy, thank you.

Well, now that I understand what you are trying to achieve, I retract my
proposal of using a stack-like ring buffer for this. I think that the stack dump
should simply be saved directly to the ring buffer, without copy. The
dump_stack() functions might have to be extended so they don't just save text
dumbly, but can also be used to save events into the trace in binary format,
perhaps with the continuation cookie Linus was proposing.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

  parent reply	other threads:[~2010-08-04 14:45 UTC|newest]

Thread overview: 163+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-07-14 15:49 [patch 0/2] x86: NMI-safe trap handlers Mathieu Desnoyers
2010-07-14 15:49 ` [patch 1/2] x86_64 page fault NMI-safe Mathieu Desnoyers
2010-07-14 16:28   ` Linus Torvalds
2010-07-14 17:06     ` Mathieu Desnoyers
2010-07-14 18:10       ` Linus Torvalds
2010-07-14 18:46         ` Ingo Molnar
2010-07-14 19:14           ` Linus Torvalds
2010-07-14 19:36             ` Frederic Weisbecker
2010-07-14 19:54               ` Linus Torvalds
2010-07-14 20:17                 ` Mathieu Desnoyers
2010-07-14 20:55                   ` Linus Torvalds
2010-07-14 21:18                     ` Ingo Molnar
2010-07-14 22:14                 ` Frederic Weisbecker
2010-07-14 22:31                   ` Mathieu Desnoyers
2010-07-14 22:48                     ` Frederic Weisbecker
2010-07-14 23:11                       ` Mathieu Desnoyers
2010-07-14 23:38                         ` Frederic Weisbecker
2010-07-15 16:26                           ` Mathieu Desnoyers
2010-08-03 17:18                             ` Peter Zijlstra
2010-08-03 18:25                               ` Mathieu Desnoyers
2010-08-04  6:46                                 ` Peter Zijlstra
2010-08-04  7:14                                   ` Ingo Molnar
2010-08-04 14:45                                   ` Mathieu Desnoyers [this message]
2010-08-04 14:56                                     ` Peter Zijlstra
2010-08-06  1:49                                       ` Mathieu Desnoyers
2010-08-06  9:51                                         ` Peter Zijlstra
2010-08-06 13:46                                           ` Mathieu Desnoyers
2010-08-06  6:18                                       ` Masami Hiramatsu
2010-08-06  9:50                                         ` Peter Zijlstra
2010-08-06 13:37                                           ` Mathieu Desnoyers
2010-08-07  9:51                                           ` Masami Hiramatsu
2010-08-09 16:53                                           ` Frederic Weisbecker
2010-08-03 18:56                               ` Linus Torvalds
2010-08-03 19:45                                 ` Mathieu Desnoyers
2010-08-03 20:02                                   ` Linus Torvalds
2010-08-03 20:10                                     ` Ingo Molnar
2010-08-03 20:21                                       ` Ingo Molnar
2010-08-03 21:16                                         ` Mathieu Desnoyers
2010-08-03 20:54                                     ` Mathieu Desnoyers
2010-08-04  6:27                                 ` Peter Zijlstra
2010-08-04 14:06                                   ` Mathieu Desnoyers
2010-08-04 14:50                                     ` Peter Zijlstra
2010-08-06  1:42                                       ` Mathieu Desnoyers
2010-08-06 10:11                                         ` Peter Zijlstra
2010-08-06 11:14                                           ` Peter Zijlstra
2010-08-06 14:15                                             ` Mathieu Desnoyers
2010-08-06 14:13                                           ` Mathieu Desnoyers
2010-08-11 14:44                                             ` Steven Rostedt
2010-08-11 14:34                                   ` Steven Rostedt
2010-08-15 13:35                                     ` Mathieu Desnoyers
2010-08-15 16:33                                     ` Avi Kivity
2010-08-15 16:44                                       ` Mathieu Desnoyers
2010-08-15 16:51                                         ` Avi Kivity
2010-08-15 18:31                                           ` Mathieu Desnoyers
2010-08-16 10:49                                             ` Avi Kivity
2010-08-16 11:29                                             ` Avi Kivity
2010-08-04  6:46                                 ` Dave Chinner
2010-08-04  7:21                                   ` Ingo Molnar
2010-07-14 23:40                         ` Steven Rostedt
2010-07-14 19:41             ` Linus Torvalds
2010-07-14 19:56               ` Andi Kleen
2010-07-14 20:05                 ` Mathieu Desnoyers
2010-07-14 20:07                   ` Andi Kleen
2010-07-14 20:08                     ` H. Peter Anvin
2010-07-14 23:32                       ` Tejun Heo
2010-07-14 22:31                   ` Frederic Weisbecker
2010-07-14 22:56                     ` Linus Torvalds
2010-07-14 23:09                       ` Andi Kleen
2010-07-14 23:22                         ` Linus Torvalds
2010-07-15 14:11                       ` Frederic Weisbecker
2010-07-15 14:35                         ` Andi Kleen
2010-07-16 11:21                           ` Frederic Weisbecker
2010-07-15 14:46                         ` Steven Rostedt
2010-07-16 10:47                           ` Frederic Weisbecker
2010-07-16 11:43                             ` Steven Rostedt
2010-07-15 14:51                         ` Linus Torvalds
2010-07-15 15:38                           ` Linus Torvalds
2010-07-16 12:00                           ` Frederic Weisbecker
2010-07-16 12:54                             ` Steven Rostedt
2010-07-14 20:39         ` Mathieu Desnoyers
2010-07-14 21:23           ` Linus Torvalds
2010-07-14 21:45             ` Maciej W. Rozycki
2010-07-14 21:52               ` Linus Torvalds
2010-07-14 22:31                 ` Maciej W. Rozycki
2010-07-14 22:21             ` Mathieu Desnoyers
2010-07-14 22:37               ` Linus Torvalds
2010-07-14 22:51                 ` Jeremy Fitzhardinge
2010-07-14 23:02                   ` Linus Torvalds
2010-07-14 23:54                     ` Jeremy Fitzhardinge
2010-07-15  1:23                 ` Linus Torvalds
2010-07-15  1:45                   ` Linus Torvalds
2010-07-15 18:31                     ` Mathieu Desnoyers
2010-07-15 18:43                       ` Linus Torvalds
2010-07-15 18:48                         ` Linus Torvalds
2010-07-15 22:01                           ` Mathieu Desnoyers
2010-07-15 22:16                             ` Linus Torvalds
2010-07-15 22:24                               ` H. Peter Anvin
2010-07-15 22:26                               ` Linus Torvalds
2010-07-15 22:46                                 ` H. Peter Anvin
2010-07-15 22:58                                 ` Andi Kleen
2010-07-15 23:20                                   ` H. Peter Anvin
2010-07-15 23:23                                     ` Linus Torvalds
2010-07-15 23:41                                       ` H. Peter Anvin
2010-07-15 23:44                                         ` Linus Torvalds
2010-07-15 23:46                                           ` H. Peter Anvin
2010-07-15 23:48                                       ` Andi Kleen
2010-07-15 22:30                               ` Mathieu Desnoyers
2010-07-16 19:13                             ` Mathieu Desnoyers
2010-07-15 16:44                   ` Mathieu Desnoyers
2010-07-15 16:49                     ` Linus Torvalds
2010-07-15 17:38                       ` Mathieu Desnoyers
2010-07-15 20:44                         ` H. Peter Anvin
2010-07-18 11:03                   ` Avi Kivity
2010-07-18 17:36                     ` Linus Torvalds
2010-07-18 18:04                       ` Avi Kivity
2010-07-18 18:22                         ` Linus Torvalds
2010-07-19  7:32                           ` Avi Kivity
2010-07-18 18:17                       ` Linus Torvalds
2010-07-18 18:43                         ` Steven Rostedt
2010-07-18 19:26                           ` Linus Torvalds
2010-07-14 15:49 ` [patch 2/2] x86 NMI-safe INT3 and Page Fault Mathieu Desnoyers
2010-07-14 16:42   ` Maciej W. Rozycki
2010-07-14 18:12     ` Mathieu Desnoyers
2010-07-14 19:21       ` Maciej W. Rozycki
2010-07-14 19:58         ` Mathieu Desnoyers
2010-07-14 20:36           ` Maciej W. Rozycki
2010-07-16 12:28   ` Avi Kivity
2010-07-16 14:49     ` Mathieu Desnoyers
2010-07-16 15:34       ` Andi Kleen
2010-07-16 15:40         ` Mathieu Desnoyers
2010-07-16 16:47       ` Avi Kivity
2010-07-16 16:58         ` Mathieu Desnoyers
2010-07-16 17:54           ` Avi Kivity
2010-07-16 18:05             ` H. Peter Anvin
2010-07-16 18:15               ` Avi Kivity
2010-07-16 18:17                 ` H. Peter Anvin
2010-07-16 18:28                   ` Avi Kivity
2010-07-16 18:37                     ` Linus Torvalds
2010-07-16 19:26                       ` Avi Kivity
2010-07-16 21:39                         ` Linus Torvalds
2010-07-16 22:07                           ` Andi Kleen
2010-07-16 22:26                             ` Linus Torvalds
2010-07-16 22:41                               ` Andi Kleen
2010-07-17  1:15                                 ` Linus Torvalds
2010-07-16 22:40                             ` Mathieu Desnoyers
2010-07-18  9:23                           ` Avi Kivity
2010-07-16 18:22                 ` Mathieu Desnoyers
2010-07-16 18:32                   ` Avi Kivity
2010-07-16 19:29                     ` H. Peter Anvin
2010-07-16 19:39                       ` Avi Kivity
2010-07-16 19:32                     ` Andi Kleen
2010-07-16 18:25                 ` Linus Torvalds
2010-07-16 19:30                   ` Andi Kleen
2010-07-18  9:26                     ` Avi Kivity
2010-07-16 19:28               ` Andi Kleen
2010-07-16 19:32                 ` Avi Kivity
2010-07-16 19:34                   ` Andi Kleen
2010-08-04  9:46               ` Peter Zijlstra
2010-08-04 20:23                 ` H. Peter Anvin
2010-07-14 17:06 ` [patch 0/2] x86: NMI-safe trap handlers Andi Kleen
2010-07-14 17:08   ` Mathieu Desnoyers
2010-07-14 18:56     ` Andi Kleen
2010-07-14 23:29       ` Tejun Heo

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20100804144539.GA4617@Krystal \
    --to=mathieu.desnoyers@efficios.com \
    --cc=acme@infradead.org \
    --cc=akpm@linux-foundation.org \
    --cc=andi@firstfloor.org \
    --cc=fche@redhat.com \
    --cc=fweisbec@gmail.com \
    --cc=hch@lst.de \
    --cc=hpa@zytor.com \
    --cc=htejun@gmail.com \
    --cc=jeremy@goop.org \
    --cc=johannes.berg@intel.com \
    --cc=kosaki.motohiro@jp.fujitsu.com \
    --cc=laijs@cn.fujitsu.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=lizf@cn.fujitsu.com \
    --cc=masami.hiramatsu.pt@hitachi.com \
    --cc=mingo@elte.hu \
    --cc=peterz@infradead.org \
    --cc=rostedt@goodmis.org \
    --cc=rostedt@rostedt.homelinux.com \
    --cc=tglx@linutronix.de \
    --cc=torvalds@linux-foundation.org \
    --cc=tzanussi@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).