[RFC PATCH 0/2 v2] Unified trace buffer (take two)

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [RFC PATCH 0/2 v2] Unified trace buffer (take two)
@ 2008-09-25 15:58 Steven Rostedt
  2008-09-25 15:58 ` [RFC PATCH 1/2 v2] Unified trace buffer Steven Rostedt
  2008-09-25 15:58 ` [RFC PATCH 2/2 v2] ftrace: make work with new ring buffer Steven Rostedt
  0 siblings, 2 replies; 10+ messages in thread
From: Steven Rostedt @ 2008-09-25 15:58 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ingo Molnar, Thomas Gleixner, Peter Zijlstra, Andrew Morton,
	prasad, Linus Torvalds, Mathieu Desnoyers, Frank Ch. Eigler,
	David Wilder, hch, Martin Bligh, Christoph Hellwig


Again: this is a proof of concept, just spitting out code for comments.

Here's my second attempt.

Changes since version 1:

 - Ripped away all the debugfs and event registration from ring buffers.

 - Removed the mergesort from the ringbuffer and pushed that up to the
   tracer.

 - Changed the event header to what Linus suggested (we can discuss this
   and try other suggestions for v3, namely Peter Zijlstras ideas).

 struct {
 	u32 time_delta:27, type:5;
	u32 data;
	u64 array[];
 };

 - Added timestamp at beginning of each page and implemented a way
   for all events to get the full timestamp from the previous.

 - The changes to ftrace on this release was much less than the first
   one.

Comments?

-- Steve


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [RFC PATCH 1/2 v2] Unified trace buffer
  2008-09-25 15:58 [RFC PATCH 0/2 v2] Unified trace buffer (take two) Steven Rostedt
@ 2008-09-25 15:58 ` Steven Rostedt
  2008-09-25 17:02   ` Linus Torvalds
  2008-09-25 15:58 ` [RFC PATCH 2/2 v2] ftrace: make work with new ring buffer Steven Rostedt
  1 sibling, 1 reply; 10+ messages in thread
From: Steven Rostedt @ 2008-09-25 15:58 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ingo Molnar, Thomas Gleixner, Peter Zijlstra, Andrew Morton,
	prasad, Linus Torvalds, Mathieu Desnoyers, Frank Ch. Eigler,
	David Wilder, hch, Martin Bligh, Christoph Hellwig,
	Steven Rostedt

[-- Attachment #1: ring-buffer.patch --]
[-- Type: text/plain, Size: 37742 bytes --]

This is probably very buggy. I ran it as a back end for ftrace but only
tested the irqsoff and ftrace tracers. The selftests are busted with it.

But this is an attempt to get a unified buffering system that was
talked about at the LPC meeting.

Now that it boots and runs (albeit, a bit buggy), I decided to post it.
This is some idea that I had to handle this.

I tried to make it as simple as possible.

I'm not going to explain all the stuff I'm doing here, since this code
is under a lot of flux (RFC, POC work), and I don't want to keep updating
this change log. When we finally agree on something, I'll make this
change log worthy.

If you want to know what this patch does, the code below explains it :-p

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
---
 include/linux/ring_buffer.h |  191 +++++++
 kernel/trace/Kconfig        |    3 
 kernel/trace/Makefile       |    1 
 kernel/trace/ring_buffer.c  | 1172 ++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 1367 insertions(+)

Index: linux-compile.git/include/linux/ring_buffer.h
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-compile.git/include/linux/ring_buffer.h	2008-09-25 01:23:42.000000000 -0400
@@ -0,0 +1,191 @@
+#ifndef _LINUX_RING_BUFFER_H
+#define _LINUX_RING_BUFFER_H
+
+#include <linux/mm.h>
+#include <linux/seq_file.h>
+
+struct ring_buffer;
+struct ring_buffer_iter;
+
+/*
+ * Don't reference this struct directly, use the inline items below.
+ */
+struct ring_buffer_event {
+	u32		time_delta:27, type:5;
+	u32		data;
+	u64		array[];
+} __attribute__((__packed__));
+
+/*
+ * Recommend types by Linus Torvalds. Yeah, he didn't say
+ * this was a requirement, but it sounded good regardless.
+ */
+enum {
+	RB_TYPE_PADDING,	/* Left over page padding
+				 * (data is ignored)
+				 * size is variable depending on
+				 * the left over space on the page.
+				 */
+	RB_TYPE_TIME_EXTENT,	/* Extent the time delta
+				 * data = time delta (28 .. 59)
+				 * size = 8 bytes
+				 */
+	/* FIXME: RB_TYPE_TIME_STAMP not implemented */
+	RB_TYPE_TIME_STAMP,	/* Sync time stamp with external clock
+				 * data = tv_nsec
+				 * array[0] = tv_sec
+				 * size = 16 bytes
+				 */
+	RB_TYPE_SMALL_DATA,	/* Data that can fit in a page
+				 * data is length is bytes
+				 * array[0 .. (len+7)/8] = data
+				 * size = (len+15) & ~7
+				 */
+	/* FIXME: These are not implemented */
+	RB_TYPE_LARGE_DATA,	/* Data pointing to larger data.
+				 * data = 32-bit length of binary data
+				 * array[0] = 64-bit binary pointer to data
+				 * array[1] = 64-bit pointer to free function
+				 * size = 24
+				 */
+	RB_TYPE_STRING,		/* ASCII data
+				 * data = number of arguments
+				 * array[0] = 64-bit pointer to format string
+				 * array[1..args] = argument values
+				 * size = 8*(2+args)
+				 */
+};
+
+#define RB_EVNT_HDR_SIZE (sizeof(struct ring_buffer_event))
+#define RB_ALIGNMENT_SHIFT	3
+#define RB_ALIGNMENT		(1 << RB_ALIGNMENT_SHIFT)
+
+enum {
+	RB_LEN_TIME_EXTENT = 8,
+	RB_LEN_TIME_STAMP = 16,
+	RB_LEN_LARGE_DATA = 24,
+};
+
+/**
+ * ring_buffer_event_length - return the length of the event
+ * @event: the event to get the length of
+ *
+ * Note, if the event is bigger than 256 bytes, the length
+ * can not be held in the shifted 5 bits. The length is then
+ * added as a short (unshifted) in the body.
+ */
+static inline unsigned
+ring_buffer_event_length(struct ring_buffer_event *event)
+{
+	switch (event->type) {
+	case RB_TYPE_PADDING:
+		/* undefined */
+		return -1;
+
+	case RB_TYPE_TIME_EXTENT:
+		return RB_LEN_TIME_EXTENT;
+
+	case RB_TYPE_TIME_STAMP:
+		return RB_LEN_TIME_STAMP;
+
+	case RB_TYPE_SMALL_DATA:
+		return (event->data+15) & ~7;
+
+	case RB_TYPE_LARGE_DATA:
+		return RB_LEN_LARGE_DATA;
+
+	case RB_TYPE_STRING:
+		return (2 + event->data) << 3;
+
+	default:
+		BUG();
+	}
+	/* not hit */
+	return 0;
+}
+
+/**
+ * ring_buffer_event_time_delta - return the delta timestamp of the event
+ * @event: the event to get the delta timestamp of
+ *
+ * The delta timestamp is the 27 bit timestamp since the last event.
+ */
+static inline unsigned
+ring_buffer_event_time_delta(struct ring_buffer_event *event)
+{
+	return event->time_delta;
+}
+
+/**
+ * ring_buffer_event_data - return the data of the event
+ * @event: the event to get the data from
+ *
+ * Note, if the length of the event is more than 256 bytes, the
+ * length field is stored in the body. We need to return
+ * after the length field in that case.
+ */
+static inline void *
+ring_buffer_event_data(struct ring_buffer_event *event)
+{
+	BUG_ON(event->type != RB_TYPE_SMALL_DATA);
+	return (void *)&event->array[0];
+}
+
+void ring_buffer_lock(struct ring_buffer *buffer, unsigned long *flags);
+void ring_buffer_unlock(struct ring_buffer *buffer, unsigned long flags);
+
+/*
+ * size is in bytes for each per CPU buffer.
+ */
+struct ring_buffer *
+ring_buffer_alloc(unsigned long size, unsigned flags);
+void ring_buffer_free(struct ring_buffer *buffer);
+
+int ring_buffer_resize(struct ring_buffer *buffer, unsigned long size);
+
+void *ring_buffer_lock_reserve(struct ring_buffer *buffer,
+			       unsigned long length,
+			       unsigned long *flags);
+int ring_buffer_unlock_commit(struct ring_buffer *buffer,
+			      void *data, unsigned long flags);
+void *ring_buffer_write(struct ring_buffer *buffer,
+			unsigned long length, void *data);
+
+struct ring_buffer_event *
+ring_buffer_consume(struct ring_buffer *buffer, int cpu, u64 *ts);
+
+struct ring_buffer_iter *
+ring_buffer_read_start(struct ring_buffer *buffer, int cpu);
+void ring_buffer_read_finish(struct ring_buffer_iter *iter);
+
+struct ring_buffer_event *
+ring_buffer_peek(struct ring_buffer *buffer, int cpu, u64 *ts);
+struct ring_buffer_event *
+ring_buffer_iter_peek(struct ring_buffer_iter *iter, u64 *ts);
+struct ring_buffer_event *
+ring_buffer_read(struct ring_buffer_iter *iter, u64 *ts);
+void ring_buffer_iter_reset(struct ring_buffer_iter *iter);
+int ring_buffer_iter_empty(struct ring_buffer_iter *iter);
+
+unsigned long ring_buffer_size(struct ring_buffer *buffer);
+
+void ring_buffer_reset_cpu(struct ring_buffer *buffer, int cpu);
+void ring_buffer_reset(struct ring_buffer *buffer);
+
+int ring_buffer_swap_cpu(struct ring_buffer *buffer_a,
+			 struct ring_buffer *buffer_b, int cpu);
+
+int ring_buffer_empty(struct ring_buffer *buffer);
+int ring_buffer_empty_cpu(struct ring_buffer *buffer, int cpu);
+
+void ring_buffer_disable(struct ring_buffer *buffer);
+void ring_buffer_enable(struct ring_buffer *buffer);
+
+unsigned long ring_buffer_entries(struct ring_buffer *buffer);
+unsigned long ring_buffer_overruns(struct ring_buffer *buffer);
+
+enum ring_buffer_flags {
+	RB_FL_OVERWRITE		= 1 << 0,
+};
+
+#endif /* _LINUX_RING_BUFFER_H */
Index: linux-compile.git/kernel/trace/ring_buffer.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-compile.git/kernel/trace/ring_buffer.c	2008-09-25 11:47:07.000000000 -0400
@@ -0,0 +1,1172 @@
+/*
+ * Generic ring buffer
+ *
+ * Copyright (C) 2008 Steven Rostedt <srostedt@redhat.com>
+ */
+#include <linux/ring_buffer.h>
+#include <linux/spinlock.h>
+#include <linux/debugfs.h>
+#include <linux/uaccess.h>
+#include <linux/module.h>
+#include <linux/percpu.h>
+#include <linux/init.h>
+#include <linux/hash.h>
+#include <linux/list.h>
+#include <linux/fs.h>
+
+#include "trace.h"
+
+#define sdr_print(x, y...) printk("%s:%d " x "\n", __FUNCTION__, __LINE__, y)
+
+/* FIXME!!! */
+unsigned long long
+ring_buffer_time_stamp(int cpu)
+{
+	return sched_clock();
+}
+
+#define TS_SHIFT	27
+#define TS_MASK		((1ULL << TS_SHIFT) - 1)
+#define TS_DELTA_TEST	~TS_MASK
+
+/*
+ * We need to fit the time_stamp delta into 27 bits.
+ * Plue the time stamp delta of (-1) is a special flag.
+ */
+static inline int
+test_time_stamp(unsigned long long delta)
+{
+	if ((delta + 1) & TS_DELTA_TEST)
+		return 1;
+	return 0;
+}
+
+struct buffer_page {
+	u64		time_stamp;
+	unsigned char	body[];
+};
+
+#define BUF_PAGE_SIZE (PAGE_SIZE - sizeof(u64))
+
+/*
+ * head_page == tail_page && head == tail then buffer is empty.
+ */
+struct ring_buffer_per_cpu {
+	int			cpu;
+	struct ring_buffer	*buffer;
+	raw_spinlock_t		lock;
+	struct lock_class_key	lock_key;
+	struct buffer_page	**pages;
+	unsigned long		head;	/* read from head */
+	unsigned long		tail;	/* write to tail */
+	unsigned long		head_page;
+	unsigned long		tail_page;
+	unsigned long		overrun;
+	unsigned long		entries;
+	u64			last_stamp;
+	u64			read_stamp;
+	atomic_t		record_disabled;
+};
+
+struct ring_buffer {
+	unsigned long		size;
+	unsigned		pages;
+	unsigned		flags;
+	int			cpus;
+	atomic_t		record_disabled;
+
+	spinlock_t		lock;
+	struct mutex		mutex;
+
+	/* FIXME: this should be online CPUS */
+	struct ring_buffer_per_cpu *buffers[NR_CPUS];
+};
+
+struct ring_buffer_iter {
+	struct ring_buffer_per_cpu	*cpu_buffer;
+	unsigned long			head;
+	unsigned long			head_page;
+	u64				read_stamp;
+};
+
+static struct ring_buffer_per_cpu *
+ring_buffer_allocate_cpu_buffer(struct ring_buffer *buffer, int cpu)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+	int pages = buffer->pages;
+	int i;
+
+	cpu_buffer = kzalloc_node(ALIGN(sizeof(*cpu_buffer), cache_line_size()),
+				  GFP_KERNEL, cpu_to_node(cpu));
+	if (!cpu_buffer)
+		return NULL;
+
+	cpu_buffer->cpu = cpu;
+	cpu_buffer->buffer = buffer;
+	cpu_buffer->lock = (raw_spinlock_t)__RAW_SPIN_LOCK_UNLOCKED;
+
+	cpu_buffer->pages = kzalloc_node(ALIGN(sizeof(void *) * pages,
+					       cache_line_size()), GFP_KERNEL,
+					 cpu_to_node(cpu));
+	if (!cpu_buffer->pages)
+		goto fail_free_buffer;
+
+	for (i = 0; i < pages; i++) {
+		cpu_buffer->pages[i] = (void *)get_zeroed_page(GFP_KERNEL);
+		if (!cpu_buffer->pages[i])
+			goto fail_free_pages;
+	}
+
+	return cpu_buffer;
+
+ fail_free_pages:
+	for (i = 0; i < pages; i++) {
+		if (cpu_buffer->pages[i])
+			free_page((unsigned long)cpu_buffer->pages[i]);
+	}
+	kfree(cpu_buffer->pages);
+
+ fail_free_buffer:
+	kfree(cpu_buffer);
+	return NULL;
+}
+
+static void
+ring_buffer_free_cpu_buffer(struct ring_buffer_per_cpu *cpu_buffer)
+{
+	int i;
+
+	for (i = 0; i < cpu_buffer->buffer->pages; i++) {
+		if (cpu_buffer->pages[i])
+			free_page((unsigned long)cpu_buffer->pages[i]);
+	}
+	kfree(cpu_buffer->pages);
+	kfree(cpu_buffer);
+}
+
+struct ring_buffer *
+ring_buffer_alloc(unsigned long size, unsigned flags)
+{
+	struct ring_buffer *buffer;
+	int cpu;
+
+	/* keep it in its own cache line */
+	buffer = kzalloc(ALIGN(sizeof(*buffer), cache_line_size()),
+			 GFP_KERNEL);
+	if (!buffer)
+		return NULL;
+
+	buffer->pages = (size + (PAGE_SIZE - 1)) / PAGE_SIZE;
+	buffer->flags = flags;
+
+	/* need at least two pages */
+	if (buffer->pages == 1)
+		buffer->pages++;
+
+	/* FIXME: do for only online CPUS */
+	buffer->cpus = num_possible_cpus();
+	for_each_possible_cpu(cpu) {
+		if (cpu >= buffer->cpus)
+			continue;
+		buffer->buffers[cpu] =
+			ring_buffer_allocate_cpu_buffer(buffer, cpu);
+		if (!buffer->buffers[cpu])
+			goto fail_free_buffers;
+	}
+
+	spin_lock_init(&buffer->lock);
+	mutex_init(&buffer->mutex);
+
+	return buffer;
+
+ fail_free_buffers:
+	for_each_possible_cpu(cpu) {
+		if (cpu >= buffer->cpus)
+			continue;
+		if (buffer->buffers[cpu])
+			ring_buffer_free_cpu_buffer(buffer->buffers[cpu]);
+	}
+
+	kfree(buffer);
+	return NULL;
+}
+
+/**
+ * ring_buffer_free - free a ring buffer.
+ * @buffer: the buffer to free.
+ */
+void
+ring_buffer_free(struct ring_buffer *buffer)
+{
+	int cpu;
+
+	for (cpu = 0; cpu < buffer->cpus; cpu++)
+		ring_buffer_free_cpu_buffer(buffer->buffers[cpu]);
+
+	kfree(buffer);
+}
+
+int ring_buffer_resize(struct ring_buffer *buffer, unsigned long size)
+{
+	/* FIXME: */
+	return -1;
+}
+
+static inline int
+ring_buffer_per_cpu_empty(struct ring_buffer_per_cpu *cpu_buffer)
+{
+	return cpu_buffer->head_page == cpu_buffer->tail_page &&
+		cpu_buffer->head == cpu_buffer->tail;
+}
+
+static inline int
+ring_buffer_null_event(struct ring_buffer_event *event)
+{
+	return event->type == RB_TYPE_PADDING;
+}
+
+static inline void *
+rb_page_body(struct ring_buffer_per_cpu *cpu_buffer,
+		      unsigned long page, unsigned index)
+{
+	return cpu_buffer->pages[page]->body + index;
+}
+
+static inline struct ring_buffer_event *
+ring_buffer_head_event(struct ring_buffer_per_cpu *cpu_buffer)
+{
+	return rb_page_body(cpu_buffer,cpu_buffer->head_page,
+			    cpu_buffer->head);
+}
+
+static inline struct ring_buffer_event *
+ring_buffer_iter_head_event(struct ring_buffer_iter *iter)
+{
+	struct ring_buffer_per_cpu *cpu_buffer = iter->cpu_buffer;
+
+	return rb_page_body(cpu_buffer, iter->head_page,
+			    iter->head);
+}
+
+static void
+ring_buffer_update_overflow(struct ring_buffer_per_cpu *cpu_buffer)
+{
+	struct ring_buffer_event *event;
+	unsigned long head;
+
+	for (head = 0; head < BUF_PAGE_SIZE;
+	     head += ring_buffer_event_length(event)) {
+		event = rb_page_body(cpu_buffer, cpu_buffer->head_page, head);
+		if (ring_buffer_null_event(event))
+			break;
+		cpu_buffer->overrun++;
+		cpu_buffer->entries--;
+	}
+}
+
+static inline void
+ring_buffer_inc_page(struct ring_buffer *buffer,
+		     unsigned long *page)
+{
+	(*page)++;
+	if (*page >= buffer->pages)
+		*page = 0;
+}
+
+static inline void
+rb_add_stamp(struct ring_buffer_per_cpu *cpu_buffer, u64 *ts)
+{
+	struct buffer_page *bpage;
+
+	bpage = cpu_buffer->pages[cpu_buffer->tail_page];
+	bpage->time_stamp = *ts;
+}
+
+static void
+rb_reset_read_page(struct ring_buffer_per_cpu *cpu_buffer)
+{
+	struct buffer_page *bpage;
+
+	cpu_buffer->head = 0;
+	bpage = cpu_buffer->pages[cpu_buffer->head_page];
+	cpu_buffer->read_stamp = bpage->time_stamp;
+}
+
+static void
+rb_reset_iter_read_page(struct ring_buffer_iter *iter)
+{
+	struct ring_buffer_per_cpu *cpu_buffer = iter->cpu_buffer;
+	struct buffer_page *bpage;
+
+	iter->head = 0;
+	bpage = cpu_buffer->pages[iter->head_page];
+	iter->read_stamp = bpage->time_stamp;
+}
+
+/**
+ * ring_buffer_update_event - update event type and data
+ * @event: the even to update
+ * @type: the type of event
+ * @length: the size of the event field in the ring buffer
+ *
+ * Update the type and data fields of the event. The length
+ * is the actual size that is written to the ring buffer,
+ * and with this, we can determine what to place into the
+ * data field.
+ */
+static inline void
+ring_buffer_update_event(struct ring_buffer_event *event,
+			 unsigned type, unsigned length)
+{
+	event->type = type;
+
+	switch (type) {
+		/* ignore fixed size types */
+	case RB_TYPE_PADDING:
+	case RB_TYPE_TIME_EXTENT:
+	case RB_TYPE_TIME_STAMP:
+	case RB_TYPE_LARGE_DATA:
+		break;
+
+	case RB_TYPE_SMALL_DATA:
+		event->data = length - 16;
+		break;
+
+	case RB_TYPE_STRING:
+		event->data = (length >> 3) - 2;
+		break;
+	}
+}
+
+static struct ring_buffer_event *
+__ring_buffer_reserve_next(struct ring_buffer_per_cpu *cpu_buffer,
+			   unsigned type, unsigned long length, u64 *ts)
+{
+	unsigned long head_page, tail_page, tail;
+	struct ring_buffer *buffer = cpu_buffer->buffer;
+	struct ring_buffer_event *event;
+
+	tail_page = cpu_buffer->tail_page;
+	head_page = cpu_buffer->head_page;
+	tail = cpu_buffer->tail;
+
+	BUG_ON(tail_page >= buffer->pages);
+	BUG_ON(head_page >= buffer->pages);
+
+	if (tail + length > BUF_PAGE_SIZE) {
+		unsigned long next_page = tail_page;
+
+		ring_buffer_inc_page(buffer, &next_page);
+
+		if (next_page == head_page) {
+			if (!(buffer->flags & RB_FL_OVERWRITE))
+				return NULL;
+
+			/* count overflows */
+			ring_buffer_update_overflow(cpu_buffer);
+
+			ring_buffer_inc_page(buffer, &head_page);
+			cpu_buffer->head_page = head_page;
+			rb_reset_read_page(cpu_buffer);
+		}
+
+		if (tail != BUF_PAGE_SIZE) {
+			event = rb_page_body(cpu_buffer, tail_page, tail);
+			/* page padding */
+			event->type = RB_TYPE_PADDING;
+		}
+
+		tail = 0;
+		tail_page = next_page;
+		cpu_buffer->tail_page = tail_page;
+		cpu_buffer->tail = tail;
+		rb_add_stamp(cpu_buffer, ts);
+	}
+
+	BUG_ON(tail_page >= buffer->pages);
+	BUG_ON(tail + length > BUF_PAGE_SIZE);
+
+	event = rb_page_body(cpu_buffer, tail_page, tail);
+	ring_buffer_update_event(event, type, length);
+	cpu_buffer->entries++;
+
+	return event;
+}
+
+static struct ring_buffer_event *
+ring_buffer_reserve_next_event(struct ring_buffer_per_cpu *cpu_buffer,
+			       unsigned type, unsigned long length)
+{
+	unsigned long long ts, delta;
+	struct ring_buffer_event *event;
+
+	ts = ring_buffer_time_stamp(cpu_buffer->cpu);
+
+	if (cpu_buffer->tail) {
+		delta = ts - cpu_buffer->last_stamp;
+
+		if (test_time_stamp(delta)) {
+			/*
+			 * The delta is too big, we to add a
+			 * new timestamp.
+			 */
+			event = __ring_buffer_reserve_next(cpu_buffer,
+							   RB_TYPE_TIME_EXTENT,
+							   RB_LEN_TIME_EXTENT,
+							   &ts);
+			if (!event)
+				return NULL;
+
+			/* check to see if we went to the next page */
+			if (!cpu_buffer->tail) {
+				/*
+				 * new page, dont commit this and add the
+				 * time stamp to the page instead.
+				 */
+				rb_add_stamp(cpu_buffer, &ts);
+			} else {
+				event->time_delta = delta & TS_MASK;
+				event->data = delta >> TS_SHIFT;
+			}
+
+			cpu_buffer->last_stamp = ts;
+			delta = 0;
+		}
+	} else {
+		rb_add_stamp(cpu_buffer, &ts);
+		delta = 0;
+	}
+
+	event = __ring_buffer_reserve_next(cpu_buffer, type, length, &ts);
+	if (!event)
+		return NULL;
+
+	event->time_delta = delta;
+	cpu_buffer->last_stamp = ts;
+
+	return event;
+}
+
+/**
+ * ring_buffer_lock_reserve - reserve a part of the buffer
+ * @buffer: the ring buffer to reserve from
+ * @length: the length of the data to reserve (excluding event header)
+ * @flags: a pointer to save the interrupt flags
+ *
+ * Returns a location on the ring buffer to copy directly to.
+ * The length is the length of the data needed, not the event length
+ * which also includes the event header.
+ *
+ * Must be paired with ring_buffer_unlock_commit, unless NULL is returned.
+ * If NULL is returned, then nothing has been allocated or locked.
+ */
+void *ring_buffer_lock_reserve(struct ring_buffer *buffer,
+			       unsigned long length,
+			       unsigned long *flags)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+	struct ring_buffer_event *event;
+	int cpu;
+
+	if (atomic_read(&buffer->record_disabled))
+		return NULL;
+
+	raw_local_irq_save(*flags);
+	cpu = raw_smp_processor_id();
+	cpu_buffer = buffer->buffers[cpu];
+	__raw_spin_lock(&cpu_buffer->lock);
+
+	if (atomic_read(&cpu_buffer->record_disabled))
+		goto no_record;
+
+	length += RB_EVNT_HDR_SIZE;
+	length = ALIGN(length, 8);
+	if (length > BUF_PAGE_SIZE)
+		return NULL;
+
+	event = ring_buffer_reserve_next_event(cpu_buffer,
+					       RB_TYPE_SMALL_DATA, length);
+	if (!event)
+		goto no_record;
+
+	return ring_buffer_event_data(event);
+
+ no_record:
+	__raw_spin_unlock(&cpu_buffer->lock);
+	local_irq_restore(*flags);
+	return NULL;
+}
+
+/**
+ * ring_buffer_unlock_commit - commit a reserved
+ * @buffer: The buffer to commit to
+ * @data: The data pointer to commit.
+ * @flags: the interrupt flags received from ring_buffer_lock_reserve.
+ *
+ * This commits the data to the ring buffer, and releases any locks held.
+ *
+ * Must be paired with ring_buffer_lock_reserve.
+ */
+int ring_buffer_unlock_commit(struct ring_buffer *buffer, void *data, unsigned long flags)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+	struct ring_buffer_event *event;
+	int cpu = raw_smp_processor_id();
+
+	cpu_buffer = buffer->buffers[cpu];
+
+	event = container_of(data, struct ring_buffer_event, array);
+	cpu_buffer->tail += ring_buffer_event_length(event);
+
+	__raw_spin_unlock(&cpu_buffer->lock);
+	raw_local_irq_restore(flags);
+
+	return 0;
+}
+
+/**
+ * ring_buffer_write - write data to the buffer without reserving
+ * @buffer: The ring buffer to write to.
+ * @event_type: The event type to write to.
+ * @length: The length of the data being written (excluding the event header)
+ * @data: The data to write to the buffer.
+ *
+ * This is like ring_buffer_lock_reserve and ring_buffer_unlock_commit as
+ * one function. If you already have the data to write to the buffer, it
+ * may be easier to simply call this function.
+ *
+ * Note, like ring_buffer_lock_reserve, the length is the length of the data
+ * and not the length of the event which would hold the header.
+ */
+void *ring_buffer_write(struct ring_buffer *buffer,
+			unsigned long length,
+			void *data)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+	struct ring_buffer_event *event;
+	unsigned long event_length, flags;
+	void *ret = NULL;
+	int cpu;
+
+	if (atomic_read(&buffer->record_disabled))
+		return NULL;
+
+	local_irq_save(flags);
+	cpu = raw_smp_processor_id();
+	cpu_buffer = buffer->buffers[cpu];
+	__raw_spin_lock(&cpu_buffer->lock);
+
+	if (atomic_read(&cpu_buffer->record_disabled))
+		goto out;
+
+	event_length = ALIGN(length + RB_EVNT_HDR_SIZE, 8);
+	event = ring_buffer_reserve_next_event(cpu_buffer,
+					       RB_TYPE_SMALL_DATA, event_length);
+	if (!event)
+		goto out;
+
+	ret = ring_buffer_event_data(event);
+
+	memcpy(ret, data, length);
+	cpu_buffer->tail += event_length;
+
+ out:
+	__raw_spin_unlock(&cpu_buffer->lock);
+	local_irq_restore(flags);
+
+	return ret;
+}
+
+/**
+ * ring_buffer_lock - lock the ring buffer
+ * @buffer: The ring buffer to lock
+ * @flags: The place to store the interrupt flags
+ *
+ * This locks all the per CPU buffers.
+ *
+ * Must be unlocked by ring_buffer_unlock.
+ */
+void ring_buffer_lock(struct ring_buffer *buffer, unsigned long *flags)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+	int cpu;
+
+	local_irq_save(*flags);
+
+	for (cpu = 0; cpu < buffer->cpus; cpu++) {
+
+		cpu_buffer = buffer->buffers[cpu];
+		__raw_spin_lock(&cpu_buffer->lock);
+	}
+}
+
+/**
+ * ring_buffer_unlock - unlock a locked buffer
+ * @buffer: The locked buffer to unlock
+ * @flags: The interrupt flags received by ring_buffer_lock
+ */
+void ring_buffer_unlock(struct ring_buffer *buffer, unsigned long flags)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+	int cpu;
+
+	for (cpu = buffer->cpus - 1; cpu >= 0; cpu--) {
+
+		cpu_buffer = buffer->buffers[cpu];
+		__raw_spin_unlock(&cpu_buffer->lock);
+	}
+
+	local_irq_restore(flags);
+}
+
+/**
+ * ring_buffer_record_disable - stop all writes into the buffer
+ * @buffer: The ring buffer to stop writes to.
+ *
+ * This prevents all writes to the buffer. Any attempt to write
+ * to the buffer after this will fail and return NULL.
+ */
+void ring_buffer_record_disable(struct ring_buffer *buffer)
+{
+	atomic_inc(&buffer->record_disabled);
+}
+
+/**
+ * ring_buffer_record_enable - enable writes to the buffer
+ * @buffer: The ring buffer to enable writes
+ *
+ * Note, multiple disables will need the same number of enables
+ * to truely enable the writing (much like preempt_disable).
+ */
+void ring_buffer_record_enable(struct ring_buffer *buffer)
+{
+	atomic_dec(&buffer->record_disabled);
+}
+
+void ring_buffer_record_disable_cpu(struct ring_buffer *buffer, int cpu)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+
+	cpu_buffer = buffer->buffers[cpu];
+	atomic_inc(&cpu_buffer->record_disabled);
+}
+
+void ring_buffer_record_enable_cpu(struct ring_buffer *buffer, int cpu)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+
+	cpu_buffer = buffer->buffers[cpu];
+	atomic_dec(&cpu_buffer->record_disabled);
+}
+
+/**
+ * ring_buffer_entries_cpu - get the number of entries in a cpu buffer
+ * @buffer: The ring buffer
+ * @cpu: The per CPU buffer to get the entries from.
+ */
+unsigned long ring_buffer_entries_cpu(struct ring_buffer *buffer, int cpu)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+
+	cpu_buffer = buffer->buffers[cpu];
+	return cpu_buffer->entries;
+}
+
+/**
+ * ring_buffer_overrun_cpu - get the number of overruns in a cpu_buffer
+ * @buffer: The ring buffer
+ * @cpu: The per CPU buffer to get the number of overruns from
+ */
+unsigned long ring_buffer_overrun_cpu(struct ring_buffer *buffer, int cpu)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+
+	cpu_buffer = buffer->buffers[cpu];
+	return cpu_buffer->overrun;
+}
+
+/**
+ * ring_buffer_entries - get the number of entries in a buffer
+ * @buffer: The ring buffer
+ *
+ * Returns the total number of entries in the ring buffer
+ * (all CPU entries)
+ */
+unsigned long ring_buffer_entries(struct ring_buffer *buffer)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+	unsigned long entries = 0;
+	int cpu;
+
+	/* if you care about this being correct, lock the buffer */
+	for (cpu = 0; cpu < buffer->cpus; cpu++) {
+		cpu_buffer = buffer->buffers[cpu];
+		entries += cpu_buffer->entries;
+	}
+
+	return entries;
+}
+
+/**
+ * ring_buffer_overrun_cpu - get the number of overruns in buffer
+ * @buffer: The ring buffer
+ *
+ * Returns the total number of overruns in the ring buffer
+ * (all CPU entries)
+ */
+unsigned long ring_buffer_overruns(struct ring_buffer *buffer)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+	unsigned long overruns = 0;
+	int cpu;
+
+	/* if you care about this being correct, lock the buffer */
+	for (cpu = 0; cpu < buffer->cpus; cpu++) {
+		cpu_buffer = buffer->buffers[cpu];
+		overruns += cpu_buffer->overrun;
+	}
+
+	return overruns;
+}
+
+void ring_buffer_iter_reset(struct ring_buffer_iter *iter)
+{
+	iter->head_page = 0;
+	iter->head = 0;
+}
+
+int ring_buffer_iter_empty(struct ring_buffer_iter *iter)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+
+	cpu_buffer = iter->cpu_buffer;
+
+	return iter->head_page == cpu_buffer->tail_page &&
+		iter->head == cpu_buffer->tail;
+}
+
+static void
+ring_buffer_advance_head(struct ring_buffer_per_cpu *cpu_buffer)
+{
+	struct ring_buffer *buffer = cpu_buffer->buffer;
+	struct ring_buffer_event *event;
+	unsigned length;
+
+	event = ring_buffer_head_event(cpu_buffer);
+	/*
+	 * Check if we are at the end of the buffer.
+	 * For fixed length, we need to check if we can fit
+	 *  another entry on the page.
+	 * Otherwise we need to see if the end is a null
+	 *  pointer.
+	 */
+	if (ring_buffer_null_event(event)) {
+		BUG_ON(cpu_buffer->head_page == cpu_buffer->tail_page);
+		ring_buffer_inc_page(buffer, &cpu_buffer->head_page);
+		rb_reset_read_page(cpu_buffer);
+		return;
+	}
+
+	length = ring_buffer_event_length(event);
+
+	/*
+	 * This should not be called to advance the header if we are
+	 * at the tail of the buffer.
+	 */
+	BUG_ON((cpu_buffer->head_page == cpu_buffer->tail_page) &&
+	       (cpu_buffer->head + length > cpu_buffer->tail));
+
+	cpu_buffer->head += length;
+
+	/* check for end of page padding */
+	event = ring_buffer_head_event(cpu_buffer);
+	if (ring_buffer_null_event(event) &&
+	    (cpu_buffer->head_page != cpu_buffer->tail_page))
+		ring_buffer_advance_head(cpu_buffer);
+}
+
+static void
+ring_buffer_advance_iter(struct ring_buffer_iter *iter)
+{
+	struct ring_buffer *buffer;
+	struct ring_buffer_per_cpu *cpu_buffer;
+	struct ring_buffer_event *event;
+	unsigned length;
+
+	cpu_buffer = iter->cpu_buffer;
+	buffer = cpu_buffer->buffer;
+
+	event = ring_buffer_iter_head_event(iter);
+
+	/*
+	 * Check if we are at the end of the buffer.
+	 * For fixed length, we need to check if we can fit
+	 *  another entry on the page.
+	 * Otherwise we need to see if the end is a null
+	 *  pointer.
+	 */
+	if (ring_buffer_null_event(event)) {
+		BUG_ON(iter->head_page == cpu_buffer->tail_page);
+		ring_buffer_inc_page(buffer, &iter->head_page);
+		rb_reset_iter_read_page(iter);
+		return;
+	}
+
+	length = ring_buffer_event_length(event);
+
+	/*
+	 * This should not be called to advance the header if we are
+	 * at the tail of the buffer.
+	 */
+	BUG_ON((iter->head_page == cpu_buffer->tail_page) &&
+	       (iter->head + length > cpu_buffer->tail));
+
+	iter->head += length;
+
+	/* check for end of page padding */
+	event = ring_buffer_iter_head_event(iter);
+	if (ring_buffer_null_event(event) &&
+	    (iter->head_page != cpu_buffer->tail_page))
+		ring_buffer_advance_iter(iter);
+}
+
+/**
+ * ring_buffer_peek - peek at the next event to be read
+ * @iter: The ring buffer iterator
+ * @iter_next_cpu: The CPU that the next event belongs on
+ *
+ * This will return the event that will be read next, but does
+ * not increment the iterator.
+ */
+struct ring_buffer_event *
+ring_buffer_peek(struct ring_buffer *buffer, int cpu, u64 *ts)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+	struct ring_buffer_event *event;
+	u64 delta;
+
+	cpu_buffer = buffer->buffers[cpu];
+
+ again:
+	if (ring_buffer_per_cpu_empty(cpu_buffer))
+		return NULL;
+
+	event = ring_buffer_head_event(cpu_buffer);
+
+	switch (event->type) {
+	case RB_TYPE_PADDING:
+		ring_buffer_inc_page(buffer, &cpu_buffer->head_page);
+		rb_reset_read_page(cpu_buffer);
+		goto again;
+
+	case RB_TYPE_TIME_EXTENT:
+		delta = event->data;
+		delta <<= TS_SHIFT;
+		delta += event->time_delta;
+		cpu_buffer->read_stamp += delta;
+		goto again;
+
+	case RB_TYPE_TIME_STAMP:
+		/* FIXME: not implemented */
+		goto again;
+
+	case RB_TYPE_SMALL_DATA:
+	case RB_TYPE_LARGE_DATA:
+	case RB_TYPE_STRING:
+		if (ts)
+			*ts = cpu_buffer->read_stamp + event->time_delta;
+		return event;
+
+	default:
+		BUG();
+	}
+
+	return NULL;
+}
+
+/**
+ * ring_buffer_peek - peek at the next event to be read
+ * @iter: The ring buffer iterator
+ * @iter_next_cpu: The CPU that the next event belongs on
+ *
+ * This will return the event that will be read next, but does
+ * not increment the iterator.
+ */
+struct ring_buffer_event *
+ring_buffer_iter_peek(struct ring_buffer_iter *iter, u64 *ts)
+{
+	struct ring_buffer *buffer;
+	struct ring_buffer_per_cpu *cpu_buffer;
+	struct ring_buffer_event *event;
+	u64 delta;
+
+	if (ring_buffer_iter_empty(iter))
+		return NULL;
+
+	cpu_buffer = iter->cpu_buffer;
+	buffer = cpu_buffer->buffer;
+
+ again:
+	if (ring_buffer_per_cpu_empty(cpu_buffer))
+		return NULL;
+
+	event = ring_buffer_iter_head_event(iter);
+
+	switch (event->type) {
+	case RB_TYPE_PADDING:
+		ring_buffer_inc_page(buffer, &iter->head_page);
+		rb_reset_iter_read_page(iter);
+		goto again;
+
+	case RB_TYPE_TIME_EXTENT:
+		delta = event->data;
+		delta <<= TS_SHIFT;
+		delta += event->time_delta;
+		iter->read_stamp += delta;
+		goto again;
+
+	case RB_TYPE_TIME_STAMP:
+		/* FIXME: not implemented */
+		goto again;
+
+	case RB_TYPE_SMALL_DATA:
+	case RB_TYPE_LARGE_DATA:
+	case RB_TYPE_STRING:
+		if (ts)
+			*ts = iter->read_stamp + event->time_delta;
+		return event;
+
+	default:
+		BUG();
+	}
+
+	return NULL;
+}
+
+/**
+ * ring_buffer_consume - return an event and consume it
+ * @buffer: The ring buffer to get the next event from
+ *
+ * Returns the next event in the ring buffer, and that event is consumed.
+ * Meaning, that sequential reads will keep returning a different event,
+ * and eventually empty the ring buffer if the producer is slower.
+ */
+struct ring_buffer_event *
+ring_buffer_consume(struct ring_buffer *buffer, int cpu, u64 *ts)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+	struct ring_buffer_event *event;
+
+	event = ring_buffer_peek(buffer, cpu, ts);
+	if (!event)
+		return NULL;
+
+	cpu_buffer = buffer->buffers[cpu];
+	ring_buffer_advance_head(cpu_buffer);
+
+	return event;
+}
+
+/**
+ * ring_buffer_read_start - start a non consuming read of the buffer
+ * @buffer: The ring buffer to read from
+ * @iter_flags: control flags on how to read the buffer.
+ *
+ * This starts up an iteration through the buffer. It also disables
+ * the recording to the buffer until the reading is finished.
+ * This prevents the reading from being corrupted. This is not
+ * a consuming read, so a producer is not expected.
+ *
+ * The iter_flags of RB_ITER_FL_SNAP will read the snapshot image
+ * and not the main buffer.
+ *
+ * Must be paired with ring_buffer_finish.
+ */
+struct ring_buffer_iter *
+ring_buffer_read_start(struct ring_buffer *buffer, int cpu)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+	struct ring_buffer_iter *iter;
+
+	iter = kmalloc(sizeof(*iter), GFP_KERNEL);
+	if (!iter)
+		return NULL;
+
+	cpu_buffer = buffer->buffers[cpu];
+
+	iter->cpu_buffer = cpu_buffer;
+
+	atomic_inc(&cpu_buffer->record_disabled);
+
+	__raw_spin_lock(&cpu_buffer->lock);
+	iter->head = cpu_buffer->head;
+	iter->head_page = cpu_buffer->head_page;
+	rb_reset_iter_read_page(iter);
+	__raw_spin_unlock(&cpu_buffer->lock);
+
+	return iter;
+}
+
+/**
+ * ring_buffer_finish - finish reading the iterator of the buffer
+ * @iter: The iterator retrieved by ring_buffer_start
+ *
+ * This re-enables the recording to the buffer, and frees the
+ * iterator.
+ */
+void
+ring_buffer_read_finish(struct ring_buffer_iter *iter)
+{
+	struct ring_buffer_per_cpu *cpu_buffer = iter->cpu_buffer;
+
+	atomic_dec(&cpu_buffer->record_disabled);
+	kfree(iter);
+}
+
+/**
+ * ring_buffer_read - read the next item in the ring buffer by the iterator
+ * @iter: The ring buffer iterator
+ * @cpu: The cpu buffer to read from.
+ *
+ * This reads the next event in the ring buffer and increments the iterator.
+ */
+struct ring_buffer_event *
+ring_buffer_read(struct ring_buffer_iter *iter, u64 *ts)
+{
+	struct ring_buffer_event *event;
+
+	event = ring_buffer_iter_peek(iter, ts);
+	if (!event)
+		return NULL;
+
+	ring_buffer_advance_iter(iter);
+
+	return event;
+}
+
+/**
+ * ring_buffer_size - return the size of the ring buffer (in bytes)
+ * @buffer: The ring buffer.
+ */
+unsigned long ring_buffer_size(struct ring_buffer *buffer)
+{
+	return PAGE_SIZE * buffer->pages;
+}
+
+static void
+__ring_buffer_reset_cpu(struct ring_buffer_per_cpu *cpu_buffer)
+{
+	cpu_buffer->head_page = cpu_buffer->tail_page = 0;
+	cpu_buffer->head = cpu_buffer->tail = 0;
+	cpu_buffer->overrun = 0;
+	cpu_buffer->entries = 0;
+}
+
+/**
+ * ring_buffer_reset_cpu - reset a ring buffer per CPU buffer
+ * @buffer: The ring buffer to reset a per cpu buffer of
+ * @cpu: The CPU buffer to be reset
+ */
+void ring_buffer_reset_cpu(struct ring_buffer *buffer, int cpu)
+{
+	struct ring_buffer_per_cpu *cpu_buffer = buffer->buffers[cpu];
+	unsigned long flags;
+
+	raw_local_irq_save(flags);
+	__raw_spin_lock(&cpu_buffer->lock);
+
+	__ring_buffer_reset_cpu(cpu_buffer);
+
+	__raw_spin_unlock(&cpu_buffer->lock);
+	raw_local_irq_restore(flags);
+}
+
+/**
+ * ring_buffer_reset_cpu - reset a ring buffer per CPU buffer
+ * @buffer: The ring buffer to reset a per cpu buffer of
+ * @cpu: The CPU buffer to be reset
+ */
+void ring_buffer_reset(struct ring_buffer *buffer)
+{
+	unsigned long flags;
+	int cpu;
+
+	ring_buffer_lock(buffer, &flags);
+
+	for (cpu = 0; cpu < buffer->cpus; cpu++)
+		__ring_buffer_reset_cpu(buffer->buffers[cpu]);
+
+	ring_buffer_unlock(buffer, flags);
+}
+
+/**
+ * rind_buffer_empty - is the ring buffer empty?
+ * @buffer: The ring buffer to test
+ */
+int ring_buffer_empty(struct ring_buffer *buffer)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+	int cpu;
+
+	/* yes this is racy, but if you don't like the race, lock the buffer */
+	for (cpu = 0; cpu < buffer->cpus; cpu++) {
+		cpu_buffer = buffer->buffers[cpu];
+		if (!ring_buffer_per_cpu_empty(cpu_buffer))
+			return 0;
+	}
+	return 1;
+}
+
+/**
+ * ring_buffer_empty_cpu - is a cpu buffer of a ring buffer empty?
+ * @buffer: The ring buffer
+ * @cpu: The CPU buffer to test
+ */
+int ring_buffer_empty_cpu(struct ring_buffer *buffer, int cpu)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+
+	/* yes this is racy, but if you don't like the race, lock the buffer */
+	cpu_buffer = buffer->buffers[cpu];
+	return ring_buffer_per_cpu_empty(cpu_buffer);
+}
+
+/**
+ * ring_buffer_swap_cpu - swap a CPU buffer between two ring buffers
+ * @buffer_a: One buffer to swap with
+ * @buffer_b: The other buffer to swap with
+ *
+ * This function is useful for tracers that want to take a "snapshot"
+ * of a CPU buffer and has another back up buffer lying around.
+ * it is expected that the tracer handles the cpu buffer not being
+ * used at the moment.
+ */
+int ring_buffer_swap_cpu(struct ring_buffer *buffer_a,
+			 struct ring_buffer *buffer_b, int cpu)
+{
+	struct ring_buffer_per_cpu *cpu_buffer_a;
+	struct ring_buffer_per_cpu *cpu_buffer_b;
+
+	/* At least make sure the two buffers are somewhat the same */
+	if (buffer_a->size != buffer_b->size ||
+	    buffer_a->pages != buffer_b->pages)
+		return -EINVAL;
+
+	cpu_buffer_a = buffer_a->buffers[cpu];
+	cpu_buffer_b = buffer_b->buffers[cpu];
+
+	atomic_inc(&cpu_buffer_a->record_disabled);
+	atomic_inc(&cpu_buffer_b->record_disabled);
+
+	buffer_a->buffers[cpu] = cpu_buffer_b;
+	buffer_b->buffers[cpu] = cpu_buffer_a;
+
+	cpu_buffer_b->buffer = buffer_a;
+	cpu_buffer_a->buffer = buffer_b;
+
+	atomic_dec(&cpu_buffer_a->record_disabled);
+	atomic_dec(&cpu_buffer_b->record_disabled);
+
+	return 0;
+}
+
Index: linux-compile.git/kernel/trace/Kconfig
===================================================================
--- linux-compile.git.orig/kernel/trace/Kconfig	2008-09-24 13:21:18.000000000 -0400
+++ linux-compile.git/kernel/trace/Kconfig	2008-09-24 19:31:01.000000000 -0400
@@ -15,6 +15,9 @@ config TRACING
 	select DEBUG_FS
 	select STACKTRACE
 
+config RING_BUFFER
+	bool "ring buffer"
+
 config FTRACE
 	bool "Kernel Function Tracer"
 	depends on HAVE_FTRACE
Index: linux-compile.git/kernel/trace/Makefile
===================================================================
--- linux-compile.git.orig/kernel/trace/Makefile	2008-09-24 13:21:18.000000000 -0400
+++ linux-compile.git/kernel/trace/Makefile	2008-09-24 19:31:01.000000000 -0400
@@ -11,6 +11,7 @@ obj-y += trace_selftest_dynamic.o
 endif
 
 obj-$(CONFIG_FTRACE) += libftrace.o
+obj-$(CONFIG_RING_BUFFER) += ring_buffer.o
 
 obj-$(CONFIG_TRACING) += trace.o
 obj-$(CONFIG_CONTEXT_SWITCH_TRACER) += trace_sched_switch.o

-- 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [RFC PATCH 2/2 v2] ftrace: make work with new ring buffer
  2008-09-25 15:58 [RFC PATCH 0/2 v2] Unified trace buffer (take two) Steven Rostedt
  2008-09-25 15:58 ` [RFC PATCH 1/2 v2] Unified trace buffer Steven Rostedt
@ 2008-09-25 15:58 ` Steven Rostedt
  1 sibling, 0 replies; 10+ messages in thread
From: Steven Rostedt @ 2008-09-25 15:58 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ingo Molnar, Thomas Gleixner, Peter Zijlstra, Andrew Morton,
	prasad, Linus Torvalds, Mathieu Desnoyers, Frank Ch. Eigler,
	David Wilder, hch, Martin Bligh, Christoph Hellwig,
	Steven Rostedt

[-- Attachment #1: ftrace-ring-buffer-take-two.patch --]
[-- Type: text/plain, Size: 40851 bytes --]

Note: This patch is a proof of concept, and breaks a lot of
 functionality of ftrace.

This patch simply makes ftrace work with the developmental ring buffer.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
---
 kernel/trace/trace.c              |  776 ++++++++------------------------------
 kernel/trace/trace.h              |   22 -
 kernel/trace/trace_functions.c    |    2 
 kernel/trace/trace_irqsoff.c      |    6 
 kernel/trace/trace_mmiotrace.c    |   10 
 kernel/trace/trace_sched_switch.c |    2 
 kernel/trace/trace_sched_wakeup.c |    2 
 7 files changed, 195 insertions(+), 625 deletions(-)

Index: linux-compile.git/kernel/trace/trace.c
===================================================================
--- linux-compile.git.orig/kernel/trace/trace.c	2008-09-24 19:48:11.000000000 -0400
+++ linux-compile.git/kernel/trace/trace.c	2008-09-25 09:55:04.000000000 -0400
@@ -31,25 +31,24 @@
 #include <linux/writeback.h>
 
 #include <linux/stacktrace.h>
+#include <linux/ring_buffer.h>
 
 #include "trace.h"
 
+#define sdr_print(x, y...) printk("%s:%d " x "\n", __FUNCTION__, __LINE__, y)
+
+#define TRACE_BUFFER_FLAGS	(RB_FL_OVERWRITE)
+
 unsigned long __read_mostly	tracing_max_latency = (cycle_t)ULONG_MAX;
 unsigned long __read_mostly	tracing_thresh;
 
-static unsigned long __read_mostly	tracing_nr_buffers;
 static cpumask_t __read_mostly		tracing_buffer_mask;
 
 #define for_each_tracing_cpu(cpu)	\
 	for_each_cpu_mask(cpu, tracing_buffer_mask)
 
-static int trace_alloc_page(void);
-static int trace_free_page(void);
-
 static int tracing_disabled = 1;
 
-static unsigned long tracing_pages_allocated;
-
 long
 ns2usecs(cycle_t nsec)
 {
@@ -100,11 +99,11 @@ static int			tracer_enabled = 1;
 int				ftrace_function_enabled;
 
 /*
- * trace_nr_entries is the number of entries that is allocated
- * for a buffer. Note, the number of entries is always rounded
- * to ENTRIES_PER_PAGE.
+ * trace_buf_size is the size in bytes that is allocated
+ * for a buffer. Note, the number of bytes is always rounded
+ * to page size.
  */
-static unsigned long		trace_nr_entries = 65536UL;
+static unsigned long		trace_buf_size = 65536UL;
 
 /* trace_types holds a link list of available tracers. */
 static struct tracer		*trace_types __read_mostly;
@@ -139,8 +138,8 @@ static notrace void no_trace_init(struct
 
 	ftrace_function_enabled = 0;
 	if(tr->ctrl)
-		for_each_online_cpu(cpu)
-			tracing_reset(tr->data[cpu]);
+		for_each_tracing_cpu(cpu)
+			tracing_reset(tr, cpu);
 	tracer_enabled = 0;
 }
 
@@ -167,23 +166,21 @@ void trace_wake_up(void)
 		wake_up(&trace_wait);
 }
 
-#define ENTRIES_PER_PAGE (PAGE_SIZE / sizeof(struct trace_entry))
-
-static int __init set_nr_entries(char *str)
+static int __init set_buf_size(char *str)
 {
-	unsigned long nr_entries;
+	unsigned long buf_size;
 	int ret;
 
 	if (!str)
 		return 0;
-	ret = strict_strtoul(str, 0, &nr_entries);
+	ret = strict_strtoul(str, 0, &buf_size);
 	/* nr_entries can not be zero */
-	if (ret < 0 || nr_entries == 0)
+	if (ret < 0 || buf_size == 0)
 		return 0;
-	trace_nr_entries = nr_entries;
+	trace_buf_size = buf_size;
 	return 1;
 }
-__setup("trace_entries=", set_nr_entries);
+__setup("trace_buf_size=", set_buf_size);
 
 unsigned long nsecs_to_usecs(unsigned long nsecs)
 {
@@ -266,54 +263,6 @@ __update_max_tr(struct trace_array *tr, 
 	tracing_record_cmdline(current);
 }
 
-#define CHECK_COND(cond)			\
-	if (unlikely(cond)) {			\
-		tracing_disabled = 1;		\
-		WARN_ON(1);			\
-		return -1;			\
-	}
-
-/**
- * check_pages - integrity check of trace buffers
- *
- * As a safty measure we check to make sure the data pages have not
- * been corrupted.
- */
-int check_pages(struct trace_array_cpu *data)
-{
-	struct page *page, *tmp;
-
-	CHECK_COND(data->trace_pages.next->prev != &data->trace_pages);
-	CHECK_COND(data->trace_pages.prev->next != &data->trace_pages);
-
-	list_for_each_entry_safe(page, tmp, &data->trace_pages, lru) {
-		CHECK_COND(page->lru.next->prev != &page->lru);
-		CHECK_COND(page->lru.prev->next != &page->lru);
-	}
-
-	return 0;
-}
-
-/**
- * head_page - page address of the first page in per_cpu buffer.
- *
- * head_page returns the page address of the first page in
- * a per_cpu buffer. This also preforms various consistency
- * checks to make sure the buffer has not been corrupted.
- */
-void *head_page(struct trace_array_cpu *data)
-{
-	struct page *page;
-
-	if (list_empty(&data->trace_pages))
-		return NULL;
-
-	page = list_entry(data->trace_pages.next, struct page, lru);
-	BUG_ON(&page->lru == &data->trace_pages);
-
-	return page_address(page);
-}
-
 /**
  * trace_seq_printf - sequence printing of trace information
  * @s: trace sequence descriptor
@@ -460,34 +409,6 @@ trace_print_seq(struct seq_file *m, stru
 	trace_seq_reset(s);
 }
 
-/*
- * flip the trace buffers between two trace descriptors.
- * This usually is the buffers between the global_trace and
- * the max_tr to record a snapshot of a current trace.
- *
- * The ftrace_max_lock must be held.
- */
-static void
-flip_trace(struct trace_array_cpu *tr1, struct trace_array_cpu *tr2)
-{
-	struct list_head flip_pages;
-
-	INIT_LIST_HEAD(&flip_pages);
-
-	memcpy(&tr1->trace_head_idx, &tr2->trace_head_idx,
-		sizeof(struct trace_array_cpu) -
-		offsetof(struct trace_array_cpu, trace_head_idx));
-
-	check_pages(tr1);
-	check_pages(tr2);
-	list_splice_init(&tr1->trace_pages, &flip_pages);
-	list_splice_init(&tr2->trace_pages, &tr1->trace_pages);
-	list_splice_init(&flip_pages, &tr2->trace_pages);
-	BUG_ON(!list_empty(&flip_pages));
-	check_pages(tr1);
-	check_pages(tr2);
-}
-
 /**
  * update_max_tr - snapshot all trace buffers from global_trace to max_tr
  * @tr: tracer
@@ -500,17 +421,15 @@ flip_trace(struct trace_array_cpu *tr1, 
 void
 update_max_tr(struct trace_array *tr, struct task_struct *tsk, int cpu)
 {
-	struct trace_array_cpu *data;
-	int i;
+	struct ring_buffer *buf = tr->buffer;
 
 	WARN_ON_ONCE(!irqs_disabled());
 	__raw_spin_lock(&ftrace_max_lock);
-	/* clear out all the previous traces */
-	for_each_tracing_cpu(i) {
-		data = tr->data[i];
-		flip_trace(max_tr.data[i], data);
-		tracing_reset(data);
-	}
+
+	tr->buffer = max_tr.buffer;
+	max_tr.buffer = buf;
+
+	ring_buffer_reset(tr->buffer);
 
 	__update_max_tr(tr, tsk, cpu);
 	__raw_spin_unlock(&ftrace_max_lock);
@@ -527,16 +446,15 @@ update_max_tr(struct trace_array *tr, st
 void
 update_max_tr_single(struct trace_array *tr, struct task_struct *tsk, int cpu)
 {
-	struct trace_array_cpu *data = tr->data[cpu];
-	int i;
+	int ret;
 
 	WARN_ON_ONCE(!irqs_disabled());
 	__raw_spin_lock(&ftrace_max_lock);
-	for_each_tracing_cpu(i)
-		tracing_reset(max_tr.data[i]);
 
-	flip_trace(max_tr.data[cpu], data);
-	tracing_reset(data);
+	ring_buffer_reset(max_tr.buffer);
+	ret = ring_buffer_swap_cpu(max_tr.buffer, tr->buffer, cpu);
+
+	WARN_ON_ONCE(ret);
 
 	__update_max_tr(tr, tsk, cpu);
 	__raw_spin_unlock(&ftrace_max_lock);
@@ -573,7 +491,6 @@ int register_tracer(struct tracer *type)
 #ifdef CONFIG_FTRACE_STARTUP_TEST
 	if (type->selftest) {
 		struct tracer *saved_tracer = current_trace;
-		struct trace_array_cpu *data;
 		struct trace_array *tr = &global_trace;
 		int saved_ctrl = tr->ctrl;
 		int i;
@@ -585,10 +502,7 @@ int register_tracer(struct tracer *type)
 		 * If we fail, we do not register this tracer.
 		 */
 		for_each_tracing_cpu(i) {
-			data = tr->data[i];
-			if (!head_page(data))
-				continue;
-			tracing_reset(data);
+			tracing_reset(tr, i);
 		}
 		current_trace = type;
 		tr->ctrl = 0;
@@ -604,10 +518,7 @@ int register_tracer(struct tracer *type)
 		}
 		/* Only reset on passing, to avoid touching corrupted buffers */
 		for_each_tracing_cpu(i) {
-			data = tr->data[i];
-			if (!head_page(data))
-				continue;
-			tracing_reset(data);
+			tracing_reset(tr, i);
 		}
 		printk(KERN_CONT "PASSED\n");
 	}
@@ -653,13 +564,9 @@ void unregister_tracer(struct tracer *ty
 	mutex_unlock(&trace_types_lock);
 }
 
-void tracing_reset(struct trace_array_cpu *data)
+void tracing_reset(struct trace_array *tr, int cpu)
 {
-	data->trace_idx = 0;
-	data->overrun = 0;
-	data->trace_head = data->trace_tail = head_page(data);
-	data->trace_head_idx = 0;
-	data->trace_tail_idx = 0;
+	ring_buffer_reset_cpu(tr->buffer, cpu);
 }
 
 #define SAVED_CMDLINES 128
@@ -745,70 +652,6 @@ void tracing_record_cmdline(struct task_
 	trace_save_cmdline(tsk);
 }
 
-static inline struct list_head *
-trace_next_list(struct trace_array_cpu *data, struct list_head *next)
-{
-	/*
-	 * Roundrobin - but skip the head (which is not a real page):
-	 */
-	next = next->next;
-	if (unlikely(next == &data->trace_pages))
-		next = next->next;
-	BUG_ON(next == &data->trace_pages);
-
-	return next;
-}
-
-static inline void *
-trace_next_page(struct trace_array_cpu *data, void *addr)
-{
-	struct list_head *next;
-	struct page *page;
-
-	page = virt_to_page(addr);
-
-	next = trace_next_list(data, &page->lru);
-	page = list_entry(next, struct page, lru);
-
-	return page_address(page);
-}
-
-static inline struct trace_entry *
-tracing_get_trace_entry(struct trace_array *tr, struct trace_array_cpu *data)
-{
-	unsigned long idx, idx_next;
-	struct trace_entry *entry;
-
-	data->trace_idx++;
-	idx = data->trace_head_idx;
-	idx_next = idx + 1;
-
-	BUG_ON(idx * TRACE_ENTRY_SIZE >= PAGE_SIZE);
-
-	entry = data->trace_head + idx * TRACE_ENTRY_SIZE;
-
-	if (unlikely(idx_next >= ENTRIES_PER_PAGE)) {
-		data->trace_head = trace_next_page(data, data->trace_head);
-		idx_next = 0;
-	}
-
-	if (data->trace_head == data->trace_tail &&
-	    idx_next == data->trace_tail_idx) {
-		/* overrun */
-		data->overrun++;
-		data->trace_tail_idx++;
-		if (data->trace_tail_idx >= ENTRIES_PER_PAGE) {
-			data->trace_tail =
-				trace_next_page(data, data->trace_tail);
-			data->trace_tail_idx = 0;
-		}
-	}
-
-	data->trace_head_idx = idx_next;
-
-	return entry;
-}
-
 static inline void
 tracing_generic_entry_update(struct trace_entry *entry, unsigned long flags)
 {
@@ -819,7 +662,6 @@ tracing_generic_entry_update(struct trac
 
 	entry->preempt_count	= pc & 0xff;
 	entry->pid		= (tsk) ? tsk->pid : 0;
-	entry->t		= ftrace_now(raw_smp_processor_id());
 	entry->flags = (irqs_disabled_flags(flags) ? TRACE_FLAG_IRQS_OFF : 0) |
 		((pc & HARDIRQ_MASK) ? TRACE_FLAG_HARDIRQ : 0) |
 		((pc & SOFTIRQ_MASK) ? TRACE_FLAG_SOFTIRQ : 0) |
@@ -833,15 +675,14 @@ trace_function(struct trace_array *tr, s
 	struct trace_entry *entry;
 	unsigned long irq_flags;
 
-	raw_local_irq_save(irq_flags);
-	__raw_spin_lock(&data->lock);
-	entry			= tracing_get_trace_entry(tr, data);
+	entry	= ring_buffer_lock_reserve(tr->buffer, sizeof(*entry), &irq_flags);
+	if (!entry)
+		return;
 	tracing_generic_entry_update(entry, flags);
 	entry->type		= TRACE_FN;
 	entry->fn.ip		= ip;
 	entry->fn.parent_ip	= parent_ip;
-	__raw_spin_unlock(&data->lock);
-	raw_local_irq_restore(irq_flags);
+	ring_buffer_unlock_commit(tr->buffer, entry, irq_flags);
 }
 
 void
@@ -859,16 +700,13 @@ void __trace_mmiotrace_rw(struct trace_a
 	struct trace_entry *entry;
 	unsigned long irq_flags;
 
-	raw_local_irq_save(irq_flags);
-	__raw_spin_lock(&data->lock);
-
-	entry			= tracing_get_trace_entry(tr, data);
+	entry	= ring_buffer_lock_reserve(tr->buffer, sizeof(*entry), &irq_flags);
+	if (!entry)
+		return;
 	tracing_generic_entry_update(entry, 0);
 	entry->type		= TRACE_MMIO_RW;
 	entry->mmiorw		= *rw;
-
-	__raw_spin_unlock(&data->lock);
-	raw_local_irq_restore(irq_flags);
+	ring_buffer_unlock_commit(tr->buffer, entry, irq_flags);
 
 	trace_wake_up();
 }
@@ -879,16 +717,13 @@ void __trace_mmiotrace_map(struct trace_
 	struct trace_entry *entry;
 	unsigned long irq_flags;
 
-	raw_local_irq_save(irq_flags);
-	__raw_spin_lock(&data->lock);
-
-	entry			= tracing_get_trace_entry(tr, data);
+	entry	= ring_buffer_lock_reserve(tr->buffer, sizeof(*entry), &irq_flags);
+	if (!entry)
+		return;
 	tracing_generic_entry_update(entry, 0);
 	entry->type		= TRACE_MMIO_MAP;
 	entry->mmiomap		= *map;
-
-	__raw_spin_unlock(&data->lock);
-	raw_local_irq_restore(irq_flags);
+	ring_buffer_unlock_commit(tr->buffer, entry, irq_flags);
 
 	trace_wake_up();
 }
@@ -901,11 +736,14 @@ void __trace_stack(struct trace_array *t
 {
 	struct trace_entry *entry;
 	struct stack_trace trace;
+	unsigned long irq_flags;
 
 	if (!(trace_flags & TRACE_ITER_STACKTRACE))
 		return;
 
-	entry			= tracing_get_trace_entry(tr, data);
+	entry	= ring_buffer_lock_reserve(tr->buffer, sizeof(*entry), &irq_flags);
+	if (!entry)
+		return;
 	tracing_generic_entry_update(entry, flags);
 	entry->type		= TRACE_STACK;
 
@@ -917,6 +755,7 @@ void __trace_stack(struct trace_array *t
 	trace.entries		= entry->stack.caller;
 
 	save_stack_trace(&trace);
+	ring_buffer_unlock_commit(tr->buffer, entry, irq_flags);
 }
 
 void
@@ -928,17 +767,16 @@ __trace_special(void *__tr, void *__data
 	struct trace_entry *entry;
 	unsigned long irq_flags;
 
-	raw_local_irq_save(irq_flags);
-	__raw_spin_lock(&data->lock);
-	entry			= tracing_get_trace_entry(tr, data);
+	entry	= ring_buffer_lock_reserve(tr->buffer, sizeof(*entry), &irq_flags);
+	if (!entry)
+		return;
 	tracing_generic_entry_update(entry, 0);
 	entry->type		= TRACE_SPECIAL;
 	entry->special.arg1	= arg1;
 	entry->special.arg2	= arg2;
 	entry->special.arg3	= arg3;
+	ring_buffer_unlock_commit(tr->buffer, entry, irq_flags);
 	__trace_stack(tr, data, irq_flags, 4);
-	__raw_spin_unlock(&data->lock);
-	raw_local_irq_restore(irq_flags);
 
 	trace_wake_up();
 }
@@ -953,9 +791,9 @@ tracing_sched_switch_trace(struct trace_
 	struct trace_entry *entry;
 	unsigned long irq_flags;
 
-	raw_local_irq_save(irq_flags);
-	__raw_spin_lock(&data->lock);
-	entry			= tracing_get_trace_entry(tr, data);
+	entry	= ring_buffer_lock_reserve(tr->buffer, sizeof(*entry), &irq_flags);
+	if (!entry)
+		return;
 	tracing_generic_entry_update(entry, flags);
 	entry->type		= TRACE_CTX;
 	entry->ctx.prev_pid	= prev->pid;
@@ -964,9 +802,8 @@ tracing_sched_switch_trace(struct trace_
 	entry->ctx.next_pid	= next->pid;
 	entry->ctx.next_prio	= next->prio;
 	entry->ctx.next_state	= next->state;
+	ring_buffer_unlock_commit(tr->buffer, entry, irq_flags);
 	__trace_stack(tr, data, flags, 5);
-	__raw_spin_unlock(&data->lock);
-	raw_local_irq_restore(irq_flags);
 }
 
 void
@@ -979,9 +816,9 @@ tracing_sched_wakeup_trace(struct trace_
 	struct trace_entry *entry;
 	unsigned long irq_flags;
 
-	raw_local_irq_save(irq_flags);
-	__raw_spin_lock(&data->lock);
-	entry			= tracing_get_trace_entry(tr, data);
+	entry	= ring_buffer_lock_reserve(tr->buffer, sizeof(*entry), &irq_flags);
+	if (!entry)
+		return;
 	tracing_generic_entry_update(entry, flags);
 	entry->type		= TRACE_WAKE;
 	entry->ctx.prev_pid	= curr->pid;
@@ -990,9 +827,8 @@ tracing_sched_wakeup_trace(struct trace_
 	entry->ctx.next_pid	= wakee->pid;
 	entry->ctx.next_prio	= wakee->prio;
 	entry->ctx.next_state	= wakee->state;
+	ring_buffer_unlock_commit(tr->buffer, entry, irq_flags);
 	__trace_stack(tr, data, flags, 6);
-	__raw_spin_unlock(&data->lock);
-	raw_local_irq_restore(irq_flags);
 
 	trace_wake_up();
 }
@@ -1074,105 +910,66 @@ enum trace_file_type {
 };
 
 static struct trace_entry *
-trace_entry_idx(struct trace_array *tr, struct trace_array_cpu *data,
-		struct trace_iterator *iter, int cpu)
-{
-	struct page *page;
-	struct trace_entry *array;
-
-	if (iter->next_idx[cpu] >= tr->entries ||
-	    iter->next_idx[cpu] >= data->trace_idx ||
-	    (data->trace_head == data->trace_tail &&
-	     data->trace_head_idx == data->trace_tail_idx))
-		return NULL;
-
-	if (!iter->next_page[cpu]) {
-		/* Initialize the iterator for this cpu trace buffer */
-		WARN_ON(!data->trace_tail);
-		page = virt_to_page(data->trace_tail);
-		iter->next_page[cpu] = &page->lru;
-		iter->next_page_idx[cpu] = data->trace_tail_idx;
-	}
-
-	page = list_entry(iter->next_page[cpu], struct page, lru);
-	BUG_ON(&data->trace_pages == &page->lru);
-
-	array = page_address(page);
-
-	WARN_ON(iter->next_page_idx[cpu] >= ENTRIES_PER_PAGE);
-	return &array[iter->next_page_idx[cpu]];
-}
-
-static struct trace_entry *
-find_next_entry(struct trace_iterator *iter, int *ent_cpu)
+find_next_entry(struct trace_iterator *iter, int *ent_cpu, u64 *ent_ts)
 {
-	struct trace_array *tr = iter->tr;
+	struct ring_buffer *buffer = iter->tr->buffer;
+	struct ring_buffer_event *event;
 	struct trace_entry *ent, *next = NULL;
+	u64 next_ts = 0, ts;
 	int next_cpu = -1;
 	int cpu;
 
 	for_each_tracing_cpu(cpu) {
-		if (!head_page(tr->data[cpu]))
+		struct ring_buffer_iter *buf_iter;
+
+		if (ring_buffer_empty_cpu(buffer, cpu))
 			continue;
-		ent = trace_entry_idx(tr, tr->data[cpu], iter, cpu);
+
+		buf_iter = iter->buffer_iter[cpu];
+		event = ring_buffer_iter_peek(buf_iter, &ts);
+		ent = event ? ring_buffer_event_data(event) : NULL;
+
 		/*
 		 * Pick the entry with the smallest timestamp:
 		 */
-		if (ent && (!next || ent->t < next->t)) {
+		if (ent && (!next || ts < next_ts)) {
 			next = ent;
 			next_cpu = cpu;
+			next_ts = ts;
 		}
 	}
 
 	if (ent_cpu)
 		*ent_cpu = next_cpu;
 
+	if (ent_ts)
+		*ent_ts = next_ts;
+
 	return next;
 }
 
 static void trace_iterator_increment(struct trace_iterator *iter)
 {
 	iter->idx++;
-	iter->next_idx[iter->cpu]++;
-	iter->next_page_idx[iter->cpu]++;
-
-	if (iter->next_page_idx[iter->cpu] >= ENTRIES_PER_PAGE) {
-		struct trace_array_cpu *data = iter->tr->data[iter->cpu];
-
-		iter->next_page_idx[iter->cpu] = 0;
-		iter->next_page[iter->cpu] =
-			trace_next_list(data, iter->next_page[iter->cpu]);
-	}
+	ring_buffer_read(iter->buffer_iter[iter->cpu], NULL);
 }
 
 static void trace_consume(struct trace_iterator *iter)
 {
-	struct trace_array_cpu *data = iter->tr->data[iter->cpu];
-
-	data->trace_tail_idx++;
-	if (data->trace_tail_idx >= ENTRIES_PER_PAGE) {
-		data->trace_tail = trace_next_page(data, data->trace_tail);
-		data->trace_tail_idx = 0;
-	}
-
-	/* Check if we empty it, then reset the index */
-	if (data->trace_head == data->trace_tail &&
-	    data->trace_head_idx == data->trace_tail_idx)
-		data->trace_idx = 0;
+	ring_buffer_consume(iter->tr->buffer, iter->cpu, &iter->ts);
 }
 
 static void *find_next_entry_inc(struct trace_iterator *iter)
 {
 	struct trace_entry *next;
 	int next_cpu = -1;
+	u64 ts;
 
-	next = find_next_entry(iter, &next_cpu);
-
-	iter->prev_ent = iter->ent;
-	iter->prev_cpu = iter->cpu;
+	next = find_next_entry(iter, &next_cpu, &ts);
 
 	iter->ent = next;
 	iter->cpu = next_cpu;
+	iter->ts = ts;
 
 	if (next)
 		trace_iterator_increment(iter);
@@ -1210,7 +1007,7 @@ static void *s_start(struct seq_file *m,
 	struct trace_iterator *iter = m->private;
 	void *p = NULL;
 	loff_t l = 0;
-	int i;
+	int cpu;
 
 	mutex_lock(&trace_types_lock);
 
@@ -1229,12 +1026,9 @@ static void *s_start(struct seq_file *m,
 		iter->ent = NULL;
 		iter->cpu = 0;
 		iter->idx = -1;
-		iter->prev_ent = NULL;
-		iter->prev_cpu = -1;
 
-		for_each_tracing_cpu(i) {
-			iter->next_idx[i] = 0;
-			iter->next_page[i] = NULL;
+		for_each_tracing_cpu(cpu) {
+			ring_buffer_iter_reset(iter->buffer_iter[cpu]);
 		}
 
 		for (p = iter; p && l < *pos; p = s_next(m, p, &l))
@@ -1357,21 +1151,12 @@ print_trace_header(struct seq_file *m, s
 	struct tracer *type = current_trace;
 	unsigned long total   = 0;
 	unsigned long entries = 0;
-	int cpu;
 	const char *name = "preemption";
 
 	if (type)
 		name = type->name;
 
-	for_each_tracing_cpu(cpu) {
-		if (head_page(tr->data[cpu])) {
-			total += tr->data[cpu]->trace_idx;
-			if (tr->data[cpu]->trace_idx > tr->entries)
-				entries += tr->entries;
-			else
-				entries += tr->data[cpu]->trace_idx;
-		}
-	}
+	entries = ring_buffer_entries(iter->tr->buffer);
 
 	seq_printf(m, "%s latency trace v1.1.5 on %s\n",
 		   name, UTS_RELEASE);
@@ -1457,7 +1242,7 @@ lat_print_generic(struct trace_seq *s, s
 unsigned long preempt_mark_thresh = 100;
 
 static void
-lat_print_timestamp(struct trace_seq *s, unsigned long long abs_usecs,
+lat_print_timestamp(struct trace_seq *s, u64 abs_usecs,
 		    unsigned long rel_usecs)
 {
 	trace_seq_printf(s, " %4lldus", abs_usecs);
@@ -1476,20 +1261,22 @@ print_lat_fmt(struct trace_iterator *ite
 {
 	struct trace_seq *s = &iter->seq;
 	unsigned long sym_flags = (trace_flags & TRACE_ITER_SYM_MASK);
-	struct trace_entry *next_entry = find_next_entry(iter, NULL);
+	struct trace_entry *next_entry;
 	unsigned long verbose = (trace_flags & TRACE_ITER_VERBOSE);
 	struct trace_entry *entry = iter->ent;
 	unsigned long abs_usecs;
 	unsigned long rel_usecs;
+	u64 next_ts;
 	char *comm;
 	int S, T;
 	int i;
 	unsigned state;
 
+	next_entry = find_next_entry(iter, NULL, &next_ts);
 	if (!next_entry)
-		next_entry = entry;
-	rel_usecs = ns2usecs(next_entry->t - entry->t);
-	abs_usecs = ns2usecs(entry->t - iter->tr->time_start);
+		next_ts = iter->ts;
+	rel_usecs = ns2usecs(next_ts - iter->ts);
+	abs_usecs = ns2usecs(iter->ts - iter->tr->time_start);
 
 	if (verbose) {
 		comm = trace_find_cmdline(entry->pid);
@@ -1498,7 +1285,7 @@ print_lat_fmt(struct trace_iterator *ite
 				 comm,
 				 entry->pid, cpu, entry->flags,
 				 entry->preempt_count, trace_idx,
-				 ns2usecs(entry->t),
+				 ns2usecs(iter->ts),
 				 abs_usecs/1000,
 				 abs_usecs % 1000, rel_usecs/1000,
 				 rel_usecs % 1000);
@@ -1569,7 +1356,7 @@ static int print_trace_fmt(struct trace_
 
 	comm = trace_find_cmdline(iter->ent->pid);
 
-	t = ns2usecs(entry->t);
+	t = ns2usecs(iter->ts);
 	usec_rem = do_div(t, 1000000ULL);
 	secs = (unsigned long)t;
 
@@ -1660,7 +1447,7 @@ static int print_raw_fmt(struct trace_it
 	entry = iter->ent;
 
 	ret = trace_seq_printf(s, "%d %d %llu ",
-		entry->pid, iter->cpu, entry->t);
+		entry->pid, iter->cpu, iter->ts);
 	if (!ret)
 		return 0;
 
@@ -1725,7 +1512,7 @@ static int print_hex_fmt(struct trace_it
 
 	SEQ_PUT_HEX_FIELD_RET(s, entry->pid);
 	SEQ_PUT_HEX_FIELD_RET(s, iter->cpu);
-	SEQ_PUT_HEX_FIELD_RET(s, entry->t);
+	SEQ_PUT_HEX_FIELD_RET(s, iter->ts);
 
 	switch (entry->type) {
 	case TRACE_FN:
@@ -1769,7 +1556,7 @@ static int print_bin_fmt(struct trace_it
 
 	SEQ_PUT_FIELD_RET(s, entry->pid);
 	SEQ_PUT_FIELD_RET(s, entry->cpu);
-	SEQ_PUT_FIELD_RET(s, entry->t);
+	SEQ_PUT_FIELD_RET(s, iter->ts);
 
 	switch (entry->type) {
 	case TRACE_FN:
@@ -1796,16 +1583,10 @@ static int print_bin_fmt(struct trace_it
 
 static int trace_empty(struct trace_iterator *iter)
 {
-	struct trace_array_cpu *data;
 	int cpu;
 
 	for_each_tracing_cpu(cpu) {
-		data = iter->tr->data[cpu];
-
-		if (head_page(data) && data->trace_idx &&
-		    (data->trace_tail != data->trace_head ||
-		     data->trace_tail_idx != data->trace_head_idx))
-			return 0;
+		ring_buffer_iter_empty(iter->buffer_iter[cpu]);
 	}
 	return 1;
 }
@@ -1869,6 +1650,8 @@ static struct trace_iterator *
 __tracing_open(struct inode *inode, struct file *file, int *ret)
 {
 	struct trace_iterator *iter;
+	struct seq_file *m;
+	int cpu;
 
 	if (tracing_disabled) {
 		*ret = -ENODEV;
@@ -1889,28 +1672,43 @@ __tracing_open(struct inode *inode, stru
 	iter->trace = current_trace;
 	iter->pos = -1;
 
+	for_each_tracing_cpu(cpu) {
+		iter->buffer_iter[cpu] =
+			ring_buffer_read_start(iter->tr->buffer, cpu);
+		if (!iter->buffer_iter[cpu])
+			goto fail_buffer;
+	}
+
 	/* TODO stop tracer */
 	*ret = seq_open(file, &tracer_seq_ops);
-	if (!*ret) {
-		struct seq_file *m = file->private_data;
-		m->private = iter;
+	if (*ret)
+		goto fail_buffer;
 
-		/* stop the trace while dumping */
-		if (iter->tr->ctrl) {
-			tracer_enabled = 0;
-			ftrace_function_enabled = 0;
-		}
+	m = file->private_data;
+	m->private = iter;
 
-		if (iter->trace && iter->trace->open)
-			iter->trace->open(iter);
-	} else {
-		kfree(iter);
-		iter = NULL;
+	/* stop the trace while dumping */
+	if (iter->tr->ctrl) {
+		tracer_enabled = 0;
+		ftrace_function_enabled = 0;
 	}
+
+	if (iter->trace && iter->trace->open)
+			iter->trace->open(iter);
+
 	mutex_unlock(&trace_types_lock);
 
  out:
 	return iter;
+
+ fail_buffer:
+	for_each_tracing_cpu(cpu) {
+		if (iter->buffer_iter[cpu])
+			ring_buffer_read_finish(iter->buffer_iter[cpu]);
+	}
+	mutex_unlock(&trace_types_lock);
+
+	return ERR_PTR(-ENOMEM);
 }
 
 int tracing_open_generic(struct inode *inode, struct file *filp)
@@ -1926,8 +1724,14 @@ int tracing_release(struct inode *inode,
 {
 	struct seq_file *m = (struct seq_file *)file->private_data;
 	struct trace_iterator *iter = m->private;
+	int cpu;
 
 	mutex_lock(&trace_types_lock);
+	for_each_tracing_cpu(cpu) {
+		if (iter->buffer_iter[cpu])
+			ring_buffer_read_finish(iter->buffer_iter[cpu]);
+	}
+
 	if (iter->trace && iter->trace->close)
 		iter->trace->close(iter);
 
@@ -2500,13 +2304,10 @@ tracing_read_pipe(struct file *filp, cha
 		  size_t cnt, loff_t *ppos)
 {
 	struct trace_iterator *iter = filp->private_data;
-	struct trace_array_cpu *data;
-	static cpumask_t mask;
 	unsigned long flags;
 #ifdef CONFIG_FTRACE
 	int ftrace_save;
 #endif
-	int cpu;
 	ssize_t sret;
 
 	/* return any leftover data */
@@ -2595,32 +2396,13 @@ tracing_read_pipe(struct file *filp, cha
 	 * and then release the locks again.
 	 */
 
-	cpus_clear(mask);
-	local_irq_save(flags);
+	local_irq_disable();
 #ifdef CONFIG_FTRACE
 	ftrace_save = ftrace_enabled;
 	ftrace_enabled = 0;
 #endif
 	smp_wmb();
-	for_each_tracing_cpu(cpu) {
-		data = iter->tr->data[cpu];
-
-		if (!head_page(data) || !data->trace_idx)
-			continue;
-
-		atomic_inc(&data->disabled);
-		cpu_set(cpu, mask);
-	}
-
-	for_each_cpu_mask(cpu, mask) {
-		data = iter->tr->data[cpu];
-		__raw_spin_lock(&data->lock);
-
-		if (data->overrun > iter->last_overrun[cpu])
-			iter->overrun[cpu] +=
-				data->overrun - iter->last_overrun[cpu];
-		iter->last_overrun[cpu] = data->overrun;
-	}
+	ring_buffer_lock(iter->tr->buffer, &flags);
 
 	while (find_next_entry_inc(iter) != NULL) {
 		int ret;
@@ -2639,19 +2421,11 @@ tracing_read_pipe(struct file *filp, cha
 			break;
 	}
 
-	for_each_cpu_mask(cpu, mask) {
-		data = iter->tr->data[cpu];
-		__raw_spin_unlock(&data->lock);
-	}
-
-	for_each_cpu_mask(cpu, mask) {
-		data = iter->tr->data[cpu];
-		atomic_dec(&data->disabled);
-	}
+	ring_buffer_unlock(iter->tr->buffer, flags);
 #ifdef CONFIG_FTRACE
 	ftrace_enabled = ftrace_save;
 #endif
-	local_irq_restore(flags);
+	local_irq_enable();
 
 	/* Now copy what we have to the user */
 	sret = trace_seq_to_user(&iter->seq, ubuf, cnt);
@@ -2684,7 +2458,7 @@ tracing_entries_write(struct file *filp,
 {
 	unsigned long val;
 	char buf[64];
-	int i, ret;
+	int ret;
 
 	if (cnt >= sizeof(buf))
 		return -EINVAL;
@@ -2711,52 +2485,31 @@ tracing_entries_write(struct file *filp,
 		goto out;
 	}
 
-	if (val > global_trace.entries) {
-		long pages_requested;
-		unsigned long freeable_pages;
-
-		/* make sure we have enough memory before mapping */
-		pages_requested =
-			(val + (ENTRIES_PER_PAGE-1)) / ENTRIES_PER_PAGE;
-
-		/* account for each buffer (and max_tr) */
-		pages_requested *= tracing_nr_buffers * 2;
-
-		/* Check for overflow */
-		if (pages_requested < 0) {
-			cnt = -ENOMEM;
+	if (val != global_trace.entries) {
+		ret = ring_buffer_resize(global_trace.buffer, val);
+		if (ret < 0) {
+			cnt = ret;
 			goto out;
 		}
 
-		freeable_pages = determine_dirtyable_memory();
-
-		/* we only allow to request 1/4 of useable memory */
-		if (pages_requested >
-		    ((freeable_pages + tracing_pages_allocated) / 4)) {
-			cnt = -ENOMEM;
-			goto out;
-		}
-
-		while (global_trace.entries < val) {
-			if (trace_alloc_page()) {
-				cnt = -ENOMEM;
-				goto out;
+		ret = ring_buffer_resize(max_tr.buffer, val);
+		if (ret < 0) {
+			int r;
+			cnt = ret;
+			r = ring_buffer_resize(global_trace.buffer,
+					       global_trace.entries);
+			if (r < 0) {
+				/* AARGH! We are left with different
+				 * size max buffer!!!! */
+				WARN_ON(1);
+				tracing_disabled = 1;
 			}
-			/* double check that we don't go over the known pages */
-			if (tracing_pages_allocated > pages_requested)
-				break;
+			goto out;
 		}
 
-	} else {
-		/* include the number of entries in val (inc of page entries) */
-		while (global_trace.entries > val + (ENTRIES_PER_PAGE - 1))
-			trace_free_page();
+		global_trace.entries = val;
 	}
 
-	/* check integrity */
-	for_each_tracing_cpu(i)
-		check_pages(global_trace.data[i]);
-
 	filp->f_pos += cnt;
 
 	/* If check pages failed, return ENOMEM */
@@ -2930,190 +2683,41 @@ static __init void tracer_init_debugfs(v
 #endif
 }
 
-static int trace_alloc_page(void)
+__init static int tracer_alloc_buffers(void)
 {
 	struct trace_array_cpu *data;
-	struct page *page, *tmp;
-	LIST_HEAD(pages);
-	void *array;
-	unsigned pages_allocated = 0;
 	int i;
 
-	/* first allocate a page for each CPU */
-	for_each_tracing_cpu(i) {
-		array = (void *)__get_free_page(GFP_KERNEL);
-		if (array == NULL) {
-			printk(KERN_ERR "tracer: failed to allocate page"
-			       "for trace buffer!\n");
-			goto free_pages;
-		}
-
-		pages_allocated++;
-		page = virt_to_page(array);
-		list_add(&page->lru, &pages);
+	/* TODO: make the number of buffers hot pluggable with CPUS */
+	tracing_buffer_mask = cpu_possible_map;
 
-/* Only allocate if we are actually using the max trace */
-#ifdef CONFIG_TRACER_MAX_TRACE
-		array = (void *)__get_free_page(GFP_KERNEL);
-		if (array == NULL) {
-			printk(KERN_ERR "tracer: failed to allocate page"
-			       "for trace buffer!\n");
-			goto free_pages;
-		}
-		pages_allocated++;
-		page = virt_to_page(array);
-		list_add(&page->lru, &pages);
-#endif
+	global_trace.buffer = ring_buffer_alloc(trace_buf_size,
+						   TRACE_BUFFER_FLAGS);
+	if (!global_trace.buffer) {
+		printk(KERN_ERR "tracer: failed to allocate ring buffer!\n");
+		WARN_ON(1);
+		return 0;
 	}
-
-	/* Now that we successfully allocate a page per CPU, add them */
-	for_each_tracing_cpu(i) {
-		data = global_trace.data[i];
-		page = list_entry(pages.next, struct page, lru);
-		list_del_init(&page->lru);
-		list_add_tail(&page->lru, &data->trace_pages);
-		ClearPageLRU(page);
+	global_trace.entries = ring_buffer_size(global_trace.buffer);
 
 #ifdef CONFIG_TRACER_MAX_TRACE
-		data = max_tr.data[i];
-		page = list_entry(pages.next, struct page, lru);
-		list_del_init(&page->lru);
-		list_add_tail(&page->lru, &data->trace_pages);
-		SetPageLRU(page);
-#endif
-	}
-	tracing_pages_allocated += pages_allocated;
-	global_trace.entries += ENTRIES_PER_PAGE;
-
-	return 0;
-
- free_pages:
-	list_for_each_entry_safe(page, tmp, &pages, lru) {
-		list_del_init(&page->lru);
-		__free_page(page);
+	max_tr.buffer = ring_buffer_alloc(trace_buf_size,
+					     TRACE_BUFFER_FLAGS);
+	if (!max_tr.buffer) {
+		printk(KERN_ERR "tracer: failed to allocate max ring buffer!\n");
+		WARN_ON(1);
+		ring_buffer_free(global_trace.buffer);
+		return 0;
 	}
-	return -ENOMEM;
-}
-
-static int trace_free_page(void)
-{
-	struct trace_array_cpu *data;
-	struct page *page;
-	struct list_head *p;
-	int i;
-	int ret = 0;
-
-	/* free one page from each buffer */
-	for_each_tracing_cpu(i) {
-		data = global_trace.data[i];
-		p = data->trace_pages.next;
-		if (p == &data->trace_pages) {
-			/* should never happen */
-			WARN_ON(1);
-			tracing_disabled = 1;
-			ret = -1;
-			break;
-		}
-		page = list_entry(p, struct page, lru);
-		ClearPageLRU(page);
-		list_del(&page->lru);
-		tracing_pages_allocated--;
-		tracing_pages_allocated--;
-		__free_page(page);
-
-		tracing_reset(data);
-
-#ifdef CONFIG_TRACER_MAX_TRACE
-		data = max_tr.data[i];
-		p = data->trace_pages.next;
-		if (p == &data->trace_pages) {
-			/* should never happen */
-			WARN_ON(1);
-			tracing_disabled = 1;
-			ret = -1;
-			break;
-		}
-		page = list_entry(p, struct page, lru);
-		ClearPageLRU(page);
-		list_del(&page->lru);
-		__free_page(page);
-
-		tracing_reset(data);
+	max_tr.entries = ring_buffer_size(max_tr.buffer);
+	WARN_ON(max_tr.entries != global_trace.entries);
 #endif
-	}
-	global_trace.entries -= ENTRIES_PER_PAGE;
-
-	return ret;
-}
-
-__init static int tracer_alloc_buffers(void)
-{
-	struct trace_array_cpu *data;
-	void *array;
-	struct page *page;
-	int pages = 0;
-	int ret = -ENOMEM;
-	int i;
-
-	/* TODO: make the number of buffers hot pluggable with CPUS */
-	tracing_nr_buffers = num_possible_cpus();
-	tracing_buffer_mask = cpu_possible_map;
 
 	/* Allocate the first page for all buffers */
 	for_each_tracing_cpu(i) {
 		data = global_trace.data[i] = &per_cpu(global_trace_cpu, i);
 		max_tr.data[i] = &per_cpu(max_data, i);
-
-		array = (void *)__get_free_page(GFP_KERNEL);
-		if (array == NULL) {
-			printk(KERN_ERR "tracer: failed to allocate page"
-			       "for trace buffer!\n");
-			goto free_buffers;
-		}
-
-		/* set the array to the list */
-		INIT_LIST_HEAD(&data->trace_pages);
-		page = virt_to_page(array);
-		list_add(&page->lru, &data->trace_pages);
-		/* use the LRU flag to differentiate the two buffers */
-		ClearPageLRU(page);
-
-		data->lock = (raw_spinlock_t)__RAW_SPIN_LOCK_UNLOCKED;
-		max_tr.data[i]->lock = (raw_spinlock_t)__RAW_SPIN_LOCK_UNLOCKED;
-
-/* Only allocate if we are actually using the max trace */
-#ifdef CONFIG_TRACER_MAX_TRACE
-		array = (void *)__get_free_page(GFP_KERNEL);
-		if (array == NULL) {
-			printk(KERN_ERR "tracer: failed to allocate page"
-			       "for trace buffer!\n");
-			goto free_buffers;
-		}
-
-		INIT_LIST_HEAD(&max_tr.data[i]->trace_pages);
-		page = virt_to_page(array);
-		list_add(&page->lru, &max_tr.data[i]->trace_pages);
-		SetPageLRU(page);
-#endif
-	}
-
-	/*
-	 * Since we allocate by orders of pages, we may be able to
-	 * round up a bit.
-	 */
-	global_trace.entries = ENTRIES_PER_PAGE;
-	pages++;
-
-	while (global_trace.entries < trace_nr_entries) {
-		if (trace_alloc_page())
-			break;
-		pages++;
 	}
-	max_tr.entries = global_trace.entries;
-
-	pr_info("tracer: %d pages allocated for %ld entries of %ld bytes\n",
-		pages, trace_nr_entries, (long)TRACE_ENTRY_SIZE);
-	pr_info("   actual entries %ld\n", global_trace.entries);
 
 	tracer_init_debugfs();
 
@@ -3127,31 +2731,5 @@ __init static int tracer_alloc_buffers(v
 	tracing_disabled = 0;
 
 	return 0;
-
- free_buffers:
-	for (i-- ; i >= 0; i--) {
-		struct page *page, *tmp;
-		struct trace_array_cpu *data = global_trace.data[i];
-
-		if (data) {
-			list_for_each_entry_safe(page, tmp,
-						 &data->trace_pages, lru) {
-				list_del_init(&page->lru);
-				__free_page(page);
-			}
-		}
-
-#ifdef CONFIG_TRACER_MAX_TRACE
-		data = max_tr.data[i];
-		if (data) {
-			list_for_each_entry_safe(page, tmp,
-						 &data->trace_pages, lru) {
-				list_del_init(&page->lru);
-				__free_page(page);
-			}
-		}
-#endif
-	}
-	return ret;
 }
 fs_initcall(tracer_alloc_buffers);
Index: linux-compile.git/kernel/trace/trace.h
===================================================================
--- linux-compile.git.orig/kernel/trace/trace.h	2008-09-24 19:47:46.000000000 -0400
+++ linux-compile.git/kernel/trace/trace.h	2008-09-24 22:19:13.000000000 -0400
@@ -6,6 +6,7 @@
 #include <linux/sched.h>
 #include <linux/clocksource.h>
 #include <linux/mmiotrace.h>
+#include <linux/ring_buffer.h>
 
 enum trace_type {
 	__TRACE_FIRST_TYPE = 0,
@@ -72,7 +73,6 @@ struct trace_entry {
 	char			flags;
 	char			preempt_count;
 	int			pid;
-	cycle_t			t;
 	union {
 		struct ftrace_entry		fn;
 		struct ctx_switch_entry		ctx;
@@ -91,16 +91,9 @@ struct trace_entry {
  * the trace, etc.)
  */
 struct trace_array_cpu {
-	struct list_head	trace_pages;
 	atomic_t		disabled;
-	raw_spinlock_t		lock;
-	struct lock_class_key	lock_key;
 
 	/* these fields get copied into max-trace: */
-	unsigned		trace_head_idx;
-	unsigned		trace_tail_idx;
-	void			*trace_head; /* producer */
-	void			*trace_tail; /* consumer */
 	unsigned long		trace_idx;
 	unsigned long		overrun;
 	unsigned long		saved_latency;
@@ -124,6 +117,7 @@ struct trace_iterator;
  * They have on/off state as well:
  */
 struct trace_array {
+	struct ring_buffer	*buffer;
 	unsigned long		entries;
 	long			ctrl;
 	int			cpu;
@@ -171,26 +165,20 @@ struct trace_iterator {
 	struct trace_array	*tr;
 	struct tracer		*trace;
 	void			*private;
-	long			last_overrun[NR_CPUS];
-	long			overrun[NR_CPUS];
+	struct ring_buffer_iter	*buffer_iter[NR_CPUS];
 
 	/* The below is zeroed out in pipe_read */
 	struct trace_seq	seq;
 	struct trace_entry	*ent;
 	int			cpu;
-
-	struct trace_entry	*prev_ent;
-	int			prev_cpu;
+	u64			ts;
 
 	unsigned long		iter_flags;
 	loff_t			pos;
-	unsigned long		next_idx[NR_CPUS];
-	struct list_head	*next_page[NR_CPUS];
-	unsigned		next_page_idx[NR_CPUS];
 	long			idx;
 };
 
-void tracing_reset(struct trace_array_cpu *data);
+void tracing_reset(struct trace_array *tr, int cpu);
 int tracing_open_generic(struct inode *inode, struct file *filp);
 struct dentry *tracing_init_dentry(void);
 void init_tracer_sysprof_debugfs(struct dentry *d_tracer);
Index: linux-compile.git/kernel/trace/trace_functions.c
===================================================================
--- linux-compile.git.orig/kernel/trace/trace_functions.c	2008-09-24 19:47:46.000000000 -0400
+++ linux-compile.git/kernel/trace/trace_functions.c	2008-09-25 00:27:12.000000000 -0400
@@ -23,7 +23,7 @@ static void function_reset(struct trace_
 	tr->time_start = ftrace_now(tr->cpu);
 
 	for_each_online_cpu(cpu)
-		tracing_reset(tr->data[cpu]);
+		tracing_reset(tr, cpu);
 }
 
 static void start_function_trace(struct trace_array *tr)
Index: linux-compile.git/kernel/trace/trace_irqsoff.c
===================================================================
--- linux-compile.git.orig/kernel/trace/trace_irqsoff.c	2008-09-24 19:47:46.000000000 -0400
+++ linux-compile.git/kernel/trace/trace_irqsoff.c	2008-09-25 00:27:12.000000000 -0400
@@ -173,7 +173,7 @@ out_unlock:
 out:
 	data->critical_sequence = max_sequence;
 	data->preempt_timestamp = ftrace_now(cpu);
-	tracing_reset(data);
+	tracing_reset(tr, cpu);
 	trace_function(tr, data, CALLER_ADDR0, parent_ip, flags);
 }
 
@@ -203,7 +203,7 @@ start_critical_timing(unsigned long ip, 
 	data->critical_sequence = max_sequence;
 	data->preempt_timestamp = ftrace_now(cpu);
 	data->critical_start = parent_ip ? : ip;
-	tracing_reset(data);
+	tracing_reset(tr, cpu);
 
 	local_save_flags(flags);
 
@@ -234,7 +234,7 @@ stop_critical_timing(unsigned long ip, u
 
 	data = tr->data[cpu];
 
-	if (unlikely(!data) || unlikely(!head_page(data)) ||
+	if (unlikely(!data) ||
 	    !data->critical_start || atomic_read(&data->disabled))
 		return;
 
Index: linux-compile.git/kernel/trace/trace_mmiotrace.c
===================================================================
--- linux-compile.git.orig/kernel/trace/trace_mmiotrace.c	2008-09-24 19:47:46.000000000 -0400
+++ linux-compile.git/kernel/trace/trace_mmiotrace.c	2008-09-25 00:27:12.000000000 -0400
@@ -27,7 +27,7 @@ static void mmio_reset_data(struct trace
 	tr->time_start = ftrace_now(tr->cpu);
 
 	for_each_online_cpu(cpu)
-		tracing_reset(tr->data[cpu]);
+		tracing_reset(tr, cpu);
 }
 
 static void mmio_trace_init(struct trace_array *tr)
@@ -130,10 +130,14 @@ static unsigned long count_overruns(stru
 {
 	int cpu;
 	unsigned long cnt = 0;
+/* FIXME: */
+#if 0
 	for_each_online_cpu(cpu) {
 		cnt += iter->overrun[cpu];
 		iter->overrun[cpu] = 0;
 	}
+#endif
+	(void)cpu;
 	return cnt;
 }
 
@@ -176,7 +180,7 @@ static int mmio_print_rw(struct trace_it
 	struct trace_entry *entry = iter->ent;
 	struct mmiotrace_rw *rw	= &entry->mmiorw;
 	struct trace_seq *s	= &iter->seq;
-	unsigned long long t	= ns2usecs(entry->t);
+	unsigned long long t	= ns2usecs(iter->ts);
 	unsigned long usec_rem	= do_div(t, 1000000ULL);
 	unsigned secs		= (unsigned long)t;
 	int ret = 1;
@@ -218,7 +222,7 @@ static int mmio_print_map(struct trace_i
 	struct trace_entry *entry = iter->ent;
 	struct mmiotrace_map *m	= &entry->mmiomap;
 	struct trace_seq *s	= &iter->seq;
-	unsigned long long t	= ns2usecs(entry->t);
+	unsigned long long t	= ns2usecs(iter->ts);
 	unsigned long usec_rem	= do_div(t, 1000000ULL);
 	unsigned secs		= (unsigned long)t;
 	int ret = 1;
Index: linux-compile.git/kernel/trace/trace_sched_switch.c
===================================================================
--- linux-compile.git.orig/kernel/trace/trace_sched_switch.c	2008-09-24 19:47:46.000000000 -0400
+++ linux-compile.git/kernel/trace/trace_sched_switch.c	2008-09-25 00:27:12.000000000 -0400
@@ -133,7 +133,7 @@ static void sched_switch_reset(struct tr
 	tr->time_start = ftrace_now(tr->cpu);
 
 	for_each_online_cpu(cpu)
-		tracing_reset(tr->data[cpu]);
+		tracing_reset(tr, cpu);
 }
 
 static int tracing_sched_register(void)
Index: linux-compile.git/kernel/trace/trace_sched_wakeup.c
===================================================================
--- linux-compile.git.orig/kernel/trace/trace_sched_wakeup.c	2008-09-24 19:47:46.000000000 -0400
+++ linux-compile.git/kernel/trace/trace_sched_wakeup.c	2008-09-25 00:27:12.000000000 -0400
@@ -216,7 +216,7 @@ static void __wakeup_reset(struct trace_
 
 	for_each_possible_cpu(cpu) {
 		data = tr->data[cpu];
-		tracing_reset(data);
+		tracing_reset(tr, cpu);
 	}
 
 	wakeup_cpu = -1;

-- 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH 1/2 v2] Unified trace buffer
  2008-09-25 15:58 ` [RFC PATCH 1/2 v2] Unified trace buffer Steven Rostedt
@ 2008-09-25 17:02   ` Linus Torvalds
  2008-09-25 17:16     ` Steven Rostedt
  2008-09-25 17:35     ` Mathieu Desnoyers
  0 siblings, 2 replies; 10+ messages in thread
From: Linus Torvalds @ 2008-09-25 17:02 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-kernel, Ingo Molnar, Thomas Gleixner, Peter Zijlstra,
	Andrew Morton, prasad, Mathieu Desnoyers, Frank Ch. Eigler,
	David Wilder, hch, Martin Bligh, Christoph Hellwig,
	Steven Rostedt



On Thu, 25 Sep 2008, Steven Rostedt wrote:
> +
> +/**
> + * ring_buffer_event_length - return the length of the event
> + * @event: the event to get the length of
> + *
> + * Note, if the event is bigger than 256 bytes, the length
> + * can not be held in the shifted 5 bits. The length is then
> + * added as a short (unshifted) in the body.

The comment seems stale ;)

> +
> +/**
> + * ring_buffer_peek - peek at the next event to be read
> + * @iter: The ring buffer iterator
> + * @iter_next_cpu: The CPU that the next event belongs on
> + *
> + * This will return the event that will be read next, but does
> + * not increment the iterator.
> + */
> +struct ring_buffer_event *
> +ring_buffer_peek(struct ring_buffer *buffer, int cpu, u64 *ts)
> +{
> +	struct ring_buffer_per_cpu *cpu_buffer;
> +	struct ring_buffer_event *event;
> +	u64 delta;
> +
> +	cpu_buffer = buffer->buffers[cpu];
> +
> + again:
> +	if (ring_buffer_per_cpu_empty(cpu_buffer))
> +		return NULL;
> +
> +	event = ring_buffer_head_event(cpu_buffer);
> +
> +	switch (event->type) {
> +	case RB_TYPE_PADDING:
> +		ring_buffer_inc_page(buffer, &cpu_buffer->head_page);
> +		rb_reset_read_page(cpu_buffer);
> +		goto again;
> +
> +	case RB_TYPE_TIME_EXTENT:
> +		delta = event->data;
> +		delta <<= TS_SHIFT;
> +		delta += event->time_delta;
> +		cpu_buffer->read_stamp += delta;
> +		goto again;
> +
> +	case RB_TYPE_TIME_STAMP:
> +		/* FIXME: not implemented */
> +		goto again;
> +
> +	case RB_TYPE_SMALL_DATA:
> +	case RB_TYPE_LARGE_DATA:
> +	case RB_TYPE_STRING:
> +		if (ts)
> +			*ts = cpu_buffer->read_stamp + event->time_delta;
> +		return event;

Your timestamp handling seems odd. You do it per-event, but I think it 
should happen for all events, ie just do

	*ts += event->time_delta;

_outside_ the case statement, and then in RB_TYPE_TIME_EXTENT you'd do 
either

 - relative:
	*ts += event->data << TS_SHIFT;

 - absolute timestamp events:
	*ts = (event->data << TS_SHIFT) + event->time_delta;

but the bigger issue is that I think the timestamp should be relative to 
the _previous_ event, not relative to the page start. IOW, you really 
should accumulate them. 

IOW, the base timestamp cannot be in the cpu_buffer, it needs to be in the 
iterator data structure, since it updates as you walk over it.

Otherwise the extended TSC format will be _horrible_. You don't want to 
add it in front of every event in the page just because you had a pause at 
the beginning of the page. You want to have a running update, so that you 
only need to add it after there was a pause.

		Linus

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH 1/2 v2] Unified trace buffer
  2008-09-25 17:02   ` Linus Torvalds
@ 2008-09-25 17:16     ` Steven Rostedt
  2008-09-25 17:25       ` Linus Torvalds
  2008-09-25 17:35     ` Mathieu Desnoyers
  1 sibling, 1 reply; 10+ messages in thread
From: Steven Rostedt @ 2008-09-25 17:16 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-kernel, Ingo Molnar, Thomas Gleixner, Peter Zijlstra,
	Andrew Morton, prasad, Mathieu Desnoyers, Frank Ch. Eigler,
	David Wilder, hch, Martin Bligh, Christoph Hellwig,
	Steven Rostedt


On Thu, 25 Sep 2008, Linus Torvalds wrote:
> On Thu, 25 Sep 2008, Steven Rostedt wrote:
> > +
> > +/**
> > + * ring_buffer_event_length - return the length of the event
> > + * @event: the event to get the length of
> > + *
> > + * Note, if the event is bigger than 256 bytes, the length
> > + * can not be held in the shifted 5 bits. The length is then
> > + * added as a short (unshifted) in the body.
> 
> The comment seems stale ;)

hehe, this code is in so much flux, that I gave up on keeping the
comments up to date.

> 
> > +
> > +/**
> > + * ring_buffer_peek - peek at the next event to be read
> > + * @iter: The ring buffer iterator
> > + * @iter_next_cpu: The CPU that the next event belongs on
> > + *
> > + * This will return the event that will be read next, but does
> > + * not increment the iterator.
> > + */
> > +struct ring_buffer_event *
> > +ring_buffer_peek(struct ring_buffer *buffer, int cpu, u64 *ts)
> > +{
> > +	struct ring_buffer_per_cpu *cpu_buffer;
> > +	struct ring_buffer_event *event;
> > +	u64 delta;
> > +
> > +	cpu_buffer = buffer->buffers[cpu];
> > +
> > + again:
> > +	if (ring_buffer_per_cpu_empty(cpu_buffer))
> > +		return NULL;
> > +
> > +	event = ring_buffer_head_event(cpu_buffer);
> > +
> > +	switch (event->type) {
> > +	case RB_TYPE_PADDING:
> > +		ring_buffer_inc_page(buffer, &cpu_buffer->head_page);
> > +		rb_reset_read_page(cpu_buffer);
> > +		goto again;
> > +
> > +	case RB_TYPE_TIME_EXTENT:
> > +		delta = event->data;
> > +		delta <<= TS_SHIFT;
> > +		delta += event->time_delta;
> > +		cpu_buffer->read_stamp += delta;
> > +		goto again;
> > +
> > +	case RB_TYPE_TIME_STAMP:
> > +		/* FIXME: not implemented */
> > +		goto again;
> > +
> > +	case RB_TYPE_SMALL_DATA:
> > +	case RB_TYPE_LARGE_DATA:
> > +	case RB_TYPE_STRING:
> > +		if (ts)
> > +			*ts = cpu_buffer->read_stamp + event->time_delta;
> > +		return event;
> 
> Your timestamp handling seems odd. You do it per-event, but I think it 
> should happen for all events, ie just do
> 
> 	*ts += event->time_delta;
> 
> _outside_ the case statement, and then in RB_TYPE_TIME_EXTENT you'd do 
> either
> 
>  - relative:
> 	*ts += event->data << TS_SHIFT;
> 
>  - absolute timestamp events:
> 	*ts = (event->data << TS_SHIFT) + event->time_delta;
> 
> but the bigger issue is that I think the timestamp should be relative to 
> the _previous_ event, not relative to the page start. IOW, you really 
> should accumulate them. 
> 
> IOW, the base timestamp cannot be in the cpu_buffer, it needs to be in the 
> iterator data structure, since it updates as you walk over it.
> 
> Otherwise the extended TSC format will be _horrible_. You don't want to 
> add it in front of every event in the page just because you had a pause at 
> the beginning of the page. You want to have a running update, so that you 
> only need to add it after there was a pause.

The problem with this is overwrite mode, which is the only mode ftace 
currently offers. What happens when your writer starts overwriting the 
ring buffer and there is no reader?

What happens is that the start value is gone. You do not have a way to use 
all the deltas to catch up to the remaining events.

-- Steve

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH 1/2 v2] Unified trace buffer
  2008-09-25 17:16     ` Steven Rostedt
@ 2008-09-25 17:25       ` Linus Torvalds
  2008-09-25 17:46         ` Steven Rostedt
  0 siblings, 1 reply; 10+ messages in thread
From: Linus Torvalds @ 2008-09-25 17:25 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-kernel, Ingo Molnar, Thomas Gleixner, Peter Zijlstra,
	Andrew Morton, prasad, Mathieu Desnoyers, Frank Ch. Eigler,
	David Wilder, hch, Martin Bligh, Christoph Hellwig,
	Steven Rostedt



On Thu, 25 Sep 2008, Steven Rostedt wrote:
> 
> The problem with this is overwrite mode, which is the only mode ftace 
> currently offers. What happens when your writer starts overwriting the 
> ring buffer and there is no reader?

Overwrite things on page at a time. Don't you already do that? (I didn't 
check that closely, I just assumed you would do the _much_ simpler "move 
the head to the next page" thing rather than trying to mix head and tail 
on the same page.

> What happens is that the start value is gone. You do not have a way to use 
> all the deltas to catch up to the remaining events.

Use the page start date for the first event in a page. But within pages, 
make everything depend on previous event.

		Linus

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH 1/2 v2] Unified trace buffer
  2008-09-25 17:02   ` Linus Torvalds
  2008-09-25 17:16     ` Steven Rostedt
@ 2008-09-25 17:35     ` Mathieu Desnoyers
  2008-09-25 17:48       ` Steven Rostedt
  1 sibling, 1 reply; 10+ messages in thread
From: Mathieu Desnoyers @ 2008-09-25 17:35 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Steven Rostedt, linux-kernel, Ingo Molnar, Thomas Gleixner,
	Peter Zijlstra, Andrew Morton, prasad, Frank Ch. Eigler,
	David Wilder, hch, Martin Bligh, Christoph Hellwig,
	Steven Rostedt

* Linus Torvalds (torvalds@linux-foundation.org) wrote:
[...]
> but the bigger issue is that I think the timestamp should be relative to 
> the _previous_ event, not relative to the page start. IOW, you really 
> should accumulate them. 
> 

How about keeping the timestamps absolute ? (but just keep 27 LSBs)

It would help resynchronizing the timestamps if an event is lost and
would not accumulate error over and over. It would event help detecting
bugs in the tracer by checking if timestamps go backward.

Also, it would remove inter-dependency between consecutive events; we
would not have to know "for sure" what the previous timestamp was when
we write the current event. Just knowing if we need to write the full
TSC is enough (which implies knowing an upper bound), which is a much
more relax constraint than having to know the _exact_ previous
timestamp.

Is there a reason to use delta between events rather than simply write
the 27 LSBs that I would have missed ?

Mathieu

> IOW, the base timestamp cannot be in the cpu_buffer, it needs to be in the 
> iterator data structure, since it updates as you walk over it.
> 
> Otherwise the extended TSC format will be _horrible_. You don't want to 
> add it in front of every event in the page just because you had a pause at 
> the beginning of the page. You want to have a running update, so that you 
> only need to add it after there was a pause.
> 
> 		Linus
> 

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH 1/2 v2] Unified trace buffer
  2008-09-25 17:25       ` Linus Torvalds
@ 2008-09-25 17:46         ` Steven Rostedt
  0 siblings, 0 replies; 10+ messages in thread
From: Steven Rostedt @ 2008-09-25 17:46 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-kernel, Ingo Molnar, Thomas Gleixner, Peter Zijlstra,
	Andrew Morton, prasad, Mathieu Desnoyers, Frank Ch. Eigler,
	David Wilder, hch, Martin Bligh, Christoph Hellwig,
	Steven Rostedt



On Thu, 25 Sep 2008, Linus Torvalds wrote:
> On Thu, 25 Sep 2008, Steven Rostedt wrote:
> > 
> > The problem with this is overwrite mode, which is the only mode ftace 
> > currently offers. What happens when your writer starts overwriting the 
> > ring buffer and there is no reader?
> 
> Overwrite things on page at a time. Don't you already do that? (I didn't 
> check that closely, I just assumed you would do the _much_ simpler "move 
> the head to the next page" thing rather than trying to mix head and tail 
> on the same page.

Yeah, I decided to do the blast the page instead of just blast the entry.
This keeps complexity a hell of a lot simpler.

> 
> > What happens is that the start value is gone. You do not have a way to use 
> > all the deltas to catch up to the remaining events.
> 
> Use the page start date for the first event in a page. But within pages, 
> make everything depend on previous event.

Hmm, I'm confused. I do this already. I put the page start time on each 
page already. The events on the page are based on the page start time
itself.  The iterator, or writer, keeps track of the last event. When it 
reaches a new page, it reads the new time stamp of the page and starts 
incrementing against that.

-- Steve


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH 1/2 v2] Unified trace buffer
  2008-09-25 17:35     ` Mathieu Desnoyers
@ 2008-09-25 17:48       ` Steven Rostedt
  2008-09-25 18:25         ` Mathieu Desnoyers
  0 siblings, 1 reply; 10+ messages in thread
From: Steven Rostedt @ 2008-09-25 17:48 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Linus Torvalds, linux-kernel, Ingo Molnar, Thomas Gleixner,
	Peter Zijlstra, Andrew Morton, prasad, Frank Ch. Eigler,
	David Wilder, hch, Martin Bligh, Christoph Hellwig,
	Steven Rostedt


On Thu, 25 Sep 2008, Mathieu Desnoyers wrote:
> 
> Is there a reason to use delta between events rather than simply write
> the 27 LSBs that I would have missed ?

One answer is that your counter wrap problem is extended.

That is, you have 27 bits of time between each event to not worry about
wraps. But if you go against the page itself, the last event on that page
is more likely to suffer.

-- Steve

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH 1/2 v2] Unified trace buffer
  2008-09-25 17:48       ` Steven Rostedt
@ 2008-09-25 18:25         ` Mathieu Desnoyers
  0 siblings, 0 replies; 10+ messages in thread
From: Mathieu Desnoyers @ 2008-09-25 18:25 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Linus Torvalds, linux-kernel, Ingo Molnar, Thomas Gleixner,
	Peter Zijlstra, Andrew Morton, prasad, Frank Ch. Eigler,
	David Wilder, hch, Martin Bligh, Christoph Hellwig,
	Steven Rostedt

* Steven Rostedt (rostedt@goodmis.org) wrote:
> 
> On Thu, 25 Sep 2008, Mathieu Desnoyers wrote:
> > 
> > Is there a reason to use delta between events rather than simply write
> > the 27 LSBs that I would have missed ?
> 
> One answer is that your counter wrap problem is extended.
> 
> That is, you have 27 bits of time between each event to not worry about
> wraps. But if you go against the page itself, the last event on that page
> is more likely to suffer.
> 

You can do the exact same thing and manage to keep the absolute time.
You just have to adapt the reader like this : (this would be for
per-event cycle count in the 32 LSBs, slight bitmask adaptation needed
for 27 bits only).

keep a 64 bits TSC value in a per-buffer variable. The previous value is
always re-used for the next read.

let's call it :
tf->buffer.tsc  (u64)

read_event() would look like :

u32 timetamp = read_event_timetamp();

if(timestamp < (0xFFFFFFFFULL & tf->buffer.tsc)) {
    /* overflow */
    tf->buffer.tsc = ((tf->buffer.tsc & 0xFFFFFFFF00000000ULL)
                     + 0x100000000ULL) | (u64)timestamp;
} else {
    /* no overflow */
    tf->buffer.tsc = (tf->buffer.tsc & 0xFFFFFFFF00000000ULL)
                     | (u64)timestamp;
}

This will detect 32 bits overflow and keep the tf->buffer.tsc in sync
with the TSC representation on the traced machine as long as events are
less then 27 bits apart. A "full tsc" header can also be easily managed
with this by updating the tf->buffer.tsc value completely when such
event is met.

Mathieu

> -- Steve
> 

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2008-09-25 18:30 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-09-25 15:58 [RFC PATCH 0/2 v2] Unified trace buffer (take two) Steven Rostedt
2008-09-25 15:58 ` [RFC PATCH 1/2 v2] Unified trace buffer Steven Rostedt
2008-09-25 17:02   ` Linus Torvalds
2008-09-25 17:16     ` Steven Rostedt
2008-09-25 17:25       ` Linus Torvalds
2008-09-25 17:46         ` Steven Rostedt
2008-09-25 17:35     ` Mathieu Desnoyers
2008-09-25 17:48       ` Steven Rostedt
2008-09-25 18:25         ` Mathieu Desnoyers
2008-09-25 15:58 ` [RFC PATCH 2/2 v2] ftrace: make work with new ring buffer Steven Rostedt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox