public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [RFC 0/2] perf: add AUX space to ring_buffer
@ 2014-05-15 15:08 Alexander Shishkin
  2014-05-15 15:08 ` [RFC 1/2] perf: add data_{offset,size} to user_page Alexander Shishkin
  2014-05-15 15:08 ` [RFC 2/2] perf: add AUX area to ring buffer for raw data streams Alexander Shishkin
  0 siblings, 2 replies; 11+ messages in thread
From: Alexander Shishkin @ 2014-05-15 15:08 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Frederic Weisbecker, Mike Galbraith,
	Paul Mackerras, Stephane Eranian, Andi Kleen, Alexander Shishkin

Hi Peter,

I tried to bend my head around the AUX space and this is where I am at
the moment. Not posting the whole itrace/PT bulk yet, to make sure
this part is ok and to save bandwidth.

Alexander Shishkin (1):
  perf: add data_{offset,size} to user_page

Peter Zijlstra (1):
  perf: add AUX area to ring buffer for raw data streams

 include/linux/perf_event.h      |  13 ++++
 include/uapi/linux/perf_event.h |  33 ++++++++++
 kernel/events/core.c            | 132 ++++++++++++++++++++++++++++++++--------
 kernel/events/internal.h        |  16 +++++
 kernel/events/ring_buffer.c     |  53 ++++++++++++++--
 5 files changed, 217 insertions(+), 30 deletions(-)

-- 
2.0.0.rc2


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [RFC 1/2] perf: add data_{offset,size} to user_page
  2014-05-15 15:08 [RFC 0/2] perf: add AUX space to ring_buffer Alexander Shishkin
@ 2014-05-15 15:08 ` Alexander Shishkin
  2014-05-15 18:14   ` Robert Richter
  2014-05-15 15:08 ` [RFC 2/2] perf: add AUX area to ring buffer for raw data streams Alexander Shishkin
  1 sibling, 1 reply; 11+ messages in thread
From: Alexander Shishkin @ 2014-05-15 15:08 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Frederic Weisbecker, Mike Galbraith,
	Paul Mackerras, Stephane Eranian, Andi Kleen, Alexander Shishkin

Currently, the actual perf ring buffer is one page into the mmap area,
following the user page and the userspace follows this convention. This
patch adds data_{offset,size} fields to user_page that can be used by
userspace instead for locating perf data in the mmap area.

Right now, it is made to follow the existing convention that

	data_offset == PAGE_SIZE and
	data_offset + data_size == mmap_size.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
---
 include/uapi/linux/perf_event.h | 5 +++++
 kernel/events/core.c            | 2 ++
 2 files changed, 7 insertions(+)

diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index 853bc1c..ef95af4 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -488,9 +488,14 @@ struct perf_event_mmap_page {
 	 * In this case the kernel will not over-write unread data.
 	 *
 	 * See perf_output_put_handle() for the data ordering.
+	 *
+	 * data_{offset,size} indicate the location and size of the perf record
+	 * buffer within the mmapped area.
 	 */
 	__u64   data_head;		/* head in the data section */
 	__u64	data_tail;		/* user-space written tail */
+	__u64	data_offset;		/* where the buffer starts */
+	__u64	data_size;		/* data buffer size */
 };
 
 #define PERF_RECORD_MISC_CPUMODE_MASK		(7 << 0)
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 0b9309b..bbe8b48 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -3768,6 +3768,8 @@ static void perf_event_init_userpage(struct perf_event *event)
 	/* Allow new userspace to detect that bit 0 is deprecated */
 	userpg->cap_bit0_is_deprecated = 1;
 	userpg->size = offsetof(struct perf_event_mmap_page, __reserved);
+	userpg->data_offset = PAGE_SIZE;
+	userpg->data_size = perf_data_size(rb);
 
 unlock:
 	rcu_read_unlock();
-- 
2.0.0.rc2


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [RFC 2/2] perf: add AUX area to ring buffer for raw data streams
  2014-05-15 15:08 [RFC 0/2] perf: add AUX space to ring_buffer Alexander Shishkin
  2014-05-15 15:08 ` [RFC 1/2] perf: add data_{offset,size} to user_page Alexander Shishkin
@ 2014-05-15 15:08 ` Alexander Shishkin
  2014-05-15 18:02   ` Robert Richter
  2014-05-19  8:58   ` Peter Zijlstra
  1 sibling, 2 replies; 11+ messages in thread
From: Alexander Shishkin @ 2014-05-15 15:08 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Frederic Weisbecker, Mike Galbraith,
	Paul Mackerras, Stephane Eranian, Andi Kleen, Peter Zijlstra,
	Alexander Shishkin

From: Peter Zijlstra <peterz@infradead.org>

This patch introduces "AUX space" in the perf mmap buffer, intended for
exporting high bandwidth data streams to userspace, such as instruction
flow traces.

AUX space is a ring buffer, defined by aux_{offset,size} fields in the
user_page structure, and read/write pointers aux_{head,tail}, which abide
by the same rules as data_* counterparts of the main perf buffer.

In order to allocate/mmap AUX, userspace needs to set up aux_offset to
such an offset that will be greater than data_offset+data_size and
aux_size to be the desired buffer size. Both need to be page aligned.
The latter is not forced to be a power of 2 number of pages, so that PMU
drivers have to take care of this if necessary in their implementations
of ->alloc_aux(). Then, same aux_offset and aux_size should be passed to
mmap() call and if everything adds up, you should have an AUX buffer as
a result.

Pages that are mapped into this buffer also come out of user's mlock
rlimit plus perf_event_mlock_kb allowance.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
---
 include/linux/perf_event.h      |  13 ++++
 include/uapi/linux/perf_event.h |  28 +++++++++
 kernel/events/core.c            | 130 ++++++++++++++++++++++++++++++++--------
 kernel/events/internal.h        |  16 +++++
 kernel/events/ring_buffer.c     |  53 ++++++++++++++--
 5 files changed, 210 insertions(+), 30 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index af6dcf1..258c5f0 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -252,6 +252,19 @@ struct pmu {
 	 * flush branch stack on context-switches (needed in cpu-wide mode)
 	 */
 	void (*flush_branch_stack)	(void);
+
+	/*
+	 * Allocate AUX space buffer: return an array of @nr_pages pages to be
+	 * mapped to userspace that will also be passed to ->free_aux.
+	 */
+	void *(*alloc_aux)		(int cpu, int nr_pages, bool overwrite,
+					 struct perf_event_mmap_page *user_page);
+					/* optional */
+
+	/*
+	 * Free AUX buffer
+	 */
+	void (*free_aux)		(void *aux); /* optional */
 };
 
 /**
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index ef95af4..2509c93 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -496,6 +496,22 @@ struct perf_event_mmap_page {
 	__u64	data_tail;		/* user-space written tail */
 	__u64	data_offset;		/* where the buffer starts */
 	__u64	data_size;		/* data buffer size */
+
+	/*
+	 * AUX area is defined by aux_{offset,size} fields that should be set
+	 * by the userspace, so that
+	 *
+	 *   aux_offset >= data_offset + data_size
+	 *
+	 * prior to mmap()ing it. Size of the mmap()ed area should be aux_size.
+	 *
+	 * Ring buffer pointers aux_{head,tail} have the same semantics as
+	 * data_{head,tail} and same ordering rules apply.
+	 */
+	__u64	aux_head;
+	__u64	aux_tail;
+	__u64	aux_offset;
+	__u64	aux_size;
 };
 
 #define PERF_RECORD_MISC_CPUMODE_MASK		(7 << 0)
@@ -710,6 +726,18 @@ enum perf_event_type {
 	 */
 	PERF_RECORD_MMAP2			= 10,
 
+	/*
+	 * Records that new data landed in the AUX buffer part.
+	 *
+	 * struct {
+	 * 	struct perf_event_header	header;
+	 *
+	 * 	u64				aux_offset;
+	 * 	u64				aux_size;
+	 * };
+	 */
+	PERF_RECORD_AUX				= 11,
+
 	PERF_RECORD_MAX,			/* non-ABI */
 };
 
diff --git a/kernel/events/core.c b/kernel/events/core.c
index bbe8b48..2df4781 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -3969,6 +3969,20 @@ static void perf_mmap_close(struct vm_area_struct *vma)
 	int mmap_locked = rb->mmap_locked;
 	unsigned long size = perf_data_size(rb);
 
+	/*
+	 * rb->aux_mmap_count will always drop before rb->mmap_count and
+	 * event->mmap_count, so it is ok to use event->mmap_mutex to
+	 * serialize with perf_mmap here.
+	 */
+	if (rb_has_aux(rb) && vma->vm_pgoff == rb->aux_pgoff &&
+	    atomic_dec_and_mutex_lock(&rb->aux_mmap_count, &event->mmap_mutex)) {
+		atomic_long_sub(rb->aux_nr_pages, &mmap_user->locked_vm);
+		vma->vm_mm->pinned_vm -= rb->aux_mmap_locked;
+
+		rb_free_aux(rb);
+		mutex_unlock(&event->mmap_mutex);
+	}
+
 	atomic_dec(&rb->mmap_count);
 
 	if (!atomic_dec_and_mutex_lock(&event->mmap_count, &event->mmap_mutex))
@@ -4047,7 +4061,7 @@ again:
 
 static const struct vm_operations_struct perf_mmap_vmops = {
 	.open		= perf_mmap_open,
-	.close		= perf_mmap_close,
+	.close		= perf_mmap_close, /* non mergable */
 	.fault		= perf_mmap_fault,
 	.page_mkwrite	= perf_mmap_fault,
 };
@@ -4058,10 +4072,10 @@ static int perf_mmap(struct file *file, struct vm_area_struct *vma)
 	unsigned long user_locked, user_lock_limit;
 	struct user_struct *user = current_user();
 	unsigned long locked, lock_limit;
-	struct ring_buffer *rb;
+	struct ring_buffer *rb = NULL;
 	unsigned long vma_size;
 	unsigned long nr_pages;
-	long user_extra, extra;
+	long user_extra = 0, extra = 0;
 	int ret = 0, flags = 0;
 
 	/*
@@ -4076,7 +4090,63 @@ static int perf_mmap(struct file *file, struct vm_area_struct *vma)
 		return -EINVAL;
 
 	vma_size = vma->vm_end - vma->vm_start;
-	nr_pages = (vma_size / PAGE_SIZE) - 1;
+
+	if (vma->vm_pgoff == 0) {
+		nr_pages = (vma_size / PAGE_SIZE) - 1;
+	} else {
+		/*
+		 * AUX area mapping: if rb->aux_nr_pages != 0, it's already
+		 * mapped, all subsequent mappings should have the same size
+		 * and offset. Must be above the normal perf buffer.
+		 */
+		u64 aux_offset, aux_size;
+
+		if (!event->rb)
+			return -EINVAL;
+
+		nr_pages = vma_size / PAGE_SIZE;
+
+		mutex_lock(&event->mmap_mutex);
+		ret = -EINVAL;
+
+		rb = event->rb;
+		if (!rb)
+			goto aux_unlock;
+
+		aux_offset = ACCESS_ONCE(rb->user_page->aux_offset);
+		aux_size = ACCESS_ONCE(rb->user_page->aux_size);
+
+		if (aux_offset < perf_data_size(rb) + PAGE_SIZE)
+			goto aux_unlock;
+
+		if (aux_offset != vma->vm_pgoff << PAGE_SHIFT)
+			goto aux_unlock;
+
+		/* already mapped with a different offset */
+		if (rb_has_aux(rb) && rb->aux_pgoff != vma->vm_pgoff)
+			goto aux_unlock;
+
+		if (aux_size != vma_size || aux_size != nr_pages * PAGE_SIZE)
+			goto aux_unlock;
+
+		/* already mapped with a different size */
+		if (rb_has_aux(rb) && rb->aux_nr_pages != nr_pages)
+			goto aux_unlock;
+
+		if (!atomic_inc_not_zero(&rb->mmap_count))
+			goto aux_unlock;
+
+		if (rb_has_aux(rb)) {
+			atomic_inc(&rb->aux_mmap_count);
+			ret = 0;
+			goto unlock;
+		}
+
+		atomic_set(&rb->aux_mmap_count, 1);
+		user_extra = nr_pages;
+
+		goto accounting;
+	}
 
 	/*
 	 * If we have rb pages ensure they're a power-of-two number, so we
@@ -4088,9 +4158,6 @@ static int perf_mmap(struct file *file, struct vm_area_struct *vma)
 	if (vma_size != PAGE_SIZE * (1 + nr_pages))
 		return -EINVAL;
 
-	if (vma->vm_pgoff != 0)
-		return -EINVAL;
-
 	WARN_ON_ONCE(event->ctx->parent_ctx);
 again:
 	mutex_lock(&event->mmap_mutex);
@@ -4114,6 +4181,8 @@ again:
 	}
 
 	user_extra = nr_pages + 1;
+
+accounting:
 	user_lock_limit = sysctl_perf_event_mlock >> (PAGE_SHIFT - 10);
 
 	/*
@@ -4123,7 +4192,6 @@ again:
 
 	user_locked = atomic_long_read(&user->locked_vm) + user_extra;
 
-	extra = 0;
 	if (user_locked > user_lock_limit)
 		extra = user_locked - user_lock_limit;
 
@@ -4137,36 +4205,46 @@ again:
 		goto unlock;
 	}
 
-	WARN_ON(event->rb);
+	WARN_ON(!rb && event->rb);
 
 	if (vma->vm_flags & VM_WRITE)
 		flags |= RING_BUFFER_WRITABLE;
 
-	rb = rb_alloc(nr_pages, 
-		event->attr.watermark ? event->attr.wakeup_watermark : 0,
-		event->cpu, flags);
-
 	if (!rb) {
-		ret = -ENOMEM;
-		goto unlock;
-	}
+		rb = rb_alloc(nr_pages,
+			      event->attr.watermark ? event->attr.wakeup_watermark : 0,
+			      event->cpu, flags);
 
-	atomic_set(&rb->mmap_count, 1);
-	rb->mmap_locked = extra;
-	rb->mmap_user = get_current_user();
+		if (!rb) {
+			ret = -ENOMEM;
+			goto unlock;
+		}
 
-	atomic_long_add(user_extra, &user->locked_vm);
-	vma->vm_mm->pinned_vm += extra;
+		atomic_set(&rb->mmap_count, 1);
+		rb->mmap_user = get_current_user();
+		rb->mmap_locked = extra;
 
-	ring_buffer_attach(event, rb);
-	rcu_assign_pointer(event->rb, rb);
+		ring_buffer_attach(event, rb);
+		rcu_assign_pointer(event->rb, rb);
 
-	perf_event_init_userpage(event);
-	perf_event_update_userpage(event);
+		perf_event_init_userpage(event);
+		perf_event_update_userpage(event);
+	} else {
+		ret = rb_alloc_aux(rb, event, vma->vm_pgoff, nr_pages, flags);
+		if (ret)
+			atomic_dec(&rb->mmap_count);
+		else
+			rb->aux_mmap_locked = extra;
+	}
 
 unlock:
-	if (!ret)
+	if (!ret) {
+		atomic_long_add(user_extra, &user->locked_vm);
+		vma->vm_mm->pinned_vm += extra;
+
 		atomic_inc(&event->mmap_count);
+	}
+aux_unlock:
 	mutex_unlock(&event->mmap_mutex);
 
 	/*
diff --git a/kernel/events/internal.h b/kernel/events/internal.h
index 569b2187..29cecc7 100644
--- a/kernel/events/internal.h
+++ b/kernel/events/internal.h
@@ -35,6 +35,14 @@ struct ring_buffer {
 	unsigned long			mmap_locked;
 	struct user_struct		*mmap_user;
 
+	/* AUX area */
+	unsigned long			aux_pgoff;
+	int				aux_nr_pages;
+	atomic_t			aux_mmap_count;
+	unsigned long			aux_mmap_locked;
+	void				**aux_pages;
+	void				(*free_aux)(void *aux);
+
 	struct perf_event_mmap_page	*user_page;
 	void				*data_pages[0];
 };
@@ -43,6 +51,14 @@ extern void rb_free(struct ring_buffer *rb);
 extern struct ring_buffer *
 rb_alloc(int nr_pages, long watermark, int cpu, int flags);
 extern void perf_event_wakeup(struct perf_event *event);
+extern int rb_alloc_aux(struct ring_buffer *rb, struct perf_event *event,
+			pgoff_t pgoff, int nr_pages, int flags);
+extern void rb_free_aux(struct ring_buffer *rb);
+
+static inline bool rb_has_aux(struct ring_buffer *rb)
+{
+	return !!rb->aux_nr_pages;
+}
 
 extern void
 perf_event_header__init_id(struct perf_event_header *header,
diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
index 146a579..8df2a77 100644
--- a/kernel/events/ring_buffer.c
+++ b/kernel/events/ring_buffer.c
@@ -242,14 +242,43 @@ ring_buffer_init(struct ring_buffer *rb, long watermark, int flags)
 	spin_lock_init(&rb->event_lock);
 }
 
+int rb_alloc_aux(struct ring_buffer *rb, struct perf_event *event,
+		 pgoff_t pgoff, int nr_pages, int flags)
+{
+	bool overwrite = !!(flags & RING_BUFFER_WRITABLE);
+
+	if (!event->pmu->alloc_aux)
+		return -ENOTSUPP;
+
+	rb->aux_pages = event->pmu->alloc_aux(event->cpu, nr_pages, overwrite,
+					      rb->user_page);
+	if (!rb->aux_pages)
+		return -ENOMEM;
+
+	rb->free_aux = event->pmu->free_aux;
+	rb->aux_pgoff = pgoff;
+	rb->aux_nr_pages = nr_pages;
+
+	return 0;
+}
+
+void rb_free_aux(struct ring_buffer *rb)
+{
+	if (WARN_ON_ONCE(!rb->free_aux))
+		return;
+
+	rb->free_aux(rb->aux_pages);
+	rb->aux_nr_pages = 0;
+}
+
 #ifndef CONFIG_PERF_USE_VMALLOC
 
 /*
  * Back perf_mmap() with regular GFP_KERNEL-0 pages.
  */
 
-struct page *
-perf_mmap_to_page(struct ring_buffer *rb, unsigned long pgoff)
+static struct page *
+__perf_mmap_to_page(struct ring_buffer *rb, unsigned long pgoff)
 {
 	if (pgoff > rb->nr_pages)
 		return NULL;
@@ -339,8 +368,8 @@ static int data_page_nr(struct ring_buffer *rb)
 	return rb->nr_pages << page_order(rb);
 }
 
-struct page *
-perf_mmap_to_page(struct ring_buffer *rb, unsigned long pgoff)
+static struct page *
+__perf_mmap_to_page(struct ring_buffer *rb, unsigned long pgoff)
 {
 	/* The '>' counts in the user page. */
 	if (pgoff > data_page_nr(rb))
@@ -415,3 +444,19 @@ fail:
 }
 
 #endif
+
+struct page *
+perf_mmap_to_page(struct ring_buffer *rb, unsigned long pgoff)
+{
+	if (rb->aux_nr_pages) {
+		/* above AUX space */
+		if (pgoff > rb->aux_pgoff + rb->aux_nr_pages)
+			return NULL;
+
+		/* AUX space */
+		if (pgoff >= rb->aux_pgoff)
+			return virt_to_page(rb->aux_pages[pgoff - rb->aux_pgoff]);
+	}
+
+	return __perf_mmap_to_page(rb, pgoff);
+}
-- 
2.0.0.rc2


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [RFC 2/2] perf: add AUX area to ring buffer for raw data streams
  2014-05-15 15:08 ` [RFC 2/2] perf: add AUX area to ring buffer for raw data streams Alexander Shishkin
@ 2014-05-15 18:02   ` Robert Richter
  2014-05-16  7:07     ` Alexander Shishkin
  2014-05-19  8:58   ` Peter Zijlstra
  1 sibling, 1 reply; 11+ messages in thread
From: Robert Richter @ 2014-05-15 18:02 UTC (permalink / raw)
  To: Alexander Shishkin
  Cc: Peter Zijlstra, Ingo Molnar, linux-kernel, Frederic Weisbecker,
	Mike Galbraith, Paul Mackerras, Stephane Eranian, Andi Kleen,
	Peter Zijlstra

On 15.05.14 18:08:30, Alexander Shishkin wrote:
> From: Peter Zijlstra <peterz@infradead.org>
> 
> This patch introduces "AUX space" in the perf mmap buffer, intended for
> exporting high bandwidth data streams to userspace, such as instruction
> flow traces.
> 
> AUX space is a ring buffer, defined by aux_{offset,size} fields in the
> user_page structure, and read/write pointers aux_{head,tail}, which abide
> by the same rules as data_* counterparts of the main perf buffer.
> 
> In order to allocate/mmap AUX, userspace needs to set up aux_offset to
> such an offset that will be greater than data_offset+data_size and
> aux_size to be the desired buffer size. Both need to be page aligned.
> The latter is not forced to be a power of 2 number of pages, so that PMU
> drivers have to take care of this if necessary in their implementations
> of ->alloc_aux(). Then, same aux_offset and aux_size should be passed to
> mmap() call and if everything adds up, you should have an AUX buffer as
> a result.

Raw data sample (PERF_SAMPLE_RAW) already allow to pass data to
userspace. What is the reason for a separate buffer? I see the
following:

 * the buffer accessed from userspace should have a customized format
   unrelated to perf,

 * zero copying (not sure if this is really possible),

 * data size and alignment requirements that can not be handled by raw
   data samples.

If possible, raw data samples should be a first option.

On the other side there might the need of an aux buffer (I would
better call it raw buffer analog to raw data). So here some comments
on the implementation and the user i/f.

Instead of extending the current ringbuffer I would better add a 2nd
ring buffer. The mmap and ringbuffer handling and esp. the access from
userspace looks much more complex now and it was already complex
before. So with a 2nd buffer there would be less changes in the
implementation and handling. Access from userland would remain as it
currently is, there is no need to modify the header page layout. To
setup a 2nd buffer, you just need to modify the mmap call in a way
(maybe by prot, flags or offset argument) to let the kernel know an
aux buffer should be attached to the event.

Then, instead of introducing a new PERF_RECORD_AUX, just reuse
PERF_RECORD_SAMPLE. If an aux/raw buffer is connected to the event,
then the data is not part of the sample but passed to the raw buffer.

Tools, interfaces and helper functions would work already with such
raw samples, just the raw buffer handling need to be added.

How about this?

-Robert

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC 1/2] perf: add data_{offset,size} to user_page
  2014-05-15 15:08 ` [RFC 1/2] perf: add data_{offset,size} to user_page Alexander Shishkin
@ 2014-05-15 18:14   ` Robert Richter
  0 siblings, 0 replies; 11+ messages in thread
From: Robert Richter @ 2014-05-15 18:14 UTC (permalink / raw)
  To: Alexander Shishkin
  Cc: Peter Zijlstra, Ingo Molnar, linux-kernel, Frederic Weisbecker,
	Mike Galbraith, Paul Mackerras, Stephane Eranian, Andi Kleen,
	Borislav Petkov

On 15.05.14 18:08:29, Alexander Shishkin wrote:
> @@ -488,9 +488,14 @@ struct perf_event_mmap_page {
>  	 * In this case the kernel will not over-write unread data.
>  	 *
>  	 * See perf_output_put_handle() for the data ordering.
> +	 *
> +	 * data_{offset,size} indicate the location and size of the perf record
> +	 * buffer within the mmapped area.
>  	 */
>  	__u64   data_head;		/* head in the data section */
>  	__u64	data_tail;		/* user-space written tail */
> +	__u64	data_offset;		/* where the buffer starts */
> +	__u64	data_size;		/* data buffer size */
>  };

Yeah, we need something like this also for other reasons. For
persistent events we don't know the buffer size when mmaping
existing/shared buffers. To determine the buffer size we could map the
first page only, read the size and then mremap() to the actual size.

Please keep this use case in mind too, also for the aux buffer. The
best would be if data_size is the mmap size.

Thanks,

-Robert

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC 2/2] perf: add AUX area to ring buffer for raw data streams
  2014-05-15 18:02   ` Robert Richter
@ 2014-05-16  7:07     ` Alexander Shishkin
  0 siblings, 0 replies; 11+ messages in thread
From: Alexander Shishkin @ 2014-05-16  7:07 UTC (permalink / raw)
  To: Robert Richter
  Cc: Peter Zijlstra, Ingo Molnar, linux-kernel, Frederic Weisbecker,
	Mike Galbraith, Paul Mackerras, Stephane Eranian, Andi Kleen,
	Peter Zijlstra

Robert Richter <rric@kernel.org> writes:

> On 15.05.14 18:08:30, Alexander Shishkin wrote:
>> From: Peter Zijlstra <peterz@infradead.org>
>> 
>> This patch introduces "AUX space" in the perf mmap buffer, intended for
>> exporting high bandwidth data streams to userspace, such as instruction
>> flow traces.
>> 
>> AUX space is a ring buffer, defined by aux_{offset,size} fields in the
>> user_page structure, and read/write pointers aux_{head,tail}, which abide
>> by the same rules as data_* counterparts of the main perf buffer.
>> 
>> In order to allocate/mmap AUX, userspace needs to set up aux_offset to
>> such an offset that will be greater than data_offset+data_size and
>> aux_size to be the desired buffer size. Both need to be page aligned.
>> The latter is not forced to be a power of 2 number of pages, so that PMU
>> drivers have to take care of this if necessary in their implementations
>> of ->alloc_aux(). Then, same aux_offset and aux_size should be passed to
>> mmap() call and if everything adds up, you should have an AUX buffer as
>> a result.
>
> Raw data sample (PERF_SAMPLE_RAW) already allow to pass data to
> userspace. What is the reason for a separate buffer? I see the
> following:
>
>  * the buffer accessed from userspace should have a customized format
>    unrelated to perf,
>
>  * zero copying (not sure if this is really possible),
>
>  * data size and alignment requirements that can not be handled by raw
>    data samples.

It is all three. What we're dealing with at the moment is a unit that
generates instruction traces at rates like hhundreds of megabytes per
second. So the only sensible thing to do is zero copy the data.

> If possible, raw data samples should be a first option.
>
> On the other side there might the need of an aux buffer (I would
> better call it raw buffer analog to raw data). So here some comments
> on the implementation and the user i/f.
>
> Instead of extending the current ringbuffer I would better add a 2nd
> ring buffer. The mmap and ringbuffer handling and esp. the access from
> userspace looks much more complex now and it was already complex
> before. So with a 2nd buffer there would be less changes in the
> implementation and handling. Access from userland would remain as it
> currently is, there is no need to modify the header page layout. To
> setup a 2nd buffer, you just need to modify the mmap call in a way
> (maybe by prot, flags or offset argument) to let the kernel know an
> aux buffer should be attached to the event.

Tried that, didn't fly. See the full story here [1]. Though splicing
arbitrary pages into perf buffer is less than trivial, especially so
with hardware that doesn't do scatter-gather. So all in all, AUX area
seems to be a sensible compromise.

[1] http://marc.info/?l=linux-kernel&m=139264764530809&w=2

Regards,
--
Alex

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC 2/2] perf: add AUX area to ring buffer for raw data streams
  2014-05-15 15:08 ` [RFC 2/2] perf: add AUX area to ring buffer for raw data streams Alexander Shishkin
  2014-05-15 18:02   ` Robert Richter
@ 2014-05-19  8:58   ` Peter Zijlstra
  2014-05-19 12:57     ` Alexander Shishkin
  1 sibling, 1 reply; 11+ messages in thread
From: Peter Zijlstra @ 2014-05-19  8:58 UTC (permalink / raw)
  To: Alexander Shishkin
  Cc: Ingo Molnar, linux-kernel, Frederic Weisbecker, Mike Galbraith,
	Paul Mackerras, Stephane Eranian, Andi Kleen

[-- Attachment #1: Type: text/plain, Size: 3564 bytes --]

On Thu, May 15, 2014 at 06:08:30PM +0300, Alexander Shishkin wrote:
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -252,6 +252,19 @@ struct pmu {
>  	 * flush branch stack on context-switches (needed in cpu-wide mode)
>  	 */
>  	void (*flush_branch_stack)	(void);
> +
> +	/*
> +	 * Allocate AUX space buffer: return an array of @nr_pages pages to be
> +	 * mapped to userspace that will also be passed to ->free_aux.
> +	 */
> +	void *(*alloc_aux)		(int cpu, int nr_pages, bool overwrite,
> +					 struct perf_event_mmap_page *user_page);
> +					/* optional */
> +
> +	/*
> +	 * Free AUX buffer
> +	 */
> +	void (*free_aux)		(void *aux); /* optional */
>  };

I'm not entirely thrilled to expose it to the PMU like this.. I realize
you want this in order to get physically contiguous pages.

Are you aware of allocation constraints for other architectures?

>  #define PERF_RECORD_MISC_CPUMODE_MASK		(7 << 0)
> @@ -710,6 +726,18 @@ enum perf_event_type {
>  	 */
>  	PERF_RECORD_MMAP2			= 10,
>  
> +	/*
> +	 * Records that new data landed in the AUX buffer part.
> +	 *
> +	 * struct {
> +	 * 	struct perf_event_header	header;
> +	 *
> +	 * 	u64				aux_offset;
> +	 * 	u64				aux_size;
> +	 * };
> +	 */
> +	PERF_RECORD_AUX				= 11,
> +
>  	PERF_RECORD_MAX,			/* non-ABI */
>  };

Ideally the patch introducing this would also introduce code to generate
these records.

> @@ -4076,7 +4090,63 @@ static int perf_mmap(struct file *file, struct vm_area_struct *vma)
>  		return -EINVAL;
>  
>  	vma_size = vma->vm_end - vma->vm_start;
> -	nr_pages = (vma_size / PAGE_SIZE) - 1;
> +
> +	if (vma->vm_pgoff == 0) {
> +		nr_pages = (vma_size / PAGE_SIZE) - 1;
> +	} else {
> +		/*
> +		 * AUX area mapping: if rb->aux_nr_pages != 0, it's already
> +		 * mapped, all subsequent mappings should have the same size
> +		 * and offset. Must be above the normal perf buffer.
> +		 */
> +		u64 aux_offset, aux_size;
> +
> +		if (!event->rb)
> +			return -EINVAL;
> +
> +		nr_pages = vma_size / PAGE_SIZE;
> +
> +		mutex_lock(&event->mmap_mutex);
> +		ret = -EINVAL;
> +
> +		rb = event->rb;
> +		if (!rb)
> +			goto aux_unlock;
> +
> +		aux_offset = ACCESS_ONCE(rb->user_page->aux_offset);
> +		aux_size = ACCESS_ONCE(rb->user_page->aux_size);
> +
> +		if (aux_offset < perf_data_size(rb) + PAGE_SIZE)
> +			goto aux_unlock;
> +
> +		if (aux_offset != vma->vm_pgoff << PAGE_SHIFT)
> +			goto aux_unlock;
> +
> +		/* already mapped with a different offset */
> +		if (rb_has_aux(rb) && rb->aux_pgoff != vma->vm_pgoff)
> +			goto aux_unlock;
> +
> +		if (aux_size != vma_size || aux_size != nr_pages * PAGE_SIZE)
> +			goto aux_unlock;
> +
> +		/* already mapped with a different size */
> +		if (rb_has_aux(rb) && rb->aux_nr_pages != nr_pages)
> +			goto aux_unlock;
> +
> +		if (!atomic_inc_not_zero(&rb->mmap_count))
> +			goto aux_unlock;
> +
> +		if (rb_has_aux(rb)) {
> +			atomic_inc(&rb->aux_mmap_count);
> +			ret = 0;
> +			goto unlock;
> +		}
> +
> +		atomic_set(&rb->aux_mmap_count, 1);
> +		user_extra = nr_pages;
> +
> +		goto accounting;
> +	}

That appears to be missing a is_power_of_2(aux_size) check.

The problem with not having that is that since
perf_event_mmap_page::aux_{head,tail} are of Z mod 2^64 but your actual
{head,tail} are of Z mod aux_size, you need aux_size to be a full
divider of 2^64 or otherwise you get wrapping issues at the overflow.

Having it them all 2^n makes the divider trivial.

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC 2/2] perf: add AUX area to ring buffer for raw data streams
  2014-05-19  8:58   ` Peter Zijlstra
@ 2014-05-19 12:57     ` Alexander Shishkin
  2014-05-20  9:51       ` Peter Zijlstra
  0 siblings, 1 reply; 11+ messages in thread
From: Alexander Shishkin @ 2014-05-19 12:57 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Frederic Weisbecker, Mike Galbraith,
	Paul Mackerras, Stephane Eranian, Andi Kleen

Peter Zijlstra <peterz@infradead.org> writes:

> On Thu, May 15, 2014 at 06:08:30PM +0300, Alexander Shishkin wrote:
>> --- a/include/linux/perf_event.h
>> +++ b/include/linux/perf_event.h
>> @@ -252,6 +252,19 @@ struct pmu {
>>  	 * flush branch stack on context-switches (needed in cpu-wide mode)
>>  	 */
>>  	void (*flush_branch_stack)	(void);
>> +
>> +	/*
>> +	 * Allocate AUX space buffer: return an array of @nr_pages pages to be
>> +	 * mapped to userspace that will also be passed to ->free_aux.
>> +	 */
>> +	void *(*alloc_aux)		(int cpu, int nr_pages, bool overwrite,
>> +					 struct perf_event_mmap_page *user_page);
>> +					/* optional */
>> +
>> +	/*
>> +	 * Free AUX buffer
>> +	 */
>> +	void (*free_aux)		(void *aux); /* optional */
>>  };
>
> I'm not entirely thrilled to expose it to the PMU like this.. I realize
> you want this in order to get physically contiguous pages.

Hmm, I guess we can have code in perf core to carry out the allocation
according to, say, contstraint flags and pass the page array down to the
PMU if that sounds like a cleaner thing to do?

> Are you aware of allocation constraints for other architectures?

Somewhat. ARM's trace memory controller supports both scatter-gather and
a plain contiguous buffer, I haven't found evidence of one being
available while the other one isn't, so I'm inclined to assume that if
it can write to system memory, it supports SG.

>>  #define PERF_RECORD_MISC_CPUMODE_MASK		(7 << 0)
>> @@ -710,6 +726,18 @@ enum perf_event_type {
>>  	 */
>>  	PERF_RECORD_MMAP2			= 10,
>>  
>> +	/*
>> +	 * Records that new data landed in the AUX buffer part.
>> +	 *
>> +	 * struct {
>> +	 * 	struct perf_event_header	header;
>> +	 *
>> +	 * 	u64				aux_offset;
>> +	 * 	u64				aux_size;
>> +	 * };
>> +	 */
>> +	PERF_RECORD_AUX				= 11,
>> +
>>  	PERF_RECORD_MAX,			/* non-ABI */
>>  };
>
> Ideally the patch introducing this would also introduce code to generate
> these records.

Fair enough.

>> @@ -4076,7 +4090,63 @@ static int perf_mmap(struct file *file, struct vm_area_struct *vma)
>>  		return -EINVAL;
>>  
>>  	vma_size = vma->vm_end - vma->vm_start;
>> -	nr_pages = (vma_size / PAGE_SIZE) - 1;
>> +
>> +	if (vma->vm_pgoff == 0) {
>> +		nr_pages = (vma_size / PAGE_SIZE) - 1;
>> +	} else {
>> +		/*
>> +		 * AUX area mapping: if rb->aux_nr_pages != 0, it's already
>> +		 * mapped, all subsequent mappings should have the same size
>> +		 * and offset. Must be above the normal perf buffer.
>> +		 */
>> +		u64 aux_offset, aux_size;
>> +
>> +		if (!event->rb)
>> +			return -EINVAL;
>> +
>> +		nr_pages = vma_size / PAGE_SIZE;
>> +
>> +		mutex_lock(&event->mmap_mutex);
>> +		ret = -EINVAL;
>> +
>> +		rb = event->rb;
>> +		if (!rb)
>> +			goto aux_unlock;
>> +
>> +		aux_offset = ACCESS_ONCE(rb->user_page->aux_offset);
>> +		aux_size = ACCESS_ONCE(rb->user_page->aux_size);
>> +
>> +		if (aux_offset < perf_data_size(rb) + PAGE_SIZE)
>> +			goto aux_unlock;
>> +
>> +		if (aux_offset != vma->vm_pgoff << PAGE_SHIFT)
>> +			goto aux_unlock;
>> +
>> +		/* already mapped with a different offset */
>> +		if (rb_has_aux(rb) && rb->aux_pgoff != vma->vm_pgoff)
>> +			goto aux_unlock;
>> +
>> +		if (aux_size != vma_size || aux_size != nr_pages * PAGE_SIZE)
>> +			goto aux_unlock;
>> +
>> +		/* already mapped with a different size */
>> +		if (rb_has_aux(rb) && rb->aux_nr_pages != nr_pages)
>> +			goto aux_unlock;
>> +
>> +		if (!atomic_inc_not_zero(&rb->mmap_count))
>> +			goto aux_unlock;
>> +
>> +		if (rb_has_aux(rb)) {
>> +			atomic_inc(&rb->aux_mmap_count);
>> +			ret = 0;
>> +			goto unlock;
>> +		}
>> +
>> +		atomic_set(&rb->aux_mmap_count, 1);
>> +		user_extra = nr_pages;
>> +
>> +		goto accounting;
>> +	}
>
> That appears to be missing a is_power_of_2(aux_size) check.
>
> The problem with not having that is that since
> perf_event_mmap_page::aux_{head,tail} are of Z mod 2^64 but your actual
> {head,tail} are of Z mod aux_size, you need aux_size to be a full
> divider of 2^64 or otherwise you get wrapping issues at the overflow.
>
> Having it them all 2^n makes the divider trivial.

I left it out so that the PMU callback could decide if it wants to do
the math or not. Maybe it can also be a constraint flag or is it not
worth it at all?

Regards,
--
Alex

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC 2/2] perf: add AUX area to ring buffer for raw data streams
  2014-05-19 12:57     ` Alexander Shishkin
@ 2014-05-20  9:51       ` Peter Zijlstra
  2014-05-21 14:02         ` Alexander Shishkin
  0 siblings, 1 reply; 11+ messages in thread
From: Peter Zijlstra @ 2014-05-20  9:51 UTC (permalink / raw)
  To: Alexander Shishkin
  Cc: Ingo Molnar, linux-kernel, Frederic Weisbecker, Mike Galbraith,
	Paul Mackerras, Stephane Eranian, Andi Kleen

On Mon, May 19, 2014 at 03:57:37PM +0300, Alexander Shishkin wrote:
> Peter Zijlstra <peterz@infradead.org> writes:

> > I'm not entirely thrilled to expose it to the PMU like this.. I realize
> > you want this in order to get physically contiguous pages.
> 
> Hmm, I guess we can have code in perf core to carry out the allocation
> according to, say, contstraint flags and pass the page array down to the
> PMU if that sounds like a cleaner thing to do?
> 
> > Are you aware of allocation constraints for other architectures?
> 
> Somewhat. ARM's trace memory controller supports both scatter-gather and
> a plain contiguous buffer, I haven't found evidence of one being
> available while the other one isn't, so I'm inclined to assume that if
> it can write to system memory, it supports SG.

I've just added a patch from Vince Weaver:

  http://lkml.kernel.org/r/alpine.DEB.2.10.1405161708060.11099@vincent-weaver-1.umelst.maine.edu

That adds pmu::capabilities, I suppose we could start with something
like:

  PERF_PMU_CAP_AUX_BROKEN_SG

which would make the allocator attempt to fill the AUX buffer with as
big a chunks of contiguous memory as is available.

> > That appears to be missing a is_power_of_2(aux_size) check.
> >
> > The problem with not having that is that since
> > perf_event_mmap_page::aux_{head,tail} are of Z mod 2^64 but your actual
> > {head,tail} are of Z mod aux_size, you need aux_size to be a full
> > divider of 2^64 or otherwise you get wrapping issues at the overflow.
> >
> > Having it them all 2^n makes the divider trivial.
> 
> I left it out so that the PMU callback could decide if it wants to do
> the math or not. Maybe it can also be a constraint flag or is it not
> worth it at all?

I'd start with the most constrained model -- that is add the power of
two test -- and worry about relaxing it if it turns out its really
needed.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC 2/2] perf: add AUX area to ring buffer for raw data streams
  2014-05-20  9:51       ` Peter Zijlstra
@ 2014-05-21 14:02         ` Alexander Shishkin
  2014-06-05 11:58           ` Ingo Molnar
  0 siblings, 1 reply; 11+ messages in thread
From: Alexander Shishkin @ 2014-05-21 14:02 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Frederic Weisbecker, Mike Galbraith,
	Paul Mackerras, Stephane Eranian, Andi Kleen

Peter Zijlstra <peterz@infradead.org> writes:

> On Mon, May 19, 2014 at 03:57:37PM +0300, Alexander Shishkin wrote:
>> Peter Zijlstra <peterz@infradead.org> writes:
>
>> > I'm not entirely thrilled to expose it to the PMU like this.. I realize
>> > you want this in order to get physically contiguous pages.
>> 
>> Hmm, I guess we can have code in perf core to carry out the allocation
>> according to, say, contstraint flags and pass the page array down to the
>> PMU if that sounds like a cleaner thing to do?
>> 
>> > Are you aware of allocation constraints for other architectures?
>> 
>> Somewhat. ARM's trace memory controller supports both scatter-gather and
>> a plain contiguous buffer, I haven't found evidence of one being
>> available while the other one isn't, so I'm inclined to assume that if
>> it can write to system memory, it supports SG.
>
> I've just added a patch from Vince Weaver:
>
>   http://lkml.kernel.org/r/alpine.DEB.2.10.1405161708060.11099@vincent-weaver-1.umelst.maine.edu
>
> That adds pmu::capabilities, I suppose we could start with something
> like:
>
>   PERF_PMU_CAP_AUX_BROKEN_SG
>
> which would make the allocator attempt to fill the AUX buffer with as
> big a chunks of contiguous memory as is available.

Ok, how about this (on top of the previous patch):

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 9643450..e2a6b6b 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -278,15 +278,15 @@ struct pmu {
 	void (*flush_branch_stack)	(void);
 
 	/*
-	 * Allocate AUX space buffer: return an array of @nr_pages pages to be
-	 * mapped to userspace that will also be passed to ->free_aux.
+	 * Set up pmu-private data structures for an AUX area
 	 */
-	void *(*alloc_aux)		(int cpu, int nr_pages, bool overwrite,
+	void *(*setup_aux)		(int cpu, void **pages,
+					 int nr_pages, bool overwrite,
 					 struct perf_event_mmap_page *user_page);
 					/* optional */
 
 	/*
-	 * Free AUX buffer
+	 * Free pmu-private AUX data structures
 	 */
 	void (*free_aux)		(void *aux); /* optional */
 
@@ -300,6 +300,7 @@ struct pmu {
  * struct pmu::capabilities flags
  */
 #define PERF_PMU_CAP_NO_INTERRUPT	1
+#define PERF_PMU_CAP_AUX_BROKEN_SG	2
 
 /**
  * enum perf_event_active_state - the states of a event
diff --git a/kernel/events/internal.h b/kernel/events/internal.h
index ea51cfb..a06d7fe 100644
--- a/kernel/events/internal.h
+++ b/kernel/events/internal.h
@@ -41,6 +41,7 @@ struct ring_buffer {
 	atomic_t			aux_mmap_count;
 	unsigned long			aux_mmap_locked;
 	void				**aux_pages;
+	void				*aux_priv;
 	void				(*free_aux)(void *aux);
 
 	struct perf_event_mmap_page	*user_page;
diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
index 5935cb2..7f166f2 100644
--- a/kernel/events/ring_buffer.c
+++ b/kernel/events/ring_buffer.c
@@ -244,32 +244,96 @@ ring_buffer_init(struct ring_buffer *rb, long watermark, int flags)
 	spin_lock_init(&rb->event_lock);
 }
 
+#define PERF_AUX_GFP	(GFP_KERNEL | __GFP_ZERO | __GFP_NOWARN | __GFP_NORETRY)
+
+static struct page *rb_alloc_aux_page(int node, int order)
+{
+	struct page *page;
+
+	if (order > MAX_ORDER)
+		order = MAX_ORDER;
+
+	do {
+		page = alloc_pages_node(node, PERF_AUX_GFP, order);
+	} while (!page && order--);
+
+	if (page && order) {
+		/*
+		 * Communicate the allocation size to the driver
+		 */
+		split_page(page, order);
+		SetPagePrivate(page);
+		set_page_private(page, order);
+	}
+
+	return page;
+}
+
+static void rb_free_aux_page(struct ring_buffer *rb, int idx)
+{
+	struct page *page = virt_to_page(rb->aux_pages[idx]);
+
+	ClearPagePrivate(page);
+	page->mapping = NULL;
+	__free_page(page);
+}
+
 int rb_alloc_aux(struct ring_buffer *rb, struct perf_event *event,
 		 pgoff_t pgoff, int nr_pages, int flags)
 {
 	bool overwrite = !!(flags & RING_BUFFER_WRITABLE);
+	int pg, node = (event->cpu == -1) ? -1 : cpu_to_node(event->cpu);
+	int order = 0;
 
-	if (!event->pmu->alloc_aux)
+	if (!event->pmu->setup_aux)
 		return -ENOTSUPP;
 
-	rb->aux_pages = event->pmu->alloc_aux(event->cpu, nr_pages, overwrite,
-					      rb->user_page);
+	if (event->pmu->capabilities & PERF_PMU_CAP_AUX_BROKEN_SG)
+		order = get_order(nr_pages * PAGE_SIZE);
+
+	rb->aux_pages = kzalloc_node(nr_pages * sizeof(void *), GFP_KERNEL, node);
 	if (!rb->aux_pages)
 		return -ENOMEM;
 
+	for (pg = 0; pg < nr_pages;) {
+		struct page *page;
+		int last;
+
+		page = rb_alloc_aux_page(node, order);
+		if (!page)
+			goto err;
+
+		for (last = pg + (1 << page_private(page)); pg < last; pg++)
+			rb->aux_pages[pg] = page_address(page++);
+	}
+
+	rb->aux_priv = event->pmu->setup_aux(event->cpu, rb->aux_pages, nr_pages,
+					     overwrite, rb->user_page);
+	if (!rb->aux_priv) {
+		rb_free_aux(rb);
+		return -EINVAL;
+	}
+
 	rb->free_aux = event->pmu->free_aux;
 	rb->aux_pgoff = pgoff;
 	rb->aux_nr_pages = nr_pages;
 
 	return 0;
+err:
+	for (; pg >= 0; pg--)
+		rb_free_aux_page(rb, pg);
+
+	return -ENOMEM;
 }
 
 void rb_free_aux(struct ring_buffer *rb)
 {
-	if (WARN_ON_ONCE(!rb->free_aux))
-		return;
+	int pg;
+
+	for (pg = 0; pg < rb->aux_nr_pages; pg++)
+		rb_free_aux_page(rb, pg);
 
-	rb->free_aux(rb->aux_pages);
+	rb->free_aux(rb->aux_priv);
 	rb->aux_nr_pages = 0;
 }
 
>> > That appears to be missing a is_power_of_2(aux_size) check.
>> >
>> > The problem with not having that is that since
>> > perf_event_mmap_page::aux_{head,tail} are of Z mod 2^64 but your actual
>> > {head,tail} are of Z mod aux_size, you need aux_size to be a full
>> > divider of 2^64 or otherwise you get wrapping issues at the overflow.
>> >
>> > Having it them all 2^n makes the divider trivial.
>> 
>> I left it out so that the PMU callback could decide if it wants to do
>> the math or not. Maybe it can also be a constraint flag or is it not
>> worth it at all?
>
> I'd start with the most constrained model -- that is add the power of
> two test -- and worry about relaxing it if it turns out its really
> needed.

Makes sense, I'll put it back.

Regards,
--
Alex

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [RFC 2/2] perf: add AUX area to ring buffer for raw data streams
  2014-05-21 14:02         ` Alexander Shishkin
@ 2014-06-05 11:58           ` Ingo Molnar
  0 siblings, 0 replies; 11+ messages in thread
From: Ingo Molnar @ 2014-06-05 11:58 UTC (permalink / raw)
  To: Alexander Shishkin
  Cc: Peter Zijlstra, Ingo Molnar, linux-kernel, Frederic Weisbecker,
	Mike Galbraith, Paul Mackerras, Stephane Eranian, Andi Kleen


* Alexander Shishkin <alexander.shishkin@linux.intel.com> wrote:

> @@ -300,6 +300,7 @@ struct pmu {
>   * struct pmu::capabilities flags
>   */
>  #define PERF_PMU_CAP_NO_INTERRUPT	1
> +#define PERF_PMU_CAP_AUX_BROKEN_SG	2

Btw., I'd name it PERF_PMU_AUX_PHYSICAL_SG or so, and add a comment 
before it that explains that such hardware interfaces are restrictive, 
or so.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2014-06-05 11:58 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-05-15 15:08 [RFC 0/2] perf: add AUX space to ring_buffer Alexander Shishkin
2014-05-15 15:08 ` [RFC 1/2] perf: add data_{offset,size} to user_page Alexander Shishkin
2014-05-15 18:14   ` Robert Richter
2014-05-15 15:08 ` [RFC 2/2] perf: add AUX area to ring buffer for raw data streams Alexander Shishkin
2014-05-15 18:02   ` Robert Richter
2014-05-16  7:07     ` Alexander Shishkin
2014-05-19  8:58   ` Peter Zijlstra
2014-05-19 12:57     ` Alexander Shishkin
2014-05-20  9:51       ` Peter Zijlstra
2014-05-21 14:02         ` Alexander Shishkin
2014-06-05 11:58           ` Ingo Molnar

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox