linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] perf session: Add option to copy events when queueing
@ 2013-09-06 19:37 David Ahern
  2013-09-14 16:16 ` Frederic Weisbecker
  2013-10-02 12:18 ` Jiri Olsa
  0 siblings, 2 replies; 14+ messages in thread
From: David Ahern @ 2013-09-06 19:37 UTC (permalink / raw)
  To: acme, linux-kernel
  Cc: David Ahern, Frederic Weisbecker, Ingo Molnar, Jiri Olsa,
	Mike Galbraith, Namhyung Kim, Peter Zijlstra, Stephane Eranian

When processing events the session code has an ordered samples queue which is
used to time-sort events coming in across multiple mmaps. At a later point in
time samples on the queue are flushed up to some timestamp at which point the
event is actually processed.

When analyzing events live (ie., record/analysis path in the same command)
there is a race that leads to corrupted events and parse errors which cause
perf to terminate. The problem is that when the event is placed in the ordered
samples queue it is only a reference to the event which is really sitting in
the mmap buffer. Even though the event is queued for later processing the mmap
tail pointer is updated which indicates to the kernel that the event has been
processed. The race is flushing the event from the queue before it gets
overwritten by some other event. For commands trying to process events live
(versus just writing to a file) and processing a high rate of events this leads
to parse failures and perf terminates.

Examples hitting this problem are 'perf kvm stat live', especially with nested
VMs which generate 100,000+ traces per second, and a command processing
scheduling events with a high rate of context switching -- e.g., running
'perf bench sched pipe'.

This patch offers live commands an option to copy the event when it is placed in
the ordered samples queue.

Signed-off-by: David Ahern <dsahern@gmail.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
---
 tools/perf/util/session.c |   17 ++++++++++++++---
 tools/perf/util/session.h |    1 +
 2 files changed, 15 insertions(+), 3 deletions(-)

diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
index 1b185ca..71f16db 100644
--- a/tools/perf/util/session.c
+++ b/tools/perf/util/session.c
@@ -483,6 +483,8 @@ static void perf_session_free_sample_buffers(struct perf_session *session)
 
 		sq = list_entry(os->to_free.next, struct sample_queue, list);
 		list_del(&sq->list);
+		if (session->copy_on_queue)
+			free(sq->event);
 		free(sq);
 	}
 }
@@ -513,11 +515,15 @@ static int flush_sample_queue(struct perf_session *s,
 			break;
 
 		ret = perf_evlist__parse_sample(s->evlist, iter->event, &sample);
-		if (ret)
+		if (ret) {
 			pr_err("Can't parse sample, err = %d\n", ret);
-		else {
+			if (s->copy_on_queue)
+				free(iter->event);
+		} else {
 			ret = perf_session_deliver_event(s, iter->event, &sample, tool,
 							 iter->file_offset);
+			if (s->copy_on_queue)
+				free(iter->event);
 			if (ret)
 				return ret;
 		}
@@ -676,7 +682,12 @@ int perf_session_queue_event(struct perf_session *s, union perf_event *event,
 
 	new->timestamp = timestamp;
 	new->file_offset = file_offset;
-	new->event = event;
+
+	if (s->copy_on_queue) {
+		new->event = malloc(event->header.size);
+		memcpy(new->event, event, event->header.size);
+	} else
+		new->event = event;
 
 	__queue_event(new, s);
 
diff --git a/tools/perf/util/session.h b/tools/perf/util/session.h
index 3aa75fb..4adfcbb 100644
--- a/tools/perf/util/session.h
+++ b/tools/perf/util/session.h
@@ -38,6 +38,7 @@ struct perf_session {
 	bool			fd_pipe;
 	bool			repipe;
 	struct ordered_samples	ordered_samples;
+	bool			copy_on_queue;
 	char			filename[1];
 };
 
-- 
1.7.10.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH] perf session: Add option to copy events when queueing
  2013-09-06 19:37 [PATCH] perf session: Add option to copy events when queueing David Ahern
@ 2013-09-14 16:16 ` Frederic Weisbecker
  2013-09-14 17:25   ` David Ahern
  2013-10-24  9:30   ` Frederic Weisbecker
  2013-10-02 12:18 ` Jiri Olsa
  1 sibling, 2 replies; 14+ messages in thread
From: Frederic Weisbecker @ 2013-09-14 16:16 UTC (permalink / raw)
  To: David Ahern
  Cc: acme, linux-kernel, Ingo Molnar, Jiri Olsa, Mike Galbraith,
	Namhyung Kim, Peter Zijlstra, Stephane Eranian

On Fri, Sep 06, 2013 at 01:37:01PM -0600, David Ahern wrote:
> When processing events the session code has an ordered samples queue which is
> used to time-sort events coming in across multiple mmaps. At a later point in
> time samples on the queue are flushed up to some timestamp at which point the
> event is actually processed.
> 
> When analyzing events live (ie., record/analysis path in the same command)
> there is a race that leads to corrupted events and parse errors which cause
> perf to terminate. The problem is that when the event is placed in the ordered
> samples queue it is only a reference to the event which is really sitting in
> the mmap buffer. Even though the event is queued for later processing the mmap
> tail pointer is updated which indicates to the kernel that the event has been
> processed. The race is flushing the event from the queue before it gets
> overwritten by some other event. For commands trying to process events live
> (versus just writing to a file) and processing a high rate of events this leads
> to parse failures and perf terminates.
> 
> Examples hitting this problem are 'perf kvm stat live', especially with nested
> VMs which generate 100,000+ traces per second, and a command processing
> scheduling events with a high rate of context switching -- e.g., running
> 'perf bench sched pipe'.
> 
> This patch offers live commands an option to copy the event when it is placed in
> the ordered samples queue.
> 
> Signed-off-by: David Ahern <dsahern@gmail.com>
> Cc: Frederic Weisbecker <fweisbec@gmail.com>
> Cc: Ingo Molnar <mingo@kernel.org>
> Cc: Jiri Olsa <jolsa@redhat.com>
> Cc: Mike Galbraith <efault@gmx.de>
> Cc: Namhyung Kim <namhyung@kernel.org>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Stephane Eranian <eranian@google.com>
> ---
>  tools/perf/util/session.c |   17 ++++++++++++++---
>  tools/perf/util/session.h |    1 +
>  2 files changed, 15 insertions(+), 3 deletions(-)
> 
> diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
> index 1b185ca..71f16db 100644
> --- a/tools/perf/util/session.c
> +++ b/tools/perf/util/session.c
> @@ -483,6 +483,8 @@ static void perf_session_free_sample_buffers(struct perf_session *session)
>  
>  		sq = list_entry(os->to_free.next, struct sample_queue, list);
>  		list_del(&sq->list);
> +		if (session->copy_on_queue)
> +			free(sq->event);
>  		free(sq);
>  	}
>  }
> @@ -513,11 +515,15 @@ static int flush_sample_queue(struct perf_session *s,
>  			break;
>  
>  		ret = perf_evlist__parse_sample(s->evlist, iter->event, &sample);
> -		if (ret)
> +		if (ret) {
>  			pr_err("Can't parse sample, err = %d\n", ret);
> -		else {
> +			if (s->copy_on_queue)
> +				free(iter->event);
> +		} else {
>  			ret = perf_session_deliver_event(s, iter->event, &sample, tool,
>  							 iter->file_offset);
> +			if (s->copy_on_queue)
> +				free(iter->event);
>  			if (ret)
>  				return ret;
>  		}
> @@ -676,7 +682,12 @@ int perf_session_queue_event(struct perf_session *s, union perf_event *event,
>  
>  	new->timestamp = timestamp;
>  	new->file_offset = file_offset;
> -	new->event = event;
> +
> +	if (s->copy_on_queue) {
> +		new->event = malloc(event->header.size);
> +		memcpy(new->event, event, event->header.size);
> +	} else
> +		new->event = event;
>  
>  	__queue_event(new, s);
>  
> diff --git a/tools/perf/util/session.h b/tools/perf/util/session.h
> index 3aa75fb..4adfcbb 100644
> --- a/tools/perf/util/session.h
> +++ b/tools/perf/util/session.h
> @@ -38,6 +38,7 @@ struct perf_session {
>  	bool			fd_pipe;
>  	bool			repipe;
>  	struct ordered_samples	ordered_samples;
> +	bool			copy_on_queue;

So do you think it should stay optional? This looks like a global problem, I mean
the event can be unmapped anytime for any builtin tool mapping it, right?

Also we already allocate the sample list node (struct sample_queue) from os->sample
buffer. ie: we have our own allocator there.

Probably we should reuse that and include the copied event space in "struct sample_queue"?

Also looking at it now, it seems we have a bug on the existing code:


        if (!list_empty(sc)) {
                new = list_entry(sc->next, struct sample_queue, list);
                list_del(&new->list);
        } else if (os->sample_buffer) {
                new = os->sample_buffer + os->sample_buffer_idx;
                if (++os->sample_buffer_idx == MAX_SAMPLE_BUFFER)
                        os->sample_buffer = NULL;
        } else {
               os->sample_buffer = malloc(MAX_SAMPLE_BUFFER * sizeof(*new));
               if (!os->sample_buffer)
                        return -ENOMEM;
               list_add(&os->sample_buffer->list, &os->to_free);
               os->sample_buffer_idx = 2;
               new = os->sample_buffer + 1;
        }

If we actually run out of buffer rooms, we should realloc right after and not
wait for the next entry, otherwise we loose an event:

        if (!list_empty(sc)) {
                new = list_entry(sc->next, struct sample_queue, list);
                list_del(&new->list);
        } else {
                if (os->sample_buffer) {
                        new = os->sample_buffer + os->sample_buffer_idx;
                        if (++os->sample_buffer_idx == MAX_SAMPLE_BUFFER)
                                os->sample_buffer = NULL;
                }

                if (!os->sample_buffer) {
	                os->sample_buffer = malloc(MAX_SAMPLE_BUFFER * sizeof(*new));
                        if (!os->sample_buffer)
                                return -ENOMEM;
                        list_add(&os->sample_buffer->list, &os->to_free);
                        os->sample_buffer_idx = 2;
                        new = os->sample_buffer + 1;
        }


Although the mirrored os->sample_buffer condition check is a bit ugly and should move to
a function. But the idea is there.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] perf session: Add option to copy events when queueing
  2013-09-14 16:16 ` Frederic Weisbecker
@ 2013-09-14 17:25   ` David Ahern
  2013-09-16 16:40     ` Frederic Weisbecker
  2013-10-24  9:30   ` Frederic Weisbecker
  1 sibling, 1 reply; 14+ messages in thread
From: David Ahern @ 2013-09-14 17:25 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: acme, linux-kernel, Ingo Molnar, Jiri Olsa, Mike Galbraith,
	Namhyung Kim, Peter Zijlstra, Stephane Eranian

On 9/14/13 10:16 AM, Frederic Weisbecker wrote:
>> @@ -676,7 +682,12 @@ int perf_session_queue_event(struct perf_session *s, union perf_event *event,
>>
>>   	new->timestamp = timestamp;
>>   	new->file_offset = file_offset;
>> -	new->event = event;
>> +
>> +	if (s->copy_on_queue) {
>> +		new->event = malloc(event->header.size);
>> +		memcpy(new->event, event, event->header.size);
>> +	} else
>> +		new->event = event;

---8<---

> So do you think it should stay optional? This looks like a global problem, I mean
> the event can be unmapped anytime for any builtin tool mapping it, right?

Yes. I could make it the default behavior; just overhead in doing that 
(malloc/copy for each event).

>
> Also we already allocate the sample list node (struct sample_queue) from os->sample
> buffer. ie: we have our own allocator there.
>
> Probably we should reuse that and include the copied event space in "struct sample_queue"?


Right, that's where I put the malloc and copy - I kept the relevant 
change above. I take it you are thinking of something different but I am 
not following you. You definitely do NOT want to change struct 
sample_queue to include an event - like this:

diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
index 51f5edf..866944a 100644
--- a/tools/perf/util/session.c
+++ b/tools/perf/util/session.c
@@ -491,7 +491,7 @@ static perf_event__swap_op perf_event__swap_ops[] = {
  struct sample_queue {
     u64         timestamp;
     u64         file_offset;
-   union perf_event    *event;
+   union perf_event    event;
     struct list_head    list;
  };

size of event is determined by mmap_event (mmap2_event in latest code) 
which is > 4096 because of the filename argument. Including the event 
directly in sample_queue would balloon memory usage (learned this the 
hard way!).

>
> Also looking at it now, it seems we have a bug on the existing code:
>
>
>          if (!list_empty(sc)) {
>                  new = list_entry(sc->next, struct sample_queue, list);
>                  list_del(&new->list);
>          } else if (os->sample_buffer) {
>                  new = os->sample_buffer + os->sample_buffer_idx;
>                  if (++os->sample_buffer_idx == MAX_SAMPLE_BUFFER)
>                          os->sample_buffer = NULL;
>          } else {
>                 os->sample_buffer = malloc(MAX_SAMPLE_BUFFER * sizeof(*new));
>                 if (!os->sample_buffer)
>                          return -ENOMEM;
>                 list_add(&os->sample_buffer->list, &os->to_free);
>                 os->sample_buffer_idx = 2;
>                 new = os->sample_buffer + 1;
>          }
>
> If we actually run out of buffer rooms, we should realloc right after and not
> wait for the next entry, otherwise we loose an event:
>
>          if (!list_empty(sc)) {
>                  new = list_entry(sc->next, struct sample_queue, list);
>                  list_del(&new->list);
>          } else {
>                  if (os->sample_buffer) {
>                          new = os->sample_buffer + os->sample_buffer_idx;
>                          if (++os->sample_buffer_idx == MAX_SAMPLE_BUFFER)
>                                  os->sample_buffer = NULL;
>                  }
>
>                  if (!os->sample_buffer) {
> 	                os->sample_buffer = malloc(MAX_SAMPLE_BUFFER * sizeof(*new));
>                          if (!os->sample_buffer)
>                                  return -ENOMEM;
>                          list_add(&os->sample_buffer->list, &os->to_free);
>                          os->sample_buffer_idx = 2;
>                          new = os->sample_buffer + 1;
>          }
>
>
> Although the mirrored os->sample_buffer condition check is a bit ugly and should move to
> a function. But the idea is there.

Ok. That should be a separate patch. Are you going to submit that one?

David


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH] perf session: Add option to copy events when queueing
  2013-09-14 17:25   ` David Ahern
@ 2013-09-16 16:40     ` Frederic Weisbecker
  2013-09-17  4:14       ` David Ahern
  0 siblings, 1 reply; 14+ messages in thread
From: Frederic Weisbecker @ 2013-09-16 16:40 UTC (permalink / raw)
  To: David Ahern
  Cc: acme, linux-kernel, Ingo Molnar, Jiri Olsa, Mike Galbraith,
	Namhyung Kim, Peter Zijlstra, Stephane Eranian

On Sat, Sep 14, 2013 at 11:25:40AM -0600, David Ahern wrote:
> On 9/14/13 10:16 AM, Frederic Weisbecker wrote:
> >>@@ -676,7 +682,12 @@ int perf_session_queue_event(struct perf_session *s, union perf_event *event,
> >>
> >>  	new->timestamp = timestamp;
> >>  	new->file_offset = file_offset;
> >>-	new->event = event;
> >>+
> >>+	if (s->copy_on_queue) {
> >>+		new->event = malloc(event->header.size);
> >>+		memcpy(new->event, event, event->header.size);
> >>+	} else
> >>+		new->event = event;
> 
> ---8<---
> 
> >So do you think it should stay optional? This looks like a global problem, I mean
> >the event can be unmapped anytime for any builtin tool mapping it, right?
> 
> Yes. I could make it the default behavior; just overhead in doing
> that (malloc/copy for each event).

Are there any tool that don't suffer from this bug somehow? If not then it must
be applied unconditionally.

> 
> >
> >Also we already allocate the sample list node (struct sample_queue) from os->sample
> >buffer. ie: we have our own allocator there.
> >
> >Probably we should reuse that and include the copied event space in "struct sample_queue"?
> 
> 
> Right, that's where I put the malloc and copy - I kept the relevant
> change above. I take it you are thinking of something different but
> I am not following you. You definitely do NOT want to change struct
> sample_queue to include an event - like this:
> 
> diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
> index 51f5edf..866944a 100644
> --- a/tools/perf/util/session.c
> +++ b/tools/perf/util/session.c
> @@ -491,7 +491,7 @@ static perf_event__swap_op perf_event__swap_ops[] = {
>  struct sample_queue {
>     u64         timestamp;
>     u64         file_offset;
> -   union perf_event    *event;
> +   union perf_event    event;

Right that's roughly what I thought.

>     struct list_head    list;
>  };
> 
> size of event is determined by mmap_event (mmap2_event in latest
> code) which is > 4096 because of the filename argument. Including
> the event directly in sample_queue would balloon memory usage
> (learned this the hard way!).

Ah then perhaps we can allocate with the dynamic size of the event?

> 
> >
> >Also looking at it now, it seems we have a bug on the existing code:
> >
> >
> >         if (!list_empty(sc)) {
> >                 new = list_entry(sc->next, struct sample_queue, list);
> >                 list_del(&new->list);
> >         } else if (os->sample_buffer) {
> >                 new = os->sample_buffer + os->sample_buffer_idx;
> >                 if (++os->sample_buffer_idx == MAX_SAMPLE_BUFFER)
> >                         os->sample_buffer = NULL;
> >         } else {
> >                os->sample_buffer = malloc(MAX_SAMPLE_BUFFER * sizeof(*new));
> >                if (!os->sample_buffer)
> >                         return -ENOMEM;
> >                list_add(&os->sample_buffer->list, &os->to_free);
> >                os->sample_buffer_idx = 2;
> >                new = os->sample_buffer + 1;
> >         }
> >
> >If we actually run out of buffer rooms, we should realloc right after and not
> >wait for the next entry, otherwise we loose an event:
> >
> >         if (!list_empty(sc)) {
> >                 new = list_entry(sc->next, struct sample_queue, list);
> >                 list_del(&new->list);
> >         } else {
> >                 if (os->sample_buffer) {
> >                         new = os->sample_buffer + os->sample_buffer_idx;
> >                         if (++os->sample_buffer_idx == MAX_SAMPLE_BUFFER)
> >                                 os->sample_buffer = NULL;
> >                 }
> >
> >                 if (!os->sample_buffer) {
> >	                os->sample_buffer = malloc(MAX_SAMPLE_BUFFER * sizeof(*new));
> >                         if (!os->sample_buffer)
> >                                 return -ENOMEM;
> >                         list_add(&os->sample_buffer->list, &os->to_free);
> >                         os->sample_buffer_idx = 2;
> >                         new = os->sample_buffer + 1;
> >         }
> >
> >
> >Although the mirrored os->sample_buffer condition check is a bit ugly and should move to
> >a function. But the idea is there.
> 
> Ok. That should be a separate patch. Are you going to submit that one?

Yeah, unless you beat me at it :)

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] perf session: Add option to copy events when queueing
  2013-09-16 16:40     ` Frederic Weisbecker
@ 2013-09-17  4:14       ` David Ahern
  0 siblings, 0 replies; 14+ messages in thread
From: David Ahern @ 2013-09-17  4:14 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: acme, linux-kernel, Ingo Molnar, Jiri Olsa, Mike Galbraith,
	Namhyung Kim, Peter Zijlstra, Stephane Eranian

On 9/16/13 10:40 AM, Frederic Weisbecker wrote:
>> Yes. I could make it the default behavior; just overhead in doing
>> that (malloc/copy for each event).
>
> Are there any tool that don't suffer from this bug somehow? If not then it must
> be applied unconditionally.

I believe only 'live' commands would be affected: top, kvm stat live, my 
local scheduling daemon. perf-trace bypasses session and time ordering 
(though I question that decision).

---8<---

>> size of event is determined by mmap_event (mmap2_event in latest
>> code) which is > 4096 because of the filename argument. Including
>> the event directly in sample_queue would balloon memory usage
>> (learned this the hard way!).
>
> Ah then perhaps we can allocate with the dynamic size of the event?

Yes, that's how I have it my patch -- allocate memory based on header 
size and copy in the queue_event function. I think we agree on this part 
now.

---8<---

>>> Although the mirrored os->sample_buffer condition check is a bit ugly and should move to
>>> a function. But the idea is there.
>>
>> Ok. That should be a separate patch. Are you going to submit that one?
>
> Yeah, unless you beat me at it :)

That's not going to happen on my end due to a recent time constraint 
(jury duty - minimum 1 week trial). At best I can do a patch in early 
October.

David


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] perf session: Add option to copy events when queueing
  2013-09-06 19:37 [PATCH] perf session: Add option to copy events when queueing David Ahern
  2013-09-14 16:16 ` Frederic Weisbecker
@ 2013-10-02 12:18 ` Jiri Olsa
  2013-10-02 12:38   ` Jiri Olsa
  1 sibling, 1 reply; 14+ messages in thread
From: Jiri Olsa @ 2013-10-02 12:18 UTC (permalink / raw)
  To: David Ahern
  Cc: acme, linux-kernel, Frederic Weisbecker, Ingo Molnar,
	Mike Galbraith, Namhyung Kim, Peter Zijlstra, Stephane Eranian

On Fri, Sep 06, 2013 at 01:37:01PM -0600, David Ahern wrote:
> When processing events the session code has an ordered samples queue which is
> used to time-sort events coming in across multiple mmaps. At a later point in
> time samples on the queue are flushed up to some timestamp at which point the
> event is actually processed.
> 
> When analyzing events live (ie., record/analysis path in the same command)
> there is a race that leads to corrupted events and parse errors which cause
> perf to terminate. The problem is that when the event is placed in the ordered
> samples queue it is only a reference to the event which is really sitting in
> the mmap buffer. Even though the event is queued for later processing the mmap
> tail pointer is updated which indicates to the kernel that the event has been
> processed. The race is flushing the event from the queue before it gets
> overwritten by some other event. For commands trying to process events live
> (versus just writing to a file) and processing a high rate of events this leads
> to parse failures and perf terminates.

hi,
'perf top -G dwarf' hit this due to longer callchains
processing your patch fixed that, thanks! ;-)

> 
> Examples hitting this problem are 'perf kvm stat live', especially with nested
> VMs which generate 100,000+ traces per second, and a command processing
> scheduling events with a high rate of context switching -- e.g., running
> 'perf bench sched pipe'.
> 
> This patch offers live commands an option to copy the event when it is placed in
> the ordered samples queue.
> 

SNIP

>  
>  	new->timestamp = timestamp;
>  	new->file_offset = file_offset;
> -	new->event = event;
> +
> +	if (s->copy_on_queue) {
> +		new->event = malloc(event->header.size);
> +		memcpy(new->event, event, event->header.size);

missing allocation error check

jirka

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] perf session: Add option to copy events when queueing
  2013-10-02 12:18 ` Jiri Olsa
@ 2013-10-02 12:38   ` Jiri Olsa
  2013-10-02 12:52     ` David Ahern
  0 siblings, 1 reply; 14+ messages in thread
From: Jiri Olsa @ 2013-10-02 12:38 UTC (permalink / raw)
  To: David Ahern
  Cc: acme, linux-kernel, Frederic Weisbecker, Ingo Molnar,
	Mike Galbraith, Namhyung Kim, Peter Zijlstra, Stephane Eranian

On Wed, Oct 02, 2013 at 02:18:21PM +0200, Jiri Olsa wrote:
> On Fri, Sep 06, 2013 at 01:37:01PM -0600, David Ahern wrote:
> > When processing events the session code has an ordered samples queue which is
> > used to time-sort events coming in across multiple mmaps. At a later point in
> > time samples on the queue are flushed up to some timestamp at which point the
> > event is actually processed.
> > 
> > When analyzing events live (ie., record/analysis path in the same command)
> > there is a race that leads to corrupted events and parse errors which cause
> > perf to terminate. The problem is that when the event is placed in the ordered
> > samples queue it is only a reference to the event which is really sitting in
> > the mmap buffer. Even though the event is queued for later processing the mmap
> > tail pointer is updated which indicates to the kernel that the event has been
> > processed. The race is flushing the event from the queue before it gets
> > overwritten by some other event. For commands trying to process events live
> > (versus just writing to a file) and processing a high rate of events this leads
> > to parse failures and perf terminates.
> 
> hi,
> 'perf top -G dwarf' hit this due to longer callchains
> processing your patch fixed that, thanks! ;-)

ugh... now looking closer on top processing, it does not
seem to use queues ;-\ must be something else then..


> 
> > 
> > Examples hitting this problem are 'perf kvm stat live', especially with nested
> > VMs which generate 100,000+ traces per second, and a command processing
> > scheduling events with a high rate of context switching -- e.g., running
> > 'perf bench sched pipe'.
> >
> > This patch offers live commands an option to copy the event when it is
> > placed in
> > the ordered samples queue.

So I guess you have some other patch that actually sets
session::copy_on_queue?

jirka

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] perf session: Add option to copy events when queueing
  2013-10-02 12:38   ` Jiri Olsa
@ 2013-10-02 12:52     ` David Ahern
  0 siblings, 0 replies; 14+ messages in thread
From: David Ahern @ 2013-10-02 12:52 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: acme, linux-kernel, Frederic Weisbecker, Ingo Molnar,
	Mike Galbraith, Namhyung Kim, Peter Zijlstra, Stephane Eranian

On 10/2/13 6:38 AM, Jiri Olsa wrote:
>>> Examples hitting this problem are 'perf kvm stat live', especially with nested
>>> VMs which generate 100,000+ traces per second, and a command processing
>>> scheduling events with a high rate of context switching -- e.g., running
>>> 'perf bench sched pipe'.
>>>
>>> This patch offers live commands an option to copy the event when it is
>>> placed in
>>> the ordered samples queue.
>
> So I guess you have some other patch that actually sets
> session::copy_on_queue?

Yes, with this patch commands have to set session->copy_on_queue to true.

My latest perf-sched-daemon code actually solves this another way: 
rather than asking perf-session to copy the events, the command itself 
does and passes the copied event to the perf-session processing code.

I like this design better because the command itself controls the 
allocation and free which the daemon needs because the events are added 
to another queue so the flow is:

    read from mmmap --> copy event --> pass to session code for time 
ordering --> process time ordered events

The daemon puts the time ordered events into a time-limited queue and 
only processes the event when requested.

If that is too confusing take a look at:

https://github.com/dsahern/linux/blob/perf-sched-timehist-3.11/tools/perf/schedmon.c

process_event is the handler for sample events coming out of the mmaps. 
It allocates memory, copies the event and then calls 
perf_session_queue_event on the copy.

David


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] perf session: Add option to copy events when queueing
  2013-09-14 16:16 ` Frederic Weisbecker
  2013-09-14 17:25   ` David Ahern
@ 2013-10-24  9:30   ` Frederic Weisbecker
  2013-10-24 10:23     ` David Ahern
  1 sibling, 1 reply; 14+ messages in thread
From: Frederic Weisbecker @ 2013-10-24  9:30 UTC (permalink / raw)
  To: David Ahern
  Cc: Arnaldo Melo, LKML, Ingo Molnar, Jiri Olsa, Mike Galbraith,
	Namhyung Kim, Peter Zijlstra, Stephane Eranian

2013/9/14 Frederic Weisbecker <fweisbec@gmail.com>:
> On Fri, Sep 06, 2013 at 01:37:01PM -0600, David Ahern wrote:
>> When processing events the session code has an ordered samples queue which is
>> used to time-sort events coming in across multiple mmaps. At a later point in
>> time samples on the queue are flushed up to some timestamp at which point the
>> event is actually processed.
>>
>> When analyzing events live (ie., record/analysis path in the same command)
>> there is a race that leads to corrupted events and parse errors which cause
>> perf to terminate. The problem is that when the event is placed in the ordered
>> samples queue it is only a reference to the event which is really sitting in
>> the mmap buffer. Even though the event is queued for later processing the mmap
>> tail pointer is updated which indicates to the kernel that the event has been
>> processed. The race is flushing the event from the queue before it gets
>> overwritten by some other event. For commands trying to process events live
>> (versus just writing to a file) and processing a high rate of events this leads
>> to parse failures and perf terminates.
>>
>> Examples hitting this problem are 'perf kvm stat live', especially with nested
>> VMs which generate 100,000+ traces per second, and a command processing
>> scheduling events with a high rate of context switching -- e.g., running
>> 'perf bench sched pipe'.
>>
>> This patch offers live commands an option to copy the event when it is placed in
>> the ordered samples queue.
>>
>> Signed-off-by: David Ahern <dsahern@gmail.com>
>> Cc: Frederic Weisbecker <fweisbec@gmail.com>
>> Cc: Ingo Molnar <mingo@kernel.org>
>> Cc: Jiri Olsa <jolsa@redhat.com>
>> Cc: Mike Galbraith <efault@gmx.de>
>> Cc: Namhyung Kim <namhyung@kernel.org>
>> Cc: Peter Zijlstra <peterz@infradead.org>
>> Cc: Stephane Eranian <eranian@google.com>
>> ---
>>  tools/perf/util/session.c |   17 ++++++++++++++---
>>  tools/perf/util/session.h |    1 +
>>  2 files changed, 15 insertions(+), 3 deletions(-)
>>
>> diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
>> index 1b185ca..71f16db 100644
>> --- a/tools/perf/util/session.c
>> +++ b/tools/perf/util/session.c
>> @@ -483,6 +483,8 @@ static void perf_session_free_sample_buffers(struct perf_session *session)
>>
>>               sq = list_entry(os->to_free.next, struct sample_queue, list);
>>               list_del(&sq->list);
>> +             if (session->copy_on_queue)
>> +                     free(sq->event);
>>               free(sq);
>>       }
>>  }
>> @@ -513,11 +515,15 @@ static int flush_sample_queue(struct perf_session *s,
>>                       break;
>>
>>               ret = perf_evlist__parse_sample(s->evlist, iter->event, &sample);
>> -             if (ret)
>> +             if (ret) {
>>                       pr_err("Can't parse sample, err = %d\n", ret);
>> -             else {
>> +                     if (s->copy_on_queue)
>> +                             free(iter->event);
>> +             } else {
>>                       ret = perf_session_deliver_event(s, iter->event, &sample, tool,
>>                                                        iter->file_offset);
>> +                     if (s->copy_on_queue)
>> +                             free(iter->event);
>>                       if (ret)
>>                               return ret;
>>               }
>> @@ -676,7 +682,12 @@ int perf_session_queue_event(struct perf_session *s, union perf_event *event,
>>
>>       new->timestamp = timestamp;
>>       new->file_offset = file_offset;
>> -     new->event = event;
>> +
>> +     if (s->copy_on_queue) {
>> +             new->event = malloc(event->header.size);
>> +             memcpy(new->event, event, event->header.size);
>> +     } else
>> +             new->event = event;
>>
>>       __queue_event(new, s);
>>
>> diff --git a/tools/perf/util/session.h b/tools/perf/util/session.h
>> index 3aa75fb..4adfcbb 100644
>> --- a/tools/perf/util/session.h
>> +++ b/tools/perf/util/session.h
>> @@ -38,6 +38,7 @@ struct perf_session {
>>       bool                    fd_pipe;
>>       bool                    repipe;
>>       struct ordered_samples  ordered_samples;
>> +     bool                    copy_on_queue;
>
> So do you think it should stay optional? This looks like a global problem, I mean
> the event can be unmapped anytime for any builtin tool mapping it, right?
>
> Also we already allocate the sample list node (struct sample_queue) from os->sample
> buffer. ie: we have our own allocator there.
>
> Probably we should reuse that and include the copied event space in "struct sample_queue"?
>
> Also looking at it now, it seems we have a bug on the existing code:
>
>
>         if (!list_empty(sc)) {
>                 new = list_entry(sc->next, struct sample_queue, list);
>                 list_del(&new->list);
>         } else if (os->sample_buffer) {
>                 new = os->sample_buffer + os->sample_buffer_idx;
>                 if (++os->sample_buffer_idx == MAX_SAMPLE_BUFFER)
>                         os->sample_buffer = NULL;
>         } else {
>                os->sample_buffer = malloc(MAX_SAMPLE_BUFFER * sizeof(*new));
>                if (!os->sample_buffer)
>                         return -ENOMEM;
>                list_add(&os->sample_buffer->list, &os->to_free);
>                os->sample_buffer_idx = 2;
>                new = os->sample_buffer + 1;
>         }
>
> If we actually run out of buffer rooms, we should realloc right after and not
> wait for the next entry, otherwise we loose an event:

Bah, checking that again, there don't seem to be a bug there. Actually
the sample buffer is reset after we pick the last entry. So it looks
all fine. I got confused as usual. Nevermind.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] perf session: Add option to copy events when queueing
  2013-10-24  9:30   ` Frederic Weisbecker
@ 2013-10-24 10:23     ` David Ahern
  2013-10-24 12:27       ` Arnaldo Melo
  0 siblings, 1 reply; 14+ messages in thread
From: David Ahern @ 2013-10-24 10:23 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Arnaldo Melo, LKML, Ingo Molnar, Jiri Olsa, Mike Galbraith,
	Namhyung Kim, Peter Zijlstra, Stephane Eranian

On 10/24/13 10:30 AM, Frederic Weisbecker wrote:
> Bah, checking that again, there don't seem to be a bug there. Actually
> the sample buffer is reset after we pick the last entry. So it looks
> all fine. I got confused as usual. Nevermind.

Ok. I had not come back to this thread since I decided on a different 
route for the event copying. I'll take it out of my to-do list.

David

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] perf session: Add option to copy events when queueing
  2013-10-24 10:23     ` David Ahern
@ 2013-10-24 12:27       ` Arnaldo Melo
  2013-10-24 13:12         ` David Ahern
  0 siblings, 1 reply; 14+ messages in thread
From: Arnaldo Melo @ 2013-10-24 12:27 UTC (permalink / raw)
  To: David Ahern
  Cc: Frederic Weisbecker, LKML, Ingo Molnar, Jiri Olsa, Mike Galbraith,
	Namhyung Kim, Peter Zijlstra, Stephane Eranian

Em Thu, Oct 24, 2013 at 11:23:32AM +0100, David Ahern escreveu:
> On 10/24/13 10:30 AM, Frederic Weisbecker wrote:
> >Bah, checking that again, there don't seem to be a bug there. Actually
> >the sample buffer is reset after we pick the last entry. So it looks
> >all fine. I got confused as usual. Nevermind.
> 
> Ok. I had not come back to this thread since I decided on a
> different route for the event copying. I'll take it out of my to-do
> list.

Can you elaborate on that?

I had this feeling that perhaps we could defer copying the event till it
would be overwritten, something like making a range read only and then
when the event would be _really_ consumed the tooling would mark it as
so... Have to think about it more tho :-\

- Arnaldo

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] perf session: Add option to copy events when queueing
  2013-10-24 12:27       ` Arnaldo Melo
@ 2013-10-24 13:12         ` David Ahern
  2013-10-24 14:07           ` Arnaldo Melo
  0 siblings, 1 reply; 14+ messages in thread
From: David Ahern @ 2013-10-24 13:12 UTC (permalink / raw)
  To: Arnaldo Melo
  Cc: Frederic Weisbecker, LKML, Ingo Molnar, Jiri Olsa, Mike Galbraith,
	Namhyung Kim, Peter Zijlstra, Stephane Eranian

On 10/24/13 1:27 PM, Arnaldo Melo wrote:
> Em Thu, Oct 24, 2013 at 11:23:32AM +0100, David Ahern escreveu:
>> On 10/24/13 10:30 AM, Frederic Weisbecker wrote:
>>> Bah, checking that again, there don't seem to be a bug there. Actually
>>> the sample buffer is reset after we pick the last entry. So it looks
>>> all fine. I got confused as usual. Nevermind.
>>
>> Ok. I had not come back to this thread since I decided on a
>> different route for the event copying. I'll take it out of my to-do
>> list.
>
> Can you elaborate on that?

The driving use case is my perf-daemon:
https://github.com/dsahern/linux/blob/perf-sched-timehist-3.11/tools/perf/schedmon.c, 
line 271.

Rather than have the perf infrastructure manage the allocation and 
copies I decided to have the daemon do it. The session infrastructure is 
only used for time sorting.

This works better because the event that pops out of the session 
ordering is put onto another list and it gives the daemon control of 
when memory is allocated and freed and the event only has to be copied once.

>
> I had this feeling that perhaps we could defer copying the event till it
> would be overwritten, something like making a range read only and then
> when the event would be _really_ consumed the tooling would mark it as
> so... Have to think about it more tho :-\

The above does not solve the problem for live tools where the event 
actually lies in the ring buffer and could be overwritten while it sits 
in the session ordered samples queue. The kvm-stat-live tool for nested 
virtualization is a stress test for this case. I need to come back to 
that problem, it just has not bubbled up to the top.

David

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] perf session: Add option to copy events when queueing
  2013-10-24 13:12         ` David Ahern
@ 2013-10-24 14:07           ` Arnaldo Melo
  2013-10-25 16:04             ` David Ahern
  0 siblings, 1 reply; 14+ messages in thread
From: Arnaldo Melo @ 2013-10-24 14:07 UTC (permalink / raw)
  To: David Ahern
  Cc: Frederic Weisbecker, LKML, Ingo Molnar, Jiri Olsa, Mike Galbraith,
	Namhyung Kim, Peter Zijlstra, Stephane Eranian

Em Thu, Oct 24, 2013 at 02:12:29PM +0100, David Ahern escreveu:
> On 10/24/13 1:27 PM, Arnaldo Melo wrote:

> The driving use case is my perf-daemon:
> https://github.com/dsahern/linux/blob/perf-sched-timehist-3.11/tools/perf/schedmon.c,
> line 271.
> 
> Rather than have the perf infrastructure manage the allocation and
> copies I decided to have the daemon do it. The session
> infrastructure is only used for time sorting.
> 
> This works better because the event that pops out of the session
> ordering is put onto another list and it gives the daemon control of
> when memory is allocated and freed and the event only has to be
> copied once.

Ok, I see, but that could be useful for other tools, so at some point
what ends up being the best for your tool can be submitted upstream 8-)
 
> >
> >I had this feeling that perhaps we could defer copying the event till it
> >would be overwritten, something like making a range read only and then
> >when the event would be _really_ consumed the tooling would mark it as
> >so... Have to think about it more tho :-\
> 
> The above does not solve the problem for live tools where the event
> actually lies in the ring buffer and could be overwritten while it
> sits in the session ordered samples queue. The kvm-stat-live tool
> for nested virtualization is a stress test for this case. I need to
> come back to that problem, it just has not bubbled up to the top.

What I thought was since we have to allocate memory _somewhere_ we could
size the ring buffer in such a way that it would be big enough for us to
leave those referenced entries there, not consumed, till we actually
consume them when reordering the events, and if we run out of space, the
kernel would just use LOST records, and not overwrite anything.

- Arnaldo

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] perf session: Add option to copy events when queueing
  2013-10-24 14:07           ` Arnaldo Melo
@ 2013-10-25 16:04             ` David Ahern
  0 siblings, 0 replies; 14+ messages in thread
From: David Ahern @ 2013-10-25 16:04 UTC (permalink / raw)
  To: Arnaldo Melo
  Cc: Frederic Weisbecker, LKML, Ingo Molnar, Jiri Olsa, Mike Galbraith,
	Namhyung Kim, Peter Zijlstra, Stephane Eranian

On 10/24/13 8:07 AM, Arnaldo Melo wrote:
> What I thought was since we have to allocate memory _somewhere_ we could
> size the ring buffer in such a way that it would be big enough for us to
> leave those referenced entries there, not consumed, till we actually
> consume them when reordering the events, and if we run out of space, the
> kernel would just use LOST records, and not overwrite anything.

The tail pointer should only be moved up once all the events have been 
flushed from the ordered samples queue. Right now the tail is moved up 
when the event is read from the mmap (the patch in the other email 
thread withstanding). The problem is that we do not have any linkage 
between the event and the mmap that it came from; that information is 
lost when the event is added to the ordered samples queue.

David

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2013-10-25 16:04 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-09-06 19:37 [PATCH] perf session: Add option to copy events when queueing David Ahern
2013-09-14 16:16 ` Frederic Weisbecker
2013-09-14 17:25   ` David Ahern
2013-09-16 16:40     ` Frederic Weisbecker
2013-09-17  4:14       ` David Ahern
2013-10-24  9:30   ` Frederic Weisbecker
2013-10-24 10:23     ` David Ahern
2013-10-24 12:27       ` Arnaldo Melo
2013-10-24 13:12         ` David Ahern
2013-10-24 14:07           ` Arnaldo Melo
2013-10-25 16:04             ` David Ahern
2013-10-02 12:18 ` Jiri Olsa
2013-10-02 12:38   ` Jiri Olsa
2013-10-02 12:52     ` David Ahern

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).