* [PATCH 0/2] perf: User/kernel time correlation and event generation
@ 2014-09-18 14:34 Pawel Moll
2014-09-18 14:34 ` [PATCH 1/2] perf: Add sampling of the raw monotonic clock Pawel Moll
` (2 more replies)
0 siblings, 3 replies; 22+ messages in thread
From: Pawel Moll @ 2014-09-18 14:34 UTC (permalink / raw)
To: Richard Cochran, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
Paul Mackerras, Arnaldo Carvalho de Melo, John Stultz
Cc: linux-kernel, linux-api, Pawel Moll
Greetings,
This is a second spin of the short series posted last week:
http://www.spinics.net/lists/kernel/msg1824419.html
The first patch adds an additional timestamp field in the perf
sample data, which can be requested for any perf event along
with normal PERF_SAMPLE_TIME. Events with both values appearing
periodically in the perf data allow user code to translate
raw monotonic time (obtained via POSIX clock API) to sched_clock
domain. Although any perf event can be used, the natural choice
would be a sched_switch trace event (for processes with root
permissions) or a hrtimer-based PERF_COUNT_SW_CPU_CLOCK.
It didn't attract any comments previously, so is just re-posted
without any changes.
The second patch, functionally orthogonal but complementing
the first one, builds on the ftrace "trace_maker" idea. It adds
a ioctl that can be used to inject a userspace-generated data
into the perf buffer. It provides base for printf-like
functionality in perf world. If used with the previous patch,
it can be also used to provide synchronisation points for sched
vs. raw monotonic time stamps correlation.
First version of the patch was taking a zero-terminated string
as an argument. Now it is taking a custom structure with "type"
and "size" integer fields followed by data. Type value "0"
is defined as a zero-terminated string (although size, including
the NULL character, must still be provided), but meaning of data
for other types is of no interest for the kernel. The intention
is to host a list of "well known" types (with reference parsers
for them) in the user perf tool code.
Pawel Moll (2):
perf: Add sampling of the raw monotonic clock
perf: Userspace software event and ioctl
include/linux/perf_event.h | 10 +++++
include/uapi/linux/perf_event.h | 36 +++++++++++++++++-
kernel/events/core.c | 81 +++++++++++++++++++++++++++++++++++++++++
3 files changed, 126 insertions(+), 1 deletion(-)
--
1.9.1
^ permalink raw reply [flat|nested] 22+ messages in thread
* [PATCH 1/2] perf: Add sampling of the raw monotonic clock
2014-09-18 14:34 [PATCH 0/2] perf: User/kernel time correlation and event generation Pawel Moll
@ 2014-09-18 14:34 ` Pawel Moll
[not found] ` <1411050873-9310-2-git-send-email-pawel.moll-5wv7dgnIgG8@public.gmane.org>
2014-09-18 14:34 ` [PATCH 2/2] perf: Userspace software event and ioctl Pawel Moll
[not found] ` <1411050873-9310-1-git-send-email-pawel.moll-5wv7dgnIgG8@public.gmane.org>
2 siblings, 1 reply; 22+ messages in thread
From: Pawel Moll @ 2014-09-18 14:34 UTC (permalink / raw)
To: Richard Cochran, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
Paul Mackerras, Arnaldo Carvalho de Melo, John Stultz
Cc: linux-kernel, linux-api, Pawel Moll
This patch adds an option to sample raw monotonic clock
value with any perf event, with the the aim of allowing
time correlation between data coming from perf and
additional performance-related information generated in
userspace.
In order to correlate timestamps in perf data stream
with events happening in userspace (be it JITed debug
symbols or hwmon-originating environment data), user
requests a more or less periodic event (sched_switch
trace event of a hrtimer-based cpu-clock being the
most obvious examples) with PERF_SAMPLE_TIME *and*
PERF_SAMPLE_CLOCK_RAW_MONOTONIC and stamps
user-originating data with values obtained from
clock_gettime(CLOCK_MONOTONIC_RAW). Then, during
analysis, one looks at the perf events immediately
preceding and following (in terms of the
clock_raw_monotonic sample) the userspace event and
does simple linear approximation to get the equivalent
perf time.
perf event user event
-----O--------------+-------------O------> t_mono
: | :
: V :
-----O----------------------------O------> t_perf
Signed-off-by: Pawel Moll <pawel.moll@arm.com>
---
include/linux/perf_event.h | 2 ++
include/uapi/linux/perf_event.h | 4 +++-
kernel/events/core.c | 13 +++++++++++++
3 files changed, 18 insertions(+), 1 deletion(-)
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 707617a..28b73b2 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -602,6 +602,8 @@ struct perf_sample_data {
* Transaction flags for abort events:
*/
u64 txn;
+ /* Raw monotonic timestamp, for userspace time correlation */
+ u64 clock_raw_monotonic;
};
static inline void perf_sample_data_init(struct perf_sample_data *data,
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index 9269de2..e5a75c5 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -137,8 +137,9 @@ enum perf_event_sample_format {
PERF_SAMPLE_DATA_SRC = 1U << 15,
PERF_SAMPLE_IDENTIFIER = 1U << 16,
PERF_SAMPLE_TRANSACTION = 1U << 17,
+ PERF_SAMPLE_CLOCK_RAW_MONOTONIC = 1U << 18,
- PERF_SAMPLE_MAX = 1U << 18, /* non-ABI */
+ PERF_SAMPLE_MAX = 1U << 19, /* non-ABI */
};
/*
@@ -686,6 +687,7 @@ enum perf_event_type {
* { u64 weight; } && PERF_SAMPLE_WEIGHT
* { u64 data_src; } && PERF_SAMPLE_DATA_SRC
* { u64 transaction; } && PERF_SAMPLE_TRANSACTION
+ * { u64 clock_raw_monotonic; } && PERF_SAMPLE_CLOCK_RAW_MONOTONIC
* };
*/
PERF_RECORD_SAMPLE = 9,
diff --git a/kernel/events/core.c b/kernel/events/core.c
index f9c1ed0..f6df547 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -1216,6 +1216,9 @@ static void perf_event__header_size(struct perf_event *event)
if (sample_type & PERF_SAMPLE_TRANSACTION)
size += sizeof(data->txn);
+ if (sample_type & PERF_SAMPLE_CLOCK_RAW_MONOTONIC)
+ size += sizeof(data->clock_raw_monotonic);
+
event->header_size = size;
}
@@ -4456,6 +4459,13 @@ static void __perf_event_header__init_id(struct perf_event_header *header,
data->cpu_entry.cpu = raw_smp_processor_id();
data->cpu_entry.reserved = 0;
}
+
+ if (sample_type & PERF_SAMPLE_CLOCK_RAW_MONOTONIC) {
+ struct timespec now;
+
+ getrawmonotonic(&now);
+ data->clock_raw_monotonic = timespec_to_ns(&now);
+ }
}
void perf_event_header__init_id(struct perf_event_header *header,
@@ -4714,6 +4724,9 @@ void perf_output_sample(struct perf_output_handle *handle,
if (sample_type & PERF_SAMPLE_TRANSACTION)
perf_output_put(handle, data->txn);
+ if (sample_type & PERF_SAMPLE_CLOCK_RAW_MONOTONIC)
+ perf_output_put(handle, data->clock_raw_monotonic);
+
if (!event->attr.watermark) {
int wakeup_events = event->attr.wakeup_events;
--
1.9.1
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [PATCH 2/2] perf: Userspace software event and ioctl
2014-09-18 14:34 [PATCH 0/2] perf: User/kernel time correlation and event generation Pawel Moll
2014-09-18 14:34 ` [PATCH 1/2] perf: Add sampling of the raw monotonic clock Pawel Moll
@ 2014-09-18 14:34 ` Pawel Moll
2014-09-23 17:02 ` Pawel Moll
2014-09-29 15:32 ` Peter Zijlstra
[not found] ` <1411050873-9310-1-git-send-email-pawel.moll-5wv7dgnIgG8@public.gmane.org>
2 siblings, 2 replies; 22+ messages in thread
From: Pawel Moll @ 2014-09-18 14:34 UTC (permalink / raw)
To: Richard Cochran, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
Paul Mackerras, Arnaldo Carvalho de Melo, John Stultz
Cc: linux-kernel, linux-api, Pawel Moll
This patch adds a PERF_COUNT_SW_USERSPACE_EVENT type,
which can be generated by user with PERF_EVENT_IOC_ENTRY
ioctl command, which injects an event of said type into
the perf buffer.
The ioctl takes a pointer to struct perf_event_userspace
as an argument. The structure begins with a 64-bit
integer type value, which determines meaning of the
following content (size/data pair). Type 0 are defined
as zero-terminated strings, other types are defined by
userspace (the perf tool will contain a list of
known values with reference implementation of data
content parsers).
Possible use cases for this feature:
- "perf_printf" like mechanism to add logging messages
to one's perf session; an example implementation:
int perf_printf(int perf_fd, const char *fmt, ...)
{
struct perf_event_userspace *event;
int size;
va_list ap;
int err;
va_start(ap, fmt);
size = vsnprintf(NULL, 0, fmt, ap) + 1;
event = malloc(sizeof(*event) + size);
if (!event) {
va_end(ap);
return -1;
}
event->type = 0;
event->size = size;
vsnprintf(event->data, size, fmt, ap);
va_end(ap);
err = ioctl(perf_fd, PERF_EVENT_IOC_USERSPACE, event);
free(event);
return err < 0 ? err : size - 1;
}
- "perf_printf" used by for perf trace tool,
where certain traced process' calls are intercepted
(eg. using LD_PRELOAD) and treated as logging
requests, with it output redirected into the
perf buffer
- synchronisation of performance data generated in
user space with the perf stream coming from the kernel.
For example, the marker can be inserted by a JIT engine
after it generated portion of the code, but before the
code is executed for the first time, allowing the
post-processor to pick the correct debugging
information.
- other example is a system profiling tool taking data
from other sources than just perf, which generates a marker
at the beginning at at the end of the session
(also possibly periodically during the session) to
synchronise kernel timestamps with clock values
obtained in userspace (gtod or raw_monotonic).
Signed-off-by: Pawel Moll <pawel.moll@arm.com>
---
include/linux/perf_event.h | 8 +++++
include/uapi/linux/perf_event.h | 34 ++++++++++++++++++++-
kernel/events/core.c | 68 +++++++++++++++++++++++++++++++++++++++++
3 files changed, 109 insertions(+), 1 deletion(-)
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 28b73b2..d904d31 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -64,6 +64,12 @@ struct perf_raw_record {
void *data;
};
+struct perf_userspace_entry {
+ u32 type;
+ u32 size;
+ u8 data[0];
+};
+
/*
* branch stack layout:
* nr: number of taken branches stored in entries[]
@@ -604,6 +610,8 @@ struct perf_sample_data {
u64 txn;
/* Raw monotonic timestamp, for userspace time correlation */
u64 clock_raw_monotonic;
+ /* Userspace-originating event */
+ struct perf_userspace_entry *user_entry;
};
static inline void perf_sample_data_init(struct perf_sample_data *data,
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index e5a75c5..37604ae 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -110,6 +110,7 @@ enum perf_sw_ids {
PERF_COUNT_SW_ALIGNMENT_FAULTS = 7,
PERF_COUNT_SW_EMULATION_FAULTS = 8,
PERF_COUNT_SW_DUMMY = 9,
+ PERF_COUNT_SW_USERSPACE_EVENT = 10,
PERF_COUNT_SW_MAX, /* non-ABI */
};
@@ -138,8 +139,9 @@ enum perf_event_sample_format {
PERF_SAMPLE_IDENTIFIER = 1U << 16,
PERF_SAMPLE_TRANSACTION = 1U << 17,
PERF_SAMPLE_CLOCK_RAW_MONOTONIC = 1U << 18,
+ PERF_SAMPLE_USERSPACE_EVENT = 1U << 19,
- PERF_SAMPLE_MAX = 1U << 19, /* non-ABI */
+ PERF_SAMPLE_MAX = 1U << 20, /* non-ABI */
};
/*
@@ -337,6 +339,15 @@ struct perf_event_attr {
__u32 __reserved_2;
};
+/*
+ * Userspace-originating event to be generated with PERF_EVENT_IOC_USERSPACE
+ */
+struct perf_event_userspace {
+ __u32 type;
+ __u32 size;
+ __u8 data[0];
+};
+
#define perf_flags(attr) (*(&(attr)->read_format + 1))
/*
@@ -350,6 +361,8 @@ struct perf_event_attr {
#define PERF_EVENT_IOC_SET_OUTPUT _IO ('$', 5)
#define PERF_EVENT_IOC_SET_FILTER _IOW('$', 6, char *)
#define PERF_EVENT_IOC_ID _IOR('$', 7, __u64 *)
+#define PERF_EVENT_IOC_USERSPACE _IOR('$', 8, \
+ struct perf_event_userspace *)
enum perf_event_ioc_flags {
PERF_IOC_FLAG_GROUP = 1U << 0,
@@ -688,6 +701,25 @@ enum perf_event_type {
* { u64 data_src; } && PERF_SAMPLE_DATA_SRC
* { u64 transaction; } && PERF_SAMPLE_TRANSACTION
* { u64 clock_raw_monotonic; } && PERF_SAMPLE_CLOCK_RAW_MONOTONIC
+ *
+ * #
+ * # Contents of USERSPACE_EVENT sample data depend on its type.
+ * #
+ * # Type 0 means that the data is a zero-terminated string that
+ * # can be printf-ed in the normal way.
+ * #
+ * # Meaning of other type values depends on the userspace
+ * # and the perf tool code contains a list of those with
+ * # reference implementations of parsers.
+ * #
+ * # Overall size of the sample (including type and size fields)
+ * # is always aligned to 8 bytes by adding padding after
+ * # the data.
+ * #
+ * { u32 type;
+ * u32 size;
+ * char data[size];
+ * char __padding[] } && PERF_SAMPLE_USERSPACE_EVENT
* };
*/
PERF_RECORD_SAMPLE = 9,
diff --git a/kernel/events/core.c b/kernel/events/core.c
index f6df547..11bf1be 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -3655,6 +3655,8 @@ static inline int perf_fget_light(int fd, struct fd *p)
static int perf_event_set_output(struct perf_event *event,
struct perf_event *output_event);
static int perf_event_set_filter(struct perf_event *event, void __user *arg);
+static int perf_sw_userspace_entry(struct perf_event *event,
+ struct perf_event_userspace __user *arg);
static long perf_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
{
@@ -3709,6 +3711,10 @@ static long perf_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
case PERF_EVENT_IOC_SET_FILTER:
return perf_event_set_filter(event, (void __user *)arg);
+ case PERF_EVENT_IOC_USERSPACE:
+ return perf_sw_userspace_entry(event,
+ (struct perf_event_userspace __user *)arg);
+
default:
return -ENOTTY;
}
@@ -3728,6 +3734,7 @@ static long perf_compat_ioctl(struct file *file, unsigned int cmd,
switch (_IOC_NR(cmd)) {
case _IOC_NR(PERF_EVENT_IOC_SET_FILTER):
case _IOC_NR(PERF_EVENT_IOC_ID):
+ case _IOC_NR(PERF_EVENT_IOC_USERSPACE):
/* Fix up pointer size (usually 4 -> 8 in 32-on-64-bit case */
if (_IOC_SIZE(cmd) == sizeof(compat_uptr_t)) {
cmd &= ~IOCSIZE_MASK;
@@ -4727,6 +4734,16 @@ void perf_output_sample(struct perf_output_handle *handle,
if (sample_type & PERF_SAMPLE_CLOCK_RAW_MONOTONIC)
perf_output_put(handle, data->clock_raw_monotonic);
+ if (sample_type & PERF_SAMPLE_USERSPACE_EVENT) {
+ int size = data->user_entry->size;
+ int padding = ALIGN(size, sizeof(u64)) - size;
+
+ perf_output_put(handle, data->user_entry->type);
+ perf_output_put(handle, size);
+ __output_copy(handle, data->user_entry->data, size);
+ perf_output_skip(handle, padding);
+ };
+
if (!event->attr.watermark) {
int wakeup_events = event->attr.wakeup_events;
@@ -4834,6 +4851,24 @@ void perf_prepare_sample(struct perf_event_header *header,
data->stack_user_size = stack_size;
header->size += size;
}
+
+ if (sample_type & PERF_SAMPLE_USERSPACE_EVENT) {
+ int size = data->user_entry->size;
+
+ /*
+ * Type 0 means zero-terminated string;
+ * make sure it is terminated
+ */
+ if (!data->user_entry->type)
+ data->user_entry->data[size - 1] = '\0';
+
+ /*
+ * The sample consist of 'type' and 'size' u32 fields
+ * followed with data and padding aligning it to 8 bytes.
+ */
+ header->size += sizeof(u32) + sizeof(u32) +
+ ALIGN(size, sizeof(u64));
+ }
}
static void perf_event_output(struct perf_event *event,
@@ -5961,6 +5996,39 @@ static struct pmu perf_swevent = {
.event_idx = perf_swevent_event_idx,
};
+static int perf_sw_userspace_entry(struct perf_event *event,
+ struct perf_event_userspace __user *arg)
+{
+ u32 size;
+ struct perf_sample_data data;
+ struct pt_regs *regs = current_pt_regs();
+ struct perf_userspace_entry *entry;
+
+ if (!arg)
+ return -EINVAL;
+
+ if (!static_key_false(&perf_swevent_enabled[
+ PERF_COUNT_SW_USERSPACE_EVENT]))
+ return 0;
+
+ BUILD_BUG_ON(sizeof(size) != sizeof(arg->size));
+ if (copy_from_user(&size, &arg->size, sizeof(size)) != 0)
+ return -EFAULT;
+
+ BUILD_BUG_ON(sizeof(*arg) != sizeof(*entry));
+ entry = memdup_user(arg, sizeof(*arg) + size);
+ if (IS_ERR(entry))
+ return PTR_ERR(entry);
+
+ perf_sample_data_init(&data, 0, 0);
+ data.user_entry = entry;
+ perf_event_output(event, &data, regs);
+
+ kfree(entry);
+
+ return 0;
+}
+
#ifdef CONFIG_EVENT_TRACING
static int perf_tp_filter_match(struct perf_event *event,
--
1.9.1
^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: [PATCH 0/2] perf: User/kernel time correlation and event generation
[not found] ` <1411050873-9310-1-git-send-email-pawel.moll-5wv7dgnIgG8@public.gmane.org>
@ 2014-09-18 15:02 ` Christopher Covington
[not found] ` <541AF40B.7070604-sgV2jX0FEOL9JmXXK+q4OQ@public.gmane.org>
0 siblings, 1 reply; 22+ messages in thread
From: Christopher Covington @ 2014-09-18 15:02 UTC (permalink / raw)
To: Pawel Moll
Cc: Richard Cochran, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
Paul Mackerras, Arnaldo Carvalho de Melo, John Stultz,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
linux-api-u79uwXL29TY76Z2rM5mHXA, Michael Kerrisk
Hi Pawel,
On 09/18/2014 10:34 AM, Pawel Moll wrote:
> Greetings,
>
> This is a second spin of the short series posted last week:
>
> http://www.spinics.net/lists/kernel/msg1824419.html
>
> The first patch adds an additional timestamp field in the perf
> sample data, which can be requested for any perf event along
> with normal PERF_SAMPLE_TIME. Events with both values appearing
> periodically in the perf data allow user code to translate
> raw monotonic time (obtained via POSIX clock API) to sched_clock
> domain. Although any perf event can be used, the natural choice
> would be a sched_switch trace event (for processes with root
> permissions) or a hrtimer-based PERF_COUNT_SW_CPU_CLOCK.
>
> It didn't attract any comments previously, so is just re-posted
> without any changes.
>
> The second patch, functionally orthogonal but complementing
> the first one, builds on the ftrace "trace_maker" idea. It adds
> a ioctl that can be used to inject a userspace-generated data
> into the perf buffer. It provides base for printf-like
> functionality in perf world. If used with the previous patch,
> it can be also used to provide synchronisation points for sched
> vs. raw monotonic time stamps correlation.
>
> First version of the patch was taking a zero-terminated string
> as an argument. Now it is taking a custom structure with "type"
> and "size" integer fields followed by data. Type value "0"
> is defined as a zero-terminated string (although size, including
> the NULL character, must still be provided), but meaning of data
> for other types is of no interest for the kernel. The intention
> is to host a list of "well known" types (with reference parsers
> for them) in the user perf tool code.
Would it be possible for you to also update the corresponding man pages?
https://www.kernel.org/doc/man-pages/
Thanks,
Christopher
--
Employee of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
hosted by the Linux Foundation.
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 0/2] perf: User/kernel time correlation and event generation
[not found] ` <541AF40B.7070604-sgV2jX0FEOL9JmXXK+q4OQ@public.gmane.org>
@ 2014-09-18 15:07 ` Pawel Moll
2014-09-18 15:48 ` Christopher Covington
0 siblings, 1 reply; 22+ messages in thread
From: Pawel Moll @ 2014-09-18 15:07 UTC (permalink / raw)
To: Christopher Covington
Cc: Richard Cochran, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
Paul Mackerras, Arnaldo Carvalho de Melo, John Stultz,
linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
linux-api-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
Michael Kerrisk
On Thu, 2014-09-18 at 16:02 +0100, Christopher Covington wrote:
> Hi Pawel,
>
> On 09/18/2014 10:34 AM, Pawel Moll wrote:
> > Greetings,
> >
> > This is a second spin of the short series posted last week:
> >
> > http://www.spinics.net/lists/kernel/msg1824419.html
> >
> > The first patch adds an additional timestamp field in the perf
> > sample data, which can be requested for any perf event along
> > with normal PERF_SAMPLE_TIME. Events with both values appearing
> > periodically in the perf data allow user code to translate
> > raw monotonic time (obtained via POSIX clock API) to sched_clock
> > domain. Although any perf event can be used, the natural choice
> > would be a sched_switch trace event (for processes with root
> > permissions) or a hrtimer-based PERF_COUNT_SW_CPU_CLOCK.
> >
> > It didn't attract any comments previously, so is just re-posted
> > without any changes.
> >
> > The second patch, functionally orthogonal but complementing
> > the first one, builds on the ftrace "trace_maker" idea. It adds
> > a ioctl that can be used to inject a userspace-generated data
> > into the perf buffer. It provides base for printf-like
> > functionality in perf world. If used with the previous patch,
> > it can be also used to provide synchronisation points for sched
> > vs. raw monotonic time stamps correlation.
> >
> > First version of the patch was taking a zero-terminated string
> > as an argument. Now it is taking a custom structure with "type"
> > and "size" integer fields followed by data. Type value "0"
> > is defined as a zero-terminated string (although size, including
> > the NULL character, must still be provided), but meaning of data
> > for other types is of no interest for the kernel. The intention
> > is to host a list of "well known" types (with reference parsers
> > for them) in the user perf tool code.
>
> Would it be possible for you to also update the corresponding man pages?
>
> https://www.kernel.org/doc/man-pages/
I must admit I haven't thought of that, but of course - if the changes
are accepted I'll send patches to the perf_event_open(2) man page. Any
others you had in mind?
Pawel
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 0/2] perf: User/kernel time correlation and event generation
2014-09-18 15:07 ` Pawel Moll
@ 2014-09-18 15:48 ` Christopher Covington
0 siblings, 0 replies; 22+ messages in thread
From: Christopher Covington @ 2014-09-18 15:48 UTC (permalink / raw)
To: Pawel Moll
Cc: Richard Cochran, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
Paul Mackerras, Arnaldo Carvalho de Melo, John Stultz,
linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
linux-api-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
Michael Kerrisk, Vince Weaver
On 09/18/2014 11:07 AM, Pawel Moll wrote:
> On Thu, 2014-09-18 at 16:02 +0100, Christopher Covington wrote:
>> Hi Pawel,
>>
>> On 09/18/2014 10:34 AM, Pawel Moll wrote:
>>> Greetings,
>>>
>>> This is a second spin of the short series posted last week:
>>>
>>> http://www.spinics.net/lists/kernel/msg1824419.html
>>>
>>> The first patch adds an additional timestamp field in the perf
>>> sample data, which can be requested for any perf event along
>>> with normal PERF_SAMPLE_TIME. Events with both values appearing
>>> periodically in the perf data allow user code to translate
>>> raw monotonic time (obtained via POSIX clock API) to sched_clock
>>> domain. Although any perf event can be used, the natural choice
>>> would be a sched_switch trace event (for processes with root
>>> permissions) or a hrtimer-based PERF_COUNT_SW_CPU_CLOCK.
>>>
>>> It didn't attract any comments previously, so is just re-posted
>>> without any changes.
>>>
>>> The second patch, functionally orthogonal but complementing
>>> the first one, builds on the ftrace "trace_maker" idea. It adds
>>> a ioctl that can be used to inject a userspace-generated data
>>> into the perf buffer. It provides base for printf-like
>>> functionality in perf world. If used with the previous patch,
>>> it can be also used to provide synchronisation points for sched
>>> vs. raw monotonic time stamps correlation.
>>>
>>> First version of the patch was taking a zero-terminated string
>>> as an argument. Now it is taking a custom structure with "type"
>>> and "size" integer fields followed by data. Type value "0"
>>> is defined as a zero-terminated string (although size, including
>>> the NULL character, must still be provided), but meaning of data
>>> for other types is of no interest for the kernel. The intention
>>> is to host a list of "well known" types (with reference parsers
>>> for them) in the user perf tool code.
>>
>> Would it be possible for you to also update the corresponding man pages?
>>
>> https://www.kernel.org/doc/man-pages/
>
> I must admit I haven't thought of that, but of course - if the changes
> are accepted I'll send patches to the perf_event_open(2) man page. Any
> others you had in mind?
Nope--reading that page and trying out examples is pretty much how I learned
to use perf events. Another great Vince Weaver perf events contribution is the
test suite, if you're not already using it.
https://github.com/deater/perf_event_tests
Christopher
--
Employee of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
hosted by the Linux Foundation.
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 2/2] perf: Userspace software event and ioctl
2014-09-18 14:34 ` [PATCH 2/2] perf: Userspace software event and ioctl Pawel Moll
@ 2014-09-23 17:02 ` Pawel Moll
2014-09-24 7:49 ` Ingo Molnar
2014-09-29 15:32 ` Peter Zijlstra
1 sibling, 1 reply; 22+ messages in thread
From: Pawel Moll @ 2014-09-23 17:02 UTC (permalink / raw)
To: Ingo Molnar, Arnaldo Carvalho de Melo
Cc: Richard Cochran, Steven Rostedt, Peter Zijlstra, Paul Mackerras,
John Stultz, linux-kernel@vger.kernel.org,
linux-api@vger.kernel.org
On Thu, 2014-09-18 at 15:34 +0100, Pawel Moll wrote:
> This patch adds a PERF_COUNT_SW_USERSPACE_EVENT type,
> which can be generated by user with PERF_EVENT_IOC_ENTRY
> ioctl command, which injects an event of said type into
> the perf buffer.
It occurred to me last night that currently perf doesn't handle "write"
syscall at all, while this seems like the most natural way of
"injecting" userspace events into perf buffer.
An ioctl would still be needed to set a type of the following events,
something like:
ioctl(SET_TYPE, 0x42);
write(perf_fd, binaryblob, size);
ioctl(SET_TYPE, 0);
dprintf(perf_fd, "String");
which is fine for use cases when the type doesn't change often, but
would double the amount of syscalls when every single event is of a
different type. Perhaps there still should be a "generating ioctl"
taking both type and data/size in one go?
Anyway, I'll post a series showing this solution in a second.
As always, feedback is more than welcome.
Pawel
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 2/2] perf: Userspace software event and ioctl
2014-09-23 17:02 ` Pawel Moll
@ 2014-09-24 7:49 ` Ingo Molnar
[not found] ` <20140924074942.GB3797-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
0 siblings, 1 reply; 22+ messages in thread
From: Ingo Molnar @ 2014-09-24 7:49 UTC (permalink / raw)
To: Pawel Moll
Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Richard Cochran,
Steven Rostedt, Peter Zijlstra, Paul Mackerras, John Stultz,
linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
linux-api-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
* Pawel Moll <pawel.moll-5wv7dgnIgG8@public.gmane.org> wrote:
> On Thu, 2014-09-18 at 15:34 +0100, Pawel Moll wrote:
> > This patch adds a PERF_COUNT_SW_USERSPACE_EVENT type,
> > which can be generated by user with PERF_EVENT_IOC_ENTRY
> > ioctl command, which injects an event of said type into
> > the perf buffer.
>
> It occurred to me last night that currently perf doesn't handle "write"
> syscall at all, while this seems like the most natural way of
> "injecting" userspace events into perf buffer.
>
> An ioctl would still be needed to set a type of the following events,
> something like:
>
> ioctl(SET_TYPE, 0x42);
> write(perf_fd, binaryblob, size);
> ioctl(SET_TYPE, 0);
> dprintf(perf_fd, "String");
>
> which is fine for use cases when the type doesn't change often,
> but would double the amount of syscalls when every single event
> is of a different type. Perhaps there still should be a
> "generating ioctl" taking both type and data/size in one go?
Absolutely, there should be a single syscall.
I'd even argue it should be a new prctl(): that way we could both
generate user events for specific perf fds, but also into any
currently active context (that allows just generation/injection
of user events). In the latter case we might have no fd to work
off from.
And that is actually the really exciting usecase of your patches:
we could generate user events via simple commands, and any
external profiler/trace would be able to see them.
Thanks,
Ingo
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 2/2] perf: Userspace software event and ioctl
[not found] ` <20140924074942.GB3797-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
@ 2014-09-25 17:20 ` Pawel Moll
2014-09-25 18:33 ` Ingo Molnar
0 siblings, 1 reply; 22+ messages in thread
From: Pawel Moll @ 2014-09-25 17:20 UTC (permalink / raw)
To: Ingo Molnar
Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Richard Cochran,
Steven Rostedt, Peter Zijlstra, Paul Mackerras, John Stultz,
linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
linux-api-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
On Wed, 2014-09-24 at 08:49 +0100, Ingo Molnar wrote:
> * Pawel Moll <pawel.moll-5wv7dgnIgG8@public.gmane.org> wrote:
>
> > On Thu, 2014-09-18 at 15:34 +0100, Pawel Moll wrote:
> > > This patch adds a PERF_COUNT_SW_USERSPACE_EVENT type,
> > > which can be generated by user with PERF_EVENT_IOC_ENTRY
> > > ioctl command, which injects an event of said type into
> > > the perf buffer.
> >
> > It occurred to me last night that currently perf doesn't handle "write"
> > syscall at all, while this seems like the most natural way of
> > "injecting" userspace events into perf buffer.
> >
> > An ioctl would still be needed to set a type of the following events,
> > something like:
> >
> > ioctl(SET_TYPE, 0x42);
> > write(perf_fd, binaryblob, size);
> > ioctl(SET_TYPE, 0);
> > dprintf(perf_fd, "String");
> >
> > which is fine for use cases when the type doesn't change often,
> > but would double the amount of syscalls when every single event
> > is of a different type. Perhaps there still should be a
> > "generating ioctl" taking both type and data/size in one go?
>
> Absolutely, there should be a single syscall.
Yeah, it's my gut feeling as well. I just wonder if we still want to
keep write() handler for operations on perf fds? This seems natural -
takes data buffer and its size. The only issue is the type.
> I'd even argue it should be a new prctl(): that way we could both
> generate user events for specific perf fds, but also into any
> currently active context (that allows just generation/injection
> of user events). In the latter case we might have no fd to work
> off from.
When Arnaldo suggested that the "user events" could be used by perf
trace, it was exactly my first thought. I just didn't have answer how to
present it to the user (an extra syscall didn't seem like a good idea),
but prctl seems interesting, something like this?
prctl(PR_TRACE_UEVENT, type, size, data, 0);
How would we select tasks that can write to a given buffer? Maybe an
ioctl() on a perf fd? Something like this?
ioctl(perf_fd, PERF_EVENT_IOC_ENABLE_UEVENT, pid);
ioctl(perf_fd, PERF_EVENT_IOC_DISABLE_UEVENT, pid);
It could set/clear a flag in pid's task_struct (but probably not in the
"normal" flags, as they are only supposed to be set by owner and in
ptrace/fork case) and a pointer to the task in perf_event(_context).
Or maybe some variation on ptrace would be more in place? This would
also solve issue of permission checking (if the profiling tool can
ptrace the process, it can also enable/disable its uevent generation
capability).
Paweł
Or maybe it should go through ptrace?
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 2/2] perf: Userspace software event and ioctl
2014-09-25 17:20 ` Pawel Moll
@ 2014-09-25 18:33 ` Ingo Molnar
[not found] ` <20140925183342.GB6854-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2014-09-27 17:14 ` Frederic Weisbecker
0 siblings, 2 replies; 22+ messages in thread
From: Ingo Molnar @ 2014-09-25 18:33 UTC (permalink / raw)
To: Pawel Moll
Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Richard Cochran,
Steven Rostedt, Peter Zijlstra, Paul Mackerras, John Stultz,
linux-kernel@vger.kernel.org, linux-api@vger.kernel.org
* Pawel Moll <pawel.moll@arm.com> wrote:
> On Wed, 2014-09-24 at 08:49 +0100, Ingo Molnar wrote:
> > * Pawel Moll <pawel.moll@arm.com> wrote:
> >
> > > On Thu, 2014-09-18 at 15:34 +0100, Pawel Moll wrote:
> > > > This patch adds a PERF_COUNT_SW_USERSPACE_EVENT type,
> > > > which can be generated by user with PERF_EVENT_IOC_ENTRY
> > > > ioctl command, which injects an event of said type into
> > > > the perf buffer.
> > >
> > > It occurred to me last night that currently perf doesn't handle "write"
> > > syscall at all, while this seems like the most natural way of
> > > "injecting" userspace events into perf buffer.
> > >
> > > An ioctl would still be needed to set a type of the following events,
> > > something like:
> > >
> > > ioctl(SET_TYPE, 0x42);
> > > write(perf_fd, binaryblob, size);
> > > ioctl(SET_TYPE, 0);
> > > dprintf(perf_fd, "String");
> > >
> > > which is fine for use cases when the type doesn't change often,
> > > but would double the amount of syscalls when every single event
> > > is of a different type. Perhaps there still should be a
> > > "generating ioctl" taking both type and data/size in one go?
> >
> > Absolutely, there should be a single syscall.
>
> Yeah, it's my gut feeling as well. I just wonder if we still want to
> keep write() handler for operations on perf fds? This seems natural -
> takes data buffer and its size. The only issue is the type.
>
> > I'd even argue it should be a new prctl(): that way we could both
> > generate user events for specific perf fds, but also into any
> > currently active context (that allows just generation/injection
> > of user events). In the latter case we might have no fd to work
> > off from.
>
> When Arnaldo suggested that the "user events" could be used by perf
> trace, it was exactly my first thought. I just didn't have answer how to
> present it to the user (an extra syscall didn't seem like a good idea),
> but prctl seems interesting, something like this?
>
> prctl(PR_TRACE_UEVENT, type, size, data, 0);
Exactly!
> How would we select tasks that can write to a given buffer? Maybe an
> ioctl() on a perf fd? Something like this?
>
> ioctl(perf_fd, PERF_EVENT_IOC_ENABLE_UEVENT, pid);
> ioctl(perf_fd, PERF_EVENT_IOC_DISABLE_UEVENT, pid);
No, I think there's a simpler way: this should be a regular
perf_attr flag, which defaults to '0' (tasks cannot do this), but
which can be set to 1 if the profiler explicitly allows such
event injection.
perf-trace might want to set this flag by default.
I.e. whether user-events are allowed is controlled by the
profiling/tracing context, via the regular perf syscall. It would
propagate into the perf context, so it would be easy to check at
event generation time.
Thanks,
Ingo
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 2/2] perf: Userspace software event and ioctl
[not found] ` <20140925183342.GB6854-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
@ 2014-09-26 10:48 ` Pawel Moll
2014-09-26 11:23 ` Ingo Molnar
0 siblings, 1 reply; 22+ messages in thread
From: Pawel Moll @ 2014-09-26 10:48 UTC (permalink / raw)
To: Ingo Molnar
Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Richard Cochran,
Steven Rostedt, Peter Zijlstra, Paul Mackerras, John Stultz,
linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
linux-api-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
On Thu, 2014-09-25 at 19:33 +0100, Ingo Molnar wrote:
> > How would we select tasks that can write to a given buffer? Maybe an
> > ioctl() on a perf fd? Something like this?
> >
> > ioctl(perf_fd, PERF_EVENT_IOC_ENABLE_UEVENT, pid);
> > ioctl(perf_fd, PERF_EVENT_IOC_DISABLE_UEVENT, pid);
>
> No, I think there's a simpler way: this should be a regular
> perf_attr flag, which defaults to '0' (tasks cannot do this), but
> which can be set to 1 if the profiler explicitly allows such
> event injection.
As in: allows *all* tasks to inject the data? Are you sure we don't want
more fine-grained control, in particular per task?
If we have two buffers, both created with the "injecting allowed" flag,
do we inject a given uevent into both of them?
> I.e. whether user-events are allowed is controlled by the
> profiling/tracing context, via the regular perf syscall. It would
> propagate into the perf context, so it would be easy to check at
> event generation time.
It would definitely be the profiling/tracing tools that would decide if
the injection is allowed, no question about that. I just feel that it
should be able to select the tasks that can do that, not just flip a big
switch saying "everyone is welcome". Other question is: should a
non-root context be able to receive events from root processes? Wouldn't
it be a security hole (for example, it could be used as a kind of covert
channel)? Maybe we should do what ptrace does? As in: if a task can
ptrace another task, it can also receive uevents from it.
Pawel
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 2/2] perf: Userspace software event and ioctl
2014-09-26 10:48 ` Pawel Moll
@ 2014-09-26 11:23 ` Ingo Molnar
[not found] ` <20140926112312.GB9870-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
0 siblings, 1 reply; 22+ messages in thread
From: Ingo Molnar @ 2014-09-26 11:23 UTC (permalink / raw)
To: Pawel Moll
Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Richard Cochran,
Steven Rostedt, Peter Zijlstra, Paul Mackerras, John Stultz,
linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
linux-api-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
* Pawel Moll <pawel.moll-5wv7dgnIgG8@public.gmane.org> wrote:
> On Thu, 2014-09-25 at 19:33 +0100, Ingo Molnar wrote:
> > > How would we select tasks that can write to a given buffer? Maybe an
> > > ioctl() on a perf fd? Something like this?
> > >
> > > ioctl(perf_fd, PERF_EVENT_IOC_ENABLE_UEVENT, pid);
> > > ioctl(perf_fd, PERF_EVENT_IOC_DISABLE_UEVENT, pid);
> >
> > No, I think there's a simpler way: this should be a regular
> > perf_attr flag, which defaults to '0' (tasks cannot do this),
> > but which can be set to 1 if the profiler explicitly allows
> > such event injection.
>
> As in: allows *all* tasks to inject the data? Are you sure we
> don't want more fine-grained control, in particular per task?
Yeah. If the profiler allows it, then any task that is being
traced can inject data.
More finegrained control might be useful if there's a
justification for it, but only if this basic, most useful model
is implemented. Too finegrained control will just make it
unusable. So please keep it simple and useful.
> If we have two buffers, both created with the "injecting
> allowed" flag, do we inject a given uevent into both of them?
Yes, and that semantics is desired: if I run two globally tracing
apps, independent of each other, both ought to get the events if
they ask for them.
> > I.e. whether user-events are allowed is controlled by the
> > profiling/tracing context, via the regular perf syscall. It
> > would propagate into the perf context, so it would be easy to
> > check at event generation time.
>
> It would definitely be the profiling/tracing tools that would
> decide if the injection is allowed, no question about that. I
> just feel that it should be able to select the tasks that can
> do that, not just flip a big switch saying "everyone is
> welcome". [...]
But that's the point: our main problem right now is too little
data (not enough apps generating interesting events), not too
much data.
So lets concentrate on the task of getting events to us as easily
as possible first. If in the far future we are overwhelmed with
events, and tools want to do some filtering on them, by all means
we can implement it - but don't impose it straight away.
> [...] Other question is: should a non-root context be able to
> receive events from root processes? Wouldn't it be a security
> hole (for example, it could be used as a kind of covert
> channel)? Maybe we should do what ptrace does? As in: if a task
> can ptrace another task, it can also receive uevents from it.
So, by default a non-root context will not be able to
profile/trace a root owned task already, it cannot generate per
CPU events for example. So this already handled at event/buffer
creation time. Plus if a task gains privilege (via suid exec)
then we already zap its perf context IIRC.
Should be double checked, but the important part is to make it to
willing tracing apps as easy as possible. Lets worry about the
'too much data' case later, otherwise we _guarantee_ that this
interface won't take off and apps, tools and people won't use it,
ok?
Thanks,
Ingo
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 2/2] perf: Userspace software event and ioctl
[not found] ` <20140926112312.GB9870-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
@ 2014-09-26 11:26 ` Pawel Moll
2014-09-26 11:31 ` Ingo Molnar
0 siblings, 1 reply; 22+ messages in thread
From: Pawel Moll @ 2014-09-26 11:26 UTC (permalink / raw)
To: Ingo Molnar
Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Richard Cochran,
Steven Rostedt, Peter Zijlstra, Paul Mackerras, John Stultz,
linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
linux-api-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
On Fri, 2014-09-26 at 12:23 +0100, Ingo Molnar wrote:
> > As in: allows *all* tasks to inject the data? Are you sure we
> > don't want more fine-grained control, in particular per task?
>
> Yeah. If the profiler allows it, then any task that is being
> traced can inject data.
The "that is being traced" fragment was the key here. I missed the fact
that perf trace already takes a list of pids, so we're not talking about
all tasks in the system. That should work.
Paweł
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 2/2] perf: Userspace software event and ioctl
2014-09-26 11:26 ` Pawel Moll
@ 2014-09-26 11:31 ` Ingo Molnar
0 siblings, 0 replies; 22+ messages in thread
From: Ingo Molnar @ 2014-09-26 11:31 UTC (permalink / raw)
To: Pawel Moll
Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Richard Cochran,
Steven Rostedt, Peter Zijlstra, Paul Mackerras, John Stultz,
linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
linux-api-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
* Pawel Moll <pawel.moll-5wv7dgnIgG8@public.gmane.org> wrote:
> On Fri, 2014-09-26 at 12:23 +0100, Ingo Molnar wrote:
> > > As in: allows *all* tasks to inject the data? Are you sure we
> > > don't want more fine-grained control, in particular per task?
> >
> > Yeah. If the profiler allows it, then any task that is being
> > traced can inject data.
>
> The "that is being traced" fragment was the key here. I missed
> the fact that perf trace already takes a list of pids, so we're
> not talking about all tasks in the system. That should work.
Yeah, when we generate a user trace event, we should look at the
currently active perf context's (percpu ones plus task ones), and
inject into those only.
This way we limit event generation to those buffers that are
actively interested in this task.
Thanks,
Ingo
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 2/2] perf: Userspace software event and ioctl
2014-09-25 18:33 ` Ingo Molnar
[not found] ` <20140925183342.GB6854-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
@ 2014-09-27 17:14 ` Frederic Weisbecker
[not found] ` <CAFTL4hy1d8twv2tGxc4EhCeDm7ApnH7SuK26W1yaekKhCrPMZA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
1 sibling, 1 reply; 22+ messages in thread
From: Frederic Weisbecker @ 2014-09-27 17:14 UTC (permalink / raw)
To: Ingo Molnar
Cc: Pawel Moll, Ingo Molnar, Arnaldo Carvalho de Melo,
Richard Cochran, Steven Rostedt, Peter Zijlstra, Paul Mackerras,
John Stultz, linux-kernel@vger.kernel.org,
linux-api@vger.kernel.org
2014-09-25 20:33 GMT+02:00 Ingo Molnar <mingo@kernel.org>:
>
> * Pawel Moll <pawel.moll@arm.com> wrote:
>
>> On Wed, 2014-09-24 at 08:49 +0100, Ingo Molnar wrote:
>> > * Pawel Moll <pawel.moll@arm.com> wrote:
>> >
>> > > On Thu, 2014-09-18 at 15:34 +0100, Pawel Moll wrote:
>> > > > This patch adds a PERF_COUNT_SW_USERSPACE_EVENT type,
>> > > > which can be generated by user with PERF_EVENT_IOC_ENTRY
>> > > > ioctl command, which injects an event of said type into
>> > > > the perf buffer.
>> > >
>> > > It occurred to me last night that currently perf doesn't handle "write"
>> > > syscall at all, while this seems like the most natural way of
>> > > "injecting" userspace events into perf buffer.
>> > >
>> > > An ioctl would still be needed to set a type of the following events,
>> > > something like:
>> > >
>> > > ioctl(SET_TYPE, 0x42);
>> > > write(perf_fd, binaryblob, size);
>> > > ioctl(SET_TYPE, 0);
>> > > dprintf(perf_fd, "String");
>> > >
>> > > which is fine for use cases when the type doesn't change often,
>> > > but would double the amount of syscalls when every single event
>> > > is of a different type. Perhaps there still should be a
>> > > "generating ioctl" taking both type and data/size in one go?
>> >
>> > Absolutely, there should be a single syscall.
>>
>> Yeah, it's my gut feeling as well. I just wonder if we still want to
>> keep write() handler for operations on perf fds? This seems natural -
>> takes data buffer and its size. The only issue is the type.
>>
>> > I'd even argue it should be a new prctl(): that way we could both
>> > generate user events for specific perf fds, but also into any
>> > currently active context (that allows just generation/injection
>> > of user events). In the latter case we might have no fd to work
>> > off from.
>>
>> When Arnaldo suggested that the "user events" could be used by perf
>> trace, it was exactly my first thought. I just didn't have answer how to
>> present it to the user (an extra syscall didn't seem like a good idea),
>> but prctl seems interesting, something like this?
>>
>> prctl(PR_TRACE_UEVENT, type, size, data, 0);
>
> Exactly!
>
>> How would we select tasks that can write to a given buffer? Maybe an
>> ioctl() on a perf fd? Something like this?
>>
>> ioctl(perf_fd, PERF_EVENT_IOC_ENABLE_UEVENT, pid);
>> ioctl(perf_fd, PERF_EVENT_IOC_DISABLE_UEVENT, pid);
>
> No, I think there's a simpler way: this should be a regular
> perf_attr flag, which defaults to '0' (tasks cannot do this), but
> which can be set to 1 if the profiler explicitly allows such
> event injection.
Maybe we just don't even need any permission at all. Which harm can
that do if this only ever generate events to those interested in the
relevant perf context? It could be a simple tracepoint BTW.
Oh and I really like the fact we don't use a syscall that requires an
fd. The tracee really shouldn't be aware of the tracer.
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 2/2] perf: Userspace software event and ioctl
[not found] ` <CAFTL4hy1d8twv2tGxc4EhCeDm7ApnH7SuK26W1yaekKhCrPMZA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2014-09-29 14:52 ` Pawel Moll
0 siblings, 0 replies; 22+ messages in thread
From: Pawel Moll @ 2014-09-29 14:52 UTC (permalink / raw)
To: Frederic Weisbecker
Cc: Ingo Molnar, Ingo Molnar, Arnaldo Carvalho de Melo,
Richard Cochran, Steven Rostedt, Peter Zijlstra, Paul Mackerras,
John Stultz, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
linux-api-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
On Sat, 2014-09-27 at 18:14 +0100, Frederic Weisbecker wrote:
> 2014-09-25 20:33 GMT+02:00 Ingo Molnar <mingo-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>:
> >
> > * Pawel Moll <pawel.moll-5wv7dgnIgG8@public.gmane.org> wrote:
> >
> >> On Wed, 2014-09-24 at 08:49 +0100, Ingo Molnar wrote:
> >> > * Pawel Moll <pawel.moll-5wv7dgnIgG8@public.gmane.org> wrote:
> >> >
> >> > > On Thu, 2014-09-18 at 15:34 +0100, Pawel Moll wrote:
> >> > > > This patch adds a PERF_COUNT_SW_USERSPACE_EVENT type,
> >> > > > which can be generated by user with PERF_EVENT_IOC_ENTRY
> >> > > > ioctl command, which injects an event of said type into
> >> > > > the perf buffer.
> >> > >
> >> > > It occurred to me last night that currently perf doesn't handle "write"
> >> > > syscall at all, while this seems like the most natural way of
> >> > > "injecting" userspace events into perf buffer.
> >> > >
> >> > > An ioctl would still be needed to set a type of the following events,
> >> > > something like:
> >> > >
> >> > > ioctl(SET_TYPE, 0x42);
> >> > > write(perf_fd, binaryblob, size);
> >> > > ioctl(SET_TYPE, 0);
> >> > > dprintf(perf_fd, "String");
> >> > >
> >> > > which is fine for use cases when the type doesn't change often,
> >> > > but would double the amount of syscalls when every single event
> >> > > is of a different type. Perhaps there still should be a
> >> > > "generating ioctl" taking both type and data/size in one go?
> >> >
> >> > Absolutely, there should be a single syscall.
> >>
> >> Yeah, it's my gut feeling as well. I just wonder if we still want to
> >> keep write() handler for operations on perf fds? This seems natural -
> >> takes data buffer and its size. The only issue is the type.
> >>
> >> > I'd even argue it should be a new prctl(): that way we could both
> >> > generate user events for specific perf fds, but also into any
> >> > currently active context (that allows just generation/injection
> >> > of user events). In the latter case we might have no fd to work
> >> > off from.
> >>
> >> When Arnaldo suggested that the "user events" could be used by perf
> >> trace, it was exactly my first thought. I just didn't have answer how to
> >> present it to the user (an extra syscall didn't seem like a good idea),
> >> but prctl seems interesting, something like this?
> >>
> >> prctl(PR_TRACE_UEVENT, type, size, data, 0);
> >
> > Exactly!
> >
> >> How would we select tasks that can write to a given buffer? Maybe an
> >> ioctl() on a perf fd? Something like this?
> >>
> >> ioctl(perf_fd, PERF_EVENT_IOC_ENABLE_UEVENT, pid);
> >> ioctl(perf_fd, PERF_EVENT_IOC_DISABLE_UEVENT, pid);
> >
> > No, I think there's a simpler way: this should be a regular
> > perf_attr flag, which defaults to '0' (tasks cannot do this), but
> > which can be set to 1 if the profiler explicitly allows such
> > event injection.
>
> Maybe we just don't even need any permission at all. Which harm can
> that do if this only ever generate events to those interested in the
> relevant perf context? It could be a simple tracepoint BTW.
Yeah, Ingo already pointed it out (that non-root task can't trace root
tasks anyway).
> Oh and I really like the fact we don't use a syscall that requires an
> fd. The tracee really shouldn't be aware of the tracer.
Agreed, I'll look at solution with prctl() this week.
Pawel
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 1/2] perf: Add sampling of the raw monotonic clock
[not found] ` <1411050873-9310-2-git-send-email-pawel.moll-5wv7dgnIgG8@public.gmane.org>
@ 2014-09-29 15:28 ` Peter Zijlstra
[not found] ` <20140929152832.GL4140-IIpfhp3q70z/8w/KjCw3T+5/BudmfyzbbVWyRVo5IupeoWH0uzbU5w@public.gmane.org>
0 siblings, 1 reply; 22+ messages in thread
From: Peter Zijlstra @ 2014-09-29 15:28 UTC (permalink / raw)
To: Pawel Moll
Cc: Richard Cochran, Steven Rostedt, Ingo Molnar, Paul Mackerras,
Arnaldo Carvalho de Melo, John Stultz,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
linux-api-u79uwXL29TY76Z2rM5mHXA, Thomas Gleixner
On Thu, Sep 18, 2014 at 03:34:32PM +0100, Pawel Moll wrote:
> @@ -4456,6 +4459,13 @@ static void __perf_event_header__init_id(struct perf_event_header *header,
> data->cpu_entry.cpu = raw_smp_processor_id();
> data->cpu_entry.reserved = 0;
> }
> +
> + if (sample_type & PERF_SAMPLE_CLOCK_RAW_MONOTONIC) {
> + struct timespec now;
> +
> + getrawmonotonic(&now);
> + data->clock_raw_monotonic = timespec_to_ns(&now);
> + }
> }
>
This cannot work, getrawmonotonic() isn't NMI-safe and there's
nothing stopping this being used from NMI context.
Also getrawmonotonic() + timespec_to_ns() will make tglx sad, he's just
done a tree-wide eradication of silly conversions and now you're adding
a ns -> timespec -> ns dance right back.
I _think_ you want ktime_get_mono_fast_ns(), but this does bring us
right back to the question/discussion on which timebase you'd want to
sync again. MONO does make sense for most cases, but I think we've had
fairly sane stories for people wanting to sync against other clocks.
A well..
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 2/2] perf: Userspace software event and ioctl
2014-09-18 14:34 ` [PATCH 2/2] perf: Userspace software event and ioctl Pawel Moll
2014-09-23 17:02 ` Pawel Moll
@ 2014-09-29 15:32 ` Peter Zijlstra
[not found] ` <20140929153257.GM4140-IIpfhp3q70z/8w/KjCw3T+5/BudmfyzbbVWyRVo5IupeoWH0uzbU5w@public.gmane.org>
1 sibling, 1 reply; 22+ messages in thread
From: Peter Zijlstra @ 2014-09-29 15:32 UTC (permalink / raw)
To: Pawel Moll
Cc: Richard Cochran, Steven Rostedt, Ingo Molnar, Paul Mackerras,
Arnaldo Carvalho de Melo, John Stultz, linux-kernel, linux-api
On Thu, Sep 18, 2014 at 03:34:33PM +0100, Pawel Moll wrote:
> This patch adds a PERF_COUNT_SW_USERSPACE_EVENT type,
> which can be generated by user with PERF_EVENT_IOC_ENTRY
> ioctl command, which injects an event of said type into
> the perf buffer.
>
> The ioctl takes a pointer to struct perf_event_userspace
> as an argument. The structure begins with a 64-bit
> integer type value, which determines meaning of the
> following content (size/data pair). Type 0 are defined
> as zero-terminated strings, other types are defined by
> userspace (the perf tool will contain a list of
> known values with reference implementation of data
> content parsers).
>
> Possible use cases for this feature:
>
> - "perf_printf" like mechanism to add logging messages
> to one's perf session; an example implementation:
>
> int perf_printf(int perf_fd, const char *fmt, ...)
> {
> struct perf_event_userspace *event;
> int size;
> va_list ap;
> int err;
>
> va_start(ap, fmt);
>
> size = vsnprintf(NULL, 0, fmt, ap) + 1;
> event = malloc(sizeof(*event) + size);
> if (!event) {
> va_end(ap);
> return -1;
> }
>
> event->type = 0;
> event->size = size;
> vsnprintf(event->data, size, fmt, ap);
>
> va_end(ap);
>
> err = ioctl(perf_fd, PERF_EVENT_IOC_USERSPACE, event);
>
> free(event);
>
> return err < 0 ? err : size - 1;
> }
>
> - "perf_printf" used by for perf trace tool,
> where certain traced process' calls are intercepted
> (eg. using LD_PRELOAD) and treated as logging
> requests, with it output redirected into the
> perf buffer
>
> - synchronisation of performance data generated in
> user space with the perf stream coming from the kernel.
> For example, the marker can be inserted by a JIT engine
> after it generated portion of the code, but before the
> code is executed for the first time, allowing the
> post-processor to pick the correct debugging
> information.
>
> - other example is a system profiling tool taking data
> from other sources than just perf, which generates a marker
> at the beginning at at the end of the session
> (also possibly periodically during the session) to
> synchronise kernel timestamps with clock values
> obtained in userspace (gtod or raw_monotonic).
Feel free to use up to 70 chars wide text in Changelogs. Most editors
have support for reflowing text. No need to keep it this narrow.
Also none of the many words above describe
PERF_SAMPLE_USERSPACE_EVENT(), wth is that about?
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 1/2] perf: Add sampling of the raw monotonic clock
[not found] ` <20140929152832.GL4140-IIpfhp3q70z/8w/KjCw3T+5/BudmfyzbbVWyRVo5IupeoWH0uzbU5w@public.gmane.org>
@ 2014-09-29 15:45 ` Pawel Moll
0 siblings, 0 replies; 22+ messages in thread
From: Pawel Moll @ 2014-09-29 15:45 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Richard Cochran, Steven Rostedt, Ingo Molnar, Paul Mackerras,
Arnaldo Carvalho de Melo, John Stultz,
linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
linux-api-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
Thomas Gleixner
On Mon, 2014-09-29 at 16:28 +0100, Peter Zijlstra wrote:
> On Thu, Sep 18, 2014 at 03:34:32PM +0100, Pawel Moll wrote:
> > @@ -4456,6 +4459,13 @@ static void __perf_event_header__init_id(struct perf_event_header *header,
> > data->cpu_entry.cpu = raw_smp_processor_id();
> > data->cpu_entry.reserved = 0;
> > }
> > +
> > + if (sample_type & PERF_SAMPLE_CLOCK_RAW_MONOTONIC) {
> > + struct timespec now;
> > +
> > + getrawmonotonic(&now);
> > + data->clock_raw_monotonic = timespec_to_ns(&now);
> > + }
> > }
> >
>
> This cannot work, getrawmonotonic() isn't NMI-safe and there's
> nothing stopping this being used from NMI context.
>
> Also getrawmonotonic() + timespec_to_ns() will make tglx sad, he's just
> done a tree-wide eradication of silly conversions and now you're adding
> a ns -> timespec -> ns dance right back.
Last thing I want is to make Thomas sad... For obvious reasons ;-)
> I _think_ you want ktime_get_mono_fast_ns(),
With pleasure, it's exactly what I need.
> but this does bring us
> right back to the question/discussion on which timebase you'd want to
> sync again. MONO does make sense for most cases, but I think we've had
> fairly sane stories for people wanting to sync against other clocks.
Yes. I've asked the same question somewhere in the thread.
ftrace has got a switch and a selection of trace_clocks in
kernel/trace/trace.c - do we want something similar (in integer form
probably, though) in perf_events.h with an additional "flag" in struct
perf_event_attr? It could be used to pick a time source for
PERF_SAMPLE_CLOCK (PERF_SAMPLE_TRACE_CLOCK?) sample.
Pawel
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 2/2] perf: Userspace software event and ioctl
[not found] ` <20140929153257.GM4140-IIpfhp3q70z/8w/KjCw3T+5/BudmfyzbbVWyRVo5IupeoWH0uzbU5w@public.gmane.org>
@ 2014-09-29 15:53 ` Pawel Moll
2014-11-03 14:48 ` Tomeu Vizoso
0 siblings, 1 reply; 22+ messages in thread
From: Pawel Moll @ 2014-09-29 15:53 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Richard Cochran, Steven Rostedt, Ingo Molnar, Paul Mackerras,
Arnaldo Carvalho de Melo, John Stultz,
linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
linux-api-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
On Mon, 2014-09-29 at 16:32 +0100, Peter Zijlstra wrote:
> Also none of the many words above describe
> PERF_SAMPLE_USERSPACE_EVENT(), wth is that about?
Hopefully description of the v2 makes better job in this:
http://thread.gmane.org/gmane.linux.kernel/1793272/focus=4813
where it's already called "UEVENT" and was generated by write().
Before you get into this, though, the most important outcomes of both v1
and v2 discussions:
* Ingo suggested prctl(PR_TRACE_UEVENT, type, size, data, 0) as the way
of generating such events (so the tracee doesn't have to know the fd to
do ioctl); Frederic seems to have the same on his mind.
* Namhyung proposed sticking the userspace-originating events into the
buffer as PERF_RECORD_UEVENT rather then PERF_SAMPLE_UEVENT.
Working on making both happen now.
Pawel
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 2/2] perf: Userspace software event and ioctl
2014-09-29 15:53 ` Pawel Moll
@ 2014-11-03 14:48 ` Tomeu Vizoso
2014-11-03 15:04 ` Pawel Moll
0 siblings, 1 reply; 22+ messages in thread
From: Tomeu Vizoso @ 2014-11-03 14:48 UTC (permalink / raw)
To: Pawel Moll
Cc: Peter Zijlstra, Richard Cochran, Steven Rostedt, Ingo Molnar,
Paul Mackerras, Arnaldo Carvalho de Melo, John Stultz,
linux-kernel@vger.kernel.org, linux-api@vger.kernel.org
On 29 September 2014 17:53, Pawel Moll <pawel.moll@arm.com> wrote:
> On Mon, 2014-09-29 at 16:32 +0100, Peter Zijlstra wrote:
>> Also none of the many words above describe
>> PERF_SAMPLE_USERSPACE_EVENT(), wth is that about?
>
> Hopefully description of the v2 makes better job in this:
>
> http://thread.gmane.org/gmane.linux.kernel/1793272/focus=4813
>
> where it's already called "UEVENT" and was generated by write().
>
> Before you get into this, though, the most important outcomes of both v1
> and v2 discussions:
>
> * Ingo suggested prctl(PR_TRACE_UEVENT, type, size, data, 0) as the way
> of generating such events (so the tracee doesn't have to know the fd to
> do ioctl); Frederic seems to have the same on his mind.
>
> * Namhyung proposed sticking the userspace-originating events into the
> buffer as PERF_RECORD_UEVENT rather then PERF_SAMPLE_UEVENT.
>
> Working on making both happen now.
Hi Pawel,
are you still working on this? Would be happy to lend a hand if that
can speed things up.
Cheers,
Tomeu
> Pawel
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 2/2] perf: Userspace software event and ioctl
2014-11-03 14:48 ` Tomeu Vizoso
@ 2014-11-03 15:04 ` Pawel Moll
0 siblings, 0 replies; 22+ messages in thread
From: Pawel Moll @ 2014-11-03 15:04 UTC (permalink / raw)
To: Tomeu Vizoso
Cc: Peter Zijlstra, Richard Cochran, Steven Rostedt, Ingo Molnar,
Paul Mackerras, Arnaldo Carvalho de Melo, John Stultz,
linux-kernel@vger.kernel.org, linux-api@vger.kernel.org
On Mon, 2014-11-03 at 14:48 +0000, Tomeu Vizoso wrote:
> On 29 September 2014 17:53, Pawel Moll <pawel.moll@arm.com> wrote:
> > On Mon, 2014-09-29 at 16:32 +0100, Peter Zijlstra wrote:
> >> Also none of the many words above describe
> >> PERF_SAMPLE_USERSPACE_EVENT(), wth is that about?
> >
> > Hopefully description of the v2 makes better job in this:
> >
> > http://thread.gmane.org/gmane.linux.kernel/1793272/focus=4813
> >
> > where it's already called "UEVENT" and was generated by write().
> >
> > Before you get into this, though, the most important outcomes of both v1
> > and v2 discussions:
> >
> > * Ingo suggested prctl(PR_TRACE_UEVENT, type, size, data, 0) as the way
> > of generating such events (so the tracee doesn't have to know the fd to
> > do ioctl); Frederic seems to have the same on his mind.
> >
> > * Namhyung proposed sticking the userspace-originating events into the
> > buffer as PERF_RECORD_UEVENT rather then PERF_SAMPLE_UEVENT.
> >
> > Working on making both happen now.
>
> are you still working on this? Would be happy to lend a hand if that
> can speed things up.
By all means! In fact I'm typing commit messages right now and will post
the patches later today. Stay tuned and I'm looking forward to all
suggestions, reviews etc.
Cheers!
Pawel
^ permalink raw reply [flat|nested] 22+ messages in thread
end of thread, other threads:[~2014-11-03 15:04 UTC | newest]
Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-09-18 14:34 [PATCH 0/2] perf: User/kernel time correlation and event generation Pawel Moll
2014-09-18 14:34 ` [PATCH 1/2] perf: Add sampling of the raw monotonic clock Pawel Moll
[not found] ` <1411050873-9310-2-git-send-email-pawel.moll-5wv7dgnIgG8@public.gmane.org>
2014-09-29 15:28 ` Peter Zijlstra
[not found] ` <20140929152832.GL4140-IIpfhp3q70z/8w/KjCw3T+5/BudmfyzbbVWyRVo5IupeoWH0uzbU5w@public.gmane.org>
2014-09-29 15:45 ` Pawel Moll
2014-09-18 14:34 ` [PATCH 2/2] perf: Userspace software event and ioctl Pawel Moll
2014-09-23 17:02 ` Pawel Moll
2014-09-24 7:49 ` Ingo Molnar
[not found] ` <20140924074942.GB3797-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2014-09-25 17:20 ` Pawel Moll
2014-09-25 18:33 ` Ingo Molnar
[not found] ` <20140925183342.GB6854-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2014-09-26 10:48 ` Pawel Moll
2014-09-26 11:23 ` Ingo Molnar
[not found] ` <20140926112312.GB9870-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2014-09-26 11:26 ` Pawel Moll
2014-09-26 11:31 ` Ingo Molnar
2014-09-27 17:14 ` Frederic Weisbecker
[not found] ` <CAFTL4hy1d8twv2tGxc4EhCeDm7ApnH7SuK26W1yaekKhCrPMZA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2014-09-29 14:52 ` Pawel Moll
2014-09-29 15:32 ` Peter Zijlstra
[not found] ` <20140929153257.GM4140-IIpfhp3q70z/8w/KjCw3T+5/BudmfyzbbVWyRVo5IupeoWH0uzbU5w@public.gmane.org>
2014-09-29 15:53 ` Pawel Moll
2014-11-03 14:48 ` Tomeu Vizoso
2014-11-03 15:04 ` Pawel Moll
[not found] ` <1411050873-9310-1-git-send-email-pawel.moll-5wv7dgnIgG8@public.gmane.org>
2014-09-18 15:02 ` [PATCH 0/2] perf: User/kernel time correlation and event generation Christopher Covington
[not found] ` <541AF40B.7070604-sgV2jX0FEOL9JmXXK+q4OQ@public.gmane.org>
2014-09-18 15:07 ` Pawel Moll
2014-09-18 15:48 ` Christopher Covington
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).