All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Michael Kerrisk (man-pages)" <mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
To: Vince Weaver <vincent.weaver-e7X0jjDqjFGHXe+LvDLADg@public.gmane.org>
Cc: mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org,
	linux-man-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Subject: Re: [patch] perf_event_open() updates for Linux 3.12
Date: Thu, 12 Dec 2013 16:46:08 +1300	[thread overview]
Message-ID: <52A93180.6090807@gmail.com> (raw)
In-Reply-To: <alpine.DEB.2.10.1309201412380.26813-6xBS8L8d439fDsnSvq7Uq4Se7xf15W0s1dQoKJhdanU@public.gmane.org>

On 09/21/13 06:15, Vince Weaver wrote:
> 
> Below are the changes to perf_event_open.2 for the upcoming
> Linux 3.12 release.

Vince,

I just wanted to double check with you: everything in this 
old mail has now been applied via other patches, right?
(Indeed, I believe I have no more outstanding patches from you, 
right?)

Cheers,

Michael


> I'm not sure if sending these at 3.12-rc1 time is too early.
> 
> There are some pretty big changes this time, including an
> unfortunate ABI breakage with the cap_usr_rdpmc/cap_usr_time
> bits.
> 
> Signed-off-by: Vince Weaver <vincent.weaver-e7X0jjDqjFGHXe+LvDLADg@public.gmane.org>
> 
> diff --git a/man2/perf_event_open.2 b/man2/perf_event_open.2
> index 71a09d5..7b87c4c 100644
> --- a/man2/perf_event_open.2
> +++ b/man2/perf_event_open.2
> @@ -468,6 +468,13 @@ This counts the number of emulation faults.
>  The kernel sometimes traps on unimplemented instructions
>  and emulates them for user space.
>  This can negatively impact performance.
> +.TP
> +.BR PERF_COUNT_SW_DUMMY " (Since Linux 3.12)"
> +This is a placeholder event that counts nothing.
> +Informational sample record types such as mmap or comm
> +must be associated with an active event.
> +This dummy event allows gathering such records without requiring
> +a counting event.
>  .RE
>  
>  .RS
> @@ -680,6 +687,27 @@ Records the data source: where in the memory hierarchy
>  the data associated with the sampled instruction came from.
>  This is only available if the underlying hardware
>  supports this feature.
> +.TP
> +.BR PERF_SAMPLE_IDENTIFIER " (Since Linux 3.12)"
> +Places the SAMPLE_ID value in a fixed position in the record,
> +either at the beginning (for sample events) or at the end
> +(if a non-sample event).
> +
> +This was necessary because a sample stream may have
> +records from various different event sources with different
> +.I sample_type
> +settings.
> +Parsing the event stream properly was not possible because the 
> +format of the record was needed to find SAMPLE_ID, but
> +the the format could not be found without knowing what
> +event the sample belonged to (causing a circular
> +dependency).
> +
> +This new
> +.B PERF_SAMPLE_IDENTIFIER
> +setting makes the event stream always parsable
> +by putting SAMPLE_ID in a fixed location, even though
> +it means having duplicate SAMPLE_ID values in records.
>  .RE
>  .TP
>  .IR "read_format"
> @@ -860,12 +888,33 @@ field, but enables including data mmap events
>  in the ring-buffer.
>  .TP
>  .IR "sample_id_all" " (Since Linux 2.6.38)"
> -If set, then TID, TIME, ID, CPU, and STREAM_ID can
> +If set, then TID, TIME, ID, STREAM_ID, and CPU can
>  additionally be included in
>  .RB non- PERF_RECORD_SAMPLE s
>  if the corresponding
>  .I sample_type
>  is selected.
> +
> +If 
> +.B PERF_SAMPLE_IDENTIFIER
> +is specified than an additional ID value is included 
> +as the last value to ease parsing the record stream.
> +This may lead to the
> +.I id 
> +value appearing twice.
> +
> +The layout is described by this pseudo-structure:
> +.in +4n
> +.nf
> +struct sample_id {
> +    { u32 pid, tid; } /* if PERF_SAMPLE_TID set        */
> +    { u64 time;     } /* if PERF_SAMPLE_TIME set       */
> +    { u64 id;       } /* if PERF_SAMPLE_ID set         */
> +    { u64 stream_id;} /* if PERF_SAMPLE_STREAM_ID set  */
> +    { u32 cpu, res; } /* if PERF_SAMPLE_CPU set        */
> +    { u64 id;       } /* if PERF_SAMPLE_IDENTIFIER set */
> +};
> +.fi
>  .TP
>  .IR "exclude_host" " (Since Linux 3.2)"
>  Do not measure time spent in VM host.
> @@ -879,6 +928,11 @@ Do not include kernel callchains.
>  .IR "exclude_callchain_user" " (Since Linux 3.7)"
>  Do not include user callchains.
>  .TP
> +.IR "mmap2" " (Since Linux 3.12)"
> +Include an extended mmap record that contains enough
> +additional information to uniquely identify
> +shared mappings.
> +.TP
>  .IR "wakeup_events" ", " "wakeup_watermark"
>  This union sets how many samples
>  .RI ( wakeup_events )
> @@ -1142,8 +1196,13 @@ struct perf_event_mmap_page {
>      __u64 time_running;     /* time event on CPU */
>      union {
>          __u64   capabilities;
> -        __u64   cap_usr_time  : 1,
> -                cap_usr_rdpmc : 1,
> +        struct {
> +            __u64   cap_usr_time / cap_usr_rdpmc / cap_bit0 : 1,
> +                    cap_bit0_is_deprecated : 1,
> +                    cap_user_rdpmc         : 1,
> +                    cap_user_time          : 1,
> +                    cap_user_time_zero     : 1,
> +        };
>      };
>      __u16   pmc_width;
>      __u16   time_shift;
> @@ -1173,8 +1232,9 @@ A seqlock for synchronization.
>  A unique hardware counter identifier.
>  .TP
>  .I offset
> -.\" FIXME clarify
> -Add this to hardware counter value??
> +When using rdpmc for reads this offset value
> +must be added to the one returned by rdpmc to get
> +the current total event count.
>  .TP
>  .I time_enabled
>  Time the event was active.
> @@ -1182,10 +1242,45 @@ Time the event was active.
>  .I time_running
>  Time the event was running.
>  .TP
> +.IR cap_usr_time " / " cap_usr_rdpmc " / " cap_bit0 " (Since Linux 3.4)"
> +There was a bug in the definition of 
> +.I cap_usr_time
> +and
> +.I cap_usr_rdpmc
> +from Linux 3.4 until Linux 3.11.
> +Both bits were defined to point to the same location, so it was
> +impossible to know if 
>  .I cap_usr_time
> -User time capability.
> +or
> +.I cap_usr_rdpmc
> +were actually set.
> +
> +Starting with 3.12 these are renamed to
> +.I cap_bit0
> +and you should use the new
> +.I cap_user_time
> +and
> +.I cap_user_rdpmc
> +fields instead.
> +
>  .TP
> +.IR cap_bit0_is_deprecated " (Since Linux 3.12)"
> +If set this bit indicates that the kernel supports
> +the properly separated
> +.I cap_user_time
> +and
> +.I cap_user_rdpmc
> +bits.
> +
> +If not-set, it indicates an older kernel where
> +.I cap_usr_time
> +and
>  .I cap_usr_rdpmc
> +map to the same bit and thus both features should
> +be used with caution.
> +
> +.TP
> +.IR cap_user_rdpmc " (Since Linux 3.12)" 
>  If the hardware supports user-space read of performance counters
>  without syscall (this is the "rdpmc" instruction on x86), then
>  the following code can be used to do a read:
> @@ -1195,7 +1290,6 @@ the following code can be used to do a read:
>  u32 seq, time_mult, time_shift, idx, width;
>  u64 count, enabled, running;
>  u64 cyc, time_offset;
> -s64 pmc = 0;
>  
>  do {
>      seq = pc\->lock;
> @@ -1215,7 +1309,7 @@ do {
>  
>      if (pc\->cap_usr_rdpmc && idx) {
>          width = pc\->pmc_width;
> -        pmc = rdpmc(idx \- 1);
> +        count += rdpmc(idx \- 1);
>      }
>  
>      barrier();
> @@ -1223,6 +1317,16 @@ do {
>  .fi
>  .in
>  .TP
> +.I cap_user_time " (Since Linux 3.12)"
> +This bit indicates the hardware has a constant, non-stop
> +timestamp counter (TSC on x86).
> +.TP
> +.IR cap_user_time_zero " (Since Linux 3.12)"
> +Indicates the presence of
> +.I time_zero
> +which allows mapping timestamp values to
> +the hardware clock.
> +.TP
>  .I pmc_width
>  If
>  .IR cap_usr_rdpmc ,
> @@ -1274,6 +1378,27 @@ enabled and possible running (if idx), improving the scaling:
>      count = quot * enabled + (rem * enabled) / running;
>  .fi
>  .TP
> +.IR time_zero " (Since Linux 3.12)"
> +
> +If 
> +.I cap_usr_time_zero
> +is set then the hardware clock (the TSC timestamp counter on x86) 
> +can be calculated from the
> +.IR time_zero ", " time_mult ", and " time_shift " values:"
> +.nf
> +    time = timestamp - time_zero;
> +    quot = time / time_mult;
> +    rem  = time % time_mult;
> +    cyc = (quot << time_shift) + (rem << time_shift) / time_mult;
> +.fi
> +And vice versa:
> +.nf
> +    quot = cyc >> time_shift;
> +    rem  = cyc & ((1 << time_shift) - 1);
> +    timestamp = time_zero + quot * time_mult +
> +        ((rem * time_mult) >> time_shift);
> +.fi
> +.TP
>  .I data_head
>  This points to the head of the data section.
>  The value continuously increases, it does not wrap.
> @@ -1385,6 +1510,7 @@ The values in the corresponding record (that follows the header)
>  depend on the
>  .I type
>  selected as shown.
> +
>  .RS
>  .TP 4
>  .B PERF_RECORD_MMAP
> @@ -1416,6 +1542,7 @@ struct {
>      struct perf_event_header header;
>      u64 id;
>      u64 lost;
> +    struct sample_id sample_id;
>  };
>  .fi
>  .in
> @@ -1437,6 +1564,7 @@ struct {
>      struct perf_event_header header;
>      u32 pid, tid;
>      char comm[];
> +    struct sample_id sample_id;
>  };
>  .fi
>  .in
> @@ -1451,6 +1579,7 @@ struct {
>      u32 pid, ppid;
>      u32 tid, ptid;
>      u64 time;
> +    struct sample_id sample_id;
>  };
>  .fi
>  .in
> @@ -1465,6 +1594,7 @@ struct {
>      u64 time;
>      u64 id;
>      u64 stream_id;
> +    struct sample_id sample_id;
>  };
>  .fi
>  .in
> @@ -1479,6 +1609,7 @@ struct {
>      u32 pid, ppid;
>      u32 tid, ptid;
>      u64 time;
> +    struct sample_id sample_id;
>  };
>  .fi
>  .in
> @@ -1492,6 +1623,7 @@ struct {
>      struct perf_event_header header;
>      u32 pid, tid;
>      struct read_format values;
> +    struct sample_id sample_id;
>  };
>  .fi
>  .in
> @@ -1503,6 +1635,7 @@ This record indicates a sample.
>  .nf
>  struct {
>      struct perf_event_header header;
> +    u64   sample_id;  /* if PERF_SAMPLE_IDENTIFIER */
>      u64   ip;         /* if PERF_SAMPLE_IP */
>      u32   pid, tid;   /* if PERF_SAMPLE_TID */
>      u64   time;       /* if PERF_SAMPLE_TIME */
> @@ -1531,6 +1664,16 @@ struct {
>  .fi
>  .RS 4
>  .TP 4
> +.I sample_id
> +If
> +.B PERF_SAMPLE_IDENTIFIER
> +is enabled, a 64-bit unique ID is included.
> +This is a duplication of the 
> +.B PERF_SAMPLE_ID
> +.I id
> +value, but included at the beginning of the sample
> +so parsers can easily obtain the value.
> +.TP
>  .I ip
>  If
>  .B PERF_SAMPLE_IP
> @@ -1855,6 +1998,29 @@ OS fault handler
>  .PD
>  .RE
>  .RE
> +.TP
> +.B PERF_RECORD_MMAP2
> +This record includes information on mmap() calls.
> +It includes extended fields not available with
> +the
> +.B PERF_RECORD_MMAP
> +record that allow uniquely identifying shared mappings.
> +.in +4n
> +.nf
> +struct {
> +    struct perf_event_header header;
> +    u32 pid, tid;
> +    u64 addr;
> +    u64 len;
> +    u64 pgoff;
> +    u32 maj;
> +    u32 min;
> +    u64 ino;
> +    u64 ino_generation;
> +    char filename[];
> +    struct sample_id sample_id;
> +};
> +.fi
>  .RE
>  .RE
>  .SS Signal overflow
> @@ -1994,6 +2160,12 @@ output should be ignored.
>  This adds an ftrace filter to this event.
>  
>  The argument is a pointer to the desired ftrace filter.
> +.TP
> +.BR PERF_EVENT_IOC_ID " (Since Linux 3.12)"
> +Returns the event ID value for the given event fd.
> +
> +The argument is a pointer to a 64-bit unsigned integer
> +to hold the result.
>  .SS Using prctl
>  A process can enable or disable all the event groups that are
>  attached to it using the
> @@ -2200,6 +2372,17 @@ ioctl argument was broken and would repeatedly operate
>  on the event specified rather than iterating across
>  all sibling events in a group.
>  
> +From Linux 3.4 to Linux 3.11 the mmap
> +.I cap_usr_rdpmc
> +and
> +.I cap_usr_time
> +bits mapped to the same location.
> +Code should migrate to the new
> +.I cap_user_rdpmc
> +and
> +.I cap_user_time
> +fields instead.
> +
>  Always double-check your results!
>  Various generalized events have had wrong values.
>  For example, retired branches measured
> 


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
--
To unsubscribe from this list: send the line "unsubscribe linux-man" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

  parent reply	other threads:[~2013-12-12  3:46 UTC|newest]

Thread overview: 3+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-09-20 18:15 [patch] perf_event_open() updates for Linux 3.12 Vince Weaver
     [not found] ` <alpine.DEB.2.10.1309201412380.26813-6xBS8L8d439fDsnSvq7Uq4Se7xf15W0s1dQoKJhdanU@public.gmane.org>
2013-12-12  3:46   ` Michael Kerrisk (man-pages) [this message]
     [not found]     ` <52A93180.6090807-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2013-12-12  4:42       ` Vince Weaver

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=52A93180.6090807@gmail.com \
    --to=mtk.manpages-re5jqeeqqe8avxtiumwx3w@public.gmane.org \
    --cc=linux-man-u79uwXL29TY76Z2rM5mHXA@public.gmane.org \
    --cc=vincent.weaver-e7X0jjDqjFGHXe+LvDLADg@public.gmane.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.