intel-gfx.lists.freedesktop.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 00/11] Framework to collect gpu metrics using i915 perf infrastructure
@ 2016-02-16  5:27 sourab.gupta
  2016-02-16  5:27 ` [PATCH 01/11] drm/i915: Introduce global id for contexts sourab.gupta
                   ` (10 more replies)
  0 siblings, 11 replies; 15+ messages in thread
From: sourab.gupta @ 2016-02-16  5:27 UTC (permalink / raw)
  To: intel-gfx; +Cc: Jabin Wu, Sourab Gupta

From: Sourab Gupta <sourab.gupta@intel.com>

This series adds framework for collection of gpu performance metrics
associated with the command stream of a particular engine. These metrics
include OA reports, timestamps, mmio metrics, etc. These metrics are
are collected around batchbuffer boundaries.

This work utilizes the underlying infrastructure introduced in Robert Bragg's
patches for collecting periodic OA counter snapshots (based on Haswell):
https://lists.freedesktop.org/archives/intel-gfx/2016-February/086909.html

This patch set is based on Gen8+ version of Robert's patch series, which can
be found here: https://github.com/rib/linux/tree/wip/rib/oa-next
These are not yet individually floated in the mailing list, which I hope
doesn't lead to any significant loss of clarity in order to review the work
proposed in this patch series.

Compared to last series sent earlier, this series is based on drm i915 ioctl
based implementation ( which can be referred to, in Robert's work). As such,
the design has been changed (and simplified) due to some earlier core perf
assumptions going away.

Few salient features are listed below:
* Ability to collect command stream based OA reports on render engine, in
  conjunction with the periodic reports generated with the OA unit. These
  would be collected in seperate buffers and forwarded to userspace in the
  respective timestamp order. The samples are differentiated in userspace by
  distinguishing the value of OA sample source field.

* Ability to collect timestamps and mmio metrics, associated with command
  stream of any particular gpu engine. The particular sample metrics to be
  collected are requested by userspace client in the properties associated
  with the stream being opened. The samples generated depend on original
  sample flags requested in the stream properties.

* Ability to collect associated metadata information with the samples such as
  pid, tags, etc. These are collected at the time of inserting the commands into
  the command stream of particular gpu engine, and forwarded along with samples

* Multiple streams belonging to different engines can be opened concurrently
  (while restricting only one instance of open stream per engine). This allows
  us to open simultaneously streams belonging to different gpu engines to
  collect samples belonging to all of them concurrently.

* The different stages of a single workload (belonging to a single context) can
  be delimited by using 'execbuffer tagging' mechanism introduced here.
  For e.g. for the media pipeline, CodecHAL encoding stage has a single context
  and involves multiple stages such as Scaling, ME, MBEnc, PAK for which there
  are separate execbuffer calls. There is a need to have the samples generated
  to have such information, so as to be able to associate them with the
  particular workload stage. The presence of a tag sample_type, which is passed
  in by userspace during execbuffer ioctl fulfills this requirement.


I am looking for feedback on the design proposed here, particularly pertaining
to the mechanics of metrics collection through insertion of commands in the
command stream of associated gpu engines, sample generation according to the
requested sample flags in stream properties, concurrent operation of different
streams to collect the samples from multiple gpu engines, and any such
design/implementation aspects per se.

Few open issues which I'm working on include:
* In case both timestamp and OA sample type are requested for render engine,
  the ts information should be able to be derived from OA report only, and we
  should not need to insert seperate commands for dumping timestamps. Though,
  we need to apply relevant timestamp base conversion for converting from OA
  timestamps into ns.

* The sample consistency has to be maintained between the periodic OA reports
  and the ones generated by command stream. This implies, for e.g., that if pid
  sample_type is requested, the most recent pid collected in the CS samples
  should be used to populate the relevant field in the periodic samples.
  Likewise, the field 'ctx_id' needs to be deduced from the periodic OA reports
  and mapped to 'intel_context::global_id', for periodic OA reports.

These open issues, though, shouldn't be distracting us too much from reviewing
the general mechanism proposed here, and these can be ironed out subsequently,
if there's a general agreement on the design here.

Also, one of the pre-requisite for this work is presence of globally unique id
associated with each context. The present context id is specific to drm fd, and
as such, it can't uniquely be used to associate the reports generated with the
corresponding context scheduled from userspace in a global way.
The first few patches in the series introduce the globally unique context id,
and subsequent ones introduce the framework for collection of metrics.

Robert Bragg (2):
  drm/i915: Constrain intel_context::global_id to 20 bits
  drm/i915: return ctx->global_id from intel_execlists_ctx_id()

Sourab Gupta (9):
  drm/i915: Introduce global id for contexts
  drm/i915: Add ctx getparam ioctl parameter to retrieve ctx global id
  drm/i915: Expose OA sample source to userspace
  drm/i915: Framework for capturing command stream based OA reports
  drm/i915: Add support for having pid output with OA report
  drm/i915: Add support to add execbuffer tags to OA counter reports
  drm/i915: Extend i915 perf framework for collecting timestamps on all
    gpu engines
  drm/i915: Support opening multiple concurrent perf streams
  drm/i915: Support for capturing MMIO register values

 drivers/gpu/drm/i915/i915_debugfs.c        |    7 +-
 drivers/gpu/drm/i915/i915_drv.h            |   68 +-
 drivers/gpu/drm/i915/i915_gem_context.c    |   23 +
 drivers/gpu/drm/i915/i915_gem_execbuffer.c |    5 +
 drivers/gpu/drm/i915/i915_perf.c           | 1300 +++++++++++++++++++++++++---
 drivers/gpu/drm/i915/i915_reg.h            |    2 +
 drivers/gpu/drm/i915/intel_lrc.c           |   26 +-
 drivers/gpu/drm/i915/intel_lrc.h           |    2 +-
 include/uapi/drm/i915_drm.h                |   72 ++
 9 files changed, 1349 insertions(+), 156 deletions(-)

-- 
1.9.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 01/11] drm/i915: Introduce global id for contexts
  2016-02-16  5:27 [PATCH 00/11] Framework to collect gpu metrics using i915 perf infrastructure sourab.gupta
@ 2016-02-16  5:27 ` sourab.gupta
  2016-02-16  5:27 ` [PATCH 02/11] drm/i915: Constrain intel_context::global_id to 20 bits sourab.gupta
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 15+ messages in thread
From: sourab.gupta @ 2016-02-16  5:27 UTC (permalink / raw)
  To: intel-gfx; +Cc: Jabin Wu, Sourab Gupta

From: Sourab Gupta <sourab.gupta@intel.com>

The current context user handles are specific to drm file instance.
There are some usecases, which may require a global id for the contexts.
For e.g. a system level GPU profiler tool may lean upon the global context
ids to associate the performance snapshots with individual contexts.

This global id may also be used further in order to provide a unique
context id to hw.

In this patch, the global ids are allocated from a separate cyclic idr and
can be further utilized for any usecase described above.

v2: According to Chris' suggestion, implemented a separate idr for holding
global ids for contexts, as opposed to overloading the file specific
ctx->user_handle for this purpose. This global id can also further be used
wherever hw has to be programmed with ctx unique id, though this patch just
introduces the hw global id as such.

Signed-off-by: Sourab Gupta <sourab.gupta@intel.com>
---
 drivers/gpu/drm/i915/i915_drv.h         |  3 +++
 drivers/gpu/drm/i915/i915_gem_context.c | 21 +++++++++++++++++++++
 2 files changed, 24 insertions(+)

diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index 3ea4656..c409c8f 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -875,6 +875,7 @@ struct i915_ctx_hang_stats {
 struct intel_context {
 	struct kref ref;
 	int user_handle;
+	int global_id;
 	uint8_t remap_slice;
 	struct drm_i915_private *i915;
 	int flags;
@@ -1868,6 +1869,8 @@ struct drm_i915_private {
 
 	bool preserve_bios_swizzle;
 
+	struct idr global_ctx_idr;
+
 	/* overlay */
 	struct intel_overlay *overlay;
 
diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c
index a53f591..6f38810 100644
--- a/drivers/gpu/drm/i915/i915_gem_context.c
+++ b/drivers/gpu/drm/i915/i915_gem_context.c
@@ -256,6 +256,18 @@ __create_hw_context(struct drm_device *dev,
 
 	ctx->file_priv = file_priv;
 	ctx->user_handle = ret;
+
+	/* TODO: If required, this global id can be used for programming the hw
+	 * fields too. In that case, we'll have take care of hw restrictions
+	 * while allocating idr. e.g. for some hw, we may not have full 32 bits
+	 * available.
+	 */
+	ret = idr_alloc_cyclic(&dev_priv->global_ctx_idr,
+				ctx, 0, 0, GFP_KERNEL);
+	if (ret < 0)
+		goto err_out;
+
+	ctx->global_id = ret;
 	/* NB: Mark all slices as needing a remap so that when the context first
 	 * loads it will restore whatever remap state already exists. If there
 	 * is no remap info, it will be a NOP. */
@@ -280,6 +292,7 @@ i915_gem_create_context(struct drm_device *dev,
 			struct drm_i915_file_private *file_priv)
 {
 	const bool is_global_default_ctx = file_priv == NULL;
+	struct drm_i915_private *dev_priv = dev->dev_private;
 	struct intel_context *ctx;
 	int ret = 0;
 
@@ -326,6 +339,7 @@ err_unpin:
 		i915_gem_object_ggtt_unpin(ctx->legacy_hw_ctx.rcs_state);
 err_destroy:
 	idr_remove(&file_priv->context_idr, ctx->user_handle);
+	idr_remove(&dev_priv->global_ctx_idr, ctx->global_id);
 	i915_gem_context_unreference(ctx);
 	return ERR_PTR(ret);
 }
@@ -389,6 +403,7 @@ int i915_gem_context_init(struct drm_device *dev)
 			dev_priv->hw_context_size = 0;
 		}
 	}
+	idr_init(&dev_priv->global_ctx_idr);
 
 	ctx = i915_gem_create_context(dev, NULL);
 	if (IS_ERR(ctx)) {
@@ -416,6 +431,8 @@ void i915_gem_context_fini(struct drm_device *dev)
 	struct intel_context *dctx = dev_priv->ring[RCS].default_context;
 	int i;
 
+	idr_destroy(&dev_priv->global_ctx_idr);
+
 	if (dctx->legacy_hw_ctx.rcs_state) {
 		/* The only known way to stop the gpu from accessing the hw context is
 		 * to reset it. Do this as the very last operation to avoid confusing
@@ -478,6 +495,8 @@ static int context_idr_cleanup(int id, void *p, void *data)
 {
 	struct intel_context *ctx = p;
 
+	idr_remove(&ctx->i915->global_ctx_idr, ctx->global_id);
+
 	i915_gem_context_unreference(ctx);
 	return 0;
 }
@@ -890,6 +909,7 @@ int i915_gem_context_destroy_ioctl(struct drm_device *dev, void *data,
 {
 	struct drm_i915_gem_context_destroy *args = data;
 	struct drm_i915_file_private *file_priv = file->driver_priv;
+	struct drm_i915_private *dev_priv = dev->dev_private;
 	struct intel_context *ctx;
 	int ret;
 
@@ -907,6 +927,7 @@ int i915_gem_context_destroy_ioctl(struct drm_device *dev, void *data,
 	}
 
 	idr_remove(&ctx->file_priv->context_idr, ctx->user_handle);
+	idr_remove(&dev_priv->global_ctx_idr, ctx->global_id);
 	i915_gem_context_unreference(ctx);
 	mutex_unlock(&dev->struct_mutex);
 
-- 
1.9.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 02/11] drm/i915: Constrain intel_context::global_id to 20 bits
  2016-02-16  5:27 [PATCH 00/11] Framework to collect gpu metrics using i915 perf infrastructure sourab.gupta
  2016-02-16  5:27 ` [PATCH 01/11] drm/i915: Introduce global id for contexts sourab.gupta
@ 2016-02-16  5:27 ` sourab.gupta
  2016-02-16  5:27 ` [PATCH 03/11] drm/i915: return ctx->global_id from intel_execlists_ctx_id() sourab.gupta
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 15+ messages in thread
From: sourab.gupta @ 2016-02-16  5:27 UTC (permalink / raw)
  To: intel-gfx; +Cc: Jabin Wu, Sourab Gupta

From: Robert Bragg <robert@sixbynine.org>

This will allow the ID to be given to the HW as the unique context
identifier that's written, for example, to the context status buffer
on preemption and included in reports written by the OA unit.

Cc: Sourab Gupta <sourab.gupta@intel.com>
Signed-off-by: Robert Bragg <robert@sixbynine.org>
---
 drivers/gpu/drm/i915/i915_gem_context.c | 11 +++++------
 1 file changed, 5 insertions(+), 6 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c
index 6f38810..3a90e79 100644
--- a/drivers/gpu/drm/i915/i915_gem_context.c
+++ b/drivers/gpu/drm/i915/i915_gem_context.c
@@ -90,6 +90,10 @@
 #include "i915_drv.h"
 #include "i915_trace.h"
 
+/* With execlist scheduling we can program our own HW context ID but we we
+ * are limited to 20bits */
+#define I915_MAX_HW_CTX_ID ((1<<20)-1)
+
 /* This is a HW constraint. The value below is the largest known requirement
  * I've seen in a spec to date, and that was a workaround for a non-shipping
  * part. It should be safe to decrease this, but it's more future proof as is.
@@ -257,13 +261,8 @@ __create_hw_context(struct drm_device *dev,
 	ctx->file_priv = file_priv;
 	ctx->user_handle = ret;
 
-	/* TODO: If required, this global id can be used for programming the hw
-	 * fields too. In that case, we'll have take care of hw restrictions
-	 * while allocating idr. e.g. for some hw, we may not have full 32 bits
-	 * available.
-	 */
 	ret = idr_alloc_cyclic(&dev_priv->global_ctx_idr,
-				ctx, 0, 0, GFP_KERNEL);
+				ctx, 0, I915_MAX_HW_CTX_ID, GFP_KERNEL);
 	if (ret < 0)
 		goto err_out;
 
-- 
1.9.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 03/11] drm/i915: return ctx->global_id from intel_execlists_ctx_id()
  2016-02-16  5:27 [PATCH 00/11] Framework to collect gpu metrics using i915 perf infrastructure sourab.gupta
  2016-02-16  5:27 ` [PATCH 01/11] drm/i915: Introduce global id for contexts sourab.gupta
  2016-02-16  5:27 ` [PATCH 02/11] drm/i915: Constrain intel_context::global_id to 20 bits sourab.gupta
@ 2016-02-16  5:27 ` sourab.gupta
  2016-02-16  9:34   ` Dave Gordon
  2016-02-16  5:27 ` [PATCH 04/11] drm/i915: Add ctx getparam ioctl parameter to retrieve ctx global id sourab.gupta
                   ` (7 subsequent siblings)
  10 siblings, 1 reply; 15+ messages in thread
From: sourab.gupta @ 2016-02-16  5:27 UTC (permalink / raw)
  To: intel-gfx; +Cc: Jabin Wu, Sourab Gupta

From: Robert Bragg <robert@sixbynine.org>

The newly added intel_context::global_id is suitable (a globally unique
20 bit ID) for giving to the hardware as a unique context identifier.

Compared to using the pinned address of a logical ring context these IDs
are constant for the lifetime of a context whereas a context could be
repinned at different addresses during its lifetime.

Having a stable ID is useful when we need to buffer information
associated with a context based on this ID so the association can't be
lost. For example the OA unit writes out counter reports to a circular
buffer tagged with this ID and we want to be able to accurately filter
reports for a specific context, ideally without the added complexity of
tracking context re-pinning while the OA buffer may contain reports with
older IDs.

Cc: Sourab Gupta <sourab.gupta@intel.com>
Signed-off-by: Robert Bragg <robert@sixbynine.org>
---
 drivers/gpu/drm/i915/i915_debugfs.c |  7 ++++---
 drivers/gpu/drm/i915/intel_lrc.c    | 22 ++++++++++------------
 drivers/gpu/drm/i915/intel_lrc.h    |  2 +-
 3 files changed, 15 insertions(+), 16 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_debugfs.c b/drivers/gpu/drm/i915/i915_debugfs.c
index 8aab974..ff4a6fe 100644
--- a/drivers/gpu/drm/i915/i915_debugfs.c
+++ b/drivers/gpu/drm/i915/i915_debugfs.c
@@ -1970,6 +1970,7 @@ static int i915_context_status(struct seq_file *m, void *unused)
 
 static void i915_dump_lrc_obj(struct seq_file *m,
 			      struct intel_engine_cs *ring,
+			      struct intel_context *ctx,
 			      struct drm_i915_gem_object *ctx_obj)
 {
 	struct page *page;
@@ -1984,7 +1985,7 @@ static void i915_dump_lrc_obj(struct seq_file *m,
 	}
 
 	seq_printf(m, "CONTEXT: %s %u\n", ring->name,
-		   intel_execlists_ctx_id(ctx_obj));
+		   intel_execlists_ctx_id(ctx));
 
 	if (!i915_gem_obj_ggtt_bound(ctx_obj))
 		seq_puts(m, "\tNot bound in GGTT\n");
@@ -2033,7 +2034,7 @@ static int i915_dump_lrc(struct seq_file *m, void *unused)
 	list_for_each_entry(ctx, &dev_priv->context_list, link) {
 		for_each_ring(ring, dev_priv, i) {
 			if (ring->default_context != ctx)
-				i915_dump_lrc_obj(m, ring,
+				i915_dump_lrc_obj(m, ring, ctx,
 						  ctx->engine[i].state);
 		}
 	}
@@ -2112,7 +2113,7 @@ static int i915_execlists(struct seq_file *m, void *data)
 
 			ctx_obj = head_req->ctx->engine[ring_id].state;
 			seq_printf(m, "\tHead request id: %u\n",
-				   intel_execlists_ctx_id(ctx_obj));
+				   intel_execlists_ctx_id(head_req->ctx));
 			seq_printf(m, "\tHead request tail: %u\n",
 				   head_req->tail);
 		}
diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
index 40bda8d..4789555 100644
--- a/drivers/gpu/drm/i915/intel_lrc.c
+++ b/drivers/gpu/drm/i915/intel_lrc.c
@@ -260,7 +260,7 @@ int intel_sanitize_enable_execlists(struct drm_device *dev, int enable_execlists
 
 /**
  * intel_execlists_ctx_id() - get the Execlists Context ID
- * @ctx_obj: Logical Ring Context backing object.
+ * @ctx: LR context
  *
  * Do not confuse with ctx->id! Unfortunately we have a name overload
  * here: the old context ID we pass to userspace as a handler so that
@@ -269,15 +269,15 @@ int intel_sanitize_enable_execlists(struct drm_device *dev, int enable_execlists
  * interrupts.
  *
  * Return: 20-bits globally unique context ID.
+ *
+ * Further the ID given to HW can now be relied on to be constant for
+ * the lifetime of the context, unlike previously when we used an
+ * associated logical ring context address (which could be repinned at
+ * a different address).
  */
-u32 intel_execlists_ctx_id(struct drm_i915_gem_object *ctx_obj)
+u32 intel_execlists_ctx_id(struct intel_context *ctx)
 {
-	u32 lrca = i915_gem_obj_ggtt_offset(ctx_obj) +
-			LRC_PPHWSP_PN * PAGE_SIZE;
-
-	/* LRCA is required to be 4K aligned so the more significant 20 bits
-	 * are globally unique */
-	return lrca >> 12;
+	return ctx->global_id;
 }
 
 static bool disable_lite_restore_wa(struct intel_engine_cs *ring)
@@ -305,7 +305,7 @@ uint64_t intel_lr_context_descriptor(struct intel_context *ctx,
 		desc |= GEN8_CTX_L3LLC_COHERENT;
 	desc |= GEN8_CTX_PRIVILEGE;
 	desc |= lrca;
-	desc |= (u64)intel_execlists_ctx_id(ctx_obj) << GEN8_CTX_ID_SHIFT;
+	desc |= (u64)intel_execlists_ctx_id(ctx) << GEN8_CTX_ID_SHIFT;
 
 	/* TODO: WaDisableLiteRestore when we start using semaphore
 	 * signalling between Command Streamers */
@@ -475,9 +475,7 @@ static bool execlists_check_remove_request(struct intel_engine_cs *ring,
 					    execlist_link);
 
 	if (head_req != NULL) {
-		struct drm_i915_gem_object *ctx_obj =
-				head_req->ctx->engine[ring->id].state;
-		if (intel_execlists_ctx_id(ctx_obj) == request_id) {
+		if (intel_execlists_ctx_id(head_req->ctx) == request_id) {
 			WARN(head_req->elsp_submitted == 0,
 			     "Never submitted head request\n");
 
diff --git a/drivers/gpu/drm/i915/intel_lrc.h b/drivers/gpu/drm/i915/intel_lrc.h
index 4e60d54..1b08cd2 100644
--- a/drivers/gpu/drm/i915/intel_lrc.h
+++ b/drivers/gpu/drm/i915/intel_lrc.h
@@ -93,7 +93,7 @@ struct i915_execbuffer_params;
 int intel_execlists_submission(struct i915_execbuffer_params *params,
 			       struct drm_i915_gem_execbuffer2 *args,
 			       struct list_head *vmas);
-u32 intel_execlists_ctx_id(struct drm_i915_gem_object *ctx_obj);
+u32 intel_execlists_ctx_id(struct intel_context *ctx);
 
 void intel_lrc_irq_handler(struct intel_engine_cs *ring);
 void intel_execlists_retire_requests(struct intel_engine_cs *ring);
-- 
1.9.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 04/11] drm/i915: Add ctx getparam ioctl parameter to retrieve ctx global id
  2016-02-16  5:27 [PATCH 00/11] Framework to collect gpu metrics using i915 perf infrastructure sourab.gupta
                   ` (2 preceding siblings ...)
  2016-02-16  5:27 ` [PATCH 03/11] drm/i915: return ctx->global_id from intel_execlists_ctx_id() sourab.gupta
@ 2016-02-16  5:27 ` sourab.gupta
  2016-02-16  5:27 ` [PATCH 05/11] drm/i915: Expose OA sample source to userspace sourab.gupta
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 15+ messages in thread
From: sourab.gupta @ 2016-02-16  5:27 UTC (permalink / raw)
  To: intel-gfx; +Cc: Jabin Wu, Sourab Gupta

From: Sourab Gupta <sourab.gupta@intel.com>

This patch adds a new ctx getparam ioctl parameter, which can be used to
retrieve ctx global_id for any particular ctx by userspace.

This can be used by userspace to map the i915 perf samples with their
particular ctx's, since those would be having ctx global_id's.
Otherwise the userspace has no way of maintaining this association,
since it has the knowledge of only per-drm file specific ctx handles.

Signed-off-by: Sourab Gupta <sourab.gupta@intel.com>
---
 drivers/gpu/drm/i915/i915_gem_context.c | 3 +++
 include/uapi/drm/i915_drm.h             | 1 +
 2 files changed, 4 insertions(+)

diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c
index 3a90e79..18c545c 100644
--- a/drivers/gpu/drm/i915/i915_gem_context.c
+++ b/drivers/gpu/drm/i915/i915_gem_context.c
@@ -960,6 +960,9 @@ int i915_gem_context_getparam_ioctl(struct drm_device *dev, void *data,
 	case I915_CONTEXT_PARAM_NO_ZEROMAP:
 		args->value = ctx->flags & CONTEXT_NO_ZEROMAP;
 		break;
+	case I915_CONTEXT_PARAM_GLOBAL_ID:
+		args->value = ctx->global_id;
+		break;
 	default:
 		ret = -EINVAL;
 		break;
diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
index 4a67895..e1f13b4 100644
--- a/include/uapi/drm/i915_drm.h
+++ b/include/uapi/drm/i915_drm.h
@@ -1131,6 +1131,7 @@ struct drm_i915_gem_context_param {
 	__u64 param;
 #define I915_CONTEXT_PARAM_BAN_PERIOD 0x1
 #define I915_CONTEXT_PARAM_NO_ZEROMAP 0x2
+#define I915_CONTEXT_PARAM_GLOBAL_ID 0x3
 	__u64 value;
 };
 
-- 
1.9.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 05/11] drm/i915: Expose OA sample source to userspace
  2016-02-16  5:27 [PATCH 00/11] Framework to collect gpu metrics using i915 perf infrastructure sourab.gupta
                   ` (3 preceding siblings ...)
  2016-02-16  5:27 ` [PATCH 04/11] drm/i915: Add ctx getparam ioctl parameter to retrieve ctx global id sourab.gupta
@ 2016-02-16  5:27 ` sourab.gupta
  2016-02-16  5:27 ` [PATCH 06/11] drm/i915: Framework for capturing command stream based OA reports sourab.gupta
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 15+ messages in thread
From: sourab.gupta @ 2016-02-16  5:27 UTC (permalink / raw)
  To: intel-gfx; +Cc: Jabin Wu, Sourab Gupta

From: Sourab Gupta <sourab.gupta@intel.com>

This patch exposes a new sample source field to userspace. This field can
be populated to specify the origin of the OA report.
For e.g. for internally triggerred reports (non MI_RPC reports), the RPT_ID
field has bitfields for specifying the origin such as timer, or render ctx
switch, etc.
Likewise this field can be used to specify the source as MI_RPC when such
support is added.

Signed-off-by: Sourab Gupta <sourab.gupta@intel.com>
Signed-off-by: Robert Bragg <robert@sixbynine.org>
---
 drivers/gpu/drm/i915/i915_perf.c | 50 +++++++++++++++++++++++++++++++++-------
 include/uapi/drm/i915_drm.h      | 16 +++++++++++++
 2 files changed, 58 insertions(+), 8 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_perf.c b/drivers/gpu/drm/i915/i915_perf.c
index c5447b4..06de4b3 100644
--- a/drivers/gpu/drm/i915/i915_perf.c
+++ b/drivers/gpu/drm/i915/i915_perf.c
@@ -44,6 +44,13 @@ static u32 i915_perf_stream_paranoid = true;
 
 #define OA_EXPONENT_MAX 0x3f
 
+#define GEN8_OAREPORT_REASON_TIMER          (1<<19)
+#define GEN8_OAREPORT_REASON_TRIGGER1       (1<<20)
+#define GEN8_OAREPORT_REASON_TRIGGER2       (1<<21)
+#define GEN8_OAREPORT_REASON_CTX_SWITCH     (1<<22)
+#define GEN8_OAREPORT_REASON_GO_TRANSITION  (1<<23)
+#define GEN9_OAREPORT_REASON_CLK_RATIO      (1<<24)
+
 /* for sysctl proc_dointvec_minmax of i915_oa_min_timer_exponent */
 static int zero;
 static int oa_exponent_max = OA_EXPONENT_MAX;
@@ -79,7 +86,8 @@ static struct i915_oa_format gen8_plus_oa_formats[I915_OA_FORMAT_MAX] = {
 	[I915_OA_FORMAT_C4_B8]		    = { 7, 64 },
 };
 
-#define SAMPLE_OA_REPORT      (1<<0)
+#define SAMPLE_OA_REPORT	(1<<0)
+#define SAMPLE_OA_SOURCE_INFO	(1<<1)
 
 struct perf_open_properties
 {
@@ -149,6 +157,27 @@ static bool append_oa_sample(struct i915_perf_stream *stream,
 	copy_to_user(read_state->buf, &header, sizeof(header));
 	read_state->buf += sizeof(header);
 
+	if (sample_flags & SAMPLE_OA_SOURCE_INFO) {
+		enum drm_i915_perf_oa_event_source source;
+
+		if (INTEL_INFO(dev_priv)->gen >= 8) {
+			u32 reason = *(u32 *)report;
+
+			if (reason & GEN8_OAREPORT_REASON_CTX_SWITCH)
+				source =
+				I915_PERF_OA_EVENT_SOURCE_CONTEXT_SWITCH;
+			else if (reason & GEN8_OAREPORT_REASON_TIMER)
+				source = I915_PERF_OA_EVENT_SOURCE_PERIODIC;
+			else
+				source = I915_PERF_OA_EVENT_SOURCE_UNDEFINED;
+		} else
+			source = I915_PERF_OA_EVENT_SOURCE_PERIODIC;
+
+		if (copy_to_user(read_state->buf, &source, 4))
+			return false;
+		read_state->buf += 4;
+	}
+
 	if (sample_flags & SAMPLE_OA_REPORT) {
 		copy_to_user(read_state->buf, report, report_size);
 		read_state->buf += report_size;
@@ -841,11 +870,6 @@ static int i915_oa_stream_init(struct i915_perf_stream *stream,
 	int format_size;
 	int ret;
 
-	if (!(props->sample_flags & SAMPLE_OA_REPORT)) {
-		DRM_ERROR("Only OA report sampling supported\n");
-		return -EINVAL;
-	}
-
 	if (!dev_priv->perf.oa.ops.init_oa_buffer) {
 		DRM_ERROR("OA unit not supported\n");
 		return -ENODEV;
@@ -873,8 +897,15 @@ static int i915_oa_stream_init(struct i915_perf_stream *stream,
 
 	format_size = dev_priv->perf.oa.oa_formats[props->oa_format].size;
 
-	stream->sample_flags |= SAMPLE_OA_REPORT;
-	stream->sample_size += format_size;
+	if (props->sample_flags & SAMPLE_OA_REPORT) {
+		stream->sample_flags |= SAMPLE_OA_REPORT;
+		stream->sample_size += format_size;
+	}
+
+	if (props->sample_flags & SAMPLE_OA_SOURCE_INFO) {
+		stream->sample_flags |= SAMPLE_OA_SOURCE_INFO;
+		stream->sample_size += 4;
+	}
 
 	dev_priv->perf.oa.oa_buffer.format_size = format_size;
 	BUG_ON(dev_priv->perf.oa.oa_buffer.format_size == 0);
@@ -1480,6 +1511,9 @@ static int read_properties_unlocked(struct drm_i915_private *dev_priv,
 			props->oa_periodic = true;
 			props->oa_period_exponent = value;
 			break;
+		case DRM_I915_PERF_SAMPLE_OA_SOURCE_PROP:
+			props->sample_flags |= SAMPLE_OA_SOURCE_INFO;
+			break;
 		case DRM_I915_PERF_PROP_MAX:
 			BUG();
 		}
diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
index e1f13b4..6dbaa6d 100644
--- a/include/uapi/drm/i915_drm.h
+++ b/include/uapi/drm/i915_drm.h
@@ -1156,6 +1156,14 @@ enum drm_i915_oa_format {
 #define I915_PERF_FLAG_FD_NONBLOCK	(1<<1)
 #define I915_PERF_FLAG_DISABLED		(1<<2)
 
+enum drm_i915_perf_oa_event_source {
+	I915_PERF_OA_EVENT_SOURCE_UNDEFINED,
+	I915_PERF_OA_EVENT_SOURCE_PERIODIC,
+	I915_PERF_OA_EVENT_SOURCE_CONTEXT_SWITCH,
+
+	I915_PERF_OA_EVENT_SOURCE_MAX	/* non-ABI */
+};
+
 enum drm_i915_perf_property_id {
 	/**
 	 * Open the stream for a specific context handle (as used with
@@ -1190,6 +1198,13 @@ enum drm_i915_perf_property_id {
 	 */
 	DRM_I915_PERF_OA_EXPONENT_PROP,
 
+	/**
+	 * The value of this property set to 1 requests inclusion of sample
+	 * source field to be given to userspace. The sample source field
+	 * specifies the origin of OA report.
+	 */
+	DRM_I915_PERF_SAMPLE_OA_SOURCE_PROP,
+
 	DRM_I915_PERF_PROP_MAX /* non-ABI */
 };
 
@@ -1237,6 +1252,7 @@ enum drm_i915_perf_record_type {
 	 * struct {
 	 *     struct drm_i915_perf_record_header header;
 	 *
+	 *     { u32 source_info; } && DRM_I915_PERF_SAMPLE_OA_SOURCE_PROP
 	 *     { u32 oa_report[]; } && DRM_I915_PERF_SAMPLE_OA_PROP
 	 * };
 	 */
-- 
1.9.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 06/11] drm/i915: Framework for capturing command stream based OA reports
  2016-02-16  5:27 [PATCH 00/11] Framework to collect gpu metrics using i915 perf infrastructure sourab.gupta
                   ` (4 preceding siblings ...)
  2016-02-16  5:27 ` [PATCH 05/11] drm/i915: Expose OA sample source to userspace sourab.gupta
@ 2016-02-16  5:27 ` sourab.gupta
  2016-02-17 17:30   ` Robert Bragg
  2016-02-16  5:27 ` [PATCH 07/11] drm/i915: Add support for having pid output with OA report sourab.gupta
                   ` (4 subsequent siblings)
  10 siblings, 1 reply; 15+ messages in thread
From: sourab.gupta @ 2016-02-16  5:27 UTC (permalink / raw)
  To: intel-gfx; +Cc: Jabin Wu, Sourab Gupta

From: Sourab Gupta <sourab.gupta@intel.com>

This patch introduces a framework to enable OA counter reports associated
with Render command stream. We can then associate the reports captured
through this mechanism with their corresponding context id's. This can be
further extended to associate any other metadata information with the
corresponding samples (since the association with Render command stream
gives us the ability to capture these information while inserting the
corresponding capture commands into the command stream).

The OA reports generated in this way are associated with a corresponding
workload, and thus can be used the delimit the workload (i.e. sample the
counters at the workload boundaries), within an ongoing stream of periodic
counter snapshots.

There may be usecases wherein we need more than periodic OA capture mode
which is supported currently. This mode is primarily used for two usecases:
    - Ability to capture system wide metrics, alongwith the ability to map
      the reports back to individual contexts (particularly for HSW).
    - Ability to inject tags for work, into the reports. This provides
      visibility into the multiple stages of work within single context.

The userspace will be able to distinguish between the periodic and CS based
OA reports by the virtue of source_info sample field.

The commands to capture OA reports are inserted at BB boundaries.
The metadata information pertaining to snapshot is maintained in a list,
which also has corresponding gem request field which is tracked for completion
of command.

Both periodic and RCS based reports are associated with a single stream
(corresponding to render engine), and the samples will be forwarded to
userspace in the sequential order according to their timestamps.

v2: Aligining with the non-perf interface (custom drm ioctl based). Also,
few related patches are squashed together for better readability

Signed-off-by: Sourab Gupta <sourab.gupta@intel.com>
Signed-off-by: Robert Bragg <robert@sixbynine.org>
---
 drivers/gpu/drm/i915/i915_drv.h            |  35 +-
 drivers/gpu/drm/i915/i915_gem_execbuffer.c |   4 +
 drivers/gpu/drm/i915/i915_perf.c           | 733 +++++++++++++++++++++++++----
 drivers/gpu/drm/i915/intel_lrc.c           |   4 +
 include/uapi/drm/i915_drm.h                |  15 +
 5 files changed, 690 insertions(+), 101 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index c409c8f..44fcbf4 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -1730,6 +1730,9 @@ struct i915_perf_stream {
 	struct intel_context *ctx;
 	bool enabled;
 
+	/* Whether command stream based data collection is enabled */
+	bool cs_mode;
+
 	/* Enables the collection of HW samples, either in response to
 	 * I915_PERF_IOCTL_ENABLE or implicitly called when stream is
 	 * opened without I915_PERF_FLAG_DISABLED */
@@ -1773,6 +1776,12 @@ struct i915_perf_stream {
 	 *
 	 * The stream will always be disabled before this is called */
 	void (*destroy)(struct i915_perf_stream *stream);
+
+	/*
+	 * Routine to emit the commands in the command streamer associated
+	 * with the corresponding gpu engine.
+	 */
+	void (*command_stream_hook)(struct drm_i915_gem_request *req);
 };
 
 struct i915_oa_ops {
@@ -1786,10 +1795,21 @@ struct i915_oa_ops {
 					u32 ctx_id);
 	void (*legacy_ctx_switch_unlocked)(struct drm_i915_gem_request *req);
 	void (*read)(struct i915_perf_stream *stream,
-		     struct i915_perf_read_state *read_state);
+		     struct i915_perf_read_state *read_state, u32 ts);
 	bool (*oa_buffer_is_empty)(struct drm_i915_private *dev_priv);
 };
 
+/*
+ * List element to hold info about the perf sample data associated
+ * with a particular GPU command stream.
+ */
+struct i915_perf_cs_data_node {
+	struct list_head link;
+	struct drm_i915_gem_request *request;
+	u32 offset;
+	u32 ctx_id;
+};
+
 struct drm_i915_private {
 	struct drm_device *dev;
 	struct kmem_cache *objects;
@@ -2042,6 +2062,8 @@ struct drm_i915_private {
 		struct ctl_table_header *sysctl_header;
 
 		struct mutex lock;
+
+		struct mutex streams_lock;
 		struct list_head streams;
 
 		spinlock_t hook_lock;
@@ -2084,6 +2106,16 @@ struct drm_i915_private {
 			const struct i915_oa_format *oa_formats;
 			int n_builtin_sets;
 		} oa;
+
+		/* Command stream based perf data buffer */
+		struct {
+			struct drm_i915_gem_object *obj;
+			struct i915_vma *vma;
+			u8 *addr;
+		} command_stream_buf;
+
+		struct list_head node_list;
+		spinlock_t node_list_lock;
 	} perf;
 
 	/* Abstract the submission mechanism (legacy ringbuffer or execlists) away */
@@ -3328,6 +3360,7 @@ void i915_oa_context_pin_notify(struct drm_i915_private *dev_priv,
 				struct intel_context *context);
 void i915_oa_legacy_ctx_switch_notify(struct drm_i915_gem_request *req);
 void i915_oa_update_reg_state(struct intel_engine_cs *ring, uint32_t *reg_state);
+void i915_perf_command_stream_hook(struct drm_i915_gem_request *req);
 
 /* i915_gem_evict.c */
 int __must_check i915_gem_evict_something(struct drm_device *dev,
diff --git a/drivers/gpu/drm/i915/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
index 6ed7d63a..6860fca 100644
--- a/drivers/gpu/drm/i915/i915_gem_execbuffer.c
+++ b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
@@ -1258,12 +1258,16 @@ i915_gem_ringbuffer_submission(struct i915_execbuffer_params *params,
 	exec_start = params->batch_obj_vm_offset +
 		     params->args_batch_start_offset;
 
+	i915_perf_command_stream_hook(params->request);
+
 	ret = ring->dispatch_execbuffer(params->request,
 					exec_start, exec_len,
 					params->dispatch_flags);
 	if (ret)
 		return ret;
 
+	i915_perf_command_stream_hook(params->request);
+
 	trace_i915_gem_ring_dispatch(params->request, params->dispatch_flags);
 
 	i915_gem_execbuffer_move_to_active(vmas, params->request);
diff --git a/drivers/gpu/drm/i915/i915_perf.c b/drivers/gpu/drm/i915/i915_perf.c
index 06de4b3..2904745 100644
--- a/drivers/gpu/drm/i915/i915_perf.c
+++ b/drivers/gpu/drm/i915/i915_perf.c
@@ -51,6 +51,13 @@ static u32 i915_perf_stream_paranoid = true;
 #define GEN8_OAREPORT_REASON_GO_TRANSITION  (1<<23)
 #define GEN9_OAREPORT_REASON_CLK_RATIO      (1<<24)
 
+/* Data common to periodic and RCS based samples */
+struct oa_sample_data {
+	u32 source;
+	u32 ctx_id;
+	const u8 *report;
+};
+
 /* for sysctl proc_dointvec_minmax of i915_oa_min_timer_exponent */
 static int zero;
 static int oa_exponent_max = OA_EXPONENT_MAX;
@@ -88,6 +95,7 @@ static struct i915_oa_format gen8_plus_oa_formats[I915_OA_FORMAT_MAX] = {
 
 #define SAMPLE_OA_REPORT	(1<<0)
 #define SAMPLE_OA_SOURCE_INFO	(1<<1)
+#define SAMPLE_CTX_ID		(1<<2)
 
 struct perf_open_properties
 {
@@ -101,8 +109,236 @@ struct perf_open_properties
 	int oa_format;
 	bool oa_periodic;
 	u32 oa_period_exponent;
+
+	/* Command stream mode */
+	bool cs_mode;
+	enum intel_ring_id ring_id;
 };
 
+/*
+ * Emit the commands to capture metrics, into the command stream. This function
+ * can be called concurrently with the stream operations and doesn't require
+ * perf mutex lock.
+ */
+
+void i915_perf_command_stream_hook(struct drm_i915_gem_request *req)
+{
+	struct intel_engine_cs *ring = req->ring;
+	struct drm_i915_private *dev_priv = ring->dev->dev_private;
+	struct i915_perf_stream *stream;
+
+	if (!dev_priv->perf.initialized)
+		return;
+
+	mutex_lock(&dev_priv->perf.streams_lock);
+	list_for_each_entry(stream, &dev_priv->perf.streams, link) {
+		if (stream->enabled && stream->command_stream_hook)
+			stream->command_stream_hook(req);
+	}
+	mutex_unlock(&dev_priv->perf.streams_lock);
+}
+
+/*
+ * Release some perf entries to make space for a new entry data. We dereference
+ * the associated request before deleting the entry. Also, no need to check for
+ * gpu completion of commands, since, these entries are anyways going to be
+ * replaced by a new entry, and gpu will overwrite the buffer contents
+ * eventually, when the request associated with new entry completes.
+ */
+static void release_some_perf_entries(struct drm_i915_private *dev_priv,
+					u32 target_size)
+{
+	struct i915_perf_cs_data_node *entry, *next;
+	u32 entry_size = dev_priv->perf.oa.oa_buffer.format_size;
+	u32 size = 0;
+
+	list_for_each_entry_safe
+		(entry, next, &dev_priv->perf.node_list, link) {
+
+		size += entry_size;
+		i915_gem_request_unreference(entry->request);
+		list_del(&entry->link);
+		kfree(entry);
+
+		if (size >= target_size)
+			break;
+	}
+}
+
+/*
+ * Insert the perf entry to the end of the list. This function never fails,
+ * since it always manages to insert the entry. If the space is exhausted in
+ * the buffer, it will remove the oldest entries in order to make space.
+ */
+static void insert_perf_entry(struct drm_i915_private *dev_priv,
+				struct i915_perf_cs_data_node *entry)
+{
+	struct i915_perf_cs_data_node *first_entry, *last_entry;
+	int max_offset = dev_priv->perf.command_stream_buf.obj->base.size;
+	u32 entry_size = dev_priv->perf.oa.oa_buffer.format_size;
+
+	spin_lock(&dev_priv->perf.node_list_lock);
+	if (list_empty(&dev_priv->perf.node_list)) {
+		entry->offset = 0;
+		list_add_tail(&entry->link, &dev_priv->perf.node_list);
+		spin_unlock(&dev_priv->perf.node_list_lock);
+		return;
+	}
+
+	first_entry = list_first_entry(&dev_priv->perf.node_list,
+				       typeof(*first_entry), link);
+	last_entry = list_last_entry(&dev_priv->perf.node_list,
+				     typeof(*first_entry), link);
+
+	if (last_entry->offset >= first_entry->offset) {
+		/* Sufficient space available at the end of buffer? */
+		if (last_entry->offset + 2*entry_size < max_offset)
+			entry->offset = last_entry->offset + entry_size;
+		/*
+		 * Wraparound condition. Is sufficient space available at
+		 * beginning of buffer?
+		 */
+		else if (entry_size < first_entry->offset)
+			entry->offset = 0;
+		/* Insufficient space. Overwrite existing old entries */
+		else {
+			u32 target_size = entry_size - first_entry->offset;
+
+			release_some_perf_entries(dev_priv, target_size);
+			entry->offset = 0;
+		}
+	} else {
+		/* Sufficient space available? */
+		if (last_entry->offset + 2*entry_size < first_entry->offset)
+			entry->offset = last_entry->offset + entry_size;
+		/* Insufficient space. Overwrite existing old entries */
+		else {
+			u32 target_size = entry_size -
+				(first_entry->offset - last_entry->offset -
+				entry_size);
+
+			release_some_perf_entries(dev_priv, target_size);
+			entry->offset = last_entry->offset + entry_size;
+		}
+	}
+	list_add_tail(&entry->link, &dev_priv->perf.node_list);
+	spin_unlock(&dev_priv->perf.node_list_lock);
+}
+
+static void i915_perf_command_stream_hook_oa(struct drm_i915_gem_request *req)
+{
+	struct intel_engine_cs *ring = req->ring;
+	struct intel_ringbuffer *ringbuf = req->ringbuf;
+	struct intel_context *ctx = req->ctx;
+	struct drm_i915_private *dev_priv = ring->dev->dev_private;
+	struct i915_perf_cs_data_node *entry;
+	u32 addr = 0;
+	int ret;
+
+	/* OA counters are only supported on the render ring */
+	BUG_ON(ring->id != RCS);
+
+	entry = kzalloc(sizeof(*entry), GFP_KERNEL);
+	if (entry == NULL) {
+		DRM_ERROR("alloc failed\n");
+		return;
+	}
+
+	if (i915.enable_execlists)
+		ret = intel_logical_ring_begin(req, 4);
+	else
+		ret = intel_ring_begin(req, 4);
+
+	if (ret) {
+		kfree(entry);
+		return;
+	}
+
+	entry->ctx_id = ctx->global_id;
+	i915_gem_request_assign(&entry->request, req);
+
+	insert_perf_entry(dev_priv, entry);
+
+	addr = dev_priv->perf.command_stream_buf.vma->node.start +
+		entry->offset;
+
+	/* addr should be 64 byte aligned */
+	BUG_ON(addr & 0x3f);
+
+	if (i915.enable_execlists) {
+		intel_logical_ring_emit(ringbuf, MI_REPORT_PERF_COUNT | (2<<0));
+		intel_logical_ring_emit(ringbuf,
+					addr | MI_REPORT_PERF_COUNT_GGTT);
+		intel_logical_ring_emit(ringbuf, 0);
+		intel_logical_ring_emit(ringbuf,
+			i915_gem_request_get_seqno(req));
+		intel_logical_ring_advance(ringbuf);
+	} else {
+		if (INTEL_INFO(ring->dev)->gen >= 8) {
+			intel_ring_emit(ring, MI_REPORT_PERF_COUNT | (2<<0));
+			intel_ring_emit(ring, addr | MI_REPORT_PERF_COUNT_GGTT);
+			intel_ring_emit(ring, 0);
+			intel_ring_emit(ring,
+				i915_gem_request_get_seqno(req));
+		} else {
+			intel_ring_emit(ring, MI_REPORT_PERF_COUNT | (1<<0));
+			intel_ring_emit(ring, addr | MI_REPORT_PERF_COUNT_GGTT);
+			intel_ring_emit(ring, i915_gem_request_get_seqno(req));
+			intel_ring_emit(ring, MI_NOOP);
+		}
+		intel_ring_advance(ring);
+	}
+	i915_vma_move_to_active(dev_priv->perf.command_stream_buf.vma, req);
+}
+
+static int i915_oa_rcs_wait_gpu(struct drm_i915_private *dev_priv)
+{
+	struct i915_perf_cs_data_node *last_entry = NULL;
+	struct drm_i915_gem_request *req = NULL;
+	int ret;
+
+	/*
+	 * Wait for the last scheduled request to complete. This would
+	 * implicitly wait for the prior submitted requests. The refcount
+	 * of the requests is not decremented here.
+	 */
+	spin_lock(&dev_priv->perf.node_list_lock);
+
+	if (!list_empty(&dev_priv->perf.node_list)) {
+		last_entry = list_last_entry(&dev_priv->perf.node_list,
+			struct i915_perf_cs_data_node, link);
+		req = last_entry->request;
+	}
+	spin_unlock(&dev_priv->perf.node_list_lock);
+
+	if (!req)
+		return 0;
+
+	ret = __i915_wait_request(req, atomic_read(
+				&dev_priv->gpu_error.reset_counter),
+				true, NULL, NULL);
+	if (ret) {
+		DRM_ERROR("failed to wait\n");
+		return ret;
+	}
+	return 0;
+}
+
+static void i915_oa_rcs_free_requests(struct drm_i915_private *dev_priv)
+{
+	struct i915_perf_cs_data_node *entry, *next;
+
+	list_for_each_entry_safe
+		(entry, next, &dev_priv->perf.node_list, link) {
+		i915_gem_request_unreference__unlocked(entry->request);
+
+		spin_lock(&dev_priv->perf.node_list_lock);
+		list_del(&entry->link);
+		spin_unlock(&dev_priv->perf.node_list_lock);
+		kfree(entry);
+	}
+}
+
 static bool gen8_oa_buffer_is_empty(struct drm_i915_private *dev_priv)
 {
 	u32 head = I915_READ(GEN8_OAHEADPTR);
@@ -140,7 +376,7 @@ static bool append_oa_status(struct i915_perf_stream *stream,
 
 static bool append_oa_sample(struct i915_perf_stream *stream,
 			     struct i915_perf_read_state *read_state,
-			     const u8 *report)
+			     struct oa_sample_data *data)
 {
 	struct drm_i915_private *dev_priv = stream->dev_priv;
 	int report_size = dev_priv->perf.oa.oa_buffer.format_size;
@@ -158,6 +394,37 @@ static bool append_oa_sample(struct i915_perf_stream *stream,
 	read_state->buf += sizeof(header);
 
 	if (sample_flags & SAMPLE_OA_SOURCE_INFO) {
+		if (copy_to_user(read_state->buf, &data->source, 4))
+			return false;
+		read_state->buf += 4;
+	}
+
+	if (sample_flags & SAMPLE_CTX_ID) {
+		if (copy_to_user(read_state->buf, &data->ctx_id, 4))
+			return false;
+		read_state->buf += 4;
+	}
+
+	if (sample_flags & SAMPLE_OA_REPORT) {
+		if (copy_to_user(read_state->buf, data->report, report_size))
+			return false;
+		read_state->buf += report_size;
+	}
+
+	read_state->read += header.size;
+
+	return true;
+}
+
+static bool append_oa_buffer_sample(struct i915_perf_stream *stream,
+				    struct i915_perf_read_state *read_state,
+				    const u8 *report)
+{
+	struct drm_i915_private *dev_priv = stream->dev_priv;
+	u32 sample_flags = stream->sample_flags;
+	struct oa_sample_data data = { 0 };
+
+	if (sample_flags & SAMPLE_OA_SOURCE_INFO) {
 		enum drm_i915_perf_oa_event_source source;
 
 		if (INTEL_INFO(dev_priv)->gen >= 8) {
@@ -173,17 +440,16 @@ static bool append_oa_sample(struct i915_perf_stream *stream,
 		} else
 			source = I915_PERF_OA_EVENT_SOURCE_PERIODIC;
 
-		if (copy_to_user(read_state->buf, &source, 4))
-			return false;
-		read_state->buf += 4;
+		data.source = source;
 	}
+#warning "FIXME: append_oa_buffer_sample: read ctx ID from report and map that to an intel_context::global_id"
+	if (sample_flags & SAMPLE_CTX_ID)
+		data.ctx_id = 0;
 
-	if (sample_flags & SAMPLE_OA_REPORT) {
-		copy_to_user(read_state->buf, report, report_size);
-		read_state->buf += report_size;
-	}
+	if (sample_flags & SAMPLE_OA_REPORT)
+		data.report = report;
 
-	read_state->read += header.size;
+	append_oa_sample(stream, read_state, &data);
 
 	return true;
 }
@@ -191,14 +457,14 @@ static bool append_oa_sample(struct i915_perf_stream *stream,
 static u32 gen8_append_oa_reports(struct i915_perf_stream *stream,
 				  struct i915_perf_read_state *read_state,
 				  u32 head,
-				  u32 tail)
+				  u32 tail, u32 ts)
 {
 	struct drm_i915_private *dev_priv = stream->dev_priv;
 	int report_size = dev_priv->perf.oa.oa_buffer.format_size;
 	u8 *oa_buf_base = dev_priv->perf.oa.oa_buffer.addr;
 	u32 mask = (OA_BUFFER_SIZE - 1);
 	u8 *report;
-	u32 taken;
+	u32 report_ts, taken;
 
 	head -= dev_priv->perf.oa.oa_buffer.gtt_offset;
 	tail -= dev_priv->perf.oa.oa_buffer.gtt_offset;
@@ -223,6 +489,10 @@ static u32 gen8_append_oa_reports(struct i915_perf_stream *stream,
 
 		report = oa_buf_base + (head & mask);
 
+		report_ts = *(u32 *)(report + 4);
+		if (report_ts > ts)
+			break;
+
 		ctx_id = *(u32 *)(report + 12);
 		if (i915.enable_execlists) {
 			/* XXX: Just keep the lower 20 bits for now since I'm
@@ -247,13 +517,16 @@ static u32 gen8_append_oa_reports(struct i915_perf_stream *stream,
 			     (dev_priv->perf.oa.specific_ctx_id !=
 			      dev_priv->perf.oa.oa_buffer.last_ctx_id))) {
 
-				if (!append_oa_sample(stream, read_state, report))
+				if (!append_oa_buffer_sample(stream, read_state,
+								report))
 					break;
 			}
 		}
 
-		/* If append_oa_sample() returns false we shouldn't progress
-		 * head so we update it afterwards... */
+		/*
+		 * If append_oa_buffer_sample() returns false we shouldn't
+		 * progress head so we update it afterwards.
+		 */
 		dev_priv->perf.oa.oa_buffer.last_ctx_id = ctx_id;
 		head += report_size;
 	}
@@ -262,7 +535,7 @@ static u32 gen8_append_oa_reports(struct i915_perf_stream *stream,
 }
 
 static void gen8_oa_read(struct i915_perf_stream *stream,
-			 struct i915_perf_read_state *read_state)
+			 struct i915_perf_read_state *read_state, u32 ts)
 {
 	struct drm_i915_private *dev_priv = stream->dev_priv;
 	u32 oastatus;
@@ -293,7 +566,7 @@ static void gen8_oa_read(struct i915_perf_stream *stream,
 		I915_WRITE(GEN8_OASTATUS, oastatus);
 	}
 
-	head = gen8_append_oa_reports(stream, read_state, head, tail);
+	head = gen8_append_oa_reports(stream, read_state, head, tail, ts);
 
 	I915_WRITE(GEN8_OAHEADPTR, head);
 }
@@ -301,14 +574,14 @@ static void gen8_oa_read(struct i915_perf_stream *stream,
 static u32 gen7_append_oa_reports(struct i915_perf_stream *stream,
 				  struct i915_perf_read_state *read_state,
 				  u32 head,
-				  u32 tail)
+				  u32 tail, u32 ts)
 {
 	struct drm_i915_private *dev_priv = stream->dev_priv;
 	int report_size = dev_priv->perf.oa.oa_buffer.format_size;
 	u8 *oa_buf_base = dev_priv->perf.oa.oa_buffer.addr;
 	u32 mask = (OA_BUFFER_SIZE - 1);
 	u8 *report;
-	u32 taken;
+	u32 report_ts, taken;
 
 	head -= dev_priv->perf.oa.oa_buffer.gtt_offset;
 	tail -= dev_priv->perf.oa.oa_buffer.gtt_offset;
@@ -326,13 +599,20 @@ static u32 gen7_append_oa_reports(struct i915_perf_stream *stream,
 
 		report = oa_buf_base + (head & mask);
 
+		report_ts = *(u32 *)(report + 4);
+		if (report_ts > ts)
+			break;
+
 		if (dev_priv->perf.oa.exclusive_stream->enabled) {
-			if (!append_oa_sample(stream, read_state, report))
+			if (!append_oa_buffer_sample(stream, read_state,
+							report))
 				break;
 		}
 
-		/* If append_oa_sample() returns false we shouldn't progress
-		 * head so we update it afterwards... */
+		/*
+		 * If append_oa_buffer_sample() returns false we shouldn't
+		 * progress head so we update it afterwards.
+		 */
 		head += report_size;
 	}
 
@@ -340,7 +620,7 @@ static u32 gen7_append_oa_reports(struct i915_perf_stream *stream,
 }
 
 static void gen7_oa_read(struct i915_perf_stream *stream,
-			 struct i915_perf_read_state *read_state)
+			 struct i915_perf_read_state *read_state, u32 ts)
 {
 	struct drm_i915_private *dev_priv = stream->dev_priv;
 	u32 oastatus2;
@@ -374,20 +654,82 @@ static void gen7_oa_read(struct i915_perf_stream *stream,
 		I915_WRITE(GEN7_OASTATUS1, oastatus1);
 	}
 
-	head = gen7_append_oa_reports(stream, read_state, head, tail);
+	head = gen7_append_oa_reports(stream, read_state, head, tail, ts);
 
 	I915_WRITE(GEN7_OASTATUS2, (head & GEN7_OASTATUS2_HEAD_MASK) |
 				    OA_MEM_SELECT_GGTT);
 }
 
-static bool i915_oa_can_read(struct i915_perf_stream *stream)
+static bool append_oa_rcs_sample(struct i915_perf_stream *stream,
+				 struct i915_perf_read_state *read_state,
+				 struct i915_perf_cs_data_node *node)
+{
+	struct drm_i915_private *dev_priv = stream->dev_priv;
+	struct oa_sample_data data = { 0 };
+	const u8 *report = dev_priv->perf.command_stream_buf.addr +
+				node->offset;
+	u32 sample_flags = stream->sample_flags;
+	u32 report_ts;
+
+	/*
+	 * Forward the periodic OA samples which have the timestamp lower
+	 * than timestamp of this sample, before forwarding this sample.
+	 * This ensures samples read by user are order acc. to their timestamps
+	 */
+	report_ts = *(u32 *)(report + 4);
+	dev_priv->perf.oa.ops.read(stream, read_state, report_ts);
+
+	if (sample_flags & SAMPLE_OA_SOURCE_INFO)
+		data.source = I915_PERF_OA_EVENT_SOURCE_RCS;
+
+	if (sample_flags & SAMPLE_CTX_ID)
+		data.ctx_id = node->ctx_id;
+
+	if (sample_flags & SAMPLE_OA_REPORT)
+		data.report = report;
+
+	append_oa_sample(stream, read_state, &data);
+
+	return true;
+}
+
+static void oa_rcs_append_reports(struct i915_perf_stream *stream,
+				  struct i915_perf_read_state *read_state)
+{
+	struct drm_i915_private *dev_priv = stream->dev_priv;
+	struct i915_perf_cs_data_node *entry, *next;
+
+	list_for_each_entry_safe(entry, next,
+				 &dev_priv->perf.node_list, link) {
+		if (!i915_gem_request_completed(entry->request, true))
+			break;
+
+		if (!append_oa_rcs_sample(stream, read_state, entry))
+			break;
+
+		spin_lock(&dev_priv->perf.node_list_lock);
+		list_del(&entry->link);
+		spin_unlock(&dev_priv->perf.node_list_lock);
+
+		i915_gem_request_unreference__unlocked(entry->request);
+		kfree(entry);
+	}
+
+	/* Flush any remaining periodic reports */
+	dev_priv->perf.oa.ops.read(stream, read_state, U32_MAX);
+}
+
+static bool command_stream_buf_is_empty(struct i915_perf_stream *stream)
 {
 	struct drm_i915_private *dev_priv = stream->dev_priv;
 
-	return !dev_priv->perf.oa.ops.oa_buffer_is_empty(dev_priv);
+	if (stream->cs_mode)
+		return list_empty(&dev_priv->perf.node_list);
+	else
+		return true;
 }
 
-static int i915_oa_wait_unlocked(struct i915_perf_stream *stream)
+static bool stream_have_data__unlocked(struct i915_perf_stream *stream)
 {
 	struct drm_i915_private *dev_priv = stream->dev_priv;
 
@@ -397,8 +739,29 @@ static int i915_oa_wait_unlocked(struct i915_perf_stream *stream)
 	 * can't be destroyed until completion (such as a read()) that ensures
 	 * the device + OA buffer can't disappear
 	 */
+	return !(dev_priv->perf.oa.ops.oa_buffer_is_empty(dev_priv) &&
+		 command_stream_buf_is_empty(stream));
+}
+
+static bool i915_oa_can_read(struct i915_perf_stream *stream)
+{
+
+	return stream_have_data__unlocked(stream);
+}
+
+static int i915_oa_wait_unlocked(struct i915_perf_stream *stream)
+{
+	struct drm_i915_private *dev_priv = stream->dev_priv;
+	int ret;
+
+	if (stream->cs_mode) {
+		ret = i915_oa_rcs_wait_gpu(dev_priv);
+		if (ret)
+			return ret;
+	}
+
 	return wait_event_interruptible(dev_priv->perf.oa.poll_wq,
-					!dev_priv->perf.oa.ops.oa_buffer_is_empty(dev_priv));
+					stream_have_data__unlocked(stream));
 }
 
 static void i915_oa_poll_wait(struct i915_perf_stream *stream,
@@ -415,7 +778,26 @@ static void i915_oa_read(struct i915_perf_stream *stream,
 {
 	struct drm_i915_private *dev_priv = stream->dev_priv;
 
-	dev_priv->perf.oa.ops.read(stream, read_state);
+	if (stream->cs_mode)
+		oa_rcs_append_reports(stream, read_state);
+	else
+		dev_priv->perf.oa.ops.read(stream, read_state, U32_MAX);
+}
+
+static void
+free_command_stream_buf(struct drm_i915_private *i915)
+{
+	mutex_lock(&i915->dev->struct_mutex);
+
+	vunmap(i915->perf.command_stream_buf.addr);
+	i915_gem_object_ggtt_unpin(i915->perf.command_stream_buf.obj);
+	drm_gem_object_unreference(&i915->perf.command_stream_buf.obj->base);
+
+	i915->perf.command_stream_buf.obj = NULL;
+	i915->perf.command_stream_buf.vma = NULL;
+	i915->perf.command_stream_buf.addr = NULL;
+
+	mutex_unlock(&i915->dev->struct_mutex);
 }
 
 static void
@@ -440,12 +822,17 @@ static void i915_oa_stream_destroy(struct i915_perf_stream *stream)
 
 	BUG_ON(stream != dev_priv->perf.oa.exclusive_stream);
 
-	dev_priv->perf.oa.ops.disable_metric_set(dev_priv);
+	if (stream->cs_mode)
+		free_command_stream_buf(dev_priv);
 
-	free_oa_buffer(dev_priv);
+	if (dev_priv->perf.oa.oa_buffer.obj) {
+		dev_priv->perf.oa.ops.disable_metric_set(dev_priv);
 
-	intel_uncore_forcewake_put(dev_priv, FORCEWAKE_ALL);
-	intel_runtime_pm_put(dev_priv);
+		free_oa_buffer(dev_priv);
+
+		intel_uncore_forcewake_put(dev_priv, FORCEWAKE_ALL);
+		intel_runtime_pm_put(dev_priv);
+	}
 
 	dev_priv->perf.oa.exclusive_stream = NULL;
 }
@@ -509,16 +896,17 @@ static void gen8_init_oa_buffer(struct drm_i915_private *dev_priv)
 		   dev_priv->perf.oa.oa_buffer.gtt_offset);
 }
 
-static int alloc_oa_buffer(struct drm_i915_private *dev_priv)
+static int alloc_obj(struct drm_i915_private *dev_priv,
+				struct drm_i915_gem_object **obj)
 {
 	struct drm_i915_gem_object *bo;
 	int ret;
 
-	BUG_ON(dev_priv->perf.oa.oa_buffer.obj);
+	intel_runtime_pm_get(dev_priv);
 
 	ret = i915_mutex_lock_interruptible(dev_priv->dev);
 	if (ret)
-		return ret;
+		goto out;
 
 	bo = i915_gem_alloc_object(dev_priv->dev, OA_BUFFER_SIZE);
 	if (bo == NULL) {
@@ -526,8 +914,6 @@ static int alloc_oa_buffer(struct drm_i915_private *dev_priv)
 		ret = -ENOMEM;
 		goto unlock;
 	}
-	dev_priv->perf.oa.oa_buffer.obj = bo;
-
 	ret = i915_gem_object_set_cache_level(bo, I915_CACHE_LLC);
 	if (ret)
 		goto err_unref;
@@ -537,6 +923,31 @@ static int alloc_oa_buffer(struct drm_i915_private *dev_priv)
 	if (ret)
 		goto err_unref;
 
+	*obj = bo;
+	goto unlock;
+
+err_unref:
+	drm_gem_object_unreference(&bo->base);
+unlock:
+	mutex_unlock(&dev_priv->dev->struct_mutex);
+out:
+	intel_runtime_pm_put(dev_priv);
+	return ret;
+}
+
+static int alloc_oa_buffer(struct drm_i915_private *dev_priv)
+{
+	struct drm_i915_gem_object *bo;
+	int ret;
+
+	BUG_ON(dev_priv->perf.oa.oa_buffer.obj);
+
+	ret = alloc_obj(dev_priv, &bo);
+	if (ret)
+		return ret;
+
+	dev_priv->perf.oa.oa_buffer.obj = bo;
+
 	dev_priv->perf.oa.oa_buffer.gtt_offset = i915_gem_obj_ggtt_offset(bo);
 	dev_priv->perf.oa.oa_buffer.addr = vmap_oa_buffer(bo);
 
@@ -546,14 +957,32 @@ static int alloc_oa_buffer(struct drm_i915_private *dev_priv)
 			 dev_priv->perf.oa.oa_buffer.gtt_offset,
 			 dev_priv->perf.oa.oa_buffer.addr);
 
-	goto unlock;
+	return 0;
+}
 
-err_unref:
-	drm_gem_object_unreference(&bo->base);
+static int alloc_command_stream_buf(struct drm_i915_private *dev_priv)
+{
+	struct drm_i915_gem_object *bo;
+	int ret;
 
-unlock:
-	mutex_unlock(&dev_priv->dev->struct_mutex);
-	return ret;
+	BUG_ON(dev_priv->perf.command_stream_buf.obj);
+
+	ret = alloc_obj(dev_priv, &bo);
+	if (ret)
+		return ret;
+
+	dev_priv->perf.command_stream_buf.obj = bo;
+	dev_priv->perf.command_stream_buf.vma = i915_gem_obj_to_ggtt(bo);
+	dev_priv->perf.command_stream_buf.addr = vmap_oa_buffer(bo);
+	INIT_LIST_HEAD(&dev_priv->perf.node_list);
+
+	DRM_DEBUG_DRIVER(
+		"command stream buf initialized, gtt offset = 0x%x, vaddr = %p",
+		 (unsigned int)
+		 dev_priv->perf.command_stream_buf.vma->node.start,
+		 dev_priv->perf.command_stream_buf.addr);
+
+	return 0;
 }
 
 static void config_oa_regs(struct drm_i915_private *dev_priv,
@@ -836,6 +1265,9 @@ static void i915_oa_stream_enable(struct i915_perf_stream *stream)
 
 	dev_priv->perf.oa.ops.oa_enable(dev_priv);
 
+	if (stream->cs_mode)
+		stream->command_stream_hook = i915_perf_command_stream_hook_oa;
+
 	if (dev_priv->perf.oa.periodic)
 		hrtimer_start(&dev_priv->perf.oa.poll_check_timer,
 			      ns_to_ktime(POLL_PERIOD),
@@ -856,10 +1288,16 @@ static void i915_oa_stream_disable(struct i915_perf_stream *stream)
 {
 	struct drm_i915_private *dev_priv = stream->dev_priv;
 
-	dev_priv->perf.oa.ops.oa_disable(dev_priv);
-
 	if (dev_priv->perf.oa.periodic)
 		hrtimer_cancel(&dev_priv->perf.oa.poll_check_timer);
+
+	if (stream->cs_mode) {
+		stream->command_stream_hook = NULL;
+		i915_oa_rcs_wait_gpu(dev_priv);
+		i915_oa_rcs_free_requests(dev_priv);
+	}
+
+	dev_priv->perf.oa.ops.oa_disable(dev_priv);
 }
 
 static int i915_oa_stream_init(struct i915_perf_stream *stream,
@@ -867,14 +1305,11 @@ static int i915_oa_stream_init(struct i915_perf_stream *stream,
 			       struct perf_open_properties *props)
 {
 	struct drm_i915_private *dev_priv = stream->dev_priv;
+	bool require_oa_unit = props->sample_flags & (SAMPLE_OA_REPORT |
+						      SAMPLE_OA_SOURCE_INFO);
 	int format_size;
 	int ret;
 
-	if (!dev_priv->perf.oa.ops.init_oa_buffer) {
-		DRM_ERROR("OA unit not supported\n");
-		return -ENODEV;
-	}
-
 	/* To avoid the complexity of having to accurately filter
 	 * counter reports and marshal to the appropriate client
 	 * we currently only allow exclusive access */
@@ -883,68 +1318,120 @@ static int i915_oa_stream_init(struct i915_perf_stream *stream,
 		return -EBUSY;
 	}
 
-	if (!props->metrics_set) {
-		DRM_ERROR("OA metric set not specified\n");
-		return -EINVAL;
-	}
-
-	if (!props->oa_format) {
-		DRM_ERROR("OA report format not specified\n");
+	/* Ctx Id can be sampled in HSW only through command streamer mode */
+	if (IS_HASWELL(dev_priv->dev) &&
+	    (props->sample_flags & SAMPLE_CTX_ID) && !props->cs_mode) {
+		DRM_ERROR(
+			"Context ID sampling only supported via command stream on Haswell");
 		return -EINVAL;
 	}
 
 	stream->sample_size = sizeof(struct drm_i915_perf_record_header);
 
-	format_size = dev_priv->perf.oa.oa_formats[props->oa_format].size;
+	if (require_oa_unit) {
+		if (!dev_priv->perf.oa.ops.init_oa_buffer) {
+			DRM_ERROR("OA unit not supported\n");
+			return -ENODEV;
+		}
 
-	if (props->sample_flags & SAMPLE_OA_REPORT) {
-		stream->sample_flags |= SAMPLE_OA_REPORT;
-		stream->sample_size += format_size;
-	}
+		if (!props->metrics_set) {
+			DRM_ERROR("OA metric set not specified\n");
+			return -EINVAL;
+		}
 
-	if (props->sample_flags & SAMPLE_OA_SOURCE_INFO) {
-		stream->sample_flags |= SAMPLE_OA_SOURCE_INFO;
-		stream->sample_size += 4;
-	}
+		if (props->cs_mode  && (props->ring_id != RCS)) {
+			DRM_ERROR(
+				"Command stream OA metrics only available via Render CS\n");
+			return -EINVAL;
+		}
+		stream->ring_id = RCS;
 
-	dev_priv->perf.oa.oa_buffer.format_size = format_size;
-	BUG_ON(dev_priv->perf.oa.oa_buffer.format_size == 0);
+		if (!props->oa_format) {
+			DRM_ERROR("OA report format not specified\n");
+			return -EINVAL;
+		}
 
-	dev_priv->perf.oa.oa_buffer.format =
-		dev_priv->perf.oa.oa_formats[props->oa_format].format;
+		format_size =
+			dev_priv->perf.oa.oa_formats[props->oa_format].size;
 
-	dev_priv->perf.oa.metrics_set = props->metrics_set;
+		if (props->sample_flags & SAMPLE_OA_REPORT) {
+			stream->sample_flags |= SAMPLE_OA_REPORT;
+			stream->sample_size += format_size;
+		}
 
-	dev_priv->perf.oa.periodic = props->oa_periodic;
-	if (dev_priv->perf.oa.periodic)
-		dev_priv->perf.oa.period_exponent = props->oa_period_exponent;
+		if (props->sample_flags & SAMPLE_OA_SOURCE_INFO) {
+			stream->sample_flags |= SAMPLE_OA_SOURCE_INFO;
+			stream->sample_size += 4;
+		}
 
-	if (i915.enable_execlists && stream->ctx)
-		dev_priv->perf.oa.specific_ctx_id =
-			intel_execlists_ctx_id(stream->ctx);
+		dev_priv->perf.oa.oa_buffer.format_size = format_size;
+		BUG_ON(dev_priv->perf.oa.oa_buffer.format_size == 0);
 
-	ret = alloc_oa_buffer(dev_priv);
-	if (ret)
-		return ret;
+		dev_priv->perf.oa.oa_buffer.format =
+			dev_priv->perf.oa.oa_formats[props->oa_format].format;
 
-	dev_priv->perf.oa.exclusive_stream = stream;
+		dev_priv->perf.oa.metrics_set = props->metrics_set;
 
-	/* PRM - observability performance counters:
-	 *
-	 *   OACONTROL, performance counter enable, note:
-	 *
-	 *   "When this bit is set, in order to have coherent counts,
-	 *   RC6 power state and trunk clock gating must be disabled.
-	 *   This can be achieved by programming MMIO registers as
-	 *   0xA094=0 and 0xA090[31]=1"
-	 *
-	 *   In our case we are expected that taking pm + FORCEWAKE
-	 *   references will effectively disable RC6.
-	 */
-	intel_runtime_pm_get(dev_priv);
-	intel_uncore_forcewake_get(dev_priv, FORCEWAKE_ALL);
+		dev_priv->perf.oa.periodic = props->oa_periodic;
+		if (dev_priv->perf.oa.periodic)
+			dev_priv->perf.oa.period_exponent =
+					props->oa_period_exponent;
+
+		if (i915.enable_execlists && stream->ctx)
+			dev_priv->perf.oa.specific_ctx_id =
+				intel_execlists_ctx_id(stream->ctx);
 
-	dev_priv->perf.oa.ops.enable_metric_set(dev_priv);
+		ret = alloc_oa_buffer(dev_priv);
+		if (ret)
+			return ret;
+
+		/* PRM - observability performance counters:
+		 *
+		 *   OACONTROL, performance counter enable, note:
+		 *
+		 *   "When this bit is set, in order to have coherent counts,
+		 *   RC6 power state and trunk clock gating must be disabled.
+		 *   This can be achieved by programming MMIO registers as
+		 *   0xA094=0 and 0xA090[31]=1"
+		 *
+		 *   In our case we are expected that taking pm + FORCEWAKE
+		 *   references will effectively disable RC6.
+		 */
+		intel_runtime_pm_get(dev_priv);
+		intel_uncore_forcewake_get(dev_priv, FORCEWAKE_ALL);
+
+		dev_priv->perf.oa.ops.enable_metric_set(dev_priv);
+	}
+
+	if (props->sample_flags & SAMPLE_CTX_ID) {
+		stream->sample_flags |= SAMPLE_CTX_ID;
+		stream->sample_size += 4;
+
+		/*
+		 * NB: it's meaningful to request SAMPLE_CTX_ID with just CS
+		 * mode or periodic OA mode sampling but we don't allow
+		 * SAMPLE_CTX_ID without either mode
+		 */
+		if (!require_oa_unit)
+			require_cs_mode = true;
+	}
+
+	if (props->cs_mode) {
+		if (!(props->sample_flags & SAMPLE_CTX_ID)) {
+			DRM_ERROR(
+				"Ring given without requesting any CS data to sample");
+			ret = -EINVAL;
+			goto cs_error;
+		}
+
+		stream->cs_mode = true;
+
+		ret = alloc_command_stream_buf(dev_priv);
+		if (ret)
+			goto cs_error;
+	}
+
+	dev_priv->perf.oa.exclusive_stream = stream;
 
 	stream->destroy = i915_oa_stream_destroy;
 	stream->enable = i915_oa_stream_enable;
@@ -955,6 +1442,17 @@ static int i915_oa_stream_init(struct i915_perf_stream *stream,
 	stream->read = i915_oa_read;
 
 	return 0;
+
+cs_error:
+	if (require_oa_unit) {
+		dev_priv->perf.oa.ops.disable_metric_set(dev_priv);
+
+		intel_uncore_forcewake_put(dev_priv, FORCEWAKE_ALL);
+		intel_runtime_pm_put(dev_priv);
+
+		free_oa_buffer(dev_priv);
+	}
+	return ret;
 }
 
 static void gen7_update_hw_ctx_id_locked(struct drm_i915_private *dev_priv,
@@ -1170,12 +1668,21 @@ static ssize_t i915_perf_read(struct file *file,
 
 static enum hrtimer_restart poll_check_timer_cb(struct hrtimer *hrtimer)
 {
+	struct i915_perf_stream *stream;
+
 	struct drm_i915_private *dev_priv =
 		container_of(hrtimer, typeof(*dev_priv),
 			     perf.oa.poll_check_timer);
 
-	if (!dev_priv->perf.oa.ops.oa_buffer_is_empty(dev_priv))
-		wake_up(&dev_priv->perf.oa.poll_wq);
+	/* No need to protect the streams list here, since the hrtimer is
+	 * disabled before the stream is removed from list, and currently a
+	 * single exclusive_stream is supported.
+	 * XXX: revisit this when multiple concurrent streams are supported.
+	 */
+	list_for_each_entry(stream, &dev_priv->perf.streams, link) {
+		if (stream_have_data__unlocked(stream))
+			wake_up(&dev_priv->perf.oa.poll_wq);
+	}
 
 	hrtimer_forward_now(hrtimer, ns_to_ktime(POLL_PERIOD));
 
@@ -1268,14 +1775,16 @@ static void i915_perf_destroy_locked(struct i915_perf_stream *stream)
 {
 	struct drm_i915_private *dev_priv = stream->dev_priv;
 
+	mutex_lock(&dev_priv->perf.streams_lock);
+	list_del(&stream->link);
+	mutex_unlock(&dev_priv->perf.streams_lock);
+
 	if (stream->enabled)
 		i915_perf_disable_locked(stream);
 
 	if (stream->destroy)
 		stream->destroy(stream);
 
-	list_del(&stream->link);
-
 	if (stream->ctx) {
 		mutex_lock(&dev_priv->dev->struct_mutex);
 		i915_gem_context_unreference(stream->ctx);
@@ -1390,7 +1899,9 @@ int i915_perf_open_ioctl_locked(struct drm_device *dev,
 	 * thoroughly, but still this is the expected result at this point. */
 	BUG_ON(stream->sample_flags != props->sample_flags);
 
+	mutex_lock(&dev_priv->perf.streams_lock);
 	list_add(&stream->link, &dev_priv->perf.streams);
+	mutex_unlock(&dev_priv->perf.streams_lock);
 
 	if (param->flags & I915_PERF_FLAG_FD_CLOEXEC)
 		f_flags |= O_CLOEXEC;
@@ -1514,6 +2025,26 @@ static int read_properties_unlocked(struct drm_i915_private *dev_priv,
 		case DRM_I915_PERF_SAMPLE_OA_SOURCE_PROP:
 			props->sample_flags |= SAMPLE_OA_SOURCE_INFO;
 			break;
+		case DRM_I915_PERF_RING_PROP: {
+				u8 ring_id =
+					(value & I915_EXEC_RING_MASK) - 1;
+				if (ring_id >= LAST_USER_RING)
+					return -EINVAL;
+
+				/* XXX: Currently only RCS is supported.
+				 * Remove this check when support for other
+				 * rings is added
+				 */
+				if (ring_id != RCS)
+					return -EINVAL;
+
+				props->cs_mode = true;
+				props->ring_id = ring_id;
+			}
+			break;
+		case DRM_I915_PERF_SAMPLE_CTX_ID_PROP:
+			props->sample_flags |= SAMPLE_CTX_ID;
+			break;
 		case DRM_I915_PERF_PROP_MAX:
 			BUG();
 		}
@@ -1617,7 +2148,9 @@ void i915_perf_init(struct drm_device *dev)
 
 	INIT_LIST_HEAD(&dev_priv->perf.streams);
 	mutex_init(&dev_priv->perf.lock);
+	mutex_init(&dev_priv->perf.streams_lock);
 	spin_lock_init(&dev_priv->perf.hook_lock);
+	spin_lock_init(&dev_priv->perf.node_list_lock);
 
 	if (IS_HASWELL(dev)) {
 		dev_priv->perf.oa.ops.init_oa_buffer = gen7_init_oa_buffer;
diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
index 4789555..1de5b68 100644
--- a/drivers/gpu/drm/i915/intel_lrc.c
+++ b/drivers/gpu/drm/i915/intel_lrc.c
@@ -931,10 +931,14 @@ int intel_execlists_submission(struct i915_execbuffer_params *params,
 	exec_start = params->batch_obj_vm_offset +
 		     args->batch_start_offset;
 
+	i915_perf_command_stream_hook(params->request);
+
 	ret = ring->emit_bb_start(params->request, exec_start, params->dispatch_flags);
 	if (ret)
 		return ret;
 
+	i915_perf_command_stream_hook(params->request);
+
 	trace_i915_gem_ring_dispatch(params->request, params->dispatch_flags);
 
 	i915_gem_execbuffer_move_to_active(vmas, params->request);
diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
index 6dbaa6d..55b553b 100644
--- a/include/uapi/drm/i915_drm.h
+++ b/include/uapi/drm/i915_drm.h
@@ -1160,6 +1160,7 @@ enum drm_i915_perf_oa_event_source {
 	I915_PERF_OA_EVENT_SOURCE_UNDEFINED,
 	I915_PERF_OA_EVENT_SOURCE_PERIODIC,
 	I915_PERF_OA_EVENT_SOURCE_CONTEXT_SWITCH,
+	I915_PERF_OA_EVENT_SOURCE_RCS,
 
 	I915_PERF_OA_EVENT_SOURCE_MAX	/* non-ABI */
 };
@@ -1205,6 +1206,19 @@ enum drm_i915_perf_property_id {
 	 */
 	DRM_I915_PERF_SAMPLE_OA_SOURCE_PROP,
 
+	/**
+	 * The value of this property specifies the GPU engine (ring) for which
+	 * the samples need to be collected. Specifying this property also
+	 * implies the command stream based sample collection.
+	 */
+	DRM_I915_PERF_RING_PROP,
+
+	/**
+	 * The value of this property set to 1 requests inclusion of context ID
+	 * in the perf sample data.
+	 */
+	DRM_I915_PERF_SAMPLE_CTX_ID_PROP,
+
 	DRM_I915_PERF_PROP_MAX /* non-ABI */
 };
 
@@ -1253,6 +1267,7 @@ enum drm_i915_perf_record_type {
 	 *     struct drm_i915_perf_record_header header;
 	 *
 	 *     { u32 source_info; } && DRM_I915_PERF_SAMPLE_OA_SOURCE_PROP
+	 *     { u32 ctx_id; } && DRM_I915_PERF_SAMPLE_CTX_ID_PROP
 	 *     { u32 oa_report[]; } && DRM_I915_PERF_SAMPLE_OA_PROP
 	 * };
 	 */
-- 
1.9.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 07/11] drm/i915: Add support for having pid output with OA report
  2016-02-16  5:27 [PATCH 00/11] Framework to collect gpu metrics using i915 perf infrastructure sourab.gupta
                   ` (5 preceding siblings ...)
  2016-02-16  5:27 ` [PATCH 06/11] drm/i915: Framework for capturing command stream based OA reports sourab.gupta
@ 2016-02-16  5:27 ` sourab.gupta
  2016-02-16  5:27 ` [PATCH 08/11] drm/i915: Add support to add execbuffer tags to OA counter reports sourab.gupta
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 15+ messages in thread
From: sourab.gupta @ 2016-02-16  5:27 UTC (permalink / raw)
  To: intel-gfx; +Cc: Jabin Wu, Sourab Gupta

From: Sourab Gupta <sourab.gupta@intel.com>

This patch introduces flags and adds support for having pid output with
the OA reports generated through the RCS commands.

When the stream is opened with pid sample type, the pid information is also
captured through the command stream samples and forwarded along with the
OA reports.

Signed-off-by: Sourab Gupta <sourab.gupta@intel.com>
---
 drivers/gpu/drm/i915/i915_drv.h  |  1 +
 drivers/gpu/drm/i915/i915_perf.c | 39 ++++++++++++++++++++++++++++++++++++++-
 include/uapi/drm/i915_drm.h      |  7 +++++++
 3 files changed, 46 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index 44fcbf4..a8b374f 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -1808,6 +1808,7 @@ struct i915_perf_cs_data_node {
 	struct drm_i915_gem_request *request;
 	u32 offset;
 	u32 ctx_id;
+	u32 pid;
 };
 
 struct drm_i915_private {
diff --git a/drivers/gpu/drm/i915/i915_perf.c b/drivers/gpu/drm/i915/i915_perf.c
index 2904745..ea331eb 100644
--- a/drivers/gpu/drm/i915/i915_perf.c
+++ b/drivers/gpu/drm/i915/i915_perf.c
@@ -55,6 +55,7 @@ static u32 i915_perf_stream_paranoid = true;
 struct oa_sample_data {
 	u32 source;
 	u32 ctx_id;
+	u32 pid;
 	const u8 *report;
 };
 
@@ -96,6 +97,7 @@ static struct i915_oa_format gen8_plus_oa_formats[I915_OA_FORMAT_MAX] = {
 #define SAMPLE_OA_REPORT	(1<<0)
 #define SAMPLE_OA_SOURCE_INFO	(1<<1)
 #define SAMPLE_CTX_ID		(1<<2)
+#define SAMPLE_PID		(1<<3)
 
 struct perf_open_properties
 {
@@ -255,6 +257,7 @@ static void i915_perf_command_stream_hook_oa(struct drm_i915_gem_request *req)
 	}
 
 	entry->ctx_id = ctx->global_id;
+	entry->pid = current->pid;
 	i915_gem_request_assign(&entry->request, req);
 
 	insert_perf_entry(dev_priv, entry);
@@ -405,6 +408,12 @@ static bool append_oa_sample(struct i915_perf_stream *stream,
 		read_state->buf += 4;
 	}
 
+	if (sample_flags & SAMPLE_PID) {
+		if (copy_to_user(read_state->buf, &data->pid, 4))
+			return false;
+		read_state->buf += 4;
+	}
+
 	if (sample_flags & SAMPLE_OA_REPORT) {
 		if (copy_to_user(read_state->buf, data->report, report_size))
 			return false;
@@ -446,6 +455,10 @@ static bool append_oa_buffer_sample(struct i915_perf_stream *stream,
 	if (sample_flags & SAMPLE_CTX_ID)
 		data.ctx_id = 0;
 
+#warning "FIXME: append_oa_buffer_sample: deduce pid for periodic samples based on most recent RCS pid for ctx"
+	if (sample_flags & SAMPLE_PID)
+		data.pid = 0;
+
 	if (sample_flags & SAMPLE_OA_REPORT)
 		data.report = report;
 
@@ -685,6 +698,9 @@ static bool append_oa_rcs_sample(struct i915_perf_stream *stream,
 	if (sample_flags & SAMPLE_CTX_ID)
 		data.ctx_id = node->ctx_id;
 
+	if (sample_flags & SAMPLE_PID)
+		data.pid = node->pid;
+
 	if (sample_flags & SAMPLE_OA_REPORT)
 		data.report = report;
 
@@ -1307,6 +1323,7 @@ static int i915_oa_stream_init(struct i915_perf_stream *stream,
 	struct drm_i915_private *dev_priv = stream->dev_priv;
 	bool require_oa_unit = props->sample_flags & (SAMPLE_OA_REPORT |
 						      SAMPLE_OA_SOURCE_INFO);
+	bool require_cs_mode = props->sample_flags & SAMPLE_PID;
 	int format_size;
 	int ret;
 
@@ -1416,8 +1433,20 @@ static int i915_oa_stream_init(struct i915_perf_stream *stream,
 			require_cs_mode = true;
 	}
 
+	if (require_cs_mode && !props->cs_mode) {
+		DRM_ERROR("PID sampling requires a ring to be specified");
+		ret = -EINVAL;
+		goto cs_error;
+	}
+
 	if (props->cs_mode) {
-		if (!(props->sample_flags & SAMPLE_CTX_ID)) {
+		/*
+		 * The only time we should allow enabling CS mode if it's not
+		 * strictly required, is if SAMPLE_CTX_ID has been requested
+		 * as it's usable with periodic OA or CS sampling.
+		 */
+		if (!require_cs_mode &&
+		    !(props->sample_flags & SAMPLE_CTX_ID)) {
 			DRM_ERROR(
 				"Ring given without requesting any CS data to sample");
 			ret = -EINVAL;
@@ -1426,6 +1455,11 @@ static int i915_oa_stream_init(struct i915_perf_stream *stream,
 
 		stream->cs_mode = true;
 
+		if (props->sample_flags & SAMPLE_PID) {
+			stream->sample_flags |= SAMPLE_PID;
+			stream->sample_size += 4;
+		}
+
 		ret = alloc_command_stream_buf(dev_priv);
 		if (ret)
 			goto cs_error;
@@ -2045,6 +2079,9 @@ static int read_properties_unlocked(struct drm_i915_private *dev_priv,
 		case DRM_I915_PERF_SAMPLE_CTX_ID_PROP:
 			props->sample_flags |= SAMPLE_CTX_ID;
 			break;
+		case DRM_I915_PERF_SAMPLE_PID_PROP:
+			props->sample_flags |= SAMPLE_PID;
+			break;
 		case DRM_I915_PERF_PROP_MAX:
 			BUG();
 		}
diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
index 55b553b..3353926 100644
--- a/include/uapi/drm/i915_drm.h
+++ b/include/uapi/drm/i915_drm.h
@@ -1219,6 +1219,12 @@ enum drm_i915_perf_property_id {
 	 */
 	DRM_I915_PERF_SAMPLE_CTX_ID_PROP,
 
+	/**
+	 * The value of this property set to 1 requests inclusion of pid in the
+	 * perf sample data.
+	 */
+	DRM_I915_PERF_SAMPLE_PID_PROP,
+
 	DRM_I915_PERF_PROP_MAX /* non-ABI */
 };
 
@@ -1268,6 +1274,7 @@ enum drm_i915_perf_record_type {
 	 *
 	 *     { u32 source_info; } && DRM_I915_PERF_SAMPLE_OA_SOURCE_PROP
 	 *     { u32 ctx_id; } && DRM_I915_PERF_SAMPLE_CTX_ID_PROP
+	 *     { u32 pid; } && DRM_I915_PERF_SAMPLE_PID_PROP
 	 *     { u32 oa_report[]; } && DRM_I915_PERF_SAMPLE_OA_PROP
 	 * };
 	 */
-- 
1.9.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 08/11] drm/i915: Add support to add execbuffer tags to OA counter reports
  2016-02-16  5:27 [PATCH 00/11] Framework to collect gpu metrics using i915 perf infrastructure sourab.gupta
                   ` (6 preceding siblings ...)
  2016-02-16  5:27 ` [PATCH 07/11] drm/i915: Add support for having pid output with OA report sourab.gupta
@ 2016-02-16  5:27 ` sourab.gupta
  2016-02-16  5:27 ` [PATCH 09/11] drm/i915: Extend i915 perf framework for collecting timestamps on all gpu engines sourab.gupta
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 15+ messages in thread
From: sourab.gupta @ 2016-02-16  5:27 UTC (permalink / raw)
  To: intel-gfx; +Cc: Jabin Wu, Sourab Gupta

From: Sourab Gupta <sourab.gupta@intel.com>

This patch enables userspace to specify tags (per workload), provided via
execbuffer ioctl, which could be added to OA reports, to help associate
reports with the corresponding workloads.

There may be multiple stages within a single context, from a userspace
perspective. An ability is needed to individually associate the OA reports
with their corresponding workloads(execbuffers), which may not be possible
solely with ctx_id or pid information. This patch enables such a mechanism.

In this patch, upper 32 bits of rsvd1 field, which were previously unused
are now being used to pass in the tag.

Signed-off-by: Sourab Gupta <sourab.gupta@intel.com>
---
 drivers/gpu/drm/i915/i915_drv.h            |  6 +++--
 drivers/gpu/drm/i915/i915_gem_execbuffer.c |  5 +++--
 drivers/gpu/drm/i915/i915_perf.c           | 36 +++++++++++++++++++++++++-----
 drivers/gpu/drm/i915/intel_lrc.c           |  4 ++--
 include/uapi/drm/i915_drm.h                | 12 ++++++++++
 5 files changed, 52 insertions(+), 11 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index a8b374f..cf86228 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -1701,6 +1701,7 @@ struct i915_execbuffer_params {
 	struct drm_i915_gem_object      *batch_obj;
 	struct intel_context            *ctx;
 	struct drm_i915_gem_request     *request;
+	uint32_t			tag;
 };
 
 struct i915_oa_format {
@@ -1781,7 +1782,7 @@ struct i915_perf_stream {
 	 * Routine to emit the commands in the command streamer associated
 	 * with the corresponding gpu engine.
 	 */
-	void (*command_stream_hook)(struct drm_i915_gem_request *req);
+	void (*command_stream_hook)(struct drm_i915_gem_request *req, u32 tag);
 };
 
 struct i915_oa_ops {
@@ -1809,6 +1810,7 @@ struct i915_perf_cs_data_node {
 	u32 offset;
 	u32 ctx_id;
 	u32 pid;
+	u32 tag;
 };
 
 struct drm_i915_private {
@@ -3361,7 +3363,7 @@ void i915_oa_context_pin_notify(struct drm_i915_private *dev_priv,
 				struct intel_context *context);
 void i915_oa_legacy_ctx_switch_notify(struct drm_i915_gem_request *req);
 void i915_oa_update_reg_state(struct intel_engine_cs *ring, uint32_t *reg_state);
-void i915_perf_command_stream_hook(struct drm_i915_gem_request *req);
+void i915_perf_command_stream_hook(struct drm_i915_gem_request *req, u32 tag);
 
 /* i915_gem_evict.c */
 int __must_check i915_gem_evict_something(struct drm_device *dev,
diff --git a/drivers/gpu/drm/i915/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
index 6860fca..5e3ed23 100644
--- a/drivers/gpu/drm/i915/i915_gem_execbuffer.c
+++ b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
@@ -1258,7 +1258,7 @@ i915_gem_ringbuffer_submission(struct i915_execbuffer_params *params,
 	exec_start = params->batch_obj_vm_offset +
 		     params->args_batch_start_offset;
 
-	i915_perf_command_stream_hook(params->request);
+	i915_perf_command_stream_hook(params->request, params->tag);
 
 	ret = ring->dispatch_execbuffer(params->request,
 					exec_start, exec_len,
@@ -1266,7 +1266,7 @@ i915_gem_ringbuffer_submission(struct i915_execbuffer_params *params,
 	if (ret)
 		return ret;
 
-	i915_perf_command_stream_hook(params->request);
+	i915_perf_command_stream_hook(params->request, params->tag);
 
 	trace_i915_gem_ring_dispatch(params->request, params->dispatch_flags);
 
@@ -1574,6 +1574,7 @@ i915_gem_do_execbuffer(struct drm_device *dev, void *data,
 	params->dispatch_flags          = dispatch_flags;
 	params->batch_obj               = batch_obj;
 	params->ctx                     = ctx;
+	params->tag			= i915_execbuffer2_get_tag(*args);
 
 	ret = dev_priv->gt.execbuf_submit(params, args, &eb->vmas);
 
diff --git a/drivers/gpu/drm/i915/i915_perf.c b/drivers/gpu/drm/i915/i915_perf.c
index ea331eb..141f721 100644
--- a/drivers/gpu/drm/i915/i915_perf.c
+++ b/drivers/gpu/drm/i915/i915_perf.c
@@ -56,6 +56,7 @@ struct oa_sample_data {
 	u32 source;
 	u32 ctx_id;
 	u32 pid;
+	u32 tag;
 	const u8 *report;
 };
 
@@ -98,6 +99,7 @@ static struct i915_oa_format gen8_plus_oa_formats[I915_OA_FORMAT_MAX] = {
 #define SAMPLE_OA_SOURCE_INFO	(1<<1)
 #define SAMPLE_CTX_ID		(1<<2)
 #define SAMPLE_PID		(1<<3)
+#define SAMPLE_TAG		(1<<4)
 
 struct perf_open_properties
 {
@@ -123,7 +125,7 @@ struct perf_open_properties
  * perf mutex lock.
  */
 
-void i915_perf_command_stream_hook(struct drm_i915_gem_request *req)
+void i915_perf_command_stream_hook(struct drm_i915_gem_request *req, u32 tag)
 {
 	struct intel_engine_cs *ring = req->ring;
 	struct drm_i915_private *dev_priv = ring->dev->dev_private;
@@ -135,7 +137,7 @@ void i915_perf_command_stream_hook(struct drm_i915_gem_request *req)
 	mutex_lock(&dev_priv->perf.streams_lock);
 	list_for_each_entry(stream, &dev_priv->perf.streams, link) {
 		if (stream->enabled && stream->command_stream_hook)
-			stream->command_stream_hook(req);
+			stream->command_stream_hook(req, tag);
 	}
 	mutex_unlock(&dev_priv->perf.streams_lock);
 }
@@ -227,7 +229,8 @@ static void insert_perf_entry(struct drm_i915_private *dev_priv,
 	spin_unlock(&dev_priv->perf.node_list_lock);
 }
 
-static void i915_perf_command_stream_hook_oa(struct drm_i915_gem_request *req)
+static void i915_perf_command_stream_hook_oa(struct drm_i915_gem_request *req,
+						u32 tag)
 {
 	struct intel_engine_cs *ring = req->ring;
 	struct intel_ringbuffer *ringbuf = req->ringbuf;
@@ -258,6 +261,7 @@ static void i915_perf_command_stream_hook_oa(struct drm_i915_gem_request *req)
 
 	entry->ctx_id = ctx->global_id;
 	entry->pid = current->pid;
+	entry->tag = tag;
 	i915_gem_request_assign(&entry->request, req);
 
 	insert_perf_entry(dev_priv, entry);
@@ -414,6 +418,12 @@ static bool append_oa_sample(struct i915_perf_stream *stream,
 		read_state->buf += 4;
 	}
 
+	if (sample_flags & SAMPLE_TAG) {
+		if (copy_to_user(read_state->buf, &data->tag, 4))
+			return false;
+		read_state->buf += 4;
+	}
+
 	if (sample_flags & SAMPLE_OA_REPORT) {
 		if (copy_to_user(read_state->buf, data->report, report_size))
 			return false;
@@ -459,6 +469,10 @@ static bool append_oa_buffer_sample(struct i915_perf_stream *stream,
 	if (sample_flags & SAMPLE_PID)
 		data.pid = 0;
 
+#warning "FIXME: append_oa_buffer_sample: deduce tag for periodic samples based on most recent RCS tag for ctx"
+	if (sample_flags & SAMPLE_TAG)
+		data.tag = 0;
+
 	if (sample_flags & SAMPLE_OA_REPORT)
 		data.report = report;
 
@@ -701,6 +715,9 @@ static bool append_oa_rcs_sample(struct i915_perf_stream *stream,
 	if (sample_flags & SAMPLE_PID)
 		data.pid = node->pid;
 
+	if (sample_flags & SAMPLE_TAG)
+		data.tag = node->tag;
+
 	if (sample_flags & SAMPLE_OA_REPORT)
 		data.report = report;
 
@@ -1323,7 +1340,8 @@ static int i915_oa_stream_init(struct i915_perf_stream *stream,
 	struct drm_i915_private *dev_priv = stream->dev_priv;
 	bool require_oa_unit = props->sample_flags & (SAMPLE_OA_REPORT |
 						      SAMPLE_OA_SOURCE_INFO);
-	bool require_cs_mode = props->sample_flags & SAMPLE_PID;
+	bool require_cs_mode = props->sample_flags & (SAMPLE_PID |
+						      SAMPLE_TAG);
 	int format_size;
 	int ret;
 
@@ -1434,7 +1452,7 @@ static int i915_oa_stream_init(struct i915_perf_stream *stream,
 	}
 
 	if (require_cs_mode && !props->cs_mode) {
-		DRM_ERROR("PID sampling requires a ring to be specified");
+		DRM_ERROR("PID or TAG sampling require a ring to be specified");
 		ret = -EINVAL;
 		goto cs_error;
 	}
@@ -1460,6 +1478,11 @@ static int i915_oa_stream_init(struct i915_perf_stream *stream,
 			stream->sample_size += 4;
 		}
 
+		if (props->sample_flags & SAMPLE_TAG) {
+			stream->sample_flags |= SAMPLE_TAG;
+			stream->sample_size += 4;
+		}
+
 		ret = alloc_command_stream_buf(dev_priv);
 		if (ret)
 			goto cs_error;
@@ -2082,6 +2105,9 @@ static int read_properties_unlocked(struct drm_i915_private *dev_priv,
 		case DRM_I915_PERF_SAMPLE_PID_PROP:
 			props->sample_flags |= SAMPLE_PID;
 			break;
+		case DRM_I915_PERF_SAMPLE_TAG_PROP:
+			props->sample_flags |= SAMPLE_TAG;
+			break;
 		case DRM_I915_PERF_PROP_MAX:
 			BUG();
 		}
diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
index 1de5b68..05dc80e 100644
--- a/drivers/gpu/drm/i915/intel_lrc.c
+++ b/drivers/gpu/drm/i915/intel_lrc.c
@@ -931,13 +931,13 @@ int intel_execlists_submission(struct i915_execbuffer_params *params,
 	exec_start = params->batch_obj_vm_offset +
 		     args->batch_start_offset;
 
-	i915_perf_command_stream_hook(params->request);
+	i915_perf_command_stream_hook(params->request, params->tag);
 
 	ret = ring->emit_bb_start(params->request, exec_start, params->dispatch_flags);
 	if (ret)
 		return ret;
 
-	i915_perf_command_stream_hook(params->request);
+	i915_perf_command_stream_hook(params->request, params->tag);
 
 	trace_i915_gem_ring_dispatch(params->request, params->dispatch_flags);
 
diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
index 3353926..5687080 100644
--- a/include/uapi/drm/i915_drm.h
+++ b/include/uapi/drm/i915_drm.h
@@ -788,6 +788,11 @@ struct drm_i915_gem_execbuffer2 {
 #define i915_execbuffer2_get_context_id(eb2) \
 	((eb2).rsvd1 & I915_EXEC_CONTEXT_ID_MASK)
 
+/* upper 32 bits of rsvd1 field contain tag */
+#define I915_EXEC_TAG_MASK		(0xffffffff00000000UL)
+#define i915_execbuffer2_get_tag(eb2) \
+	((eb2).rsvd1 & I915_EXEC_TAG_MASK)
+
 struct drm_i915_gem_pin {
 	/** Handle of the buffer to be pinned. */
 	__u32 handle;
@@ -1225,6 +1230,12 @@ enum drm_i915_perf_property_id {
 	 */
 	DRM_I915_PERF_SAMPLE_PID_PROP,
 
+	/**
+	 * The value of this property set to 1 requests inclusion of tag in the
+	 * perf sample data.
+	 */
+	DRM_I915_PERF_SAMPLE_TAG_PROP,
+
 	DRM_I915_PERF_PROP_MAX /* non-ABI */
 };
 
@@ -1275,6 +1286,7 @@ enum drm_i915_perf_record_type {
 	 *     { u32 source_info; } && DRM_I915_PERF_SAMPLE_OA_SOURCE_PROP
 	 *     { u32 ctx_id; } && DRM_I915_PERF_SAMPLE_CTX_ID_PROP
 	 *     { u32 pid; } && DRM_I915_PERF_SAMPLE_PID_PROP
+	 *     { u32 tag; } && DRM_I915_PERF_SAMPLE_TAG_PROP
 	 *     { u32 oa_report[]; } && DRM_I915_PERF_SAMPLE_OA_PROP
 	 * };
 	 */
-- 
1.9.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 09/11] drm/i915: Extend i915 perf framework for collecting timestamps on all gpu engines
  2016-02-16  5:27 [PATCH 00/11] Framework to collect gpu metrics using i915 perf infrastructure sourab.gupta
                   ` (7 preceding siblings ...)
  2016-02-16  5:27 ` [PATCH 08/11] drm/i915: Add support to add execbuffer tags to OA counter reports sourab.gupta
@ 2016-02-16  5:27 ` sourab.gupta
  2016-02-16  5:27 ` [PATCH 10/11] drm/i915: Support opening multiple concurrent perf streams sourab.gupta
  2016-02-16  5:27 ` [PATCH 11/11] drm/i915: Support for capturing MMIO register values sourab.gupta
  10 siblings, 0 replies; 15+ messages in thread
From: sourab.gupta @ 2016-02-16  5:27 UTC (permalink / raw)
  To: intel-gfx; +Cc: Jabin Wu, Sourab Gupta

From: Sourab Gupta <sourab.gupta@intel.com>

This patch extends the i915  perf framework to handle the perf sample
collection for any given gpu engine. Particularly, the support
for collecting timestamp sample type is added, which can be requested for
any engine.
The thing to note is that, still only a single stream instance can be
opened at any particular time. Though that stream may now be opened for any
gpu engine, for collection of timestamp samples.

So, this patch doesn't add the support to open multiple concurrent streams,
as yet. Though it lays the groundwork for this support to be added
susequently. Part of this groundwork involves having separate command
stream buffers, per engine, for holding the samples generated.
Likewise for a few other data structures maintaining per-engine state.

Signed-off-by: Sourab Gupta <sourab.gupta@intel.com>
---
 drivers/gpu/drm/i915/i915_drv.h  |  33 ++-
 drivers/gpu/drm/i915/i915_perf.c | 578 +++++++++++++++++++++++++++------------
 drivers/gpu/drm/i915/i915_reg.h  |   2 +
 include/uapi/drm/i915_drm.h      |   7 +
 4 files changed, 431 insertions(+), 189 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index cf86228..b1c952c 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -1725,6 +1725,7 @@ struct i915_perf_stream {
 
 	struct list_head link;
 
+	enum intel_ring_id ring_id;
 	u32 sample_flags;
 	int sample_size;
 
@@ -1734,6 +1735,9 @@ struct i915_perf_stream {
 	/* Whether command stream based data collection is enabled */
 	bool cs_mode;
 
+	/* Whether the OA unit is in use */
+	bool using_oa;
+
 	/* Enables the collection of HW samples, either in response to
 	 * I915_PERF_IOCTL_ENABLE or implicitly called when stream is
 	 * opened without I915_PERF_FLAG_DISABLED */
@@ -1782,7 +1786,8 @@ struct i915_perf_stream {
 	 * Routine to emit the commands in the command streamer associated
 	 * with the corresponding gpu engine.
 	 */
-	void (*command_stream_hook)(struct drm_i915_gem_request *req, u32 tag);
+	void (*command_stream_hook)(struct i915_perf_stream *stream,
+				struct drm_i915_gem_request *req, u32 tag);
 };
 
 struct i915_oa_ops {
@@ -1807,7 +1812,16 @@ struct i915_oa_ops {
 struct i915_perf_cs_data_node {
 	struct list_head link;
 	struct drm_i915_gem_request *request;
-	u32 offset;
+
+	/* Offsets into the GEM obj holding the data */
+	u32 start_offset;
+	u32 oa_offset;
+	u32 ts_offset;
+
+	/* buffer size corresponding to this entry */
+	u32 size;
+
+	/* Other metadata */
 	u32 ctx_id;
 	u32 pid;
 	u32 tag;
@@ -2071,14 +2085,13 @@ struct drm_i915_private {
 
 		spinlock_t hook_lock;
 
-		struct {
-			struct i915_perf_stream *exclusive_stream;
+		struct hrtimer poll_check_timer;
+		struct i915_perf_stream *exclusive_stream;
+		wait_queue_head_t poll_wq[I915_NUM_RINGS];
 
+		struct {
 			u32 specific_ctx_id;
 
-			struct hrtimer poll_check_timer;
-			wait_queue_head_t poll_wq;
-
 			bool periodic;
 			u32 period_exponent;
 
@@ -2115,10 +2128,10 @@ struct drm_i915_private {
 			struct drm_i915_gem_object *obj;
 			struct i915_vma *vma;
 			u8 *addr;
-		} command_stream_buf;
+		} command_stream_buf[I915_NUM_RINGS];
 
-		struct list_head node_list;
-		spinlock_t node_list_lock;
+		struct list_head node_list[I915_NUM_RINGS];
+		spinlock_t node_list_lock[I915_NUM_RINGS];
 	} perf;
 
 	/* Abstract the submission mechanism (legacy ringbuffer or execlists) away */
diff --git a/drivers/gpu/drm/i915/i915_perf.c b/drivers/gpu/drm/i915/i915_perf.c
index 141f721..1d2712d 100644
--- a/drivers/gpu/drm/i915/i915_perf.c
+++ b/drivers/gpu/drm/i915/i915_perf.c
@@ -51,12 +51,17 @@ static u32 i915_perf_stream_paranoid = true;
 #define GEN8_OAREPORT_REASON_GO_TRANSITION  (1<<23)
 #define GEN9_OAREPORT_REASON_CLK_RATIO      (1<<24)
 
-/* Data common to periodic and RCS based samples */
-struct oa_sample_data {
+#define OA_ADDR_ALIGN 64
+#define TS_ADDR_ALIGN 8
+#define I915_PERF_TS_SAMPLE_SIZE 8
+
+/* Data common to all samples (periodic OA / CS based OA / Timestamps) */
+struct sample_data {
 	u32 source;
 	u32 ctx_id;
 	u32 pid;
 	u32 tag;
+	u64 ts;
 	const u8 *report;
 };
 
@@ -100,6 +105,7 @@ static struct i915_oa_format gen8_plus_oa_formats[I915_OA_FORMAT_MAX] = {
 #define SAMPLE_CTX_ID		(1<<2)
 #define SAMPLE_PID		(1<<3)
 #define SAMPLE_TAG		(1<<4)
+#define SAMPLE_TS		(1<<5)
 
 struct perf_open_properties
 {
@@ -136,8 +142,9 @@ void i915_perf_command_stream_hook(struct drm_i915_gem_request *req, u32 tag)
 
 	mutex_lock(&dev_priv->perf.streams_lock);
 	list_for_each_entry(stream, &dev_priv->perf.streams, link) {
-		if (stream->enabled && stream->command_stream_hook)
-			stream->command_stream_hook(req, tag);
+		if (stream->enabled && (stream->ring_id == ring->id) &&
+				stream->command_stream_hook)
+			stream->command_stream_hook(stream, req, tag);
 	}
 	mutex_unlock(&dev_priv->perf.streams_lock);
 }
@@ -150,16 +157,15 @@ void i915_perf_command_stream_hook(struct drm_i915_gem_request *req, u32 tag)
  * eventually, when the request associated with new entry completes.
  */
 static void release_some_perf_entries(struct drm_i915_private *dev_priv,
-					u32 target_size)
+				enum intel_ring_id id, u32 target_size)
 {
 	struct i915_perf_cs_data_node *entry, *next;
-	u32 entry_size = dev_priv->perf.oa.oa_buffer.format_size;
 	u32 size = 0;
 
 	list_for_each_entry_safe
-		(entry, next, &dev_priv->perf.node_list, link) {
+		(entry, next, &dev_priv->perf.node_list[id], link) {
 
-		size += entry_size;
+		size += entry->size;
 		i915_gem_request_unreference(entry->request);
 		list_del(&entry->link);
 		kfree(entry);
@@ -175,99 +181,117 @@ static void release_some_perf_entries(struct drm_i915_private *dev_priv,
  * the buffer, it will remove the oldest entries in order to make space.
  */
 static void insert_perf_entry(struct drm_i915_private *dev_priv,
+				struct i915_perf_stream *stream,
 				struct i915_perf_cs_data_node *entry)
 {
 	struct i915_perf_cs_data_node *first_entry, *last_entry;
-	int max_offset = dev_priv->perf.command_stream_buf.obj->base.size;
-	u32 entry_size = dev_priv->perf.oa.oa_buffer.format_size;
-
-	spin_lock(&dev_priv->perf.node_list_lock);
-	if (list_empty(&dev_priv->perf.node_list)) {
-		entry->offset = 0;
-		list_add_tail(&entry->link, &dev_priv->perf.node_list);
-		spin_unlock(&dev_priv->perf.node_list_lock);
-		return;
+	u32 sample_flags = stream->sample_flags;
+	enum intel_ring_id id = stream->ring_id;
+	int max_offset = dev_priv->perf.command_stream_buf[id].obj->base.size;
+	u32 offset, entry_size = 0;
+	bool sample_ts = false;
+
+	if (stream->sample_flags & SAMPLE_OA_REPORT)
+		entry_size += dev_priv->perf.oa.oa_buffer.format_size;
+	else if (sample_flags & SAMPLE_TS) {
+		/*
+		 * XXX: Since TS data can anyways be derived from OA report, so
+		 * no need to capture it for RCS ring, if capture oa data is
+		 * called already.
+		 */
+		entry_size += I915_PERF_TS_SAMPLE_SIZE;
+		sample_ts = true;
 	}
 
-	first_entry = list_first_entry(&dev_priv->perf.node_list,
+	spin_lock(&dev_priv->perf.node_list_lock[id]);
+	if (list_empty(&dev_priv->perf.node_list[id])) {
+		offset = 0;
+		goto out;
+	}
+
+	first_entry = list_first_entry(&dev_priv->perf.node_list[id],
 				       typeof(*first_entry), link);
-	last_entry = list_last_entry(&dev_priv->perf.node_list,
-				     typeof(*first_entry), link);
+	last_entry = list_last_entry(&dev_priv->perf.node_list[id],
+				     typeof(*last_entry), link);
 
-	if (last_entry->offset >= first_entry->offset) {
+	if (last_entry->start_offset >= first_entry->start_offset) {
 		/* Sufficient space available at the end of buffer? */
-		if (last_entry->offset + 2*entry_size < max_offset)
-			entry->offset = last_entry->offset + entry_size;
+		if (last_entry->start_offset + last_entry->size + entry_size
+							< max_offset)
+			offset = last_entry->start_offset + last_entry->size;
 		/*
 		 * Wraparound condition. Is sufficient space available at
 		 * beginning of buffer?
 		 */
-		else if (entry_size < first_entry->offset)
-			entry->offset = 0;
+		else if (entry_size < first_entry->start_offset)
+			offset = 0;
 		/* Insufficient space. Overwrite existing old entries */
 		else {
-			u32 target_size = entry_size - first_entry->offset;
+			u32 target_size = entry_size -
+						first_entry->start_offset;
 
-			release_some_perf_entries(dev_priv, target_size);
-			entry->offset = 0;
+			release_some_perf_entries(dev_priv, id, target_size);
+			offset = 0;
 		}
 	} else {
 		/* Sufficient space available? */
-		if (last_entry->offset + 2*entry_size < first_entry->offset)
-			entry->offset = last_entry->offset + entry_size;
+		if (last_entry->start_offset + last_entry->size + entry_size
+						< first_entry->start_offset)
+			offset = last_entry->start_offset + last_entry->size;
 		/* Insufficient space. Overwrite existing old entries */
 		else {
 			u32 target_size = entry_size -
-				(first_entry->offset - last_entry->offset -
-				entry_size);
+				(first_entry->start_offset -
+					last_entry->start_offset -
+					last_entry->size);
 
-			release_some_perf_entries(dev_priv, target_size);
-			entry->offset = last_entry->offset + entry_size;
+			release_some_perf_entries(dev_priv, id, target_size);
+			offset = last_entry->start_offset + last_entry->size;
 		}
 	}
-	list_add_tail(&entry->link, &dev_priv->perf.node_list);
-	spin_unlock(&dev_priv->perf.node_list_lock);
+
+out:
+	entry->start_offset = offset;
+	entry->size = entry_size;
+	if (stream->sample_flags & SAMPLE_OA_REPORT) {
+		entry->oa_offset = offset;
+		/* Ensure 64 byte alignment of oa_offset */
+		entry->oa_offset = ALIGN(entry->oa_offset, OA_ADDR_ALIGN);
+		offset = entry->oa_offset +
+				dev_priv->perf.oa.oa_buffer.format_size;
+	}
+	if (sample_ts) {
+		entry->ts_offset = offset;
+		/* Ensure 8 byte alignment of ts_offset */
+		entry->ts_offset = ALIGN(entry->ts_offset, TS_ADDR_ALIGN);
+		offset = entry->ts_offset + I915_PERF_TS_SAMPLE_SIZE;
+	}
+
+	list_add_tail(&entry->link, &dev_priv->perf.node_list[id]);
+	spin_unlock(&dev_priv->perf.node_list_lock[id]);
 }
 
-static void i915_perf_command_stream_hook_oa(struct drm_i915_gem_request *req,
-						u32 tag)
+static int i915_perf_stream_capture_oa_report(struct drm_i915_gem_request *req,
+				u32 offset)
 {
 	struct intel_engine_cs *ring = req->ring;
 	struct intel_ringbuffer *ringbuf = req->ringbuf;
-	struct intel_context *ctx = req->ctx;
 	struct drm_i915_private *dev_priv = ring->dev->dev_private;
-	struct i915_perf_cs_data_node *entry;
 	u32 addr = 0;
 	int ret;
 
 	/* OA counters are only supported on the render ring */
 	BUG_ON(ring->id != RCS);
 
-	entry = kzalloc(sizeof(*entry), GFP_KERNEL);
-	if (entry == NULL) {
-		DRM_ERROR("alloc failed\n");
-		return;
-	}
-
 	if (i915.enable_execlists)
 		ret = intel_logical_ring_begin(req, 4);
 	else
 		ret = intel_ring_begin(req, 4);
 
-	if (ret) {
-		kfree(entry);
-		return;
-	}
-
-	entry->ctx_id = ctx->global_id;
-	entry->pid = current->pid;
-	entry->tag = tag;
-	i915_gem_request_assign(&entry->request, req);
-
-	insert_perf_entry(dev_priv, entry);
+	if (ret)
+		return ret;
 
-	addr = dev_priv->perf.command_stream_buf.vma->node.start +
-		entry->offset;
+	addr = dev_priv->perf.command_stream_buf[RCS].vma->node.start + offset;
 
 	/* addr should be 64 byte aligned */
 	BUG_ON(addr & 0x3f);
@@ -295,10 +319,154 @@ static void i915_perf_command_stream_hook_oa(struct drm_i915_gem_request *req,
 		}
 		intel_ring_advance(ring);
 	}
-	i915_vma_move_to_active(dev_priv->perf.command_stream_buf.vma, req);
+	return 0;
+}
+
+static int i915_perf_stream_capture_ts_data(struct drm_i915_gem_request *req,
+						u32 offset)
+{
+	struct intel_engine_cs *ring = req->ring;
+	struct intel_ringbuffer *ringbuf = req->ringbuf;
+	struct drm_i915_private *dev_priv = ring->dev->dev_private;
+	u32 addr = 0;
+	int ret;
+
+	if (i915.enable_execlists)
+		ret = intel_logical_ring_begin(req, 6);
+	else
+		ret = intel_ring_begin(req, 6);
+
+	if (ret)
+		return ret;
+
+	addr = dev_priv->perf.command_stream_buf[ring->id].vma->node.start +
+		offset;
+
+	if (i915.enable_execlists) {
+		if (ring->id == RCS) {
+			intel_logical_ring_emit(ringbuf,
+						GFX_OP_PIPE_CONTROL(6));
+			intel_logical_ring_emit(ringbuf,
+						PIPE_CONTROL_GLOBAL_GTT_IVB |
+						PIPE_CONTROL_TIMESTAMP_WRITE);
+			intel_logical_ring_emit(ringbuf, addr |
+						PIPE_CONTROL_GLOBAL_GTT);
+			intel_logical_ring_emit(ringbuf, 0);
+			intel_logical_ring_emit(ringbuf, 0);
+			intel_logical_ring_emit(ringbuf, 0);
+		} else {
+			uint32_t cmd;
+
+			cmd = MI_FLUSH_DW + 2; /* Gen8+ */
+
+			cmd |= MI_FLUSH_DW_OP_STAMP;
+
+			intel_logical_ring_emit(ringbuf, cmd);
+			intel_logical_ring_emit(ringbuf, addr |
+						MI_FLUSH_DW_USE_GTT);
+			intel_logical_ring_emit(ringbuf, 0);
+			intel_logical_ring_emit(ringbuf, 0);
+			intel_logical_ring_emit(ringbuf, 0);
+			intel_logical_ring_emit(ringbuf, MI_NOOP);
+		}
+		intel_logical_ring_advance(ringbuf);
+	} else {
+		if (ring->id == RCS) {
+			if (INTEL_INFO(ring->dev)->gen >= 8)
+				intel_ring_emit(ring, GFX_OP_PIPE_CONTROL(6));
+			else
+				intel_ring_emit(ring, GFX_OP_PIPE_CONTROL(5));
+			intel_ring_emit(ring,
+					PIPE_CONTROL_GLOBAL_GTT_IVB |
+					PIPE_CONTROL_TIMESTAMP_WRITE);
+			intel_ring_emit(ring, addr | PIPE_CONTROL_GLOBAL_GTT);
+			intel_ring_emit(ring, 0);
+			if (INTEL_INFO(ring->dev)->gen >= 8) {
+				intel_ring_emit(ring, 0);
+				intel_ring_emit(ring, 0);
+			} else {
+				intel_ring_emit(ring, 0);
+				intel_ring_emit(ring, MI_NOOP);
+			}
+		} else {
+			uint32_t cmd;
+
+			cmd = MI_FLUSH_DW + 1;
+			if (INTEL_INFO(ring->dev)->gen >= 8)
+				cmd += 1;
+
+			cmd |= MI_FLUSH_DW_OP_STAMP;
+
+			intel_ring_emit(ring, cmd);
+			intel_ring_emit(ring, addr | MI_FLUSH_DW_USE_GTT);
+			if (INTEL_INFO(ring->dev)->gen >= 8) {
+				intel_ring_emit(ring, 0);
+				intel_ring_emit(ring, 0);
+				intel_ring_emit(ring, 0);
+			} else {
+				intel_ring_emit(ring, 0);
+				intel_ring_emit(ring, 0);
+				intel_ring_emit(ring, MI_NOOP);
+			}
+			intel_ring_emit(ring, MI_NOOP);
+		}
+		intel_ring_advance(ring);
+	}
+	return 0;
+}
+
+static void i915_perf_stream_cs_hook(struct i915_perf_stream *stream,
+				struct drm_i915_gem_request *req, u32 tag)
+{
+	struct intel_engine_cs *ring = req->ring;
+	struct intel_context *ctx = req->ctx;
+	struct drm_i915_private *dev_priv = ring->dev->dev_private;
+	enum intel_ring_id id = stream->ring_id;
+	u32 sample_flags = stream->sample_flags;
+	struct i915_perf_cs_data_node *entry;
+	int ret = 0;
+
+	entry = kzalloc(sizeof(*entry), GFP_KERNEL);
+	if (entry == NULL) {
+		DRM_ERROR("alloc failed\n");
+		return;
+	}
+
+	entry->ctx_id = ctx->global_id;
+	entry->pid = current->pid;
+	entry->tag = tag;
+	i915_gem_request_assign(&entry->request, req);
+
+	insert_perf_entry(dev_priv, stream, entry);
+
+	if (sample_flags & SAMPLE_OA_REPORT) {
+		ret = i915_perf_stream_capture_oa_report(req, entry->oa_offset);
+		if (ret)
+			goto err;
+	} else if (sample_flags & SAMPLE_TS) {
+		/*
+		 * XXX: Since TS data can anyways be derived from OA report, so
+		 * no need to capture it for RCS ring, if capture oa data is
+		 * called already.
+		 */
+		ret = i915_perf_stream_capture_ts_data(req, entry->ts_offset);
+		if (ret)
+			goto err;
+	}
+
+	i915_vma_move_to_active(dev_priv->perf.command_stream_buf[id].vma, req);
+	return;
+
+err:
+	i915_gem_request_unreference(entry->request);
+	spin_lock(&dev_priv->perf.node_list_lock[id]);
+	list_del(&entry->link);
+	kfree(entry);
+	spin_unlock(&dev_priv->perf.node_list_lock[id]);
 }
 
-static int i915_oa_rcs_wait_gpu(struct drm_i915_private *dev_priv)
+static int i915_perf_wait_gpu(struct drm_i915_private *dev_priv,
+				enum intel_ring_id id)
 {
 	struct i915_perf_cs_data_node *last_entry = NULL;
 	struct drm_i915_gem_request *req = NULL;
@@ -309,14 +477,14 @@ static int i915_oa_rcs_wait_gpu(struct drm_i915_private *dev_priv)
 	 * implicitly wait for the prior submitted requests. The refcount
 	 * of the requests is not decremented here.
 	 */
-	spin_lock(&dev_priv->perf.node_list_lock);
+	spin_lock(&dev_priv->perf.node_list_lock[id]);
 
-	if (!list_empty(&dev_priv->perf.node_list)) {
-		last_entry = list_last_entry(&dev_priv->perf.node_list,
+	if (!list_empty(&dev_priv->perf.node_list[id])) {
+		last_entry = list_last_entry(&dev_priv->perf.node_list[id],
 			struct i915_perf_cs_data_node, link);
 		req = last_entry->request;
 	}
-	spin_unlock(&dev_priv->perf.node_list_lock);
+	spin_unlock(&dev_priv->perf.node_list_lock[id]);
 
 	if (!req)
 		return 0;
@@ -331,17 +499,18 @@ static int i915_oa_rcs_wait_gpu(struct drm_i915_private *dev_priv)
 	return 0;
 }
 
-static void i915_oa_rcs_free_requests(struct drm_i915_private *dev_priv)
+static void i915_perf_free_requests(struct drm_i915_private *dev_priv,
+				enum intel_ring_id id)
 {
 	struct i915_perf_cs_data_node *entry, *next;
 
 	list_for_each_entry_safe
-		(entry, next, &dev_priv->perf.node_list, link) {
+		(entry, next, &dev_priv->perf.node_list[id], link) {
 		i915_gem_request_unreference__unlocked(entry->request);
 
-		spin_lock(&dev_priv->perf.node_list_lock);
+		spin_lock(&dev_priv->perf.node_list_lock[id]);
 		list_del(&entry->link);
-		spin_unlock(&dev_priv->perf.node_list_lock);
+		spin_unlock(&dev_priv->perf.node_list_lock[id]);
 		kfree(entry);
 	}
 }
@@ -381,9 +550,9 @@ static bool append_oa_status(struct i915_perf_stream *stream,
 	return true;
 }
 
-static bool append_oa_sample(struct i915_perf_stream *stream,
+static bool append_sample(struct i915_perf_stream *stream,
 			     struct i915_perf_read_state *read_state,
-			     struct oa_sample_data *data)
+			     struct sample_data *data)
 {
 	struct drm_i915_private *dev_priv = stream->dev_priv;
 	int report_size = dev_priv->perf.oa.oa_buffer.format_size;
@@ -424,6 +593,13 @@ static bool append_oa_sample(struct i915_perf_stream *stream,
 		read_state->buf += 4;
 	}
 
+	if (sample_flags & SAMPLE_TS) {
+		if (copy_to_user(read_state->buf, &data->ts,
+					I915_PERF_TS_SAMPLE_SIZE))
+			return false;
+		read_state->buf += I915_PERF_TS_SAMPLE_SIZE;
+	}
+
 	if (sample_flags & SAMPLE_OA_REPORT) {
 		if (copy_to_user(read_state->buf, data->report, report_size))
 			return false;
@@ -441,7 +617,7 @@ static bool append_oa_buffer_sample(struct i915_perf_stream *stream,
 {
 	struct drm_i915_private *dev_priv = stream->dev_priv;
 	u32 sample_flags = stream->sample_flags;
-	struct oa_sample_data data = { 0 };
+	struct sample_data data = { 0 };
 
 	if (sample_flags & SAMPLE_OA_SOURCE_INFO) {
 		enum drm_i915_perf_oa_event_source source;
@@ -473,10 +649,15 @@ static bool append_oa_buffer_sample(struct i915_perf_stream *stream,
 	if (sample_flags & SAMPLE_TAG)
 		data.tag = 0;
 
+	/* Derive timestamp from OA report, after scaling with the ts base */
+#warning "FIXME: append_oa_buffer_sample: derive the timestamp from OA report"
+	if (sample_flags & SAMPLE_TS)
+		data.ts = 0;
+
 	if (sample_flags & SAMPLE_OA_REPORT)
 		data.report = report;
 
-	append_oa_sample(stream, read_state, &data);
+	append_sample(stream, read_state, &data);
 
 	return true;
 }
@@ -528,7 +709,7 @@ static u32 gen8_append_oa_reports(struct i915_perf_stream *stream,
 			ctx_id &= 0xfffff;
 		}
 
-		if (dev_priv->perf.oa.exclusive_stream->enabled) {
+		if (stream->enabled) {
 
 			/* NB: For Gen 8 we handle per-context report filtering
 			 * ourselves instead of programming the OA unit with a
@@ -539,7 +720,7 @@ static u32 gen8_append_oa_reports(struct i915_perf_stream *stream,
 			 * first report belonging to any subsequently
 			 * switched-too context.
 			 */
-			if (!dev_priv->perf.oa.exclusive_stream->ctx ||
+			if (!stream->ctx ||
 			    (dev_priv->perf.oa.specific_ctx_id == ctx_id ||
 			     (dev_priv->perf.oa.specific_ctx_id !=
 			      dev_priv->perf.oa.oa_buffer.last_ctx_id))) {
@@ -630,7 +811,7 @@ static u32 gen7_append_oa_reports(struct i915_perf_stream *stream,
 		if (report_ts > ts)
 			break;
 
-		if (dev_priv->perf.oa.exclusive_stream->enabled) {
+		if (stream->enabled) {
 			if (!append_oa_buffer_sample(stream, read_state,
 							report))
 				break;
@@ -687,24 +868,32 @@ static void gen7_oa_read(struct i915_perf_stream *stream,
 				    OA_MEM_SELECT_GGTT);
 }
 
-static bool append_oa_rcs_sample(struct i915_perf_stream *stream,
+static bool append_one_cs_sample(struct i915_perf_stream *stream,
 				 struct i915_perf_read_state *read_state,
 				 struct i915_perf_cs_data_node *node)
 {
 	struct drm_i915_private *dev_priv = stream->dev_priv;
-	struct oa_sample_data data = { 0 };
-	const u8 *report = dev_priv->perf.command_stream_buf.addr +
-				node->offset;
+	enum intel_ring_id id = stream->ring_id;
+	struct sample_data data = { 0 };
 	u32 sample_flags = stream->sample_flags;
-	u32 report_ts;
 
-	/*
-	 * Forward the periodic OA samples which have the timestamp lower
-	 * than timestamp of this sample, before forwarding this sample.
-	 * This ensures samples read by user are order acc. to their timestamps
-	 */
-	report_ts = *(u32 *)(report + 4);
-	dev_priv->perf.oa.ops.read(stream, read_state, report_ts);
+	if (sample_flags & SAMPLE_OA_REPORT) {
+		const u8 *report = dev_priv->perf.command_stream_buf[id].addr +
+				   node->oa_offset;
+		u32 sample_ts = *(u32 *)(report + 4);
+
+		BUG_ON(id != RCS);
+
+		data.report = report;
+
+		/*
+		 * Forward the periodic OA samples which have the timestamp
+		 * lower than timestamp of this sample, before forwarding this
+		 * sample.  This ensures samples read by user are order acc. to
+		 * their timestamps
+		 */
+		dev_priv->perf.oa.ops.read(stream, read_state, sample_ts);
+	}
 
 	if (sample_flags & SAMPLE_OA_SOURCE_INFO)
 		data.source = I915_PERF_OA_EVENT_SOURCE_RCS;
@@ -718,38 +907,51 @@ static bool append_oa_rcs_sample(struct i915_perf_stream *stream,
 	if (sample_flags & SAMPLE_TAG)
 		data.tag = node->tag;
 
-	if (sample_flags & SAMPLE_OA_REPORT)
-		data.report = report;
+	if (sample_flags & SAMPLE_TS) {
+		/* For RCS, derive timestamp from OA report, after
+		 * scaling with the timestamp base. For other rings, forward the
+		 * timestamp collected via command stream.
+		 */
+#warning "FIXME: append_one_cs_sample: derive the timestamp from OA report"
+		if (sample_flags & SAMPLE_OA_REPORT)
+			data.ts = 0;
+		else
+			data.ts = *(u64 *)
+				(dev_priv->perf.command_stream_buf[id].addr +
+					node->ts_offset);
+	}
 
-	append_oa_sample(stream, read_state, &data);
+	append_sample(stream, read_state, &data);
 
 	return true;
 }
 
-static void oa_rcs_append_reports(struct i915_perf_stream *stream,
+static void append_command_stream_samples(struct i915_perf_stream *stream,
 				  struct i915_perf_read_state *read_state)
 {
 	struct drm_i915_private *dev_priv = stream->dev_priv;
+	enum intel_ring_id id = stream->ring_id;
 	struct i915_perf_cs_data_node *entry, *next;
 
 	list_for_each_entry_safe(entry, next,
-				 &dev_priv->perf.node_list, link) {
+				 &dev_priv->perf.node_list[id], link) {
 		if (!i915_gem_request_completed(entry->request, true))
 			break;
 
-		if (!append_oa_rcs_sample(stream, read_state, entry))
+		if (!append_one_cs_sample(stream, read_state, entry))
 			break;
 
-		spin_lock(&dev_priv->perf.node_list_lock);
+		spin_lock(&dev_priv->perf.node_list_lock[id]);
 		list_del(&entry->link);
-		spin_unlock(&dev_priv->perf.node_list_lock);
+		spin_unlock(&dev_priv->perf.node_list_lock[id]);
 
 		i915_gem_request_unreference__unlocked(entry->request);
 		kfree(entry);
 	}
 
-	/* Flush any remaining periodic reports */
-	dev_priv->perf.oa.ops.read(stream, read_state, U32_MAX);
+	/* Flush any remaining periodic OA reports in case of RCS*/
+	if (stream->sample_flags & SAMPLE_OA_REPORT)
+		dev_priv->perf.oa.ops.read(stream, read_state, U32_MAX);
 }
 
 static bool command_stream_buf_is_empty(struct i915_perf_stream *stream)
@@ -757,7 +959,7 @@ static bool command_stream_buf_is_empty(struct i915_perf_stream *stream)
 	struct drm_i915_private *dev_priv = stream->dev_priv;
 
 	if (stream->cs_mode)
-		return list_empty(&dev_priv->perf.node_list);
+		return list_empty(&dev_priv->perf.node_list[stream->ring_id]);
 	else
 		return true;
 }
@@ -772,63 +974,69 @@ static bool stream_have_data__unlocked(struct i915_perf_stream *stream)
 	 * can't be destroyed until completion (such as a read()) that ensures
 	 * the device + OA buffer can't disappear
 	 */
-	return !(dev_priv->perf.oa.ops.oa_buffer_is_empty(dev_priv) &&
-		 command_stream_buf_is_empty(stream));
+	if (stream->sample_flags & SAMPLE_OA_REPORT)
+		return !(dev_priv->perf.oa.ops.oa_buffer_is_empty(dev_priv) &&
+			command_stream_buf_is_empty(stream));
+	else
+		return !command_stream_buf_is_empty(stream);
 }
 
-static bool i915_oa_can_read(struct i915_perf_stream *stream)
+static bool i915_perf_stream_can_read(struct i915_perf_stream *stream)
 {
 
 	return stream_have_data__unlocked(stream);
 }
 
-static int i915_oa_wait_unlocked(struct i915_perf_stream *stream)
+static int i915_perf_stream_wait_unlocked(struct i915_perf_stream *stream)
 {
 	struct drm_i915_private *dev_priv = stream->dev_priv;
+	enum intel_ring_id id = stream->ring_id;
 	int ret;
 
 	if (stream->cs_mode) {
-		ret = i915_oa_rcs_wait_gpu(dev_priv);
+		ret = i915_perf_wait_gpu(dev_priv, id);
 		if (ret)
 			return ret;
 	}
 
-	return wait_event_interruptible(dev_priv->perf.oa.poll_wq,
+	return wait_event_interruptible(dev_priv->perf.poll_wq[id],
 					stream_have_data__unlocked(stream));
 }
 
-static void i915_oa_poll_wait(struct i915_perf_stream *stream,
+static void i915_perf_stream_poll_wait(struct i915_perf_stream *stream,
 			      struct file *file,
 			      poll_table *wait)
 {
 	struct drm_i915_private *dev_priv = stream->dev_priv;
 
-	poll_wait(file, &dev_priv->perf.oa.poll_wq, wait);
+	poll_wait(file, &dev_priv->perf.poll_wq[stream->ring_id], wait);
 }
 
-static void i915_oa_read(struct i915_perf_stream *stream,
+static void i915_perf_stream_read(struct i915_perf_stream *stream,
 			 struct i915_perf_read_state *read_state)
 {
 	struct drm_i915_private *dev_priv = stream->dev_priv;
 
 	if (stream->cs_mode)
-		oa_rcs_append_reports(stream, read_state);
-	else
+		append_command_stream_samples(stream, read_state);
+	else if (stream->ring_id == RCS)
 		dev_priv->perf.oa.ops.read(stream, read_state, U32_MAX);
 }
 
 static void
-free_command_stream_buf(struct drm_i915_private *i915)
+free_command_stream_buf(struct drm_i915_private *i915,
+				enum intel_ring_id id)
 {
 	mutex_lock(&i915->dev->struct_mutex);
 
-	vunmap(i915->perf.command_stream_buf.addr);
-	i915_gem_object_ggtt_unpin(i915->perf.command_stream_buf.obj);
-	drm_gem_object_unreference(&i915->perf.command_stream_buf.obj->base);
+	vunmap(i915->perf.command_stream_buf[id].addr);
+	i915_gem_object_ggtt_unpin(i915->perf.command_stream_buf[id].obj);
+	drm_gem_object_unreference(
+		&i915->perf.command_stream_buf[id].obj->base);
 
-	i915->perf.command_stream_buf.obj = NULL;
-	i915->perf.command_stream_buf.vma = NULL;
-	i915->perf.command_stream_buf.addr = NULL;
+	i915->perf.command_stream_buf[id].obj = NULL;
+	i915->perf.command_stream_buf[id].vma = NULL;
+	i915->perf.command_stream_buf[id].addr = NULL;
 
 	mutex_unlock(&i915->dev->struct_mutex);
 }
@@ -849,16 +1057,13 @@ free_oa_buffer(struct drm_i915_private *i915)
 	mutex_unlock(&i915->dev->struct_mutex);
 }
 
-static void i915_oa_stream_destroy(struct i915_perf_stream *stream)
+static void i915_perf_stream_destroy(struct i915_perf_stream *stream)
 {
 	struct drm_i915_private *dev_priv = stream->dev_priv;
 
-	BUG_ON(stream != dev_priv->perf.oa.exclusive_stream);
+	BUG_ON(stream != dev_priv->perf.exclusive_stream);
 
-	if (stream->cs_mode)
-		free_command_stream_buf(dev_priv);
-
-	if (dev_priv->perf.oa.oa_buffer.obj) {
+	if (stream->using_oa) {
 		dev_priv->perf.oa.ops.disable_metric_set(dev_priv);
 
 		free_oa_buffer(dev_priv);
@@ -867,7 +1072,10 @@ static void i915_oa_stream_destroy(struct i915_perf_stream *stream)
 		intel_runtime_pm_put(dev_priv);
 	}
 
-	dev_priv->perf.oa.exclusive_stream = NULL;
+	if (stream->cs_mode)
+		free_command_stream_buf(dev_priv, stream->ring_id);
+
+	dev_priv->perf.exclusive_stream = NULL;
 }
 
 static void *vmap_oa_buffer(struct drm_i915_gem_object *obj)
@@ -993,27 +1201,28 @@ static int alloc_oa_buffer(struct drm_i915_private *dev_priv)
 	return 0;
 }
 
-static int alloc_command_stream_buf(struct drm_i915_private *dev_priv)
+static int alloc_command_stream_buf(struct drm_i915_private *dev_priv,
+					enum intel_ring_id id)
 {
 	struct drm_i915_gem_object *bo;
 	int ret;
 
-	BUG_ON(dev_priv->perf.command_stream_buf.obj);
+	BUG_ON(dev_priv->perf.command_stream_buf[id].obj);
 
 	ret = alloc_obj(dev_priv, &bo);
 	if (ret)
 		return ret;
 
-	dev_priv->perf.command_stream_buf.obj = bo;
-	dev_priv->perf.command_stream_buf.vma = i915_gem_obj_to_ggtt(bo);
-	dev_priv->perf.command_stream_buf.addr = vmap_oa_buffer(bo);
-	INIT_LIST_HEAD(&dev_priv->perf.node_list);
+	dev_priv->perf.command_stream_buf[id].obj = bo;
+	dev_priv->perf.command_stream_buf[id].vma = i915_gem_obj_to_ggtt(bo);
+	dev_priv->perf.command_stream_buf[id].addr = vmap_oa_buffer(bo);
+	INIT_LIST_HEAD(&dev_priv->perf.node_list[id]);
 
 	DRM_DEBUG_DRIVER(
 		"command stream buf initialized, gtt offset = 0x%x, vaddr = %p",
 		 (unsigned int)
-		 dev_priv->perf.command_stream_buf.vma->node.start,
-		 dev_priv->perf.command_stream_buf.addr);
+		 dev_priv->perf.command_stream_buf[id].vma->node.start,
+		 dev_priv->perf.command_stream_buf[id].addr);
 
 	return 0;
 }
@@ -1225,17 +1434,17 @@ static void gen7_update_oacontrol_locked(struct drm_i915_private *dev_priv)
 {
 	assert_spin_locked(&dev_priv->perf.hook_lock);
 
-	if (dev_priv->perf.oa.exclusive_stream->enabled) {
+	if (dev_priv->perf.exclusive_stream->enabled) {
 		unsigned long ctx_id = 0;
 		bool pinning_ok = false;
 
-		if (dev_priv->perf.oa.exclusive_stream->ctx &&
+		if (dev_priv->perf.exclusive_stream->ctx &&
 		    dev_priv->perf.oa.specific_ctx_id) {
 			ctx_id = dev_priv->perf.oa.specific_ctx_id;
 			pinning_ok = true;
 		}
 
-		if (dev_priv->perf.oa.exclusive_stream->ctx == NULL ||
+		if (dev_priv->perf.exclusive_stream->ctx == NULL ||
 		    pinning_ok) {
 			bool periodic = dev_priv->perf.oa.periodic;
 			u32 period_exponent = dev_priv->perf.oa.period_exponent;
@@ -1292,17 +1501,18 @@ static void gen8_oa_enable(struct drm_i915_private *dev_priv)
 	I915_WRITE(GEN8_OAHEADPTR, tail);
 }
 
-static void i915_oa_stream_enable(struct i915_perf_stream *stream)
+static void i915_perf_stream_enable(struct i915_perf_stream *stream)
 {
 	struct drm_i915_private *dev_priv = stream->dev_priv;
 
-	dev_priv->perf.oa.ops.oa_enable(dev_priv);
+	if (stream->sample_flags & SAMPLE_OA_REPORT)
+		dev_priv->perf.oa.ops.oa_enable(dev_priv);
 
 	if (stream->cs_mode)
-		stream->command_stream_hook = i915_perf_command_stream_hook_oa;
+		stream->command_stream_hook = i915_perf_stream_cs_hook;
 
-	if (dev_priv->perf.oa.periodic)
-		hrtimer_start(&dev_priv->perf.oa.poll_check_timer,
+	if (stream->cs_mode || dev_priv->perf.oa.periodic)
+		hrtimer_start(&dev_priv->perf.poll_check_timer,
 			      ns_to_ktime(POLL_PERIOD),
 			      HRTIMER_MODE_REL_PINNED);
 }
@@ -1317,23 +1527,24 @@ static void gen8_oa_disable(struct drm_i915_private *dev_priv)
 	I915_WRITE(GEN8_OACONTROL, 0);
 }
 
-static void i915_oa_stream_disable(struct i915_perf_stream *stream)
+static void i915_perf_stream_disable(struct i915_perf_stream *stream)
 {
 	struct drm_i915_private *dev_priv = stream->dev_priv;
 
-	if (dev_priv->perf.oa.periodic)
-		hrtimer_cancel(&dev_priv->perf.oa.poll_check_timer);
+	if (stream->cs_mode || dev_priv->perf.oa.periodic)
+		hrtimer_cancel(&dev_priv->perf.poll_check_timer);
 
 	if (stream->cs_mode) {
 		stream->command_stream_hook = NULL;
-		i915_oa_rcs_wait_gpu(dev_priv);
-		i915_oa_rcs_free_requests(dev_priv);
+		i915_perf_wait_gpu(dev_priv, stream->ring_id);
+		i915_perf_free_requests(dev_priv, stream->ring_id);
 	}
 
-	dev_priv->perf.oa.ops.oa_disable(dev_priv);
+	if (stream->sample_flags & SAMPLE_OA_REPORT)
+		dev_priv->perf.oa.ops.oa_disable(dev_priv);
 }
 
-static int i915_oa_stream_init(struct i915_perf_stream *stream,
+static int i915_perf_stream_init(struct i915_perf_stream *stream,
 			       struct drm_i915_perf_open_param *param,
 			       struct perf_open_properties *props)
 {
@@ -1341,15 +1552,15 @@ static int i915_oa_stream_init(struct i915_perf_stream *stream,
 	bool require_oa_unit = props->sample_flags & (SAMPLE_OA_REPORT |
 						      SAMPLE_OA_SOURCE_INFO);
 	bool require_cs_mode = props->sample_flags & (SAMPLE_PID |
-						      SAMPLE_TAG);
-	int format_size;
+						      SAMPLE_TAG |
+						      SAMPLE_TS);
 	int ret;
 
 	/* To avoid the complexity of having to accurately filter
 	 * counter reports and marshal to the appropriate client
 	 * we currently only allow exclusive access */
-	if (dev_priv->perf.oa.exclusive_stream) {
-		DRM_ERROR("OA unit already in use\n");
+	if (dev_priv->perf.exclusive_stream) {
+		DRM_ERROR("Stream already in use\n");
 		return -EBUSY;
 	}
 
@@ -1364,6 +1575,7 @@ static int i915_oa_stream_init(struct i915_perf_stream *stream,
 	stream->sample_size = sizeof(struct drm_i915_perf_record_header);
 
 	if (require_oa_unit) {
+		int format_size;
 		if (!dev_priv->perf.oa.ops.init_oa_buffer) {
 			DRM_ERROR("OA unit not supported\n");
 			return -ENODEV;
@@ -1386,6 +1598,8 @@ static int i915_oa_stream_init(struct i915_perf_stream *stream,
 			return -EINVAL;
 		}
 
+		stream->using_oa = true;
+
 		format_size =
 			dev_priv->perf.oa.oa_formats[props->oa_format].size;
 
@@ -1452,7 +1666,8 @@ static int i915_oa_stream_init(struct i915_perf_stream *stream,
 	}
 
 	if (require_cs_mode && !props->cs_mode) {
-		DRM_ERROR("PID or TAG sampling require a ring to be specified");
+		DRM_ERROR(
+			"PID, TAG or TS sampling require a ring to be specified");
 		ret = -EINVAL;
 		goto cs_error;
 	}
@@ -1472,6 +1687,7 @@ static int i915_oa_stream_init(struct i915_perf_stream *stream,
 		}
 
 		stream->cs_mode = true;
+		stream->ring_id = props->ring_id;
 
 		if (props->sample_flags & SAMPLE_PID) {
 			stream->sample_flags |= SAMPLE_PID;
@@ -1483,20 +1699,25 @@ static int i915_oa_stream_init(struct i915_perf_stream *stream,
 			stream->sample_size += 4;
 		}
 
-		ret = alloc_command_stream_buf(dev_priv);
+		if (props->sample_flags & SAMPLE_TS) {
+			stream->sample_flags |= SAMPLE_TS;
+			stream->sample_size += I915_PERF_TS_SAMPLE_SIZE;
+		}
+
+		ret = alloc_command_stream_buf(dev_priv, stream->ring_id);
 		if (ret)
 			goto cs_error;
 	}
 
-	dev_priv->perf.oa.exclusive_stream = stream;
+	dev_priv->perf.exclusive_stream = stream;
 
-	stream->destroy = i915_oa_stream_destroy;
-	stream->enable = i915_oa_stream_enable;
-	stream->disable = i915_oa_stream_disable;
-	stream->can_read = i915_oa_can_read;
-	stream->wait_unlocked = i915_oa_wait_unlocked;
-	stream->poll_wait = i915_oa_poll_wait;
-	stream->read = i915_oa_read;
+	stream->destroy = i915_perf_stream_destroy;
+	stream->enable = i915_perf_stream_enable;
+	stream->disable = i915_perf_stream_disable;
+	stream->can_read = i915_perf_stream_can_read;
+	stream->wait_unlocked = i915_perf_stream_wait_unlocked;
+	stream->poll_wait = i915_perf_stream_poll_wait;
+	stream->read = i915_perf_stream_read;
 
 	return 0;
 
@@ -1530,8 +1751,8 @@ static void i915_oa_context_pin_notify_locked(struct drm_i915_private *dev_priv,
 	    dev_priv->perf.oa.ops.update_hw_ctx_id_locked == NULL)
 		return;
 
-	if (dev_priv->perf.oa.exclusive_stream &&
-	    dev_priv->perf.oa.exclusive_stream->ctx == context) {
+	if (dev_priv->perf.exclusive_stream &&
+	    dev_priv->perf.exclusive_stream->ctx == context) {
 		struct drm_i915_gem_object *obj =
 			context->legacy_hw_ctx.rcs_state;
 		u32 ctx_id = i915_gem_obj_ggtt_offset(obj);
@@ -1599,8 +1820,8 @@ void i915_oa_legacy_ctx_switch_notify(struct drm_i915_gem_request *req)
 	if (dev_priv->perf.oa.ops.legacy_ctx_switch_unlocked == NULL)
 		return;
 
-	if (dev_priv->perf.oa.exclusive_stream &&
-	    dev_priv->perf.oa.exclusive_stream->enabled) {
+	if (dev_priv->perf.exclusive_stream &&
+	    dev_priv->perf.exclusive_stream->enabled) {
 
 		/* XXX: We don't take a lock here and this may run
 		 * async with respect to stream methods. Notably we
@@ -1729,7 +1950,7 @@ static enum hrtimer_restart poll_check_timer_cb(struct hrtimer *hrtimer)
 
 	struct drm_i915_private *dev_priv =
 		container_of(hrtimer, typeof(*dev_priv),
-			     perf.oa.poll_check_timer);
+			     perf.poll_check_timer);
 
 	/* No need to protect the streams list here, since the hrtimer is
 	 * disabled before the stream is removed from list, and currently a
@@ -1738,7 +1959,7 @@ static enum hrtimer_restart poll_check_timer_cb(struct hrtimer *hrtimer)
 	 */
 	list_for_each_entry(stream, &dev_priv->perf.streams, link) {
 		if (stream_have_data__unlocked(stream))
-			wake_up(&dev_priv->perf.oa.poll_wq);
+			wake_up(&dev_priv->perf.poll_wq[stream->ring_id]);
 	}
 
 	hrtimer_forward_now(hrtimer, ns_to_ktime(POLL_PERIOD));
@@ -1947,7 +2168,7 @@ int i915_perf_open_ioctl_locked(struct drm_device *dev,
 	stream->dev_priv = dev_priv;
 	stream->ctx = specific_ctx;
 
-	ret = i915_oa_stream_init(stream, param, props);
+	ret = i915_perf_stream_init(stream, param, props);
 	if (ret)
 		goto err_alloc;
 
@@ -2088,13 +2309,6 @@ static int read_properties_unlocked(struct drm_i915_private *dev_priv,
 				if (ring_id >= LAST_USER_RING)
 					return -EINVAL;
 
-				/* XXX: Currently only RCS is supported.
-				 * Remove this check when support for other
-				 * rings is added
-				 */
-				if (ring_id != RCS)
-					return -EINVAL;
-
 				props->cs_mode = true;
 				props->ring_id = ring_id;
 			}
@@ -2108,6 +2322,9 @@ static int read_properties_unlocked(struct drm_i915_private *dev_priv,
 		case DRM_I915_PERF_SAMPLE_TAG_PROP:
 			props->sample_flags |= SAMPLE_TAG;
 			break;
+		case DRM_I915_PERF_SAMPLE_TS_PROP:
+			props->sample_flags |= SAMPLE_TS;
+			break;
 		case DRM_I915_PERF_PROP_MAX:
 			BUG();
 		}
@@ -2193,6 +2410,7 @@ static struct ctl_table dev_root[] = {
 void i915_perf_init(struct drm_device *dev)
 {
 	struct drm_i915_private *dev_priv = to_i915(dev);
+	int i;
 
 	if (!(IS_HASWELL(dev) ||
 	      IS_BROADWELL(dev) || IS_CHERRYVIEW(dev) ||
@@ -2204,16 +2422,18 @@ void i915_perf_init(struct drm_device *dev)
 	if (!dev_priv->perf.metrics_kobj)
 		return;
 
-	hrtimer_init(&dev_priv->perf.oa.poll_check_timer,
+	hrtimer_init(&dev_priv->perf.poll_check_timer,
 		     CLOCK_MONOTONIC, HRTIMER_MODE_REL);
-	dev_priv->perf.oa.poll_check_timer.function = poll_check_timer_cb;
-	init_waitqueue_head(&dev_priv->perf.oa.poll_wq);
+	dev_priv->perf.poll_check_timer.function = poll_check_timer_cb;
+	for (i = 0; i < I915_NUM_RINGS; i++) {
+		spin_lock_init(&dev_priv->perf.node_list_lock[i]);
+		init_waitqueue_head(&dev_priv->perf.poll_wq[i]);
+	}
 
 	INIT_LIST_HEAD(&dev_priv->perf.streams);
 	mutex_init(&dev_priv->perf.lock);
 	mutex_init(&dev_priv->perf.streams_lock);
 	spin_lock_init(&dev_priv->perf.hook_lock);
-	spin_lock_init(&dev_priv->perf.node_list_lock);
 
 	if (IS_HASWELL(dev)) {
 		dev_priv->perf.oa.ops.init_oa_buffer = gen7_init_oa_buffer;
diff --git a/drivers/gpu/drm/i915/i915_reg.h b/drivers/gpu/drm/i915/i915_reg.h
index a333038..e244626 100644
--- a/drivers/gpu/drm/i915/i915_reg.h
+++ b/drivers/gpu/drm/i915/i915_reg.h
@@ -359,6 +359,7 @@
 #define   MI_FLUSH_DW_STORE_INDEX	(1<<21)
 #define   MI_INVALIDATE_TLB		(1<<18)
 #define   MI_FLUSH_DW_OP_STOREDW	(1<<14)
+#define   MI_FLUSH_DW_OP_STAMP		(3<<14)
 #define   MI_FLUSH_DW_OP_MASK		(3<<14)
 #define   MI_FLUSH_DW_NOTIFY		(1<<8)
 #define   MI_INVALIDATE_BSD		(1<<7)
@@ -438,6 +439,7 @@
 #define   PIPE_CONTROL_TLB_INVALIDATE			(1<<18)
 #define   PIPE_CONTROL_MEDIA_STATE_CLEAR		(1<<16)
 #define   PIPE_CONTROL_QW_WRITE				(1<<14)
+#define   PIPE_CONTROL_TIMESTAMP_WRITE			(3<<14)
 #define   PIPE_CONTROL_POST_SYNC_OP_MASK                (3<<14)
 #define   PIPE_CONTROL_DEPTH_STALL			(1<<13)
 #define   PIPE_CONTROL_WRITE_FLUSH			(1<<12)
diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
index 5687080..2570f3ea 100644
--- a/include/uapi/drm/i915_drm.h
+++ b/include/uapi/drm/i915_drm.h
@@ -1236,6 +1236,12 @@ enum drm_i915_perf_property_id {
 	 */
 	DRM_I915_PERF_SAMPLE_TAG_PROP,
 
+	/**
+	 * The value of this property set to 1 requests inclusion of timestamp
+	 * in the perf sample data.
+	 */
+	DRM_I915_PERF_SAMPLE_TS_PROP,
+
 	DRM_I915_PERF_PROP_MAX /* non-ABI */
 };
 
@@ -1287,6 +1293,7 @@ enum drm_i915_perf_record_type {
 	 *     { u32 ctx_id; } && DRM_I915_PERF_SAMPLE_CTX_ID_PROP
 	 *     { u32 pid; } && DRM_I915_PERF_SAMPLE_PID_PROP
 	 *     { u32 tag; } && DRM_I915_PERF_SAMPLE_TAG_PROP
+	 *     { u64 timestamp; } && DRM_I915_PERF_SAMPLE_TS_PROP
 	 *     { u32 oa_report[]; } && DRM_I915_PERF_SAMPLE_OA_PROP
 	 * };
 	 */
-- 
1.9.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 10/11] drm/i915: Support opening multiple concurrent perf streams
  2016-02-16  5:27 [PATCH 00/11] Framework to collect gpu metrics using i915 perf infrastructure sourab.gupta
                   ` (8 preceding siblings ...)
  2016-02-16  5:27 ` [PATCH 09/11] drm/i915: Extend i915 perf framework for collecting timestamps on all gpu engines sourab.gupta
@ 2016-02-16  5:27 ` sourab.gupta
  2016-02-16  5:27 ` [PATCH 11/11] drm/i915: Support for capturing MMIO register values sourab.gupta
  10 siblings, 0 replies; 15+ messages in thread
From: sourab.gupta @ 2016-02-16  5:27 UTC (permalink / raw)
  To: intel-gfx; +Cc: Jabin Wu, Sourab Gupta

From: Sourab Gupta <sourab.gupta@intel.com>

This patch adds support for opening multiple concurrent perf streams for
different gpu engines, while having the restriction to open only a single
stream open for a particular gpu engine.
This enables userspace client to open multiple streams, one per engine,
at any time to capture sample data for multiple gpu engines.

Signed-off-by: Sourab Gupta <sourab.gupta@intel.com>
---
 drivers/gpu/drm/i915/i915_drv.h  |  2 +-
 drivers/gpu/drm/i915/i915_perf.c | 65 +++++++++++++++++++++++-----------------
 2 files changed, 38 insertions(+), 29 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index b1c952c..bf65acb 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -2086,7 +2086,7 @@ struct drm_i915_private {
 		spinlock_t hook_lock;
 
 		struct hrtimer poll_check_timer;
-		struct i915_perf_stream *exclusive_stream;
+		struct i915_perf_stream *exclusive_stream[I915_NUM_RINGS];
 		wait_queue_head_t poll_wq[I915_NUM_RINGS];
 
 		struct {
diff --git a/drivers/gpu/drm/i915/i915_perf.c b/drivers/gpu/drm/i915/i915_perf.c
index 1d2712d..3eb56d4 100644
--- a/drivers/gpu/drm/i915/i915_perf.c
+++ b/drivers/gpu/drm/i915/i915_perf.c
@@ -1061,7 +1061,7 @@ static void i915_perf_stream_destroy(struct i915_perf_stream *stream)
 {
 	struct drm_i915_private *dev_priv = stream->dev_priv;
 
-	BUG_ON(stream != dev_priv->perf.exclusive_stream);
+	BUG_ON(stream != dev_priv->perf.exclusive_stream[stream->ring_id]);
 
 	if (stream->using_oa) {
 		dev_priv->perf.oa.ops.disable_metric_set(dev_priv);
@@ -1075,7 +1075,7 @@ static void i915_perf_stream_destroy(struct i915_perf_stream *stream)
 	if (stream->cs_mode)
 		free_command_stream_buf(dev_priv, stream->ring_id);
 
-	dev_priv->perf.exclusive_stream = NULL;
+	dev_priv->perf.exclusive_stream[stream->ring_id] = NULL;
 }
 
 static void *vmap_oa_buffer(struct drm_i915_gem_object *obj)
@@ -1434,17 +1434,17 @@ static void gen7_update_oacontrol_locked(struct drm_i915_private *dev_priv)
 {
 	assert_spin_locked(&dev_priv->perf.hook_lock);
 
-	if (dev_priv->perf.exclusive_stream->enabled) {
+	if (dev_priv->perf.exclusive_stream[RCS]->enabled) {
 		unsigned long ctx_id = 0;
 		bool pinning_ok = false;
 
-		if (dev_priv->perf.exclusive_stream->ctx &&
+		if (dev_priv->perf.exclusive_stream[RCS]->ctx &&
 		    dev_priv->perf.oa.specific_ctx_id) {
 			ctx_id = dev_priv->perf.oa.specific_ctx_id;
 			pinning_ok = true;
 		}
 
-		if (dev_priv->perf.exclusive_stream->ctx == NULL ||
+		if (dev_priv->perf.exclusive_stream[RCS]->ctx == NULL ||
 		    pinning_ok) {
 			bool periodic = dev_priv->perf.oa.periodic;
 			u32 period_exponent = dev_priv->perf.oa.period_exponent;
@@ -1556,14 +1556,6 @@ static int i915_perf_stream_init(struct i915_perf_stream *stream,
 						      SAMPLE_TS);
 	int ret;
 
-	/* To avoid the complexity of having to accurately filter
-	 * counter reports and marshal to the appropriate client
-	 * we currently only allow exclusive access */
-	if (dev_priv->perf.exclusive_stream) {
-		DRM_ERROR("Stream already in use\n");
-		return -EBUSY;
-	}
-
 	/* Ctx Id can be sampled in HSW only through command streamer mode */
 	if (IS_HASWELL(dev_priv->dev) &&
 	    (props->sample_flags & SAMPLE_CTX_ID) && !props->cs_mode) {
@@ -1576,6 +1568,13 @@ static int i915_perf_stream_init(struct i915_perf_stream *stream,
 
 	if (require_oa_unit) {
 		int format_size;
+
+		/* Only allow exclusive access per stream */
+		if (dev_priv->perf.exclusive_stream[RCS]) {
+			DRM_ERROR("Stream:0 already in use\n");
+			return -EBUSY;
+		}
+
 		if (!dev_priv->perf.oa.ops.init_oa_buffer) {
 			DRM_ERROR("OA unit not supported\n");
 			return -ENODEV;
@@ -1673,6 +1672,13 @@ static int i915_perf_stream_init(struct i915_perf_stream *stream,
 	}
 
 	if (props->cs_mode) {
+		/* Only allow exclusive access per stream */
+		if (dev_priv->perf.exclusive_stream[props->ring_id]) {
+			DRM_ERROR("Stream:%d already in use\n", props->ring_id);
+			ret = -EBUSY;
+			goto cs_error;
+		}
+
 		/*
 		 * The only time we should allow enabling CS mode if it's not
 		 * strictly required, is if SAMPLE_CTX_ID has been requested
@@ -1709,7 +1715,7 @@ static int i915_perf_stream_init(struct i915_perf_stream *stream,
 			goto cs_error;
 	}
 
-	dev_priv->perf.exclusive_stream = stream;
+	dev_priv->perf.exclusive_stream[stream->ring_id] = stream;
 
 	stream->destroy = i915_perf_stream_destroy;
 	stream->enable = i915_perf_stream_enable;
@@ -1751,8 +1757,8 @@ static void i915_oa_context_pin_notify_locked(struct drm_i915_private *dev_priv,
 	    dev_priv->perf.oa.ops.update_hw_ctx_id_locked == NULL)
 		return;
 
-	if (dev_priv->perf.exclusive_stream &&
-	    dev_priv->perf.exclusive_stream->ctx == context) {
+	if (dev_priv->perf.exclusive_stream[RCS] &&
+	    dev_priv->perf.exclusive_stream[RCS]->ctx == context) {
 		struct drm_i915_gem_object *obj =
 			context->legacy_hw_ctx.rcs_state;
 		u32 ctx_id = i915_gem_obj_ggtt_offset(obj);
@@ -1820,8 +1826,8 @@ void i915_oa_legacy_ctx_switch_notify(struct drm_i915_gem_request *req)
 	if (dev_priv->perf.oa.ops.legacy_ctx_switch_unlocked == NULL)
 		return;
 
-	if (dev_priv->perf.exclusive_stream &&
-	    dev_priv->perf.exclusive_stream->enabled) {
+	if (dev_priv->perf.exclusive_stream[RCS] &&
+	    dev_priv->perf.exclusive_stream[RCS]->enabled) {
 
 		/* XXX: We don't take a lock here and this may run
 		 * async with respect to stream methods. Notably we
@@ -1944,23 +1950,26 @@ static ssize_t i915_perf_read(struct file *file,
 	return ret;
 }
 
-static enum hrtimer_restart poll_check_timer_cb(struct hrtimer *hrtimer)
+static void wake_up_perf_streams(void *data, async_cookie_t cookie)
 {
+	struct drm_i915_private *dev_priv = data;
 	struct i915_perf_stream *stream;
 
-	struct drm_i915_private *dev_priv =
-		container_of(hrtimer, typeof(*dev_priv),
-			     perf.poll_check_timer);
-
-	/* No need to protect the streams list here, since the hrtimer is
-	 * disabled before the stream is removed from list, and currently a
-	 * single exclusive_stream is supported.
-	 * XXX: revisit this when multiple concurrent streams are supported.
-	 */
+	mutex_lock(&dev_priv->perf.streams_lock);
 	list_for_each_entry(stream, &dev_priv->perf.streams, link) {
 		if (stream_have_data__unlocked(stream))
 			wake_up(&dev_priv->perf.poll_wq[stream->ring_id]);
 	}
+	mutex_unlock(&dev_priv->perf.streams_lock);
+}
+
+static enum hrtimer_restart poll_check_timer_cb(struct hrtimer *hrtimer)
+{
+	struct drm_i915_private *dev_priv =
+		container_of(hrtimer, typeof(*dev_priv),
+			     perf.poll_check_timer);
+
+	async_schedule(wake_up_perf_streams, dev_priv);
 
 	hrtimer_forward_now(hrtimer, ns_to_ktime(POLL_PERIOD));
 
-- 
1.9.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 11/11] drm/i915: Support for capturing MMIO register values
  2016-02-16  5:27 [PATCH 00/11] Framework to collect gpu metrics using i915 perf infrastructure sourab.gupta
                   ` (9 preceding siblings ...)
  2016-02-16  5:27 ` [PATCH 10/11] drm/i915: Support opening multiple concurrent perf streams sourab.gupta
@ 2016-02-16  5:27 ` sourab.gupta
  10 siblings, 0 replies; 15+ messages in thread
From: sourab.gupta @ 2016-02-16  5:27 UTC (permalink / raw)
  To: intel-gfx; +Cc: Jabin Wu, Sourab Gupta

From: Sourab Gupta <sourab.gupta@intel.com>

This patch adds support for capturing MMIO register values through
i915 perf interface.
The userspace can request upto 8 MMIO register values to be dumped.
The addresses of these registers can be passed through the corresponding
property 'value' field while opening the stream.
The commands to dump the values of these MMIO registers are then
inserted into the ring alongwith other commands.

Signed-off-by: Sourab Gupta <sourab.gupta@intel.com>
---
 drivers/gpu/drm/i915/i915_drv.h  |   4 +
 drivers/gpu/drm/i915/i915_perf.c | 177 ++++++++++++++++++++++++++++++++++++++-
 include/uapi/drm/i915_drm.h      |  14 ++++
 3 files changed, 193 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index bf65acb..fcaee75 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -1817,6 +1817,7 @@ struct i915_perf_cs_data_node {
 	u32 start_offset;
 	u32 oa_offset;
 	u32 ts_offset;
+	u32 mmio_offset;
 
 	/* buffer size corresponding to this entry */
 	u32 size;
@@ -2089,6 +2090,9 @@ struct drm_i915_private {
 		struct i915_perf_stream *exclusive_stream[I915_NUM_RINGS];
 		wait_queue_head_t poll_wq[I915_NUM_RINGS];
 
+		u32 num_mmio;
+		u32 mmio_list[I915_PERF_MMIO_NUM_MAX];
+
 		struct {
 			u32 specific_ctx_id;
 
diff --git a/drivers/gpu/drm/i915/i915_perf.c b/drivers/gpu/drm/i915/i915_perf.c
index 3eb56d4..45a7f22 100644
--- a/drivers/gpu/drm/i915/i915_perf.c
+++ b/drivers/gpu/drm/i915/i915_perf.c
@@ -63,6 +63,7 @@ struct sample_data {
 	u32 tag;
 	u64 ts;
 	const u8 *report;
+	const u8 *mmio;
 };
 
 /* for sysctl proc_dointvec_minmax of i915_oa_min_timer_exponent */
@@ -106,6 +107,7 @@ static struct i915_oa_format gen8_plus_oa_formats[I915_OA_FORMAT_MAX] = {
 #define SAMPLE_PID		(1<<3)
 #define SAMPLE_TAG		(1<<4)
 #define SAMPLE_TS		(1<<5)
+#define SAMPLE_MMIO		(1<<6)
 
 struct perf_open_properties
 {
@@ -203,6 +205,9 @@ static void insert_perf_entry(struct drm_i915_private *dev_priv,
 		sample_ts = true;
 	}
 
+	if (sample_flags & SAMPLE_MMIO)
+		entry_size += 4*dev_priv->perf.num_mmio;
+
 	spin_lock(&dev_priv->perf.node_list_lock[id]);
 	if (list_empty(&dev_priv->perf.node_list[id])) {
 		offset = 0;
@@ -266,6 +271,10 @@ out:
 		entry->ts_offset = ALIGN(entry->ts_offset, TS_ADDR_ALIGN);
 		offset = entry->ts_offset + I915_PERF_TS_SAMPLE_SIZE;
 	}
+	if (sample_flags & SAMPLE_MMIO) {
+		entry->mmio_offset = offset;
+		offset = entry->mmio_offset + 4*dev_priv->perf.num_mmio;
+	}
 
 	list_add_tail(&entry->link, &dev_priv->perf.node_list[id]);
 	spin_unlock(&dev_priv->perf.node_list_lock[id]);
@@ -415,6 +424,72 @@ static int i915_perf_stream_capture_ts_data(struct drm_i915_gem_request *req,
 	return 0;
 }
 
+static int i915_perf_stream_capture_mmio_data(struct drm_i915_gem_request *req,
+						u32 offset)
+{
+	struct intel_engine_cs *ring = req->ring;
+	struct intel_ringbuffer *ringbuf = req->ringbuf;
+	struct drm_i915_private *dev_priv = ring->dev->dev_private;
+	int num_mmio = dev_priv->perf.num_mmio;
+	u32 mmio_addr, addr = 0;
+	int ret, i;
+
+	if (i915.enable_execlists)
+		ret = intel_logical_ring_begin(req, 4*num_mmio);
+	else
+		ret = intel_ring_begin(req, 4*num_mmio);
+
+	if (ret)
+		return ret;
+
+	mmio_addr =
+		dev_priv->perf.command_stream_buf[ring->id].vma->node.start +
+			offset;
+
+	if (i915.enable_execlists) {
+		for (i = 0; i < num_mmio; i++) {
+			uint32_t cmd;
+
+			addr = mmio_addr +
+				i * sizeof(dev_priv->perf.mmio_list[i]);
+
+			cmd = MI_STORE_REGISTER_MEM_GEN8 |
+						MI_SRM_LRM_GLOBAL_GTT;
+
+			intel_logical_ring_emit(ringbuf, cmd);
+			intel_logical_ring_emit(ringbuf,
+					dev_priv->perf.mmio_list[i]);
+			intel_logical_ring_emit(ringbuf, addr);
+			intel_logical_ring_emit(ringbuf, 0);
+		}
+		intel_logical_ring_advance(ringbuf);
+	} else {
+		for (i = 0; i < num_mmio; i++) {
+			uint32_t cmd;
+
+			addr = mmio_addr +
+				i * sizeof(dev_priv->perf.mmio_list[i]);
+
+			if (INTEL_INFO(ring->dev)->gen >= 8)
+				cmd = MI_STORE_REGISTER_MEM_GEN8 |
+					MI_SRM_LRM_GLOBAL_GTT;
+			else
+				cmd = MI_STORE_REGISTER_MEM |
+					MI_SRM_LRM_GLOBAL_GTT;
+
+			intel_ring_emit(ring, cmd);
+			intel_ring_emit(ring, dev_priv->perf.mmio_list[i]);
+			intel_ring_emit(ring, addr);
+			if (INTEL_INFO(ring->dev)->gen >= 8)
+				intel_ring_emit(ring, 0);
+			else
+				intel_ring_emit(ring, MI_NOOP);
+		}
+		intel_ring_advance(ring);
+	}
+	return 0;
+}
+
 static void i915_perf_stream_cs_hook(struct i915_perf_stream *stream,
 				struct drm_i915_gem_request *req, u32 tag)
 {
@@ -454,6 +529,13 @@ static void i915_perf_stream_cs_hook(struct i915_perf_stream *stream,
 			goto err;
 	}
 
+	if (sample_flags & SAMPLE_MMIO) {
+		ret = i915_perf_stream_capture_mmio_data(req,
+				entry->mmio_offset);
+		if (ret)
+			goto err;
+	}
+
 	i915_vma_move_to_active(dev_priv->perf.command_stream_buf[id].vma, req);
 	return;
 
@@ -600,6 +682,13 @@ static bool append_sample(struct i915_perf_stream *stream,
 		read_state->buf += I915_PERF_TS_SAMPLE_SIZE;
 	}
 
+	if (sample_flags & SAMPLE_MMIO) {
+		if (copy_to_user(read_state->buf, data->mmio,
+					4*dev_priv->perf.num_mmio))
+			return false;
+		read_state->buf += 4*dev_priv->perf.num_mmio;
+	}
+
 	if (sample_flags & SAMPLE_OA_REPORT) {
 		if (copy_to_user(read_state->buf, data->report, report_size))
 			return false;
@@ -618,6 +707,7 @@ static bool append_oa_buffer_sample(struct i915_perf_stream *stream,
 	struct drm_i915_private *dev_priv = stream->dev_priv;
 	u32 sample_flags = stream->sample_flags;
 	struct sample_data data = { 0 };
+	u32 mmio_list_dummy[I915_PERF_MMIO_NUM_MAX] = { 0 };
 
 	if (sample_flags & SAMPLE_OA_SOURCE_INFO) {
 		enum drm_i915_perf_oa_event_source source;
@@ -654,6 +744,10 @@ static bool append_oa_buffer_sample(struct i915_perf_stream *stream,
 	if (sample_flags & SAMPLE_TS)
 		data.ts = 0;
 
+	/* Periodic OA samples don't have mmio associated with them */
+	if (sample_flags & SAMPLE_MMIO)
+		data.mmio = (u8 *)mmio_list_dummy;
+
 	if (sample_flags & SAMPLE_OA_REPORT)
 		data.report = report;
 
@@ -921,6 +1015,10 @@ static bool append_one_cs_sample(struct i915_perf_stream *stream,
 					node->ts_offset);
 	}
 
+	if (sample_flags & SAMPLE_MMIO)
+		data.mmio = dev_priv->perf.command_stream_buf[id].addr +
+				node->mmio_offset;
+
 	append_sample(stream, read_state, &data);
 
 	return true;
@@ -1553,7 +1651,8 @@ static int i915_perf_stream_init(struct i915_perf_stream *stream,
 						      SAMPLE_OA_SOURCE_INFO);
 	bool require_cs_mode = props->sample_flags & (SAMPLE_PID |
 						      SAMPLE_TAG |
-						      SAMPLE_TS);
+						      SAMPLE_TS |
+						      SAMPLE_MMIO);
 	int ret;
 
 	/* Ctx Id can be sampled in HSW only through command streamer mode */
@@ -1666,7 +1765,7 @@ static int i915_perf_stream_init(struct i915_perf_stream *stream,
 
 	if (require_cs_mode && !props->cs_mode) {
 		DRM_ERROR(
-			"PID, TAG or TS sampling require a ring to be specified");
+			"PID, TAG, TS or MMIO sampling require a ring to be specified");
 		ret = -EINVAL;
 		goto cs_error;
 	}
@@ -1710,6 +1809,11 @@ static int i915_perf_stream_init(struct i915_perf_stream *stream,
 			stream->sample_size += I915_PERF_TS_SAMPLE_SIZE;
 		}
 
+		if (props->sample_flags & SAMPLE_MMIO) {
+			stream->sample_flags |= SAMPLE_MMIO;
+			stream->sample_size += 4 * dev_priv->perf.num_mmio;
+		}
+
 		ret = alloc_command_stream_buf(dev_priv, stream->ring_id);
 		if (ret)
 			goto cs_error;
@@ -2222,6 +2326,69 @@ err:
 	return ret;
 }
 
+static int check_mmio_whitelist(struct drm_i915_private *dev_priv, u32 num_mmio)
+{
+#define GEN_RANGE(l, h) GENMASK(h, l)
+	static const struct register_whitelist {
+		uint64_t offset;
+		uint32_t size;
+		/* supported gens, 0x10 for 4, 0x30 for 4 and 5, etc. */
+		uint32_t gen_bitmask;
+	} whitelist[] = {
+		{ GEN6_GT_GFX_RC6, 4, GEN_RANGE(7, 9) },
+		{ GEN6_GT_GFX_RC6p, 4, GEN_RANGE(7, 9) },
+	};
+	int i, count;
+
+	for (count = 0; count < num_mmio; count++) {
+		/* Coarse check on mmio reg addresses being non zero */
+		if (!dev_priv->perf.mmio_list[count])
+			return -EINVAL;
+
+		for (i = 0; i < ARRAY_SIZE(whitelist); i++) {
+			if ((whitelist[i].offset ==
+				dev_priv->perf.mmio_list[count]) &&
+			    (1 << INTEL_INFO(dev_priv->dev)->gen &
+					whitelist[i].gen_bitmask))
+				break;
+		}
+
+		if (i == ARRAY_SIZE(whitelist))
+			return -EINVAL;
+	}
+	return 0;
+}
+
+static int copy_mmio_list(struct drm_i915_private *dev_priv,
+				void __user *mmio)
+{
+	void __user *mmio_list = ((u8 __user *)mmio + 4);
+	u32 num_mmio;
+	int ret;
+
+	if (!mmio)
+		return -EINVAL;
+
+	ret = get_user(num_mmio, (u32 __user *)mmio);
+	if (ret)
+		return ret;
+
+	if (num_mmio > I915_PERF_MMIO_NUM_MAX)
+		return -EINVAL;
+
+	memset(dev_priv->perf.mmio_list, 0, I915_PERF_MMIO_NUM_MAX);
+	if (copy_from_user(dev_priv->perf.mmio_list, mmio_list, 4*num_mmio))
+		return -EINVAL;
+
+	ret = check_mmio_whitelist(dev_priv, num_mmio);
+	if (ret)
+		return ret;
+
+	dev_priv->perf.num_mmio = num_mmio;
+
+	return 0;
+}
+
 /* Note we copy the properties from userspace outside of the i915 perf
  * mutex to avoid an awkward lockdep with mmap_sem.
  *
@@ -2334,6 +2501,12 @@ static int read_properties_unlocked(struct drm_i915_private *dev_priv,
 		case DRM_I915_PERF_SAMPLE_TS_PROP:
 			props->sample_flags |= SAMPLE_TS;
 			break;
+		case DRM_I915_PERF_SAMPLE_MMIO_PROP:
+			ret = copy_mmio_list(dev_priv, (u64 __user *)value);
+			if (ret)
+				return ret;
+			props->sample_flags |= SAMPLE_MMIO;
+			break;
 		case DRM_I915_PERF_PROP_MAX:
 			BUG();
 		}
diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
index 2570f3ea..38cf385 100644
--- a/include/uapi/drm/i915_drm.h
+++ b/include/uapi/drm/i915_drm.h
@@ -1170,6 +1170,12 @@ enum drm_i915_perf_oa_event_source {
 	I915_PERF_OA_EVENT_SOURCE_MAX	/* non-ABI */
 };
 
+#define I915_PERF_MMIO_NUM_MAX	8
+struct drm_i915_perf_mmio_list {
+	__u32 num_mmio;
+	__u32 mmio_list[I915_PERF_MMIO_NUM_MAX];
+};
+
 enum drm_i915_perf_property_id {
 	/**
 	 * Open the stream for a specific context handle (as used with
@@ -1242,6 +1248,13 @@ enum drm_i915_perf_property_id {
 	 */
 	DRM_I915_PERF_SAMPLE_TS_PROP,
 
+	/**
+	 * This property requests inclusion of mmio register values in the perf
+	 * sample data. The value of this property specifies the address of user
+	 * struct having the register addresses.
+	 */
+	DRM_I915_PERF_SAMPLE_MMIO_PROP,
+
 	DRM_I915_PERF_PROP_MAX /* non-ABI */
 };
 
@@ -1294,6 +1307,7 @@ enum drm_i915_perf_record_type {
 	 *     { u32 pid; } && DRM_I915_PERF_SAMPLE_PID_PROP
 	 *     { u32 tag; } && DRM_I915_PERF_SAMPLE_TAG_PROP
 	 *     { u64 timestamp; } && DRM_I915_PERF_SAMPLE_TS_PROP
+	 *     { u32 mmio[]; } && DRM_I915_PERF_SAMPLE_MMIO_PROP
 	 *     { u32 oa_report[]; } && DRM_I915_PERF_SAMPLE_OA_PROP
 	 * };
 	 */
-- 
1.9.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH 03/11] drm/i915: return ctx->global_id from intel_execlists_ctx_id()
  2016-02-16  5:27 ` [PATCH 03/11] drm/i915: return ctx->global_id from intel_execlists_ctx_id() sourab.gupta
@ 2016-02-16  9:34   ` Dave Gordon
  0 siblings, 0 replies; 15+ messages in thread
From: Dave Gordon @ 2016-02-16  9:34 UTC (permalink / raw)
  To: sourab.gupta, Dai, Yu, O'Rourke, Tom
  Cc: Intel Graphics Development, jabin.wu

On 16/02/16 05:27, sourab.gupta@intel.com wrote:
> From: Robert Bragg <robert@sixbynine.org>
>
> The newly added intel_context::global_id is suitable (a globally unique
> 20 bit ID) for giving to the hardware as a unique context identifier.
>
> Compared to using the pinned address of a logical ring context these IDs
> are constant for the lifetime of a context whereas a context could be
> repinned at different addresses during its lifetime.
>
> Having a stable ID is useful when we need to buffer information
> associated with a context based on this ID so the association can't be
> lost. For example the OA unit writes out counter reports to a circular
> buffer tagged with this ID and we want to be able to accurately filter
> reports for a specific context, ideally without the added complexity of
> tracking context re-pinning while the OA buffer may contain reports with
> older IDs.

Unfortunately, I suspect that the GuC firmware has conflated the context 
ID (which ought to be a purely software-defined tag) with the (hardware) 
GTT page address of the LRC. So this will probably break GuC submission.

But it looks like a good idea otherwise; maybe we can get the GuC 
firmware updated so it doesn't make this assumption ...

.Dave.

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 06/11] drm/i915: Framework for capturing command stream based OA reports
  2016-02-16  5:27 ` [PATCH 06/11] drm/i915: Framework for capturing command stream based OA reports sourab.gupta
@ 2016-02-17 17:30   ` Robert Bragg
  2016-02-19  6:51     ` sourab gupta
  0 siblings, 1 reply; 15+ messages in thread
From: Robert Bragg @ 2016-02-17 17:30 UTC (permalink / raw)
  To: Gupta, Sourab; +Cc: Jabin Wu, intel-gfx


[-- Attachment #1.1: Type: text/plain, Size: 6584 bytes --]

Hi Sourab,

As Sergio Martinez has started experimenting with this in gputop and
reported seeing lots of ENOSPC errors being reported when reading I had a
look into this and saw a few issues with how we check that there's data
available to read in command stream mode, and a I think there's a
possibility of incorrectly sorting the samples sometimes...

On Tue, Feb 16, 2016 at 5:27 AM, <sourab.gupta@intel.com> wrote:

> From: Sourab Gupta <sourab.gupta@intel.com>
>
>
> -static bool i915_oa_can_read(struct i915_perf_stream *stream)
> +static bool append_oa_rcs_sample(struct i915_perf_stream *stream,
> +                                struct i915_perf_read_state *read_state,
> +                                struct i915_perf_cs_data_node *node)
> +{
> +       struct drm_i915_private *dev_priv = stream->dev_priv;
> +       struct oa_sample_data data = { 0 };
> +       const u8 *report = dev_priv->perf.command_stream_buf.addr +
> +                               node->offset;
> +       u32 sample_flags = stream->sample_flags;
> +       u32 report_ts;
> +
> +       /*
> +        * Forward the periodic OA samples which have the timestamp lower
> +        * than timestamp of this sample, before forwarding this sample.
> +        * This ensures samples read by user are order acc. to their
> timestamps
> +        */
> +       report_ts = *(u32 *)(report + 4);
> +       dev_priv->perf.oa.ops.read(stream, read_state, report_ts);
> +
> +       if (sample_flags & SAMPLE_OA_SOURCE_INFO)
> +               data.source = I915_PERF_OA_EVENT_SOURCE_RCS;
> +
> +       if (sample_flags & SAMPLE_CTX_ID)
> +               data.ctx_id = node->ctx_id;
> +
> +       if (sample_flags & SAMPLE_OA_REPORT)
> +               data.report = report;
> +
> +       append_oa_sample(stream, read_state, &data);
> +
> +       return true;
> +}
> +
> +static void oa_rcs_append_reports(struct i915_perf_stream *stream,
> +                                 struct i915_perf_read_state *read_state)
> +{
> +       struct drm_i915_private *dev_priv = stream->dev_priv;
> +       struct i915_perf_cs_data_node *entry, *next;
> +
> +       list_for_each_entry_safe(entry, next,
> +                                &dev_priv->perf.node_list, link) {
> +               if (!i915_gem_request_completed(entry->request, true))
> +                       break;
> +
> +               if (!append_oa_rcs_sample(stream, read_state, entry))
> +                       break;
> +
> +               spin_lock(&dev_priv->perf.node_list_lock);
> +               list_del(&entry->link);
> +               spin_unlock(&dev_priv->perf.node_list_lock);
> +
> +               i915_gem_request_unreference__unlocked(entry->request);
> +               kfree(entry);
> +       }
> +
> +       /* Flush any remaining periodic reports */
> +       dev_priv->perf.oa.ops.read(stream, read_state, U32_MAX);
>

I don't think we can flush all remaining periodic reports here - at least
not if we have any in-flight MI_RPC commands - in case the next request to
complete might have reports with earlier timestamps than some of these
periodic reports.

Even if we have periodic reports available I think we need to throttle
forwarding them based on the command stream requests completing.

This is something that userspace should understand when it explicitly
decides to use command stream mode in conjunction with periodic sampling.


> +}
> +
> +static bool command_stream_buf_is_empty(struct i915_perf_stream *stream)
>  {
>         struct drm_i915_private *dev_priv = stream->dev_priv;
>
> -       return !dev_priv->perf.oa.ops.oa_buffer_is_empty(dev_priv);
> +       if (stream->cs_mode)
> +               return list_empty(&dev_priv->perf.node_list);
> +       else
> +               return true;
>  }
>

I think this list_empty() check needs a lock around it, as it's called from
stream_have_data__unlocked().

Currently only checking list_empty() can lead to some false positives that
may confuse the current i915_perf_read() (the false positives come from not
checking that at least one entry->request has also completed).

False positives here will affect the error reporting in
i915_perf_read_locked(). The first thing i915_perf_read_locked() does is
check for available data so it can return -EAGAIN for non-blocking file
descriptors or block, waiting for data, for blocking fds. Once we believe
there is data then if stream->read() returns zero then that's translated to
a -ENOSPC error because we 'know' there's data but weren't able to copy
anything to the user's buffer because it wasn't large enough to hold a
complete record.


>
> -static int i915_oa_wait_unlocked(struct i915_perf_stream *stream)
> +static bool stream_have_data__unlocked(struct i915_perf_stream *stream)
>  {
>         struct drm_i915_private *dev_priv = stream->dev_priv;
>
> @@ -397,8 +739,29 @@ static int i915_oa_wait_unlocked(struct
> i915_perf_stream *stream)
>          * can't be destroyed until completion (such as a read()) that
> ensures
>          * the device + OA buffer can't disappear
>          */
> +       return !(dev_priv->perf.oa.ops.oa_buffer_is_empty(dev_priv) &&
> +                command_stream_buf_is_empty(stream));
> +}
>

In command stream mode, due to how we need to merge sort reports I think we
need a more involved set of checks. We need to check that at least one
entry->request in the node_list has completed, or else if the node_list is
empty then we can check if !oa_buffer_is_empty().

There's going to be a thorny race to consider though: if this code ends up
reporting that there is data to read because the node_list is empty and
there are periodic samples, but by the time we call into stream->read()
there's a new node_list entry that's not completed that will end up
returning 0 which could again result in an incorrect ENOSPC error.

We should consider changing stream->read() so it can return -ENOSPC itself
if appropriate and when it returns 0 we would instead either return -EAGAIN
or loop back to waiting for blocking fds.

B.t.w for reference the stream->read() interface is a little different in
the series I sent out recently for Haswell vs the gen8+ oa-next branch
(which I haven't rebased yet) because I had to make some last minute
changes to consider -EFAULT errors if there was a copy_to_user() error
while reading. If you look at this issue, it could be good to also compare
with these related changes I made recently.


> +
> +static bool i915_oa_can_read(struct i915_perf_stream *stream)
> +{
> +
> +       return stream_have_data__unlocked(stream);
> +}
>
>
Kind regards,
- Robert

[-- Attachment #1.2: Type: text/html, Size: 8535 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 06/11] drm/i915: Framework for capturing command stream based OA reports
  2016-02-17 17:30   ` Robert Bragg
@ 2016-02-19  6:51     ` sourab gupta
  0 siblings, 0 replies; 15+ messages in thread
From: sourab gupta @ 2016-02-19  6:51 UTC (permalink / raw)
  To: Robert Bragg; +Cc: Wu, Jabin, intel-gfx@lists.freedesktop.org

On Wed, 2016-02-17 at 23:00 +0530, Robert Bragg wrote:
> Hi Sourab,
> 
> 
> As Sergio Martinez has started experimenting with this in gputop and
> reported seeing lots of ENOSPC errors being reported when reading I
> had a look into this and saw a few issues with how we check that
> there's data available to read in command stream mode, and a I think
> there's a possibility of incorrectly sorting the samples sometimes...

Hi Robert,
Thanks for spotting this anomaly. I'll have this fixed in the next
version of patch set.
> 
> On Tue, Feb 16, 2016 at 5:27 AM, <sourab.gupta@intel.com> wrote:
>         From: Sourab Gupta <sourab.gupta@intel.com>
>         
>         
>         -static bool i915_oa_can_read(struct i915_perf_stream *stream)
>         +static bool append_oa_rcs_sample(struct i915_perf_stream
>         *stream,
>         +                                struct i915_perf_read_state
>         *read_state,
>         +                                struct i915_perf_cs_data_node
>         *node)
>         +{
>         +       struct drm_i915_private *dev_priv = stream->dev_priv;
>         +       struct oa_sample_data data = { 0 };
>         +       const u8 *report =
>         dev_priv->perf.command_stream_buf.addr +
>         +                               node->offset;
>         +       u32 sample_flags = stream->sample_flags;
>         +       u32 report_ts;
>         +
>         +       /*
>         +        * Forward the periodic OA samples which have the
>         timestamp lower
>         +        * than timestamp of this sample, before forwarding
>         this sample.
>         +        * This ensures samples read by user are order acc. to
>         their timestamps
>         +        */
>         +       report_ts = *(u32 *)(report + 4);
>         +       dev_priv->perf.oa.ops.read(stream, read_state,
>         report_ts);
>         +
>         +       if (sample_flags & SAMPLE_OA_SOURCE_INFO)
>         +               data.source = I915_PERF_OA_EVENT_SOURCE_RCS;
>         +
>         +       if (sample_flags & SAMPLE_CTX_ID)
>         +               data.ctx_id = node->ctx_id;
>         +
>         +       if (sample_flags & SAMPLE_OA_REPORT)
>         +               data.report = report;
>         +
>         +       append_oa_sample(stream, read_state, &data);
>         +
>         +       return true;
>         +}
>         +
>         +static void oa_rcs_append_reports(struct i915_perf_stream
>         *stream,
>         +                                 struct i915_perf_read_state
>         *read_state)
>         +{
>         +       struct drm_i915_private *dev_priv = stream->dev_priv;
>         +       struct i915_perf_cs_data_node *entry, *next;
>         +
>         +       list_for_each_entry_safe(entry, next,
>         +                                &dev_priv->perf.node_list,
>         link) {
>         +               if (!
>         i915_gem_request_completed(entry->request, true))
>         +                       break;
>         +
>         +               if (!append_oa_rcs_sample(stream, read_state,
>         entry))
>         +                       break;
>         +
>         +               spin_lock(&dev_priv->perf.node_list_lock);
>         +               list_del(&entry->link);
>         +               spin_unlock(&dev_priv->perf.node_list_lock);
>         +
>         +
>          i915_gem_request_unreference__unlocked(entry->request);
>         +               kfree(entry);
>         +       }
>         +
>         +       /* Flush any remaining periodic reports */
>         +       dev_priv->perf.oa.ops.read(stream, read_state,
>         U32_MAX);
>  
> I don't think we can flush all remaining periodic reports here - at
> least not if we have any in-flight MI_RPC commands - in case the next
> request to complete might have reports with earlier timestamps than
> some of these periodic reports.
> 
> 
> Even if we have periodic reports available I think we need to throttle
> forwarding them based on the command stream requests completing.
> 
> 
> This is something that userspace should understand when it explicitly
> decides to use command stream mode in conjunction with periodic
> sampling.
> 
I agree, there shouldn't be any flushing of remaining periodic reports
here, instead any periodic reports remaining here should be taken care
of during the next processing of command stream samples.
>  
>         +}
>         +
>         +static bool command_stream_buf_is_empty(struct
>         i915_perf_stream *stream)
>          {
>                 struct drm_i915_private *dev_priv = stream->dev_priv;
>         
>         -       return !
>         dev_priv->perf.oa.ops.oa_buffer_is_empty(dev_priv);
>         +       if (stream->cs_mode)
>         +               return list_empty(&dev_priv->perf.node_list);
>         +       else
>         +               return true;
>          }
> 
> 
> I think this list_empty() check needs a lock around it, as it's called
> from stream_have_data__unlocked().
> 
> 
> Currently only checking list_empty() can lead to some false positives
> that may confuse the current i915_perf_read() (the false positives
> come from not checking that at least one entry->request has also
> completed).
> 
> False positives here will affect the error reporting in
> i915_perf_read_locked(). The first thing i915_perf_read_locked() does
> is check for available data so it can return -EAGAIN for non-blocking
> file descriptors or block, waiting for data, for blocking fds. Once we
> believe there is data then if stream->read() returns zero then that's
> translated to a -ENOSPC error because we 'know' there's data but
> weren't able to copy anything to the user's buffer because it wasn't
> large enough to hold a complete record.
> 
Thanks for pointing this out. This function should check that at least
one request has completed when the list is not empty. I'll aim to make
this change.
>  
>         
>         -static int i915_oa_wait_unlocked(struct i915_perf_stream
>         *stream)
>         +static bool stream_have_data__unlocked(struct
>         i915_perf_stream *stream)
>          {
>                 struct drm_i915_private *dev_priv = stream->dev_priv;
>         
>         @@ -397,8 +739,29 @@ static int i915_oa_wait_unlocked(struct
>         i915_perf_stream *stream)
>                  * can't be destroyed until completion (such as a
>         read()) that ensures
>                  * the device + OA buffer can't disappear
>                  */
>         +
>          return !(dev_priv->perf.oa.ops.oa_buffer_is_empty(dev_priv)
>         &&
>         +                command_stream_buf_is_empty(stream));
>         +}
> 
> 
> In command stream mode, due to how we need to merge sort reports I
> think we need a more involved set of checks. We need to check that at
> least one entry->request in the node_list has completed, or else if
> the node_list is empty then we can check if !oa_buffer_is_empty().
> 
> 
> There's going to be a thorny race to consider though: if this code
> ends up reporting that there is data to read because the node_list is
> empty and there are periodic samples, but by the time we call into
> stream->read() there's a new node_list entry that's not completed that
> will end up returning 0 which could again result in an incorrect
> ENOSPC error.
> 
> We should consider changing stream->read() so it can return -ENOSPC
> itself if appropriate and when it returns 0 we would instead either
> return -EAGAIN or loop back to waiting for blocking fds.

Yeah, this race needs to be handled, ideally best way would be through
the changes to stream->read() itself as you suggested. I'll aim to make
this change (while rebasing on your recent copy_to_user changes).

> B.t.w for reference the stream->read() interface is a little different
> in the series I sent out recently for Haswell vs the gen8+ oa-next
> branch (which I haven't rebased yet) because I had to make some last
> minute changes to consider -EFAULT errors if there was a
> copy_to_user() error while reading. If you look at this issue, it
> could be good to also compare with these related changes I made
> recently.
>  
>         +
>         +static bool i915_oa_can_read(struct i915_perf_stream *stream)
>         +{
>         +
>         +       return stream_have_data__unlocked(stream);
>         +}
>         
> 
> 
> Kind regards,
> 
> - Robert
> 


_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2016-02-19  6:49 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-02-16  5:27 [PATCH 00/11] Framework to collect gpu metrics using i915 perf infrastructure sourab.gupta
2016-02-16  5:27 ` [PATCH 01/11] drm/i915: Introduce global id for contexts sourab.gupta
2016-02-16  5:27 ` [PATCH 02/11] drm/i915: Constrain intel_context::global_id to 20 bits sourab.gupta
2016-02-16  5:27 ` [PATCH 03/11] drm/i915: return ctx->global_id from intel_execlists_ctx_id() sourab.gupta
2016-02-16  9:34   ` Dave Gordon
2016-02-16  5:27 ` [PATCH 04/11] drm/i915: Add ctx getparam ioctl parameter to retrieve ctx global id sourab.gupta
2016-02-16  5:27 ` [PATCH 05/11] drm/i915: Expose OA sample source to userspace sourab.gupta
2016-02-16  5:27 ` [PATCH 06/11] drm/i915: Framework for capturing command stream based OA reports sourab.gupta
2016-02-17 17:30   ` Robert Bragg
2016-02-19  6:51     ` sourab gupta
2016-02-16  5:27 ` [PATCH 07/11] drm/i915: Add support for having pid output with OA report sourab.gupta
2016-02-16  5:27 ` [PATCH 08/11] drm/i915: Add support to add execbuffer tags to OA counter reports sourab.gupta
2016-02-16  5:27 ` [PATCH 09/11] drm/i915: Extend i915 perf framework for collecting timestamps on all gpu engines sourab.gupta
2016-02-16  5:27 ` [PATCH 10/11] drm/i915: Support opening multiple concurrent perf streams sourab.gupta
2016-02-16  5:27 ` [PATCH 11/11] drm/i915: Support for capturing MMIO register values sourab.gupta

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).