intel-gfx.lists.freedesktop.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 1/4] drm/i915: Rename I915_CACHE_MLC_LLC to L3_LLC for Ivybridge
@ 2013-08-06 12:17 Chris Wilson
  2013-08-06 12:17 ` [PATCH 2/4] drm/i915: Update rules for reading cache lines through the LLC Chris Wilson
                   ` (3 more replies)
  0 siblings, 4 replies; 14+ messages in thread
From: Chris Wilson @ 2013-08-06 12:17 UTC (permalink / raw)
  To: intel-gfx

MLC_LLC was never validated for Sandybridge and was superseded by a new
level of cacheing for the GPU in Ivybridge. Update our names to be
consistent with usage, and in the process stop setting the unwanted bit
on Sandybridge.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
---
 drivers/gpu/drm/i915/i915_drv.h         |  7 +++++--
 drivers/gpu/drm/i915/i915_gem_context.c |  2 +-
 drivers/gpu/drm/i915/i915_gem_gtt.c     | 35 +++++++++++++++++++++++++++------
 drivers/gpu/drm/i915/i915_gpu_error.c   |  4 ++--
 4 files changed, 37 insertions(+), 11 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index cb2c59d..aa11731 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -450,8 +450,11 @@ struct intel_device_info {
 
 enum i915_cache_level {
 	I915_CACHE_NONE = 0,
-	I915_CACHE_LLC,
-	I915_CACHE_LLC_MLC, /* gen6+, in docs at least! */
+	I915_CACHE_LLC, /* also used for snoopable memory on non-LLC */
+	I915_CACHE_L3_LLC, /* gen7+, L3 sits between the domain specifc
+			      caches, eg sampler/render caches, and the
+			      large Last-Level-Cache. LLC is coherent with
+			      the CPU, but L3 is only visible to the GPU. */
 };
 
 typedef uint32_t gen6_gtt_pte_t;
diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c
index 30bb01a..db94aca 100644
--- a/drivers/gpu/drm/i915/i915_gem_context.c
+++ b/drivers/gpu/drm/i915/i915_gem_context.c
@@ -158,7 +158,7 @@ create_hw_context(struct drm_device *dev,
 
 	if (INTEL_INFO(dev)->gen >= 7) {
 		ret = i915_gem_object_set_cache_level(ctx->obj,
-						      I915_CACHE_LLC_MLC);
+						      I915_CACHE_L3_LLC);
 		/* Failure shouldn't ever happen this early */
 		if (WARN_ON(ret))
 			goto err_out;
diff --git a/drivers/gpu/drm/i915/i915_gem_gtt.c b/drivers/gpu/drm/i915/i915_gem_gtt.c
index 3d779cd..06f4eb3 100644
--- a/drivers/gpu/drm/i915/i915_gem_gtt.c
+++ b/drivers/gpu/drm/i915/i915_gem_gtt.c
@@ -43,7 +43,7 @@
 #define GEN6_PTE_UNCACHED		(1 << 1)
 #define HSW_PTE_UNCACHED		(0)
 #define GEN6_PTE_CACHE_LLC		(2 << 1)
-#define GEN6_PTE_CACHE_LLC_MLC		(3 << 1)
+#define GEN7_PTE_CACHE_L3_LLC		(3 << 1)
 #define GEN6_PTE_ADDR_ENCODE(addr)	GEN6_GTT_ADDR_ENCODE(addr)
 #define HSW_PTE_ADDR_ENCODE(addr)	HSW_GTT_ADDR_ENCODE(addr)
 
@@ -56,15 +56,36 @@
 #define HSW_WB_LLC_AGE0			HSW_CACHEABILITY_CONTROL(0x3)
 #define HSW_WB_ELLC_LLC_AGE0		HSW_CACHEABILITY_CONTROL(0xb)
 
-static gen6_gtt_pte_t gen6_pte_encode(dma_addr_t addr,
-				      enum i915_cache_level level)
+static gen6_gtt_pte_t snb_pte_encode(dma_addr_t addr,
+				     enum i915_cache_level level)
+{
+	gen6_gtt_pte_t pte = GEN6_PTE_VALID;
+	pte |= GEN6_PTE_ADDR_ENCODE(addr);
+
+	switch (level) {
+	case I915_CACHE_L3_LLC:
+	case I915_CACHE_LLC:
+		pte |= GEN6_PTE_CACHE_LLC;
+		break;
+	case I915_CACHE_NONE:
+		pte |= GEN6_PTE_UNCACHED;
+		break;
+	default:
+		BUG();
+	}
+
+	return pte;
+}
+
+static gen6_gtt_pte_t ivb_pte_encode(dma_addr_t addr,
+				     enum i915_cache_level level)
 {
 	gen6_gtt_pte_t pte = GEN6_PTE_VALID;
 	pte |= GEN6_PTE_ADDR_ENCODE(addr);
 
 	switch (level) {
-	case I915_CACHE_LLC_MLC:
-		pte |= GEN6_PTE_CACHE_LLC_MLC;
+	case I915_CACHE_L3_LLC:
+		pte |= GEN7_PTE_CACHE_L3_LLC;
 		break;
 	case I915_CACHE_LLC:
 		pte |= GEN6_PTE_CACHE_LLC;
@@ -880,8 +901,10 @@ int i915_gem_gtt_init(struct drm_device *dev)
 			gtt->base.pte_encode = hsw_pte_encode;
 		else if (IS_VALLEYVIEW(dev))
 			gtt->base.pte_encode = byt_pte_encode;
+		else if (INTEL_INFO(dev)->gen >= 7)
+			gtt->base.pte_encode = ivb_pte_encode;
 		else
-			gtt->base.pte_encode = gen6_pte_encode;
+			gtt->base.pte_encode = snb_pte_encode;
 	}
 
 	ret = gtt->gtt_probe(dev, &gtt->base.total,
diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c
index d970d84..8091485 100644
--- a/drivers/gpu/drm/i915/i915_gpu_error.c
+++ b/drivers/gpu/drm/i915/i915_gpu_error.c
@@ -938,8 +938,8 @@ const char *i915_cache_level_str(int type)
 {
 	switch (type) {
 	case I915_CACHE_NONE: return " uncached";
-	case I915_CACHE_LLC: return " snooped (LLC)";
-	case I915_CACHE_LLC_MLC: return " snooped (LLC+MLC)";
+	case I915_CACHE_LLC: return " snooped or LLC";
+	case I915_CACHE_L3_LLC: return " L3+LLC";
 	default: return "";
 	}
 }
-- 
1.8.4.rc1

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 2/4] drm/i915: Update rules for reading cache lines through the LLC
  2013-08-06 12:17 [PATCH 1/4] drm/i915: Rename I915_CACHE_MLC_LLC to L3_LLC for Ivybridge Chris Wilson
@ 2013-08-06 12:17 ` Chris Wilson
  2013-08-06 13:31   ` Daniel Vetter
  2013-08-06 12:17 ` [PATCH 3/4] drm/i915: Update rules for writing through the LLC with the cpu Chris Wilson
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 14+ messages in thread
From: Chris Wilson @ 2013-08-06 12:17 UTC (permalink / raw)
  To: intel-gfx

The LLC is a fun device. The cache is a distinct functional block within
the SA that arbitrates access from both the CPU and GPU cores. As such
all writes to memory land first in the LLC before further action is
taken. For example, an uncached write from either the CPU or GPU will
then proceed to memory and evict the cacheline from the LLC. This means that
a read from the LLC always returns the correct information even if the PTE
bit in the GPU differs from the PAT bit in the CPU. For the older
snooping architecture on non-LLC, the fundamental principle still holds
except that some coordination is required between the CPU and GPU to
explicitly perform the snooping (which is handled by our request
tracking).

The upshot of this is that we know that we can issue a read from either
LLC devices or snoopable memory and trust the contents of the cache -
i.e. we can forgo a clflush before a read in these circumstances.
Writing to memory from the CPU is a little more tricky as we have to
consider that the scanout does not read from the CPU cache at all, but
from main memory. So we have to currently treat all requests to write to
uncached memory as having to be flushed to main memory for coherency
with all consumers.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Ville Syrjälä <ville.syrjala@linux.intel.com>
---
 drivers/gpu/drm/i915/i915_gem.c | 22 ++++++++++++++--------
 1 file changed, 14 insertions(+), 8 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index eec6fcb..5671dab 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -60,6 +60,12 @@ static long i915_gem_purge(struct drm_i915_private *dev_priv, long target);
 static void i915_gem_shrink_all(struct drm_i915_private *dev_priv);
 static void i915_gem_object_truncate(struct drm_i915_gem_object *obj);
 
+static bool cpu_cache_is_coherent(struct drm_device *dev,
+				  enum i915_cache_level level)
+{
+	return HAS_LLC(dev) || level != I915_CACHE_NONE;
+}
+
 static inline void i915_gem_object_fence_lost(struct drm_i915_gem_object *obj)
 {
 	if (obj->tiling_mode)
@@ -510,8 +516,7 @@ i915_gem_shmem_pread(struct drm_device *dev,
 		 * read domain and manually flush cachelines (if required). This
 		 * optimizes for the case when the gpu will dirty the data
 		 * anyway again before the next pread happens. */
-		if (obj->cache_level == I915_CACHE_NONE)
-			needs_clflush = 1;
+		needs_clflush = !cpu_cache_is_coherent(dev, obj->cache_level);
 		if (i915_gem_obj_ggtt_bound(obj)) {
 			ret = i915_gem_object_set_to_gtt_domain(obj, false);
 			if (ret)
@@ -835,11 +840,11 @@ i915_gem_shmem_pwrite(struct drm_device *dev,
 				return ret;
 		}
 	}
-	/* Same trick applies for invalidate partially written cachelines before
-	 * writing.  */
-	if (!(obj->base.read_domains & I915_GEM_DOMAIN_CPU)
-	    && obj->cache_level == I915_CACHE_NONE)
-		needs_clflush_before = 1;
+	/* Same trick applies to invalidate partially written cachelines read
+	 * before writing. */
+	if ((obj->base.read_domains & I915_GEM_DOMAIN_CPU) == 0)
+		needs_clflush_before =
+			!cpu_cache_is_coherent(dev, obj->cache_level);
 
 	ret = i915_gem_object_get_pages(obj);
 	if (ret)
@@ -3650,7 +3655,8 @@ i915_gem_object_set_to_cpu_domain(struct drm_i915_gem_object *obj, bool write)
 
 	/* Flush the CPU cache if it's still invalid. */
 	if ((obj->base.read_domains & I915_GEM_DOMAIN_CPU) == 0) {
-		i915_gem_clflush_object(obj);
+		if (!cpu_cache_is_coherent(obj->base.dev, obj->cache_level))
+			i915_gem_clflush_object(obj);
 
 		obj->base.read_domains |= I915_GEM_DOMAIN_CPU;
 	}
-- 
1.8.4.rc1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 3/4] drm/i915: Update rules for writing through the LLC with the cpu
  2013-08-06 12:17 [PATCH 1/4] drm/i915: Rename I915_CACHE_MLC_LLC to L3_LLC for Ivybridge Chris Wilson
  2013-08-06 12:17 ` [PATCH 2/4] drm/i915: Update rules for reading cache lines through the LLC Chris Wilson
@ 2013-08-06 12:17 ` Chris Wilson
  2013-08-06 14:24   ` Ville Syrjälä
  2013-08-06 12:17 ` [PATCH 4/4] drm/i915: Only do a chipset flush after a clflush Chris Wilson
  2013-08-06 14:25 ` [PATCH 1/4] drm/i915: Rename I915_CACHE_MLC_LLC to L3_LLC for Ivybridge Ville Syrjälä
  3 siblings, 1 reply; 14+ messages in thread
From: Chris Wilson @ 2013-08-06 12:17 UTC (permalink / raw)
  To: intel-gfx

As mentioned in the previous commit, reads and writes from both the CPU
and GPU go through the LLC. This gives us coherency between the CPU and
GPU irrespective of the attribute settings either device sets. We can
use to avoid having to clflush even uncached memory.

Except for the scanout.

The scanout resides within another functional block that does not use
the LLC but reads directly from main memory. So in order to maintain
coherency with the scanout, writes to uncached memory must be flushed.
In order to optimize writes elsewhere, we start tracking whether an
framebuffer is attached to an object.

Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
---
 drivers/gpu/drm/i915/i915_drv.h      |  3 +++
 drivers/gpu/drm/i915/i915_gem.c      | 27 +++++++++++++++++----------
 drivers/gpu/drm/i915/intel_display.c |  3 +++
 3 files changed, 23 insertions(+), 10 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index aa11731..93c4789 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -1407,6 +1407,9 @@ struct drm_i915_gem_object {
 	/** for phy allocated objects */
 	struct drm_i915_gem_phys_object *phys_obj;
 
+	/** Track framebuffers associated with this object */
+	atomic_t fb_count;
+
 	union {
 		struct i915_gem_userptr {
 			uintptr_t ptr;
diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index 5671dab..9805693 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -66,6 +66,14 @@ static bool cpu_cache_is_coherent(struct drm_device *dev,
 	return HAS_LLC(dev) || level != I915_CACHE_NONE;
 }
 
+static bool cpu_write_needs_clflush(struct drm_i915_gem_object *obj)
+{
+	if (!cpu_cache_is_coherent(obj->base.dev, obj->cache_level))
+		return true;
+
+	return atomic_read(&obj->fb_count) > 0;
+}
+
 static inline void i915_gem_object_fence_lost(struct drm_i915_gem_object *obj)
 {
 	if (obj->tiling_mode)
@@ -832,8 +840,7 @@ i915_gem_shmem_pwrite(struct drm_device *dev,
 		 * write domain and manually flush cachelines (if required). This
 		 * optimizes for the case when the gpu will use the data
 		 * right away and we therefore have to clflush anyway. */
-		if (obj->cache_level == I915_CACHE_NONE)
-			needs_clflush_after = 1;
+		needs_clflush_after = cpu_write_needs_clflush(obj);
 		if (i915_gem_obj_ggtt_bound(obj)) {
 			ret = i915_gem_object_set_to_gtt_domain(obj, true);
 			if (ret)
@@ -1001,9 +1008,8 @@ i915_gem_pwrite_ioctl(struct drm_device *dev, void *data,
 		goto out;
 	}
 
-	if (obj->cache_level == I915_CACHE_NONE &&
-	    obj->tiling_mode == I915_TILING_NONE &&
-	    obj->base.write_domain != I915_GEM_DOMAIN_CPU) {
+	if (obj->tiling_mode == I915_TILING_NONE &&
+	    !cpu_write_needs_clflush(obj)) {
 		ret = i915_gem_gtt_pwrite_fast(dev, obj, args, file);
 		/* Note that the gtt paths might fail with non-page-backed user
 		 * pointers (e.g. gtt mappings when moving data between
@@ -3299,7 +3305,7 @@ i915_gem_clflush_object(struct drm_i915_gem_object *obj)
 	 * snooping behaviour occurs naturally as the result of our domain
 	 * tracking.
 	 */
-	if (obj->cache_level != I915_CACHE_NONE)
+	if (!cpu_write_needs_clflush(obj))
 		return;
 
 	trace_i915_gem_object_clflush(obj);
@@ -3462,7 +3468,11 @@ int i915_gem_object_set_cache_level(struct drm_i915_gem_object *obj,
 					       obj, cache_level);
 	}
 
-	if (cache_level == I915_CACHE_NONE) {
+	list_for_each_entry(vma, &obj->vma_list, vma_link)
+		vma->node.color = cache_level;
+	obj->cache_level = cache_level;
+
+	if (cpu_write_needs_clflush(obj)) {
 		u32 old_read_domains, old_write_domain;
 
 		/* If we're coming from LLC cached, then we haven't
@@ -3485,9 +3495,6 @@ int i915_gem_object_set_cache_level(struct drm_i915_gem_object *obj,
 						    old_write_domain);
 	}
 
-	list_for_each_entry(vma, &obj->vma_list, vma_link)
-		vma->node.color = cache_level;
-	obj->cache_level = cache_level;
 	i915_gem_verify_gtt(dev);
 	return 0;
 }
diff --git a/drivers/gpu/drm/i915/intel_display.c b/drivers/gpu/drm/i915/intel_display.c
index 0584480..cc5eaba 100644
--- a/drivers/gpu/drm/i915/intel_display.c
+++ b/drivers/gpu/drm/i915/intel_display.c
@@ -9443,6 +9443,8 @@ static void intel_user_framebuffer_destroy(struct drm_framebuffer *fb)
 	struct intel_framebuffer *intel_fb = to_intel_framebuffer(fb);
 
 	drm_framebuffer_cleanup(fb);
+
+	atomic_dec(&intel_fb->obj->fb_count);
 	drm_gem_object_unreference_unlocked(&intel_fb->obj->base);
 
 	kfree(intel_fb);
@@ -9568,6 +9570,7 @@ int intel_framebuffer_init(struct drm_device *dev,
 		return ret;
 	}
 
+	atomic_inc(&obj->fb_count);
 	return 0;
 }
 
-- 
1.8.4.rc1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 4/4] drm/i915: Only do a chipset flush after a clflush
  2013-08-06 12:17 [PATCH 1/4] drm/i915: Rename I915_CACHE_MLC_LLC to L3_LLC for Ivybridge Chris Wilson
  2013-08-06 12:17 ` [PATCH 2/4] drm/i915: Update rules for reading cache lines through the LLC Chris Wilson
  2013-08-06 12:17 ` [PATCH 3/4] drm/i915: Update rules for writing through the LLC with the cpu Chris Wilson
@ 2013-08-06 12:17 ` Chris Wilson
  2013-08-06 14:25 ` [PATCH 1/4] drm/i915: Rename I915_CACHE_MLC_LLC to L3_LLC for Ivybridge Ville Syrjälä
  3 siblings, 0 replies; 14+ messages in thread
From: Chris Wilson @ 2013-08-06 12:17 UTC (permalink / raw)
  To: intel-gfx

Now that we skip clflushes more often, return a boolean indicating
whether the clflush was actually performed, and only if it was do the
chipset flush. (Though on most of the architectures where the clflush will
be skipped, the chipset flush is a no-op!)

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Ville Syrjälä <ville.syrjala@linux.intel.com>
---
 drivers/gpu/drm/i915/i915_drv.h            |  2 +-
 drivers/gpu/drm/i915/i915_gem.c            | 19 ++++++++++---------
 drivers/gpu/drm/i915/i915_gem_exec.c       |  4 ++--
 drivers/gpu/drm/i915/i915_gem_execbuffer.c |  5 +++--
 4 files changed, 16 insertions(+), 14 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index 93c4789..c07dfb9 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -1845,7 +1845,7 @@ static inline bool i915_terminally_wedged(struct i915_gpu_error *error)
 }
 
 void i915_gem_reset(struct drm_device *dev);
-void i915_gem_clflush_object(struct drm_i915_gem_object *obj);
+bool i915_gem_clflush_object(struct drm_i915_gem_object *obj);
 int __must_check i915_gem_object_set_domain(struct drm_i915_gem_object *obj,
 					    uint32_t read_domains,
 					    uint32_t write_domain);
diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index 9805693..7b4c59c 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -930,8 +930,8 @@ out:
 		 */
 		if (!needs_clflush_after &&
 		    obj->base.write_domain != I915_GEM_DOMAIN_CPU) {
-			i915_gem_clflush_object(obj);
-			i915_gem_chipset_flush(dev);
+			if (i915_gem_clflush_object(obj))
+				i915_gem_chipset_flush(dev);
 		}
 	}
 
@@ -3280,7 +3280,7 @@ err_unpin:
 	return ret;
 }
 
-void
+bool
 i915_gem_clflush_object(struct drm_i915_gem_object *obj)
 {
 	/* If we don't have a page list set up, then we're not pinned
@@ -3288,14 +3288,14 @@ i915_gem_clflush_object(struct drm_i915_gem_object *obj)
 	 * again at bind time.
 	 */
 	if (obj->pages == NULL)
-		return;
+		return false;
 
 	/*
 	 * Stolen memory is always coherent with the GPU as it is explicitly
 	 * marked as wc by the system, or the system is cache-coherent.
 	 */
 	if (obj->stolen)
-		return;
+		return false;
 
 	/* If the GPU is snooping the contents of the CPU cache,
 	 * we do not need to manually clear the CPU cache lines.  However,
@@ -3306,11 +3306,12 @@ i915_gem_clflush_object(struct drm_i915_gem_object *obj)
 	 * tracking.
 	 */
 	if (!cpu_write_needs_clflush(obj))
-		return;
+		return false;
 
 	trace_i915_gem_object_clflush(obj);
-
 	drm_clflush_sg(obj->pages);
+
+	return true;
 }
 
 /** Flushes the GTT write domain for the object if it's dirty. */
@@ -3349,8 +3350,8 @@ i915_gem_object_flush_cpu_write_domain(struct drm_i915_gem_object *obj)
 	if (obj->base.write_domain != I915_GEM_DOMAIN_CPU)
 		return;
 
-	i915_gem_clflush_object(obj);
-	i915_gem_chipset_flush(obj->base.dev);
+	if (i915_gem_clflush_object(obj))
+		i915_gem_chipset_flush(obj->base.dev);
 	old_write_domain = obj->base.write_domain;
 	obj->base.write_domain = 0;
 
diff --git a/drivers/gpu/drm/i915/i915_gem_exec.c b/drivers/gpu/drm/i915/i915_gem_exec.c
index d2ac077..2396b72 100644
--- a/drivers/gpu/drm/i915/i915_gem_exec.c
+++ b/drivers/gpu/drm/i915/i915_gem_exec.c
@@ -50,8 +50,8 @@ static int i915_gem_exec_flush_object(struct drm_i915_gem_object *obj,
 		return ret;
 
 	if (obj->base.write_domain & I915_GEM_DOMAIN_CPU) {
-		i915_gem_clflush_object(obj);
-		i915_gem_chipset_flush(obj->base.dev);
+		if (i915_gem_clflush_object(obj))
+			i915_gem_chipset_flush(obj->base.dev);
 		obj->base.write_domain &= ~I915_GEM_DOMAIN_CPU;
 	}
 	if (obj->base.write_domain & I915_GEM_DOMAIN_GTT) {
diff --git a/drivers/gpu/drm/i915/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
index 9939d2e..4d2fff5 100644
--- a/drivers/gpu/drm/i915/i915_gem_execbuffer.c
+++ b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
@@ -708,6 +708,7 @@ i915_gem_execbuffer_move_to_gpu(struct intel_ring_buffer *ring,
 {
 	struct drm_i915_gem_object *obj;
 	uint32_t flush_domains = 0;
+	bool flush_chipset = false;
 	int ret;
 
 	list_for_each_entry(obj, objects, exec_list) {
@@ -716,12 +717,12 @@ i915_gem_execbuffer_move_to_gpu(struct intel_ring_buffer *ring,
 			return ret;
 
 		if (obj->base.write_domain & I915_GEM_DOMAIN_CPU)
-			i915_gem_clflush_object(obj);
+			flush_chipset |= i915_gem_clflush_object(obj);
 
 		flush_domains |= obj->base.write_domain;
 	}
 
-	if (flush_domains & I915_GEM_DOMAIN_CPU)
+	if (flush_chipset)
 		i915_gem_chipset_flush(ring->dev);
 
 	if (flush_domains & I915_GEM_DOMAIN_GTT)
-- 
1.8.4.rc1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/4] drm/i915: Update rules for reading cache lines through the LLC
  2013-08-06 12:17 ` [PATCH 2/4] drm/i915: Update rules for reading cache lines through the LLC Chris Wilson
@ 2013-08-06 13:31   ` Daniel Vetter
  2013-08-06 16:20     ` Chris Wilson
  0 siblings, 1 reply; 14+ messages in thread
From: Daniel Vetter @ 2013-08-06 13:31 UTC (permalink / raw)
  To: Chris Wilson; +Cc: intel-gfx

On Tue, Aug 06, 2013 at 01:17:03PM +0100, Chris Wilson wrote:
> The LLC is a fun device. The cache is a distinct functional block within
> the SA that arbitrates access from both the CPU and GPU cores. As such
> all writes to memory land first in the LLC before further action is
> taken. For example, an uncached write from either the CPU or GPU will
> then proceed to memory and evict the cacheline from the LLC. This means that
> a read from the LLC always returns the correct information even if the PTE
> bit in the GPU differs from the PAT bit in the CPU. For the older
> snooping architecture on non-LLC, the fundamental principle still holds
> except that some coordination is required between the CPU and GPU to
> explicitly perform the snooping (which is handled by our request
> tracking).
> 
> The upshot of this is that we know that we can issue a read from either
> LLC devices or snoopable memory and trust the contents of the cache -
> i.e. we can forgo a clflush before a read in these circumstances.
> Writing to memory from the CPU is a little more tricky as we have to
> consider that the scanout does not read from the CPU cache at all, but
> from main memory. So we have to currently treat all requests to write to
> uncached memory as having to be flushed to main memory for coherency
> with all consumers.

Since this is a behavioural change wrt cache coherency can we please have
an igt testcase to exercise pwrite/pread coherency on uncached buffers on
LLC platforms? With the set_caching ioctl it should be fairly easy to add
another subtest to the relevant existing igts.

/me is simply too paranoid about this stuff

Cheers, Daniel

> 
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Ville Syrjälä <ville.syrjala@linux.intel.com>
> ---
>  drivers/gpu/drm/i915/i915_gem.c | 22 ++++++++++++++--------
>  1 file changed, 14 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> index eec6fcb..5671dab 100644
> --- a/drivers/gpu/drm/i915/i915_gem.c
> +++ b/drivers/gpu/drm/i915/i915_gem.c
> @@ -60,6 +60,12 @@ static long i915_gem_purge(struct drm_i915_private *dev_priv, long target);
>  static void i915_gem_shrink_all(struct drm_i915_private *dev_priv);
>  static void i915_gem_object_truncate(struct drm_i915_gem_object *obj);
>  
> +static bool cpu_cache_is_coherent(struct drm_device *dev,
> +				  enum i915_cache_level level)
> +{
> +	return HAS_LLC(dev) || level != I915_CACHE_NONE;
> +}
> +
>  static inline void i915_gem_object_fence_lost(struct drm_i915_gem_object *obj)
>  {
>  	if (obj->tiling_mode)
> @@ -510,8 +516,7 @@ i915_gem_shmem_pread(struct drm_device *dev,
>  		 * read domain and manually flush cachelines (if required). This
>  		 * optimizes for the case when the gpu will dirty the data
>  		 * anyway again before the next pread happens. */
> -		if (obj->cache_level == I915_CACHE_NONE)
> -			needs_clflush = 1;
> +		needs_clflush = !cpu_cache_is_coherent(dev, obj->cache_level);
>  		if (i915_gem_obj_ggtt_bound(obj)) {
>  			ret = i915_gem_object_set_to_gtt_domain(obj, false);
>  			if (ret)
> @@ -835,11 +840,11 @@ i915_gem_shmem_pwrite(struct drm_device *dev,
>  				return ret;
>  		}
>  	}
> -	/* Same trick applies for invalidate partially written cachelines before
> -	 * writing.  */
> -	if (!(obj->base.read_domains & I915_GEM_DOMAIN_CPU)
> -	    && obj->cache_level == I915_CACHE_NONE)
> -		needs_clflush_before = 1;
> +	/* Same trick applies to invalidate partially written cachelines read
> +	 * before writing. */
> +	if ((obj->base.read_domains & I915_GEM_DOMAIN_CPU) == 0)
> +		needs_clflush_before =
> +			!cpu_cache_is_coherent(dev, obj->cache_level);
>  
>  	ret = i915_gem_object_get_pages(obj);
>  	if (ret)
> @@ -3650,7 +3655,8 @@ i915_gem_object_set_to_cpu_domain(struct drm_i915_gem_object *obj, bool write)
>  
>  	/* Flush the CPU cache if it's still invalid. */
>  	if ((obj->base.read_domains & I915_GEM_DOMAIN_CPU) == 0) {
> -		i915_gem_clflush_object(obj);
> +		if (!cpu_cache_is_coherent(obj->base.dev, obj->cache_level))
> +			i915_gem_clflush_object(obj);
>  
>  		obj->base.read_domains |= I915_GEM_DOMAIN_CPU;
>  	}
> -- 
> 1.8.4.rc1
> 
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx@lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/intel-gfx

-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 3/4] drm/i915: Update rules for writing through the LLC with the cpu
  2013-08-06 12:17 ` [PATCH 3/4] drm/i915: Update rules for writing through the LLC with the cpu Chris Wilson
@ 2013-08-06 14:24   ` Ville Syrjälä
  2013-08-06 16:16     ` Chris Wilson
  0 siblings, 1 reply; 14+ messages in thread
From: Ville Syrjälä @ 2013-08-06 14:24 UTC (permalink / raw)
  To: Chris Wilson; +Cc: intel-gfx

On Tue, Aug 06, 2013 at 01:17:04PM +0100, Chris Wilson wrote:
> As mentioned in the previous commit, reads and writes from both the CPU
> and GPU go through the LLC. This gives us coherency between the CPU and
> GPU irrespective of the attribute settings either device sets. We can
> use to avoid having to clflush even uncached memory.
> 
> Except for the scanout.
> 
> The scanout resides within another functional block that does not use
> the LLC but reads directly from main memory. So in order to maintain
> coherency with the scanout, writes to uncached memory must be flushed.
> In order to optimize writes elsewhere, we start tracking whether an
> framebuffer is attached to an object.
> 
> Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> ---
>  drivers/gpu/drm/i915/i915_drv.h      |  3 +++
>  drivers/gpu/drm/i915/i915_gem.c      | 27 +++++++++++++++++----------
>  drivers/gpu/drm/i915/intel_display.c |  3 +++
>  3 files changed, 23 insertions(+), 10 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
> index aa11731..93c4789 100644
> --- a/drivers/gpu/drm/i915/i915_drv.h
> +++ b/drivers/gpu/drm/i915/i915_drv.h
> @@ -1407,6 +1407,9 @@ struct drm_i915_gem_object {
>  	/** for phy allocated objects */
>  	struct drm_i915_gem_phys_object *phys_obj;
>  
> +	/** Track framebuffers associated with this object */
> +	atomic_t fb_count;
> +
>  	union {
>  		struct i915_gem_userptr {
>  			uintptr_t ptr;
> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> index 5671dab..9805693 100644
> --- a/drivers/gpu/drm/i915/i915_gem.c
> +++ b/drivers/gpu/drm/i915/i915_gem.c
> @@ -66,6 +66,14 @@ static bool cpu_cache_is_coherent(struct drm_device *dev,
>  	return HAS_LLC(dev) || level != I915_CACHE_NONE;
>  }
>  
> +static bool cpu_write_needs_clflush(struct drm_i915_gem_object *obj)
> +{
> +	if (!cpu_cache_is_coherent(obj->base.dev, obj->cache_level))
> +		return true;
> +
> +	return atomic_read(&obj->fb_count) > 0;
> +}

I was thinking a bit about this fb_count thing. The other option would
be to track actual scanout activity, which wouldn't be hard to do, but
I guess that could shift a bunch of the clflush cost to page flip time.
So maybe we don't want that?

Also i915_gem_sw_finish_ioctl() still looks at pin_count as an
indication whether scanout is happening. We should add an fb_count
check there as well. Maybe we should leave the pin_count check there,
so that framebuffers that aren't being scanned out currently can be
massaged a bit more freely before a clflush needs to be issued?

> +
>  static inline void i915_gem_object_fence_lost(struct drm_i915_gem_object *obj)
>  {
>  	if (obj->tiling_mode)
> @@ -832,8 +840,7 @@ i915_gem_shmem_pwrite(struct drm_device *dev,
>  		 * write domain and manually flush cachelines (if required). This
>  		 * optimizes for the case when the gpu will use the data
>  		 * right away and we therefore have to clflush anyway. */
> -		if (obj->cache_level == I915_CACHE_NONE)
> -			needs_clflush_after = 1;
> +		needs_clflush_after = cpu_write_needs_clflush(obj);
>  		if (i915_gem_obj_ggtt_bound(obj)) {
>  			ret = i915_gem_object_set_to_gtt_domain(obj, true);
>  			if (ret)
> @@ -1001,9 +1008,8 @@ i915_gem_pwrite_ioctl(struct drm_device *dev, void *data,
>  		goto out;
>  	}
>  
> -	if (obj->cache_level == I915_CACHE_NONE &&
> -	    obj->tiling_mode == I915_TILING_NONE &&
> -	    obj->base.write_domain != I915_GEM_DOMAIN_CPU) {
> +	if (obj->tiling_mode == I915_TILING_NONE &&
> +	    !cpu_write_needs_clflush(obj)) {

This would break non-LLC platforms, I think. Before, GTT writes were
used for non-snooped buffers w/ clean CPU caches (write domain != cpu),
which makes sense since access via GTT doesn't snoop ever according
to the docs. Now I think it'll do GTT writes for snooped non-scanout
buffers.

>  		ret = i915_gem_gtt_pwrite_fast(dev, obj, args, file);
>  		/* Note that the gtt paths might fail with non-page-backed user
>  		 * pointers (e.g. gtt mappings when moving data between
> @@ -3299,7 +3305,7 @@ i915_gem_clflush_object(struct drm_i915_gem_object *obj)
>  	 * snooping behaviour occurs naturally as the result of our domain
>  	 * tracking.
>  	 */
> -	if (obj->cache_level != I915_CACHE_NONE)
> +	if (!cpu_write_needs_clflush(obj))
>  		return;

I would actually prefer to kill this check completely, and move the
decision whether to do a clflush to the callers.

There are three places that I spotted which lack such a check;
i915_gem_execbuffer_move_to_gpu() and i915_gem_object_set_to_cpu_domain()
need to flush only when !cpu_cache_is_coherent(), whereas
i915_gem_object_flush_cpu_write_domain() needs to flush for scanout as
well.

>  
>  	trace_i915_gem_object_clflush(obj);
> @@ -3462,7 +3468,11 @@ int i915_gem_object_set_cache_level(struct drm_i915_gem_object *obj,
>  					       obj, cache_level);
>  	}
>  
> -	if (cache_level == I915_CACHE_NONE) {
> +	list_for_each_entry(vma, &obj->vma_list, vma_link)
> +		vma->node.color = cache_level;
> +	obj->cache_level = cache_level;
> +
> +	if (cpu_write_needs_clflush(obj)) {
>  		u32 old_read_domains, old_write_domain;
>  
>  		/* If we're coming from LLC cached, then we haven't
> @@ -3485,9 +3495,6 @@ int i915_gem_object_set_cache_level(struct drm_i915_gem_object *obj,
>  						    old_write_domain);
>  	}
>  
> -	list_for_each_entry(vma, &obj->vma_list, vma_link)
> -		vma->node.color = cache_level;
> -	obj->cache_level = cache_level;
>  	i915_gem_verify_gtt(dev);
>  	return 0;
>  }
> diff --git a/drivers/gpu/drm/i915/intel_display.c b/drivers/gpu/drm/i915/intel_display.c
> index 0584480..cc5eaba 100644
> --- a/drivers/gpu/drm/i915/intel_display.c
> +++ b/drivers/gpu/drm/i915/intel_display.c
> @@ -9443,6 +9443,8 @@ static void intel_user_framebuffer_destroy(struct drm_framebuffer *fb)
>  	struct intel_framebuffer *intel_fb = to_intel_framebuffer(fb);
>  
>  	drm_framebuffer_cleanup(fb);
> +
> +	atomic_dec(&intel_fb->obj->fb_count);

Where did the the other atomic_dec() go (was in intel_fbdev_destroy())?

>  	drm_gem_object_unreference_unlocked(&intel_fb->obj->base);
>  
>  	kfree(intel_fb);
> @@ -9568,6 +9570,7 @@ int intel_framebuffer_init(struct drm_device *dev,
>  		return ret;
>  	}
>  
> +	atomic_inc(&obj->fb_count);
>  	return 0;
>  }
>  
> -- 
> 1.8.4.rc1

-- 
Ville Syrjälä
Intel OTC

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/4] drm/i915: Rename I915_CACHE_MLC_LLC to L3_LLC for Ivybridge
  2013-08-06 12:17 [PATCH 1/4] drm/i915: Rename I915_CACHE_MLC_LLC to L3_LLC for Ivybridge Chris Wilson
                   ` (2 preceding siblings ...)
  2013-08-06 12:17 ` [PATCH 4/4] drm/i915: Only do a chipset flush after a clflush Chris Wilson
@ 2013-08-06 14:25 ` Ville Syrjälä
  2013-08-06 14:36   ` Daniel Vetter
  3 siblings, 1 reply; 14+ messages in thread
From: Ville Syrjälä @ 2013-08-06 14:25 UTC (permalink / raw)
  To: Chris Wilson; +Cc: intel-gfx

On Tue, Aug 06, 2013 at 01:17:02PM +0100, Chris Wilson wrote:
> MLC_LLC was never validated for Sandybridge and was superseded by a new
> level of cacheing for the GPU in Ivybridge. Update our names to be
> consistent with usage, and in the process stop setting the unwanted bit
> on Sandybridge.
> 
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> ---
>  drivers/gpu/drm/i915/i915_drv.h         |  7 +++++--
>  drivers/gpu/drm/i915/i915_gem_context.c |  2 +-
>  drivers/gpu/drm/i915/i915_gem_gtt.c     | 35 +++++++++++++++++++++++++++------
>  drivers/gpu/drm/i915/i915_gpu_error.c   |  4 ++--
>  4 files changed, 37 insertions(+), 11 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
> index cb2c59d..aa11731 100644
> --- a/drivers/gpu/drm/i915/i915_drv.h
> +++ b/drivers/gpu/drm/i915/i915_drv.h
> @@ -450,8 +450,11 @@ struct intel_device_info {
>  
>  enum i915_cache_level {
>  	I915_CACHE_NONE = 0,
> -	I915_CACHE_LLC,
> -	I915_CACHE_LLC_MLC, /* gen6+, in docs at least! */
> +	I915_CACHE_LLC, /* also used for snoopable memory on non-LLC */
> +	I915_CACHE_L3_LLC, /* gen7+, L3 sits between the domain specifc
> +			      caches, eg sampler/render caches, and the
> +			      large Last-Level-Cache. LLC is coherent with
> +			      the CPU, but L3 is only visible to the GPU. */
>  };
>  
>  typedef uint32_t gen6_gtt_pte_t;
> diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c
> index 30bb01a..db94aca 100644
> --- a/drivers/gpu/drm/i915/i915_gem_context.c
> +++ b/drivers/gpu/drm/i915/i915_gem_context.c
> @@ -158,7 +158,7 @@ create_hw_context(struct drm_device *dev,
>  
>  	if (INTEL_INFO(dev)->gen >= 7) {
>  		ret = i915_gem_object_set_cache_level(ctx->obj,
> -						      I915_CACHE_LLC_MLC);
> +						      I915_CACHE_L3_LLC);
>  		/* Failure shouldn't ever happen this early */
>  		if (WARN_ON(ret))
>  			goto err_out;
> diff --git a/drivers/gpu/drm/i915/i915_gem_gtt.c b/drivers/gpu/drm/i915/i915_gem_gtt.c
> index 3d779cd..06f4eb3 100644
> --- a/drivers/gpu/drm/i915/i915_gem_gtt.c
> +++ b/drivers/gpu/drm/i915/i915_gem_gtt.c
> @@ -43,7 +43,7 @@
>  #define GEN6_PTE_UNCACHED		(1 << 1)
>  #define HSW_PTE_UNCACHED		(0)
>  #define GEN6_PTE_CACHE_LLC		(2 << 1)
> -#define GEN6_PTE_CACHE_LLC_MLC		(3 << 1)
> +#define GEN7_PTE_CACHE_L3_LLC		(3 << 1)

BSpec is quite confused about these bits (PPGTT text says one thing,
GGTT text seems to have been copied from MOCS text, MOCS matches PTE
for SNB, not for IVB, etc.). But unless we find someone who can tell
us more, or we come up with some test to really verify this stuff I
think it's as good as it gets.

So,
Reviewed-by: Ville Syrjälä <ville.syrjala@linux.intel.com>

>  #define GEN6_PTE_ADDR_ENCODE(addr)	GEN6_GTT_ADDR_ENCODE(addr)
>  #define HSW_PTE_ADDR_ENCODE(addr)	HSW_GTT_ADDR_ENCODE(addr)
>  
> @@ -56,15 +56,36 @@
>  #define HSW_WB_LLC_AGE0			HSW_CACHEABILITY_CONTROL(0x3)
>  #define HSW_WB_ELLC_LLC_AGE0		HSW_CACHEABILITY_CONTROL(0xb)
>  
> -static gen6_gtt_pte_t gen6_pte_encode(dma_addr_t addr,
> -				      enum i915_cache_level level)
> +static gen6_gtt_pte_t snb_pte_encode(dma_addr_t addr,
> +				     enum i915_cache_level level)
> +{
> +	gen6_gtt_pte_t pte = GEN6_PTE_VALID;
> +	pte |= GEN6_PTE_ADDR_ENCODE(addr);
> +
> +	switch (level) {
> +	case I915_CACHE_L3_LLC:
> +	case I915_CACHE_LLC:
> +		pte |= GEN6_PTE_CACHE_LLC;
> +		break;
> +	case I915_CACHE_NONE:
> +		pte |= GEN6_PTE_UNCACHED;
> +		break;
> +	default:
> +		BUG();
> +	}
> +
> +	return pte;
> +}
> +
> +static gen6_gtt_pte_t ivb_pte_encode(dma_addr_t addr,
> +				     enum i915_cache_level level)
>  {
>  	gen6_gtt_pte_t pte = GEN6_PTE_VALID;
>  	pte |= GEN6_PTE_ADDR_ENCODE(addr);
>  
>  	switch (level) {
> -	case I915_CACHE_LLC_MLC:
> -		pte |= GEN6_PTE_CACHE_LLC_MLC;
> +	case I915_CACHE_L3_LLC:
> +		pte |= GEN7_PTE_CACHE_L3_LLC;
>  		break;
>  	case I915_CACHE_LLC:
>  		pte |= GEN6_PTE_CACHE_LLC;
> @@ -880,8 +901,10 @@ int i915_gem_gtt_init(struct drm_device *dev)
>  			gtt->base.pte_encode = hsw_pte_encode;
>  		else if (IS_VALLEYVIEW(dev))
>  			gtt->base.pte_encode = byt_pte_encode;
> +		else if (INTEL_INFO(dev)->gen >= 7)
> +			gtt->base.pte_encode = ivb_pte_encode;
>  		else
> -			gtt->base.pte_encode = gen6_pte_encode;
> +			gtt->base.pte_encode = snb_pte_encode;
>  	}
>  
>  	ret = gtt->gtt_probe(dev, &gtt->base.total,
> diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c
> index d970d84..8091485 100644
> --- a/drivers/gpu/drm/i915/i915_gpu_error.c
> +++ b/drivers/gpu/drm/i915/i915_gpu_error.c
> @@ -938,8 +938,8 @@ const char *i915_cache_level_str(int type)
>  {
>  	switch (type) {
>  	case I915_CACHE_NONE: return " uncached";
> -	case I915_CACHE_LLC: return " snooped (LLC)";
> -	case I915_CACHE_LLC_MLC: return " snooped (LLC+MLC)";
> +	case I915_CACHE_LLC: return " snooped or LLC";
> +	case I915_CACHE_L3_LLC: return " L3+LLC";
>  	default: return "";
>  	}
>  }
> -- 
> 1.8.4.rc1
> 
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx@lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/intel-gfx

-- 
Ville Syrjälä
Intel OTC

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/4] drm/i915: Rename I915_CACHE_MLC_LLC to L3_LLC for Ivybridge
  2013-08-06 14:25 ` [PATCH 1/4] drm/i915: Rename I915_CACHE_MLC_LLC to L3_LLC for Ivybridge Ville Syrjälä
@ 2013-08-06 14:36   ` Daniel Vetter
  0 siblings, 0 replies; 14+ messages in thread
From: Daniel Vetter @ 2013-08-06 14:36 UTC (permalink / raw)
  To: Ville Syrjälä; +Cc: intel-gfx

On Tue, Aug 06, 2013 at 05:25:22PM +0300, Ville Syrjälä wrote:
> On Tue, Aug 06, 2013 at 01:17:02PM +0100, Chris Wilson wrote:
> > MLC_LLC was never validated for Sandybridge and was superseded by a new
> > level of cacheing for the GPU in Ivybridge. Update our names to be
> > consistent with usage, and in the process stop setting the unwanted bit
> > on Sandybridge.
> > 
> > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> > ---
> >  drivers/gpu/drm/i915/i915_drv.h         |  7 +++++--
> >  drivers/gpu/drm/i915/i915_gem_context.c |  2 +-
> >  drivers/gpu/drm/i915/i915_gem_gtt.c     | 35 +++++++++++++++++++++++++++------
> >  drivers/gpu/drm/i915/i915_gpu_error.c   |  4 ++--
> >  4 files changed, 37 insertions(+), 11 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
> > index cb2c59d..aa11731 100644
> > --- a/drivers/gpu/drm/i915/i915_drv.h
> > +++ b/drivers/gpu/drm/i915/i915_drv.h
> > @@ -450,8 +450,11 @@ struct intel_device_info {
> >  
> >  enum i915_cache_level {
> >  	I915_CACHE_NONE = 0,
> > -	I915_CACHE_LLC,
> > -	I915_CACHE_LLC_MLC, /* gen6+, in docs at least! */
> > +	I915_CACHE_LLC, /* also used for snoopable memory on non-LLC */
> > +	I915_CACHE_L3_LLC, /* gen7+, L3 sits between the domain specifc
> > +			      caches, eg sampler/render caches, and the
> > +			      large Last-Level-Cache. LLC is coherent with
> > +			      the CPU, but L3 is only visible to the GPU. */
> >  };
> >  
> >  typedef uint32_t gen6_gtt_pte_t;
> > diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c
> > index 30bb01a..db94aca 100644
> > --- a/drivers/gpu/drm/i915/i915_gem_context.c
> > +++ b/drivers/gpu/drm/i915/i915_gem_context.c
> > @@ -158,7 +158,7 @@ create_hw_context(struct drm_device *dev,
> >  
> >  	if (INTEL_INFO(dev)->gen >= 7) {
> >  		ret = i915_gem_object_set_cache_level(ctx->obj,
> > -						      I915_CACHE_LLC_MLC);
> > +						      I915_CACHE_L3_LLC);
> >  		/* Failure shouldn't ever happen this early */
> >  		if (WARN_ON(ret))
> >  			goto err_out;
> > diff --git a/drivers/gpu/drm/i915/i915_gem_gtt.c b/drivers/gpu/drm/i915/i915_gem_gtt.c
> > index 3d779cd..06f4eb3 100644
> > --- a/drivers/gpu/drm/i915/i915_gem_gtt.c
> > +++ b/drivers/gpu/drm/i915/i915_gem_gtt.c
> > @@ -43,7 +43,7 @@
> >  #define GEN6_PTE_UNCACHED		(1 << 1)
> >  #define HSW_PTE_UNCACHED		(0)
> >  #define GEN6_PTE_CACHE_LLC		(2 << 1)
> > -#define GEN6_PTE_CACHE_LLC_MLC		(3 << 1)
> > +#define GEN7_PTE_CACHE_L3_LLC		(3 << 1)
> 
> BSpec is quite confused about these bits (PPGTT text says one thing,
> GGTT text seems to have been copied from MOCS text, MOCS matches PTE
> for SNB, not for IVB, etc.). But unless we find someone who can tell
> us more, or we come up with some test to really verify this stuff I
> think it's as good as it gets.
> 
> So,
> Reviewed-by: Ville Syrjälä <ville.syrjala@linux.intel.com>

Queued for -next (with a tiny s/BUG/WARN_ON/ bikeshed applied), thanks for
the patch.
-Daniel
> 
> >  #define GEN6_PTE_ADDR_ENCODE(addr)	GEN6_GTT_ADDR_ENCODE(addr)
> >  #define HSW_PTE_ADDR_ENCODE(addr)	HSW_GTT_ADDR_ENCODE(addr)
> >  
> > @@ -56,15 +56,36 @@
> >  #define HSW_WB_LLC_AGE0			HSW_CACHEABILITY_CONTROL(0x3)
> >  #define HSW_WB_ELLC_LLC_AGE0		HSW_CACHEABILITY_CONTROL(0xb)
> >  
> > -static gen6_gtt_pte_t gen6_pte_encode(dma_addr_t addr,
> > -				      enum i915_cache_level level)
> > +static gen6_gtt_pte_t snb_pte_encode(dma_addr_t addr,
> > +				     enum i915_cache_level level)
> > +{
> > +	gen6_gtt_pte_t pte = GEN6_PTE_VALID;
> > +	pte |= GEN6_PTE_ADDR_ENCODE(addr);
> > +
> > +	switch (level) {
> > +	case I915_CACHE_L3_LLC:
> > +	case I915_CACHE_LLC:
> > +		pte |= GEN6_PTE_CACHE_LLC;
> > +		break;
> > +	case I915_CACHE_NONE:
> > +		pte |= GEN6_PTE_UNCACHED;
> > +		break;
> > +	default:
> > +		BUG();
> > +	}
> > +
> > +	return pte;
> > +}
> > +
> > +static gen6_gtt_pte_t ivb_pte_encode(dma_addr_t addr,
> > +				     enum i915_cache_level level)
> >  {
> >  	gen6_gtt_pte_t pte = GEN6_PTE_VALID;
> >  	pte |= GEN6_PTE_ADDR_ENCODE(addr);
> >  
> >  	switch (level) {
> > -	case I915_CACHE_LLC_MLC:
> > -		pte |= GEN6_PTE_CACHE_LLC_MLC;
> > +	case I915_CACHE_L3_LLC:
> > +		pte |= GEN7_PTE_CACHE_L3_LLC;
> >  		break;
> >  	case I915_CACHE_LLC:
> >  		pte |= GEN6_PTE_CACHE_LLC;
> > @@ -880,8 +901,10 @@ int i915_gem_gtt_init(struct drm_device *dev)
> >  			gtt->base.pte_encode = hsw_pte_encode;
> >  		else if (IS_VALLEYVIEW(dev))
> >  			gtt->base.pte_encode = byt_pte_encode;
> > +		else if (INTEL_INFO(dev)->gen >= 7)
> > +			gtt->base.pte_encode = ivb_pte_encode;
> >  		else
> > -			gtt->base.pte_encode = gen6_pte_encode;
> > +			gtt->base.pte_encode = snb_pte_encode;
> >  	}
> >  
> >  	ret = gtt->gtt_probe(dev, &gtt->base.total,
> > diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c
> > index d970d84..8091485 100644
> > --- a/drivers/gpu/drm/i915/i915_gpu_error.c
> > +++ b/drivers/gpu/drm/i915/i915_gpu_error.c
> > @@ -938,8 +938,8 @@ const char *i915_cache_level_str(int type)
> >  {
> >  	switch (type) {
> >  	case I915_CACHE_NONE: return " uncached";
> > -	case I915_CACHE_LLC: return " snooped (LLC)";
> > -	case I915_CACHE_LLC_MLC: return " snooped (LLC+MLC)";
> > +	case I915_CACHE_LLC: return " snooped or LLC";
> > +	case I915_CACHE_L3_LLC: return " L3+LLC";
> >  	default: return "";
> >  	}
> >  }
> > -- 
> > 1.8.4.rc1
> > 
> > _______________________________________________
> > Intel-gfx mailing list
> > Intel-gfx@lists.freedesktop.org
> > http://lists.freedesktop.org/mailman/listinfo/intel-gfx
> 
> -- 
> Ville Syrjälä
> Intel OTC
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx@lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/intel-gfx

-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 3/4] drm/i915: Update rules for writing through the LLC with the cpu
  2013-08-06 14:24   ` Ville Syrjälä
@ 2013-08-06 16:16     ` Chris Wilson
  2013-08-06 18:03       ` Ville Syrjälä
  0 siblings, 1 reply; 14+ messages in thread
From: Chris Wilson @ 2013-08-06 16:16 UTC (permalink / raw)
  To: Ville Syrjälä; +Cc: intel-gfx

On Tue, Aug 06, 2013 at 05:24:25PM +0300, Ville Syrjälä wrote:
> On Tue, Aug 06, 2013 at 01:17:04PM +0100, Chris Wilson wrote:
> > +static bool cpu_write_needs_clflush(struct drm_i915_gem_object *obj)
> > +{
> > +	if (!cpu_cache_is_coherent(obj->base.dev, obj->cache_level))
> > +		return true;
> > +
> > +	return atomic_read(&obj->fb_count) > 0;
> > +}
> 
> I was thinking a bit about this fb_count thing. The other option would
> be to track actual scanout activity, which wouldn't be hard to do, but
> I guess that could shift a bunch of the clflush cost to page flip time.
> So maybe we don't want that?

Different series, see the frontbuffer write tracking. As for page flip,
that is currently the only time we clflush...
 
> Also i915_gem_sw_finish_ioctl() still looks at pin_count as an
> indication whether scanout is happening. We should add an fb_count
> check there as well. Maybe we should leave the pin_count check there,
> so that framebuffers that aren't being scanned out currently can be
> massaged a bit more freely before a clflush needs to be issued?

It does do a fb_count check, the pin_count + fb_count is a really strong
indicator of active scanout.

But sw_finish is truly legacy, so it just has to work and be as simple
as possible.

> > -	if (obj->cache_level == I915_CACHE_NONE &&
> > -	    obj->tiling_mode == I915_TILING_NONE &&
> > -	    obj->base.write_domain != I915_GEM_DOMAIN_CPU) {
> > +	if (obj->tiling_mode == I915_TILING_NONE &&
> > +	    !cpu_write_needs_clflush(obj)) {
> 
> This would break non-LLC platforms, I think. Before, GTT writes were
> used for non-snooped buffers w/ clean CPU caches (write domain != cpu),
> which makes sense since access via GTT doesn't snoop ever according
> to the docs. Now I think it'll do GTT writes for snooped non-scanout
> buffers.

D'oh. What I was aiming for was to always use shmem writes on snooped
bo, irrespective of whether it is GPU hot.

s/!cpu_write_needs_clflush/cpu_write_needs_clflush/

> 
> >  		ret = i915_gem_gtt_pwrite_fast(dev, obj, args, file);
> >  		/* Note that the gtt paths might fail with non-page-backed user
> >  		 * pointers (e.g. gtt mappings when moving data between
> > @@ -3299,7 +3305,7 @@ i915_gem_clflush_object(struct drm_i915_gem_object *obj)
> >  	 * snooping behaviour occurs naturally as the result of our domain
> >  	 * tracking.
> >  	 */
> > -	if (obj->cache_level != I915_CACHE_NONE)
> > +	if (!cpu_write_needs_clflush(obj))
> >  		return;
> 
> I would actually prefer to kill this check completely, and move the
> decision whether to do a clflush to the callers.

I saw, I differ in opinion.
 
> There are three places that I spotted which lack such a check;
> i915_gem_execbuffer_move_to_gpu() and i915_gem_object_set_to_cpu_domain()
> need to flush only when !cpu_cache_is_coherent(), whereas
> i915_gem_object_flush_cpu_write_domain() needs to flush for scanout as
> well.

I like the approach of trying to keep as much of the complexity inside
i915_gem.c and exporting simple rules to the callers.

> > diff --git a/drivers/gpu/drm/i915/intel_display.c b/drivers/gpu/drm/i915/intel_display.c
> > index 0584480..cc5eaba 100644
> > --- a/drivers/gpu/drm/i915/intel_display.c
> > +++ b/drivers/gpu/drm/i915/intel_display.c
> > @@ -9443,6 +9443,8 @@ static void intel_user_framebuffer_destroy(struct drm_framebuffer *fb)
> >  	struct intel_framebuffer *intel_fb = to_intel_framebuffer(fb);
> >  
> >  	drm_framebuffer_cleanup(fb);
> > +
> > +	atomic_dec(&intel_fb->obj->fb_count);
> 
> Where did the the other atomic_dec() go (was in intel_fbdev_destroy())?

Missed. I blame the lack of intel_framebuffer_fini(), that would be a
better patch first.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/4] drm/i915: Update rules for reading cache lines through the LLC
  2013-08-06 13:31   ` Daniel Vetter
@ 2013-08-06 16:20     ` Chris Wilson
  0 siblings, 0 replies; 14+ messages in thread
From: Chris Wilson @ 2013-08-06 16:20 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: intel-gfx

On Tue, Aug 06, 2013 at 03:31:02PM +0200, Daniel Vetter wrote:
> Since this is a behavioural change wrt cache coherency can we please have
> an igt testcase to exercise pwrite/pread coherency on uncached buffers on
> LLC platforms? With the set_caching ioctl it should be fairly easy to add
> another subtest to the relevant existing igts.

Right, easy enough. Next step is play with scanout as well because we're
proposing to special case it.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 3/4] drm/i915: Update rules for writing through the LLC with the cpu
  2013-08-06 16:16     ` Chris Wilson
@ 2013-08-06 18:03       ` Ville Syrjälä
  2013-08-06 18:45         ` Chris Wilson
  0 siblings, 1 reply; 14+ messages in thread
From: Ville Syrjälä @ 2013-08-06 18:03 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx

On Tue, Aug 06, 2013 at 05:16:14PM +0100, Chris Wilson wrote:
> On Tue, Aug 06, 2013 at 05:24:25PM +0300, Ville Syrjälä wrote:
> > On Tue, Aug 06, 2013 at 01:17:04PM +0100, Chris Wilson wrote:
> > > +static bool cpu_write_needs_clflush(struct drm_i915_gem_object *obj)
> > > +{
> > > +	if (!cpu_cache_is_coherent(obj->base.dev, obj->cache_level))
> > > +		return true;
> > > +
> > > +	return atomic_read(&obj->fb_count) > 0;
> > > +}
> > 
> > I was thinking a bit about this fb_count thing. The other option would
> > be to track actual scanout activity, which wouldn't be hard to do, but
> > I guess that could shift a bunch of the clflush cost to page flip time.
> > So maybe we don't want that?
> 
> Different series, see the frontbuffer write tracking. As for page flip,
> that is currently the only time we clflush...
>  
> > Also i915_gem_sw_finish_ioctl() still looks at pin_count as an
> > indication whether scanout is happening. We should add an fb_count
> > check there as well. Maybe we should leave the pin_count check there,
> > so that framebuffers that aren't being scanned out currently can be
> > massaged a bit more freely before a clflush needs to be issued?
> 
> It does do a fb_count check, the pin_count + fb_count is a really strong
> indicator of active scanout.

The fb_count checks is done only through the eventual check in
i915_gem_clflush_object() which is why I didn't think of it.

On non-LLC platforms it also ends up doing the flush for all
non-snooped non-scanout buffers, which would get optimized away if the
fb_count check was where the pin_count check is.

> 
> But sw_finish is truly legacy, so it just has to work and be as simple
> as possible.

I see.

> 
> > > -	if (obj->cache_level == I915_CACHE_NONE &&
> > > -	    obj->tiling_mode == I915_TILING_NONE &&
> > > -	    obj->base.write_domain != I915_GEM_DOMAIN_CPU) {
> > > +	if (obj->tiling_mode == I915_TILING_NONE &&
> > > +	    !cpu_write_needs_clflush(obj)) {
> > 
> > This would break non-LLC platforms, I think. Before, GTT writes were
> > used for non-snooped buffers w/ clean CPU caches (write domain != cpu),
> > which makes sense since access via GTT doesn't snoop ever according
> > to the docs. Now I think it'll do GTT writes for snooped non-scanout
> > buffers.
> 
> D'oh. What I was aiming for was to always use shmem writes on snooped
> bo, irrespective of whether it is GPU hot.
> 
> s/!cpu_write_needs_clflush/cpu_write_needs_clflush/

So that means a GTT write for anything non-snooped or scanout. If it's
snooped scanout (I guess we may not really care about such a weird
combination), or non-snooped but in CPU write domain, that would
still have issues with possibly dirty data in the CPU caches.

Might be intersting to try on real hardware what the GTT access
snooping behaviour really is, especially for some modern non-LLC
system like VLV.

> 
> > 
> > >  		ret = i915_gem_gtt_pwrite_fast(dev, obj, args, file);
> > >  		/* Note that the gtt paths might fail with non-page-backed user
> > >  		 * pointers (e.g. gtt mappings when moving data between
> > > @@ -3299,7 +3305,7 @@ i915_gem_clflush_object(struct drm_i915_gem_object *obj)
> > >  	 * snooping behaviour occurs naturally as the result of our domain
> > >  	 * tracking.
> > >  	 */
> > > -	if (obj->cache_level != I915_CACHE_NONE)
> > > +	if (!cpu_write_needs_clflush(obj))
> > >  		return;
> > 
> > I would actually prefer to kill this check completely, and move the
> > decision whether to do a clflush to the callers.
> 
> I saw, I differ in opinion.

Opinion noted.

>  
> > There are three places that I spotted which lack such a check;
> > i915_gem_execbuffer_move_to_gpu() and i915_gem_object_set_to_cpu_domain()
> > need to flush only when !cpu_cache_is_coherent(), whereas
> > i915_gem_object_flush_cpu_write_domain() needs to flush for scanout as
> > well.
> 
> I like the approach of trying to keep as much of the complexity inside
> i915_gem.c and exporting simple rules to the callers.

OK. Looks like you already took care of i915_gem_object_set_to_cpu_domain()
in patch 2/4, and i915_gem_object_flush_cpu_write_domain() works OK with
the check in i915_gem_clflush_object().

So i915_gem_execbuffer_move_to_gpu() is the only one that may end up
doing some needless flushes with your code.

> 
> > > diff --git a/drivers/gpu/drm/i915/intel_display.c b/drivers/gpu/drm/i915/intel_display.c
> > > index 0584480..cc5eaba 100644
> > > --- a/drivers/gpu/drm/i915/intel_display.c
> > > +++ b/drivers/gpu/drm/i915/intel_display.c
> > > @@ -9443,6 +9443,8 @@ static void intel_user_framebuffer_destroy(struct drm_framebuffer *fb)
> > >  	struct intel_framebuffer *intel_fb = to_intel_framebuffer(fb);
> > >  
> > >  	drm_framebuffer_cleanup(fb);
> > > +
> > > +	atomic_dec(&intel_fb->obj->fb_count);
> > 
> > Where did the the other atomic_dec() go (was in intel_fbdev_destroy())?
> 
> Missed. I blame the lack of intel_framebuffer_fini(), that would be a
> better patch first.
> -Chris
> 
> -- 
> Chris Wilson, Intel Open Source Technology Centre

-- 
Ville Syrjälä
Intel OTC

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 3/4] drm/i915: Update rules for writing through the LLC with the cpu
  2013-08-06 18:03       ` Ville Syrjälä
@ 2013-08-06 18:45         ` Chris Wilson
  2013-08-06 19:08           ` Ville Syrjälä
  0 siblings, 1 reply; 14+ messages in thread
From: Chris Wilson @ 2013-08-06 18:45 UTC (permalink / raw)
  To: Ville Syrjälä; +Cc: intel-gfx

On Tue, Aug 06, 2013 at 09:03:01PM +0300, Ville Syrjälä wrote:
> On Tue, Aug 06, 2013 at 05:16:14PM +0100, Chris Wilson wrote:
> > On Tue, Aug 06, 2013 at 05:24:25PM +0300, Ville Syrjälä wrote:
> > > On Tue, Aug 06, 2013 at 01:17:04PM +0100, Chris Wilson wrote:
> > > > +static bool cpu_write_needs_clflush(struct drm_i915_gem_object *obj)
> > > > +{
> > > > +	if (!cpu_cache_is_coherent(obj->base.dev, obj->cache_level))
> > > > +		return true;
> > > > +
> > > > +	return atomic_read(&obj->fb_count) > 0;
> > > > +}
> > > 
> > > I was thinking a bit about this fb_count thing. The other option would
> > > be to track actual scanout activity, which wouldn't be hard to do, but
> > > I guess that could shift a bunch of the clflush cost to page flip time.
> > > So maybe we don't want that?
> > 
> > Different series, see the frontbuffer write tracking. As for page flip,
> > that is currently the only time we clflush...
> >  
> > > Also i915_gem_sw_finish_ioctl() still looks at pin_count as an
> > > indication whether scanout is happening. We should add an fb_count
> > > check there as well. Maybe we should leave the pin_count check there,
> > > so that framebuffers that aren't being scanned out currently can be
> > > massaged a bit more freely before a clflush needs to be issued?
> > 
> > It does do a fb_count check, the pin_count + fb_count is a really strong
> > indicator of active scanout.
> 
> The fb_count checks is done only through the eventual check in
> i915_gem_clflush_object() which is why I didn't think of it.
> 
> On non-LLC platforms it also ends up doing the flush for all
> non-snooped non-scanout buffers, which would get optimized away if the
> fb_count check was where the pin_count check is.

But it does still do the fb_count check before the flush? What am I
missing here?

> > > > -	    obj->tiling_mode == I915_TILING_NONE &&
> > > > -	    obj->base.write_domain != I915_GEM_DOMAIN_CPU) {
> > > > +	if (obj->tiling_mode == I915_TILING_NONE &&
> > > > +	    !cpu_write_needs_clflush(obj)) {
> > > 
> > > This would break non-LLC platforms, I think. Before, GTT writes were
> > > used for non-snooped buffers w/ clean CPU caches (write domain != cpu),
> > > which makes sense since access via GTT doesn't snoop ever according
> > > to the docs. Now I think it'll do GTT writes for snooped non-scanout
> > > buffers.
> > 
> > D'oh. What I was aiming for was to always use shmem writes on snooped
> > bo, irrespective of whether it is GPU hot.
> > 
> > s/!cpu_write_needs_clflush/cpu_write_needs_clflush/
> 
> So that means a GTT write for anything non-snooped or scanout. If it's
> snooped scanout (I guess we may not really care about such a weird
> combination), or non-snooped but in CPU write domain, that would
> still have issues with possibly dirty data in the CPU caches.

Hmm, that should be fixed up by i915_gem_gtt_pwrite_fast() as we flush
all the caches before we start overwriting. I've thoroughly confused
myself over which paths do the fancy tricks. :(
 
> Might be intersting to try on real hardware what the GTT access
> snooping behaviour really is, especially for some modern non-LLC
> system like VLV.

I remember why I have the !DOMAIN_CPU test now. If we have already paid
the price to pull it back into the CPU domain, let it rest there until
it is used again.

So as it stands, I think we want:

if (obj->tiling_mode == I915_TILING_NONE &&
    obj->base.write_domain != I915_GEM_DOMAIN_CPU &&
    cpu_write_needs_clflush(obj))
	i915_gem_gtt_pwrite_fast();

> 
> > 
> > > 
> > > >  		ret = i915_gem_gtt_pwrite_fast(dev, obj, args, file);
> > > >  		/* Note that the gtt paths might fail with non-page-backed user
> > > >  		 * pointers (e.g. gtt mappings when moving data between
> > > > @@ -3299,7 +3305,7 @@ i915_gem_clflush_object(struct drm_i915_gem_object *obj)
> > > >  	 * snooping behaviour occurs naturally as the result of our domain
> > > >  	 * tracking.
> > > >  	 */
> > > > -	if (obj->cache_level != I915_CACHE_NONE)
> > > > +	if (!cpu_write_needs_clflush(obj))
> > > >  		return;
> > > 
> > > I would actually prefer to kill this check completely, and move the
> > > decision whether to do a clflush to the callers.
> > 
> > I saw, I differ in opinion.
> 
> Opinion noted.
> 
> >  
> > > There are three places that I spotted which lack such a check;
> > > i915_gem_execbuffer_move_to_gpu() and i915_gem_object_set_to_cpu_domain()
> > > need to flush only when !cpu_cache_is_coherent(), whereas
> > > i915_gem_object_flush_cpu_write_domain() needs to flush for scanout as
> > > well.
> > 
> > I like the approach of trying to keep as much of the complexity inside
> > i915_gem.c and exporting simple rules to the callers.
> 
> OK. Looks like you already took care of i915_gem_object_set_to_cpu_domain()
> in patch 2/4, and i915_gem_object_flush_cpu_write_domain() works OK with
> the check in i915_gem_clflush_object().
> 
> So i915_gem_execbuffer_move_to_gpu() is the only one that may end up
> doing some needless flushes with your code.

Gotcha. In execbuffer, we will do a clflush on any object with fb_count
and write_domain=CPU. I'm not sure that is a flush worth eliminating,
but that might change with GFDT. But for GFDT, we can just eliminate the
clflush altogether and rely on userspace notifying the kernel it needs to
flush via dirty_fb. However, even with GFDT I don't think it will be a
good idea for frontbuffer writes as the cache eviction is still likely
to cause blotches as the screen updates at different times.

Hmm, on the other hand, a CPU upload into the backbuffer is possible, as
userspace may be blissfully unaware that the buffer is actually an
intel_fb.  Rendering after that would indeed cause needless clflushes.

I'll take another look.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 3/4] drm/i915: Update rules for writing through the LLC with the cpu
  2013-08-06 18:45         ` Chris Wilson
@ 2013-08-06 19:08           ` Ville Syrjälä
  2013-08-06 19:26             ` Chris Wilson
  0 siblings, 1 reply; 14+ messages in thread
From: Ville Syrjälä @ 2013-08-06 19:08 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx

On Tue, Aug 06, 2013 at 07:45:40PM +0100, Chris Wilson wrote:
> On Tue, Aug 06, 2013 at 09:03:01PM +0300, Ville Syrjälä wrote:
> > On Tue, Aug 06, 2013 at 05:16:14PM +0100, Chris Wilson wrote:
> > > On Tue, Aug 06, 2013 at 05:24:25PM +0300, Ville Syrjälä wrote:
> > > > On Tue, Aug 06, 2013 at 01:17:04PM +0100, Chris Wilson wrote:
> > > > > +static bool cpu_write_needs_clflush(struct drm_i915_gem_object *obj)
> > > > > +{
> > > > > +	if (!cpu_cache_is_coherent(obj->base.dev, obj->cache_level))
> > > > > +		return true;
> > > > > +
> > > > > +	return atomic_read(&obj->fb_count) > 0;
> > > > > +}
> > > > 
> > > > I was thinking a bit about this fb_count thing. The other option would
> > > > be to track actual scanout activity, which wouldn't be hard to do, but
> > > > I guess that could shift a bunch of the clflush cost to page flip time.
> > > > So maybe we don't want that?
> > > 
> > > Different series, see the frontbuffer write tracking. As for page flip,
> > > that is currently the only time we clflush...
> > >  
> > > > Also i915_gem_sw_finish_ioctl() still looks at pin_count as an
> > > > indication whether scanout is happening. We should add an fb_count
> > > > check there as well. Maybe we should leave the pin_count check there,
> > > > so that framebuffers that aren't being scanned out currently can be
> > > > massaged a bit more freely before a clflush needs to be issued?
> > > 
> > > It does do a fb_count check, the pin_count + fb_count is a really strong
> > > indicator of active scanout.
> > 
> > The fb_count checks is done only through the eventual check in
> > i915_gem_clflush_object() which is why I didn't think of it.
> > 
> > On non-LLC platforms it also ends up doing the flush for all
> > non-snooped non-scanout buffers, which would get optimized away if the
> > fb_count check was where the pin_count check is.
> 
> But it does still do the fb_count check before the flush? What am I
> missing here?

I believe the code now ends up doing this (on non-LLC):
 if (pin_count && (cache_level == UC || fb_count > 0) && write_domain == CPU)
   clflush();

So it'll flush also all pinned non-snooped buffers whether they be
scanout or not. We could delay such flushes until the bo gets
moved away from the CPU write domain. But maybe it's just a nonsense
scenario. Why would someone write to a pinned bo unless the pinning is
due to scanout?

> > > > > -	    obj->tiling_mode == I915_TILING_NONE &&
> > > > > -	    obj->base.write_domain != I915_GEM_DOMAIN_CPU) {
> > > > > +	if (obj->tiling_mode == I915_TILING_NONE &&
> > > > > +	    !cpu_write_needs_clflush(obj)) {
> > > > 
> > > > This would break non-LLC platforms, I think. Before, GTT writes were
> > > > used for non-snooped buffers w/ clean CPU caches (write domain != cpu),
> > > > which makes sense since access via GTT doesn't snoop ever according
> > > > to the docs. Now I think it'll do GTT writes for snooped non-scanout
> > > > buffers.
> > > 
> > > D'oh. What I was aiming for was to always use shmem writes on snooped
> > > bo, irrespective of whether it is GPU hot.
> > > 
> > > s/!cpu_write_needs_clflush/cpu_write_needs_clflush/
> > 
> > So that means a GTT write for anything non-snooped or scanout. If it's
> > snooped scanout (I guess we may not really care about such a weird
> > combination), or non-snooped but in CPU write domain, that would
> > still have issues with possibly dirty data in the CPU caches.
> 
> Hmm, that should be fixed up by i915_gem_gtt_pwrite_fast() as we flush
> all the caches before we start overwriting. I've thoroughly confused
> myself over which paths do the fancy tricks. :(
>  
> > Might be intersting to try on real hardware what the GTT access
> > snooping behaviour really is, especially for some modern non-LLC
> > system like VLV.
> 
> I remember why I have the !DOMAIN_CPU test now. If we have already paid
> the price to pull it back into the CPU domain, let it rest there until
> it is used again.
> 
> So as it stands, I think we want:
> 
> if (obj->tiling_mode == I915_TILING_NONE &&
>     obj->base.write_domain != I915_GEM_DOMAIN_CPU &&
>     cpu_write_needs_clflush(obj))
> 	i915_gem_gtt_pwrite_fast();

Yeah that looks better to me at least.

-- 
Ville Syrjälä
Intel OTC

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 3/4] drm/i915: Update rules for writing through the LLC with the cpu
  2013-08-06 19:08           ` Ville Syrjälä
@ 2013-08-06 19:26             ` Chris Wilson
  0 siblings, 0 replies; 14+ messages in thread
From: Chris Wilson @ 2013-08-06 19:26 UTC (permalink / raw)
  To: Ville Syrjälä; +Cc: intel-gfx

On Tue, Aug 06, 2013 at 10:08:38PM +0300, Ville Syrjälä wrote:
> On Tue, Aug 06, 2013 at 07:45:40PM +0100, Chris Wilson wrote:
> > On Tue, Aug 06, 2013 at 09:03:01PM +0300, Ville Syrjälä wrote:
> > > On Tue, Aug 06, 2013 at 05:16:14PM +0100, Chris Wilson wrote:
> > > > On Tue, Aug 06, 2013 at 05:24:25PM +0300, Ville Syrjälä wrote:
> > > > > Also i915_gem_sw_finish_ioctl() still looks at pin_count as an
> > > > > indication whether scanout is happening. We should add an fb_count
> > > > > check there as well. Maybe we should leave the pin_count check there,
> > > > > so that framebuffers that aren't being scanned out currently can be
> > > > > massaged a bit more freely before a clflush needs to be issued?
> > > > 
> > > > It does do a fb_count check, the pin_count + fb_count is a really strong
> > > > indicator of active scanout.
> > > 
> > > The fb_count checks is done only through the eventual check in
> > > i915_gem_clflush_object() which is why I didn't think of it.
> > > 
> > > On non-LLC platforms it also ends up doing the flush for all
> > > non-snooped non-scanout buffers, which would get optimized away if the
> > > fb_count check was where the pin_count check is.
> > 
> > But it does still do the fb_count check before the flush? What am I
> > missing here?
> 
> I believe the code now ends up doing this (on non-LLC):
>  if (pin_count && (cache_level == UC || fb_count > 0) && write_domain == CPU)
>    clflush();
> 
> So it'll flush also all pinned non-snooped buffers whether they be
> scanout or not. We could delay such flushes until the bo gets
> moved away from the CPU write domain. But maybe it's just a nonsense
> scenario. Why would someone write to a pinned bo unless the pinning is
> due to scanout?

Right. So long as this patch doesn't make it any worse than clflushing
all pinned objects, I'm inclined to consider it an improvement. The
final step to only clflushing when attached to an active CRTC is an
optimisation made in the frontbuffer tracking series.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2013-08-06 19:26 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-08-06 12:17 [PATCH 1/4] drm/i915: Rename I915_CACHE_MLC_LLC to L3_LLC for Ivybridge Chris Wilson
2013-08-06 12:17 ` [PATCH 2/4] drm/i915: Update rules for reading cache lines through the LLC Chris Wilson
2013-08-06 13:31   ` Daniel Vetter
2013-08-06 16:20     ` Chris Wilson
2013-08-06 12:17 ` [PATCH 3/4] drm/i915: Update rules for writing through the LLC with the cpu Chris Wilson
2013-08-06 14:24   ` Ville Syrjälä
2013-08-06 16:16     ` Chris Wilson
2013-08-06 18:03       ` Ville Syrjälä
2013-08-06 18:45         ` Chris Wilson
2013-08-06 19:08           ` Ville Syrjälä
2013-08-06 19:26             ` Chris Wilson
2013-08-06 12:17 ` [PATCH 4/4] drm/i915: Only do a chipset flush after a clflush Chris Wilson
2013-08-06 14:25 ` [PATCH 1/4] drm/i915: Rename I915_CACHE_MLC_LLC to L3_LLC for Ivybridge Ville Syrjälä
2013-08-06 14:36   ` Daniel Vetter

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).