* [PATCH 1/4] drm/cache: Use wbinvd helpers
[not found] <1418526504-26316-1-git-send-email-benjamin.widawsky@intel.com>
@ 2014-12-14 3:08 ` Ben Widawsky
2014-12-14 12:43 ` Chris Wilson
2014-12-15 20:26 ` [PATCH] [v2] " Ben Widawsky
2014-12-14 3:08 ` [PATCH 2/4] drm/cache: Try to be smarter about clflushing on x86 Ben Widawsky
` (2 subsequent siblings)
3 siblings, 2 replies; 24+ messages in thread
From: Ben Widawsky @ 2014-12-14 3:08 UTC (permalink / raw)
To: DRI Development; +Cc: Intel GFX, Ben Widawsky, Ben Widawsky
When the original drm code was written there were no centralized functions for
doing a coordinated wbinvd across all CPUs. Now (since 2010) there are, so use
them instead of rolling a new one.
Cc: Intel GFX <intel-gfx@lists.freedesktop.org>
Signed-off-by: Ben Widawsky <ben@bwidawsk.net>
---
drivers/gpu/drm/drm_cache.c | 12 +++---------
1 file changed, 3 insertions(+), 9 deletions(-)
diff --git a/drivers/gpu/drm/drm_cache.c b/drivers/gpu/drm/drm_cache.c
index a6b6906..d7797e8 100644
--- a/drivers/gpu/drm/drm_cache.c
+++ b/drivers/gpu/drm/drm_cache.c
@@ -64,12 +64,6 @@ static void drm_cache_flush_clflush(struct page *pages[],
drm_clflush_page(*pages++);
mb();
}
-
-static void
-drm_clflush_ipi_handler(void *null)
-{
- wbinvd();
-}
#endif
void
@@ -82,7 +76,7 @@ drm_clflush_pages(struct page *pages[], unsigned long num_pages)
return;
}
- if (on_each_cpu(drm_clflush_ipi_handler, NULL, 1) != 0)
+ if (wbinvd_on_all_cpus())
printk(KERN_ERR "Timed out waiting for cache flush.\n");
#elif defined(__powerpc__)
@@ -121,7 +115,7 @@ drm_clflush_sg(struct sg_table *st)
return;
}
- if (on_each_cpu(drm_clflush_ipi_handler, NULL, 1) != 0)
+ if (wbinvd_on_all_cpus())
printk(KERN_ERR "Timed out waiting for cache flush.\n");
#else
printk(KERN_ERR "Architecture has no drm_cache.c support\n");
@@ -144,7 +138,7 @@ drm_clflush_virt_range(void *addr, unsigned long length)
return;
}
- if (on_each_cpu(drm_clflush_ipi_handler, NULL, 1) != 0)
+ if (wbinvd_on_all_cpus())
printk(KERN_ERR "Timed out waiting for cache flush.\n");
#else
printk(KERN_ERR "Architecture has no drm_cache.c support\n");
--
2.1.3
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply related [flat|nested] 24+ messages in thread
* [PATCH 2/4] drm/cache: Try to be smarter about clflushing on x86
[not found] <1418526504-26316-1-git-send-email-benjamin.widawsky@intel.com>
2014-12-14 3:08 ` [PATCH 1/4] drm/cache: Use wbinvd helpers Ben Widawsky
@ 2014-12-14 3:08 ` Ben Widawsky
2014-12-14 4:15 ` Matt Turner
2014-12-14 12:59 ` [Intel-gfx] " Chris Wilson
2014-12-14 3:08 ` [PATCH 3/4] drm/cache: Return what type of cache flush occurred Ben Widawsky
2014-12-14 3:08 ` [PATCH 4/4] drm/i915: Opportunistically reduce flushing at execbuf Ben Widawsky
3 siblings, 2 replies; 24+ messages in thread
From: Ben Widawsky @ 2014-12-14 3:08 UTC (permalink / raw)
To: DRI Development; +Cc: Intel GFX, Ben Widawsky, Ben Widawsky
Any GEM driver which has very large objects and a slow CPU is subject to very
long waits simply for clflushing incoherent objects. Generally, each individual
object is not a problem, but if you have very large objects, or very many
objects, the flushing begins to show up in profiles. Because on x86 we know the
cache size, we can easily determine when an object will use all the cache, and
forego iterating over each cacheline.
We need to be careful when using wbinvd. wbinvd() is itself potentially slow
because it requires synchronizing the flush across all CPUs so they have a
coherent view of memory. This can result in either stalling work being done on
other CPUs, or this call itself stalling while waiting for a CPU to accept the
interrupt. Also, wbinvd() also has the downside of invalidating all cachelines,
so we don't want to use it unless we're sure we already own most of the
cachelines.
The current algorithm is very naive. I think it can be tweaked more, and it
would be good if someone else gave it some thought. I am pretty confident in
i915, we can even skip the IPI in the execbuf path with minimal code change (or
perhaps just some verifying of the existing code). It would be nice to hear what
other developers who depend on this code think.
Cc: Intel GFX <intel-gfx@lists.freedesktop.org>
Signed-off-by: Ben Widawsky <ben@bwidawsk.net>
---
drivers/gpu/drm/drm_cache.c | 20 +++++++++++++++++---
1 file changed, 17 insertions(+), 3 deletions(-)
diff --git a/drivers/gpu/drm/drm_cache.c b/drivers/gpu/drm/drm_cache.c
index d7797e8..6009c2d 100644
--- a/drivers/gpu/drm/drm_cache.c
+++ b/drivers/gpu/drm/drm_cache.c
@@ -64,6 +64,20 @@ static void drm_cache_flush_clflush(struct page *pages[],
drm_clflush_page(*pages++);
mb();
}
+
+static bool
+drm_cache_should_clflush(unsigned long num_pages)
+{
+ const int cache_size = boot_cpu_data.x86_cache_size;
+
+ /* For now the algorithm simply checks if the number of pages to be
+ * flushed is greater than the entire system cache. One could make the
+ * function more aware of the actual system (ie. if SMP, how large is
+ * the cache, CPU freq. etc. All those help to determine when to
+ * wbinvd() */
+ WARN_ON_ONCE(!cache_size);
+ return !cache_size || num_pages < (cache_size >> 2);
+}
#endif
void
@@ -71,7 +85,7 @@ drm_clflush_pages(struct page *pages[], unsigned long num_pages)
{
#if defined(CONFIG_X86)
- if (cpu_has_clflush) {
+ if (cpu_has_clflush && drm_cache_should_clflush(num_pages)) {
drm_cache_flush_clflush(pages, num_pages);
return;
}
@@ -104,7 +118,7 @@ void
drm_clflush_sg(struct sg_table *st)
{
#if defined(CONFIG_X86)
- if (cpu_has_clflush) {
+ if (cpu_has_clflush && drm_cache_should_clflush(st->nents)) {
struct sg_page_iter sg_iter;
mb();
@@ -128,7 +142,7 @@ void
drm_clflush_virt_range(void *addr, unsigned long length)
{
#if defined(CONFIG_X86)
- if (cpu_has_clflush) {
+ if (cpu_has_clflush && drm_cache_should_clflush(length / PAGE_SIZE)) {
void *end = addr + length;
mb();
for (; addr < end; addr += boot_cpu_data.x86_clflush_size)
--
2.1.3
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel
^ permalink raw reply related [flat|nested] 24+ messages in thread
* [PATCH 3/4] drm/cache: Return what type of cache flush occurred
[not found] <1418526504-26316-1-git-send-email-benjamin.widawsky@intel.com>
2014-12-14 3:08 ` [PATCH 1/4] drm/cache: Use wbinvd helpers Ben Widawsky
2014-12-14 3:08 ` [PATCH 2/4] drm/cache: Try to be smarter about clflushing on x86 Ben Widawsky
@ 2014-12-14 3:08 ` Ben Widawsky
2014-12-14 3:08 ` [PATCH 4/4] drm/i915: Opportunistically reduce flushing at execbuf Ben Widawsky
3 siblings, 0 replies; 24+ messages in thread
From: Ben Widawsky @ 2014-12-14 3:08 UTC (permalink / raw)
To: DRI Development; +Cc: Intel GFX, Ben Widawsky, Ben Widawsky
The caller of the cache flush APIs can sometimes do useful things with the
information of how the cache was flushed. For instance, when giving buffers to
the GPU to read, we need to make sure all of them have properly invalidated the
caches (when not using LLC). If we can determine a wbinvd() occurred though, we
can skip trying to clflush all remaining objects.
There is a further optimization to be made here in the driver specific code
where it can try to flush the largest object first in hopes of it needing a
wbinvd(). I haven't implemented that yet.
The enum parts of this were very minimally considered for the sake of getting
the data for the profile.
Cc: Intel GFX <intel-gfx@lists.freedesktop.org>
Signed-off-by: Ben Widawsky <ben@bwidawsk.net>
---
drivers/gpu/drm/drm_cache.c | 34 +++++++++++++++++++++++++---------
include/drm/drmP.h | 13 ++++++++++---
2 files changed, 35 insertions(+), 12 deletions(-)
diff --git a/drivers/gpu/drm/drm_cache.c b/drivers/gpu/drm/drm_cache.c
index 6009c2d..433b15d 100644
--- a/drivers/gpu/drm/drm_cache.c
+++ b/drivers/gpu/drm/drm_cache.c
@@ -80,21 +80,25 @@ drm_cache_should_clflush(unsigned long num_pages)
}
#endif
-void
+int
drm_clflush_pages(struct page *pages[], unsigned long num_pages)
{
#if defined(CONFIG_X86)
if (cpu_has_clflush && drm_cache_should_clflush(num_pages)) {
drm_cache_flush_clflush(pages, num_pages);
- return;
+ return DRM_CACHE_FLUSH_CL;
}
- if (wbinvd_on_all_cpus())
+ if (wbinvd_on_all_cpus()) {
printk(KERN_ERR "Timed out waiting for cache flush.\n");
+ return DRM_CACHE_FLUSH_ERROR;
+ } else
+ return DRM_CACHE_FLUSH_WBINVD;
#elif defined(__powerpc__)
unsigned long i;
+ int ret = DRM_CACHE_FLUSH_NONE;
for (i = 0; i < num_pages; i++) {
struct page *page = pages[i];
void *page_virtual;
@@ -106,15 +110,19 @@ drm_clflush_pages(struct page *pages[], unsigned long num_pages)
flush_dcache_range((unsigned long)page_virtual,
(unsigned long)page_virtual + PAGE_SIZE);
kunmap_atomic(page_virtual);
+ ret = DRM_CACHE_FLUSH_DCACHE;
}
+ WARN_ON(ret == DRM_CACHE_FLUSH_NONE);
+ return ret;
#else
printk(KERN_ERR "Architecture has no drm_cache.c support\n");
WARN_ON_ONCE(1);
+ return DRM_CACHE_FLUSH_NONE;
#endif
}
EXPORT_SYMBOL(drm_clflush_pages);
-void
+int
drm_clflush_sg(struct sg_table *st)
{
#if defined(CONFIG_X86)
@@ -126,19 +134,23 @@ drm_clflush_sg(struct sg_table *st)
drm_clflush_page(sg_page_iter_page(&sg_iter));
mb();
- return;
+ return DRM_CACHE_FLUSH_CL;
}
- if (wbinvd_on_all_cpus())
+ if (wbinvd_on_all_cpus()) {
printk(KERN_ERR "Timed out waiting for cache flush.\n");
+ return DRM_CACHE_FLUSH_ERROR;
+ } else
+ return DRM_CACHE_FLUSH_WBINVD;
#else
printk(KERN_ERR "Architecture has no drm_cache.c support\n");
WARN_ON_ONCE(1);
+ return DRM_CACHE_FLUSH_NONE;
#endif
}
EXPORT_SYMBOL(drm_clflush_sg);
-void
+int
drm_clflush_virt_range(void *addr, unsigned long length)
{
#if defined(CONFIG_X86)
@@ -149,14 +161,18 @@ drm_clflush_virt_range(void *addr, unsigned long length)
clflushopt(addr);
clflushopt(end - 1);
mb();
- return;
+ return DRM_CACHE_FLUSH_CL;
}
- if (wbinvd_on_all_cpus())
+ if (wbinvd_on_all_cpus()) {
printk(KERN_ERR "Timed out waiting for cache flush.\n");
+ return DRM_CACHE_FLUSH_ERROR;
+ } else
+ return DRM_CACHE_FLUSH_WBINVD;
#else
printk(KERN_ERR "Architecture has no drm_cache.c support\n");
WARN_ON_ONCE(1);
+ return DRM_CACHE_FLUSH_NONE;
#endif
}
EXPORT_SYMBOL(drm_clflush_virt_range);
diff --git a/include/drm/drmP.h b/include/drm/drmP.h
index 8ba35c6..09ebe46 100644
--- a/include/drm/drmP.h
+++ b/include/drm/drmP.h
@@ -884,9 +884,16 @@ int drm_noop(struct drm_device *dev, void *data,
struct drm_file *file_priv);
/* Cache management (drm_cache.c) */
-void drm_clflush_pages(struct page *pages[], unsigned long num_pages);
-void drm_clflush_sg(struct sg_table *st);
-void drm_clflush_virt_range(void *addr, unsigned long length);
+enum drm_cache_flush {
+ DRM_CACHE_FLUSH_NONE=0,
+ DRM_CACHE_FLUSH_ERROR,
+ DRM_CACHE_FLUSH_CL,
+ DRM_CACHE_FLUSH_WBINVD,
+ DRM_CACHE_FLUSH_DCACHE,
+};
+int drm_clflush_pages(struct page *pages[], unsigned long num_pages);
+int drm_clflush_sg(struct sg_table *st);
+int drm_clflush_virt_range(void *addr, unsigned long length);
/*
* These are exported to drivers so that they can implement fencing using
--
2.1.3
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply related [flat|nested] 24+ messages in thread
* [PATCH 4/4] drm/i915: Opportunistically reduce flushing at execbuf
[not found] <1418526504-26316-1-git-send-email-benjamin.widawsky@intel.com>
` (2 preceding siblings ...)
2014-12-14 3:08 ` [PATCH 3/4] drm/cache: Return what type of cache flush occurred Ben Widawsky
@ 2014-12-14 3:08 ` Ben Widawsky
2014-12-14 9:35 ` shuang.he
2014-12-14 13:12 ` [Intel-gfx] " Ville Syrjälä
3 siblings, 2 replies; 24+ messages in thread
From: Ben Widawsky @ 2014-12-14 3:08 UTC (permalink / raw)
To: DRI Development, Intel GFX; +Cc: Ben Widawsky, Ben Widawsky
If we're moving a bunch of buffers from the CPU domain to the GPU domain, and
we've already blown out the entire cache via a wbinvd, there is nothing more to
do.
With this and the previous patches, I am seeing a 3x FPS increase on a certain
benchmark which uses a giant 2d array texture. Unless I missed something in the
code, it should only effect non-LLC i915 platforms.
I haven't yet run any numbers for other benchmarks, nor have I attempted to
check if various conformance tests still pass.
NOTE: As mentioned in the previous patch, if one can easily obtain the largest
buffer and attempt to flush it first, the results would be even more desirable.
Cc: DRI Development <dri-devel@lists.freedesktop.org>
Signed-off-by: Ben Widawsky <ben@bwidawsk.net>
---
drivers/gpu/drm/i915/i915_drv.h | 3 ++-
drivers/gpu/drm/i915/i915_gem.c | 12 +++++-------
drivers/gpu/drm/i915/i915_gem_execbuffer.c | 8 +++++---
drivers/gpu/drm/i915/intel_lrc.c | 8 +++++---
4 files changed, 17 insertions(+), 14 deletions(-)
diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index d68c75f..fdb92a3 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -2642,7 +2642,8 @@ static inline bool i915_stop_ring_allow_warn(struct drm_i915_private *dev_priv)
}
void i915_gem_reset(struct drm_device *dev);
-bool i915_gem_clflush_object(struct drm_i915_gem_object *obj, bool force);
+enum drm_cache_flush
+i915_gem_clflush_object(struct drm_i915_gem_object *obj, bool force);
int __must_check i915_gem_object_finish_gpu(struct drm_i915_gem_object *obj);
int __must_check i915_gem_init(struct drm_device *dev);
int i915_gem_init_rings(struct drm_device *dev);
diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index de241eb..3746738 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -3608,7 +3608,7 @@ err_unpin:
return vma;
}
-bool
+enum drm_cache_flush
i915_gem_clflush_object(struct drm_i915_gem_object *obj,
bool force)
{
@@ -3617,14 +3617,14 @@ i915_gem_clflush_object(struct drm_i915_gem_object *obj,
* again at bind time.
*/
if (obj->pages == NULL)
- return false;
+ return DRM_CACHE_FLUSH_NONE;
/*
* Stolen memory is always coherent with the GPU as it is explicitly
* marked as wc by the system, or the system is cache-coherent.
*/
if (obj->stolen || obj->phys_handle)
- return false;
+ return DRM_CACHE_FLUSH_NONE;
/* If the GPU is snooping the contents of the CPU cache,
* we do not need to manually clear the CPU cache lines. However,
@@ -3635,12 +3635,10 @@ i915_gem_clflush_object(struct drm_i915_gem_object *obj,
* tracking.
*/
if (!force && cpu_cache_is_coherent(obj->base.dev, obj->cache_level))
- return false;
+ return DRM_CACHE_FLUSH_NONE;
trace_i915_gem_object_clflush(obj);
- drm_clflush_sg(obj->pages);
-
- return true;
+ return drm_clflush_sg(obj->pages);
}
/** Flushes the GTT write domain for the object if it's dirty. */
diff --git a/drivers/gpu/drm/i915/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
index 0c25f62..e8eb9e9 100644
--- a/drivers/gpu/drm/i915/i915_gem_execbuffer.c
+++ b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
@@ -827,7 +827,7 @@ i915_gem_execbuffer_move_to_gpu(struct intel_engine_cs *ring,
{
struct i915_vma *vma;
uint32_t flush_domains = 0;
- bool flush_chipset = false;
+ enum drm_cache_flush flush_chipset = DRM_CACHE_FLUSH_NONE;
int ret;
list_for_each_entry(vma, vmas, exec_list) {
@@ -836,8 +836,10 @@ i915_gem_execbuffer_move_to_gpu(struct intel_engine_cs *ring,
if (ret)
return ret;
- if (obj->base.write_domain & I915_GEM_DOMAIN_CPU)
- flush_chipset |= i915_gem_clflush_object(obj, false);
+ if (obj->base.write_domain & I915_GEM_DOMAIN_CPU &&
+ flush_chipset != DRM_CACHE_FLUSH_WBINVD) {
+ flush_chipset = i915_gem_clflush_object(obj, false);
+ }
flush_domains |= obj->base.write_domain;
}
diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
index 89b5577..a6c6ebd 100644
--- a/drivers/gpu/drm/i915/intel_lrc.c
+++ b/drivers/gpu/drm/i915/intel_lrc.c
@@ -611,7 +611,7 @@ static int execlists_move_to_gpu(struct intel_ringbuffer *ringbuf,
struct intel_engine_cs *ring = ringbuf->ring;
struct i915_vma *vma;
uint32_t flush_domains = 0;
- bool flush_chipset = false;
+ enum drm_cache_flush flush_chipset = DRM_CACHE_FLUSH_NONE;
int ret;
list_for_each_entry(vma, vmas, exec_list) {
@@ -621,8 +621,10 @@ static int execlists_move_to_gpu(struct intel_ringbuffer *ringbuf,
if (ret)
return ret;
- if (obj->base.write_domain & I915_GEM_DOMAIN_CPU)
- flush_chipset |= i915_gem_clflush_object(obj, false);
+ if (obj->base.write_domain & I915_GEM_DOMAIN_CPU &&
+ flush_chipset != DRM_CACHE_FLUSH_WBINVD) {
+ flush_chipset = i915_gem_clflush_object(obj, false);
+ }
flush_domains |= obj->base.write_domain;
}
--
2.1.3
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply related [flat|nested] 24+ messages in thread
* Re: [PATCH 2/4] drm/cache: Try to be smarter about clflushing on x86
2014-12-14 3:08 ` [PATCH 2/4] drm/cache: Try to be smarter about clflushing on x86 Ben Widawsky
@ 2014-12-14 4:15 ` Matt Turner
2014-12-15 22:07 ` Ben Widawsky
2014-12-14 12:59 ` [Intel-gfx] " Chris Wilson
1 sibling, 1 reply; 24+ messages in thread
From: Matt Turner @ 2014-12-14 4:15 UTC (permalink / raw)
To: Ben Widawsky; +Cc: Intel GFX, Ben Widawsky, DRI Development
On Sat, Dec 13, 2014 at 7:08 PM, Ben Widawsky
<benjamin.widawsky@intel.com> wrote:
> Any GEM driver which has very large objects and a slow CPU is subject to very
> long waits simply for clflushing incoherent objects. Generally, each individual
> object is not a problem, but if you have very large objects, or very many
> objects, the flushing begins to show up in profiles. Because on x86 we know the
> cache size, we can easily determine when an object will use all the cache, and
> forego iterating over each cacheline.
>
> We need to be careful when using wbinvd. wbinvd() is itself potentially slow
> because it requires synchronizing the flush across all CPUs so they have a
> coherent view of memory. This can result in either stalling work being done on
> other CPUs, or this call itself stalling while waiting for a CPU to accept the
> interrupt. Also, wbinvd() also has the downside of invalidating all cachelines,
> so we don't want to use it unless we're sure we already own most of the
> cachelines.
>
> The current algorithm is very naive. I think it can be tweaked more, and it
> would be good if someone else gave it some thought. I am pretty confident in
> i915, we can even skip the IPI in the execbuf path with minimal code change (or
> perhaps just some verifying of the existing code). It would be nice to hear what
> other developers who depend on this code think.
>
> Cc: Intel GFX <intel-gfx@lists.freedesktop.org>
> Signed-off-by: Ben Widawsky <ben@bwidawsk.net>
> ---
> drivers/gpu/drm/drm_cache.c | 20 +++++++++++++++++---
> 1 file changed, 17 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/gpu/drm/drm_cache.c b/drivers/gpu/drm/drm_cache.c
> index d7797e8..6009c2d 100644
> --- a/drivers/gpu/drm/drm_cache.c
> +++ b/drivers/gpu/drm/drm_cache.c
> @@ -64,6 +64,20 @@ static void drm_cache_flush_clflush(struct page *pages[],
> drm_clflush_page(*pages++);
> mb();
> }
> +
> +static bool
> +drm_cache_should_clflush(unsigned long num_pages)
> +{
> + const int cache_size = boot_cpu_data.x86_cache_size;
> +
> + /* For now the algorithm simply checks if the number of pages to be
> + * flushed is greater than the entire system cache. One could make the
> + * function more aware of the actual system (ie. if SMP, how large is
> + * the cache, CPU freq. etc. All those help to determine when to
> + * wbinvd() */
> + WARN_ON_ONCE(!cache_size);
> + return !cache_size || num_pages < (cache_size >> 2);
> +}
> #endif
>
> void
> @@ -71,7 +85,7 @@ drm_clflush_pages(struct page *pages[], unsigned long num_pages)
> {
>
> #if defined(CONFIG_X86)
> - if (cpu_has_clflush) {
> + if (cpu_has_clflush && drm_cache_should_clflush(num_pages)) {
> drm_cache_flush_clflush(pages, num_pages);
> return;
> }
> @@ -104,7 +118,7 @@ void
> drm_clflush_sg(struct sg_table *st)
> {
> #if defined(CONFIG_X86)
> - if (cpu_has_clflush) {
> + if (cpu_has_clflush && drm_cache_should_clflush(st->nents)) {
> struct sg_page_iter sg_iter;
>
> mb();
> @@ -128,7 +142,7 @@ void
> drm_clflush_virt_range(void *addr, unsigned long length)
> {
> #if defined(CONFIG_X86)
> - if (cpu_has_clflush) {
> + if (cpu_has_clflush && drm_cache_should_clflush(length / PAGE_SIZE)) {
If length isn't a multiple of page size, isn't this ignoring the
remainder? Should it be rounding length up to the next multiple of
PAGE_SIZE, like ROUND_UP_TO?
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 4/4] drm/i915: Opportunistically reduce flushing at execbuf
2014-12-14 3:08 ` [PATCH 4/4] drm/i915: Opportunistically reduce flushing at execbuf Ben Widawsky
@ 2014-12-14 9:35 ` shuang.he
2014-12-14 13:12 ` [Intel-gfx] " Ville Syrjälä
1 sibling, 0 replies; 24+ messages in thread
From: shuang.he @ 2014-12-14 9:35 UTC (permalink / raw)
To: shuang.he, intel-gfx, benjamin.widawsky
Tested-By: PRC QA PRTS (Patch Regression Test System Contact: shuang.he@intel.com)
-------------------------------------Summary-------------------------------------
Platform Delta drm-intel-nightly Series Applied
PNV 364/364 364/364
ILK +3 362/366 365/366
SNB 448/450 448/450
IVB 497/498 497/498
BYT 289/289 289/289
HSW 563/564 563/564
BDW 417/417 417/417
-------------------------------------Detailed-------------------------------------
Platform Test drm-intel-nightly Series Applied
ILK igt_kms_flip_nonexisting-fb DMESG_WARN(1, M26)PASS(5, M37M26) PASS(1, M37)
ILK igt_kms_flip_rcs-flip-vs-panning-interruptible DMESG_WARN(2, M26)PASS(4, M37M26) PASS(1, M37)
ILK igt_kms_flip_rcs-wf_vblank-vs-dpms-interruptible DMESG_WARN(1, M26)PASS(4, M26M37) PASS(1, M37)
Note: You need to pay more attention to line start with '*'
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 1/4] drm/cache: Use wbinvd helpers
2014-12-14 3:08 ` [PATCH 1/4] drm/cache: Use wbinvd helpers Ben Widawsky
@ 2014-12-14 12:43 ` Chris Wilson
2014-12-15 7:49 ` Daniel Vetter
2014-12-15 20:26 ` [PATCH] [v2] " Ben Widawsky
1 sibling, 1 reply; 24+ messages in thread
From: Chris Wilson @ 2014-12-14 12:43 UTC (permalink / raw)
To: Ben Widawsky; +Cc: Intel GFX, Ben Widawsky, DRI Development
On Sat, Dec 13, 2014 at 07:08:21PM -0800, Ben Widawsky wrote:
> When the original drm code was written there were no centralized functions for
> doing a coordinated wbinvd across all CPUs. Now (since 2010) there are, so use
> them instead of rolling a new one.
>
> Cc: Intel GFX <intel-gfx@lists.freedesktop.org>
> Signed-off-by: Ben Widawsky <ben@bwidawsk.net>
Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
-Chris
--
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [Intel-gfx] [PATCH 2/4] drm/cache: Try to be smarter about clflushing on x86
2014-12-14 3:08 ` [PATCH 2/4] drm/cache: Try to be smarter about clflushing on x86 Ben Widawsky
2014-12-14 4:15 ` Matt Turner
@ 2014-12-14 12:59 ` Chris Wilson
2014-12-15 4:06 ` Jesse Barnes
1 sibling, 1 reply; 24+ messages in thread
From: Chris Wilson @ 2014-12-14 12:59 UTC (permalink / raw)
To: Ben Widawsky; +Cc: Intel GFX, Ben Widawsky, DRI Development
On Sat, Dec 13, 2014 at 07:08:22PM -0800, Ben Widawsky wrote:
> Any GEM driver which has very large objects and a slow CPU is subject to very
> long waits simply for clflushing incoherent objects. Generally, each individual
> object is not a problem, but if you have very large objects, or very many
> objects, the flushing begins to show up in profiles. Because on x86 we know the
> cache size, we can easily determine when an object will use all the cache, and
> forego iterating over each cacheline.
>
> We need to be careful when using wbinvd. wbinvd() is itself potentially slow
> because it requires synchronizing the flush across all CPUs so they have a
> coherent view of memory. This can result in either stalling work being done on
> other CPUs, or this call itself stalling while waiting for a CPU to accept the
> interrupt. Also, wbinvd() also has the downside of invalidating all cachelines,
> so we don't want to use it unless we're sure we already own most of the
> cachelines.
>
> The current algorithm is very naive. I think it can be tweaked more, and it
> would be good if someone else gave it some thought. I am pretty confident in
> i915, we can even skip the IPI in the execbuf path with minimal code change (or
> perhaps just some verifying of the existing code). It would be nice to hear what
> other developers who depend on this code think.
One of the things wbinvd is considered evil for is that it blocks the
CPU for an indeterminate amount of time - upsetting latency critcial
aspects of the OS. For example, the x86/mm has similar code to use
wbinvd for large clflushes that caused a bit of controversy with RT:
http://linux-kernel.2935.n7.nabble.com/PATCH-x86-Use-clflush-instead-of-wbinvd-whenever-possible-when-changing-mapping-td493751.html
and also the use of wbinvd in the nvidia driver has also been noted as
evil by RT folks.
However as the wbinvd still exists, it can't be all that bad...
> Cc: Intel GFX <intel-gfx@lists.freedesktop.org>
> Signed-off-by: Ben Widawsky <ben@bwidawsk.net>
> ---
> drivers/gpu/drm/drm_cache.c | 20 +++++++++++++++++---
> 1 file changed, 17 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/gpu/drm/drm_cache.c b/drivers/gpu/drm/drm_cache.c
> index d7797e8..6009c2d 100644
> --- a/drivers/gpu/drm/drm_cache.c
> +++ b/drivers/gpu/drm/drm_cache.c
> @@ -64,6 +64,20 @@ static void drm_cache_flush_clflush(struct page *pages[],
> drm_clflush_page(*pages++);
> mb();
> }
> +
> +static bool
> +drm_cache_should_clflush(unsigned long num_pages)
> +{
> + const int cache_size = boot_cpu_data.x86_cache_size;
Maybe const int cpu_cache_size = boot_cpu_data.x86_cache_size >> 2; /* in pages */
Just to make it clear where the factor of 4 is required.
How stand alone is this patch? What happens if you just apply this by
itself? I presume it wasn't all that effective since you needed the
additional patches to prevent superfluous flushes. But it should have an
effect to reduce the time it takes to bind framebuffers, etc. I expect
it to overall worsen performance as we do the repeated wbinvd in
execbuffer.
-Chris
--
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [Intel-gfx] [PATCH 4/4] drm/i915: Opportunistically reduce flushing at execbuf
2014-12-14 3:08 ` [PATCH 4/4] drm/i915: Opportunistically reduce flushing at execbuf Ben Widawsky
2014-12-14 9:35 ` shuang.he
@ 2014-12-14 13:12 ` Ville Syrjälä
2014-12-14 23:37 ` Ben Widawsky
1 sibling, 1 reply; 24+ messages in thread
From: Ville Syrjälä @ 2014-12-14 13:12 UTC (permalink / raw)
To: Ben Widawsky; +Cc: Intel GFX, Ben Widawsky, DRI Development
On Sat, Dec 13, 2014 at 07:08:24PM -0800, Ben Widawsky wrote:
> If we're moving a bunch of buffers from the CPU domain to the GPU domain, and
> we've already blown out the entire cache via a wbinvd, there is nothing more to
> do.
>
> With this and the previous patches, I am seeing a 3x FPS increase on a certain
> benchmark which uses a giant 2d array texture. Unless I missed something in the
> code, it should only effect non-LLC i915 platforms.
>
> I haven't yet run any numbers for other benchmarks, nor have I attempted to
> check if various conformance tests still pass.
>
> NOTE: As mentioned in the previous patch, if one can easily obtain the largest
> buffer and attempt to flush it first, the results would be even more desirable.
So even with that optimization if you only have tons of small buffers
that need to be flushed you'd still take the clflush path for every
single one.
How difficult would it to calculate the total size to be flushed first,
and then make the clflush vs. wbinvd decision base on that?
>
> Cc: DRI Development <dri-devel@lists.freedesktop.org>
> Signed-off-by: Ben Widawsky <ben@bwidawsk.net>
> ---
> drivers/gpu/drm/i915/i915_drv.h | 3 ++-
> drivers/gpu/drm/i915/i915_gem.c | 12 +++++-------
> drivers/gpu/drm/i915/i915_gem_execbuffer.c | 8 +++++---
> drivers/gpu/drm/i915/intel_lrc.c | 8 +++++---
> 4 files changed, 17 insertions(+), 14 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
> index d68c75f..fdb92a3 100644
> --- a/drivers/gpu/drm/i915/i915_drv.h
> +++ b/drivers/gpu/drm/i915/i915_drv.h
> @@ -2642,7 +2642,8 @@ static inline bool i915_stop_ring_allow_warn(struct drm_i915_private *dev_priv)
> }
>
> void i915_gem_reset(struct drm_device *dev);
> -bool i915_gem_clflush_object(struct drm_i915_gem_object *obj, bool force);
> +enum drm_cache_flush
> +i915_gem_clflush_object(struct drm_i915_gem_object *obj, bool force);
> int __must_check i915_gem_object_finish_gpu(struct drm_i915_gem_object *obj);
> int __must_check i915_gem_init(struct drm_device *dev);
> int i915_gem_init_rings(struct drm_device *dev);
> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> index de241eb..3746738 100644
> --- a/drivers/gpu/drm/i915/i915_gem.c
> +++ b/drivers/gpu/drm/i915/i915_gem.c
> @@ -3608,7 +3608,7 @@ err_unpin:
> return vma;
> }
>
> -bool
> +enum drm_cache_flush
> i915_gem_clflush_object(struct drm_i915_gem_object *obj,
> bool force)
> {
> @@ -3617,14 +3617,14 @@ i915_gem_clflush_object(struct drm_i915_gem_object *obj,
> * again at bind time.
> */
> if (obj->pages == NULL)
> - return false;
> + return DRM_CACHE_FLUSH_NONE;
>
> /*
> * Stolen memory is always coherent with the GPU as it is explicitly
> * marked as wc by the system, or the system is cache-coherent.
> */
> if (obj->stolen || obj->phys_handle)
> - return false;
> + return DRM_CACHE_FLUSH_NONE;
>
> /* If the GPU is snooping the contents of the CPU cache,
> * we do not need to manually clear the CPU cache lines. However,
> @@ -3635,12 +3635,10 @@ i915_gem_clflush_object(struct drm_i915_gem_object *obj,
> * tracking.
> */
> if (!force && cpu_cache_is_coherent(obj->base.dev, obj->cache_level))
> - return false;
> + return DRM_CACHE_FLUSH_NONE;
>
> trace_i915_gem_object_clflush(obj);
> - drm_clflush_sg(obj->pages);
> -
> - return true;
> + return drm_clflush_sg(obj->pages);
> }
>
> /** Flushes the GTT write domain for the object if it's dirty. */
> diff --git a/drivers/gpu/drm/i915/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
> index 0c25f62..e8eb9e9 100644
> --- a/drivers/gpu/drm/i915/i915_gem_execbuffer.c
> +++ b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
> @@ -827,7 +827,7 @@ i915_gem_execbuffer_move_to_gpu(struct intel_engine_cs *ring,
> {
> struct i915_vma *vma;
> uint32_t flush_domains = 0;
> - bool flush_chipset = false;
> + enum drm_cache_flush flush_chipset = DRM_CACHE_FLUSH_NONE;
> int ret;
>
> list_for_each_entry(vma, vmas, exec_list) {
> @@ -836,8 +836,10 @@ i915_gem_execbuffer_move_to_gpu(struct intel_engine_cs *ring,
> if (ret)
> return ret;
>
> - if (obj->base.write_domain & I915_GEM_DOMAIN_CPU)
> - flush_chipset |= i915_gem_clflush_object(obj, false);
> + if (obj->base.write_domain & I915_GEM_DOMAIN_CPU &&
> + flush_chipset != DRM_CACHE_FLUSH_WBINVD) {
> + flush_chipset = i915_gem_clflush_object(obj, false);
> + }
>
> flush_domains |= obj->base.write_domain;
> }
> diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
> index 89b5577..a6c6ebd 100644
> --- a/drivers/gpu/drm/i915/intel_lrc.c
> +++ b/drivers/gpu/drm/i915/intel_lrc.c
> @@ -611,7 +611,7 @@ static int execlists_move_to_gpu(struct intel_ringbuffer *ringbuf,
> struct intel_engine_cs *ring = ringbuf->ring;
> struct i915_vma *vma;
> uint32_t flush_domains = 0;
> - bool flush_chipset = false;
> + enum drm_cache_flush flush_chipset = DRM_CACHE_FLUSH_NONE;
> int ret;
>
> list_for_each_entry(vma, vmas, exec_list) {
> @@ -621,8 +621,10 @@ static int execlists_move_to_gpu(struct intel_ringbuffer *ringbuf,
> if (ret)
> return ret;
>
> - if (obj->base.write_domain & I915_GEM_DOMAIN_CPU)
> - flush_chipset |= i915_gem_clflush_object(obj, false);
> + if (obj->base.write_domain & I915_GEM_DOMAIN_CPU &&
> + flush_chipset != DRM_CACHE_FLUSH_WBINVD) {
> + flush_chipset = i915_gem_clflush_object(obj, false);
> + }
>
> flush_domains |= obj->base.write_domain;
> }
> --
> 2.1.3
>
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx@lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/intel-gfx
--
Ville Syrjälä
Intel OTC
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 4/4] drm/i915: Opportunistically reduce flushing at execbuf
2014-12-14 13:12 ` [Intel-gfx] " Ville Syrjälä
@ 2014-12-14 23:37 ` Ben Widawsky
2014-12-15 7:55 ` [Intel-gfx] " Daniel Vetter
0 siblings, 1 reply; 24+ messages in thread
From: Ben Widawsky @ 2014-12-14 23:37 UTC (permalink / raw)
To: Ville Syrjälä; +Cc: Intel GFX, DRI Development, Ben Widawsky
On Sun, Dec 14, 2014 at 03:12:21PM +0200, Ville Syrjälä wrote:
> On Sat, Dec 13, 2014 at 07:08:24PM -0800, Ben Widawsky wrote:
> > If we're moving a bunch of buffers from the CPU domain to the GPU domain, and
> > we've already blown out the entire cache via a wbinvd, there is nothing more to
> > do.
> >
> > With this and the previous patches, I am seeing a 3x FPS increase on a certain
> > benchmark which uses a giant 2d array texture. Unless I missed something in the
> > code, it should only effect non-LLC i915 platforms.
> >
> > I haven't yet run any numbers for other benchmarks, nor have I attempted to
> > check if various conformance tests still pass.
> >
> > NOTE: As mentioned in the previous patch, if one can easily obtain the largest
> > buffer and attempt to flush it first, the results would be even more desirable.
>
> So even with that optimization if you only have tons of small buffers
> that need to be flushed you'd still take the clflush path for every
> single one.
>
> How difficult would it to calculate the total size to be flushed first,
> and then make the clflush vs. wbinvd decision base on that?
>
I'll write the patch and send it to Eero for test.
It's not hard, and I think that's a good idea as well. One reason I didn't put
such code in this series is that moves away from a global DRM solution (and like
I said in the cover-letter, I am fine with that). Implementing this, I think in
the i915 code we'd just iterate through the BOs until we got to a certain
threshold, then just call wbinvd() from i915 and not even both with drm_cache.
You could also maybe try to shorcut if there are more than X buffers.
However, for what you describe, I think it might make more sense to let
userspace specify an execbuf flag to do the wbinvd(). Userspace can trivially
determine such info, it prevents having to iterate through the buffers an extra
time in the kernel.
I wonder if the clflushing many small objects is showing up on profiles? So far,
this specific microbenchmark was the only profile I'd seen where the clflushes
show up.
Thanks.
[snip]
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [Intel-gfx] [PATCH 2/4] drm/cache: Try to be smarter about clflushing on x86
2014-12-14 12:59 ` [Intel-gfx] " Chris Wilson
@ 2014-12-15 4:06 ` Jesse Barnes
2014-12-15 19:54 ` Ben Widawsky
0 siblings, 1 reply; 24+ messages in thread
From: Jesse Barnes @ 2014-12-15 4:06 UTC (permalink / raw)
To: Chris Wilson, Ben Widawsky, DRI Development, Intel GFX,
Ben Widawsky
On 12/14/2014 4:59 AM, Chris Wilson wrote:
> One of the things wbinvd is considered evil for is that it blocks the
> CPU for an indeterminate amount of time - upsetting latency critcial
> aspects of the OS. For example, the x86/mm has similar code to use
> wbinvd for large clflushes that caused a bit of controversy with RT:
>
> http://linux-kernel.2935.n7.nabble.com/PATCH-x86-Use-clflush-instead-of-wbinvd-whenever-possible-when-changing-mapping-td493751.html
>
> and also the use of wbinvd in the nvidia driver has also been noted as
> evil by RT folks.
>
> However as the wbinvd still exists, it can't be all that bad...
Yeah there are definitely tradeoffs here. In this particular case,
we're trying to flush out a ~140M object on every frame, which just
seems silly.
There's definitely room for optimization in Mesa too; avoiding a mapping
that marks a large bo as dirty would be good, but if we improve the
kernel in this area, even sloppy apps and existing binaries will speed up.
Maybe we could apply this only on !llc systems or something? I wonder
how much wbinvd performance varies across microarchitectures; maybe
Thomas's issue isn't really relevant anymore (at least one can hope).
When digging into this, I found that an optimization to remove the IPI
for wbinvd was clobbered during a merge; maybe that should be
resurrected too. Surely a single, global wbinvd is sufficient; we don't
need to do n_cpus^2 wbinvd + the associated invalidation bus signals here...
Alternately, we could insert some delays into this path just to make it
extra clear to userspace that they really shouldn't be hitting this in
the common case (and provide some additional interfaces to let them
avoid it by allowing flushing and dirty management in userspace).
Jesse
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 1/4] drm/cache: Use wbinvd helpers
2014-12-14 12:43 ` Chris Wilson
@ 2014-12-15 7:49 ` Daniel Vetter
0 siblings, 0 replies; 24+ messages in thread
From: Daniel Vetter @ 2014-12-15 7:49 UTC (permalink / raw)
To: Chris Wilson, Ben Widawsky, DRI Development, Intel GFX,
Ben Widawsky
On Sun, Dec 14, 2014 at 12:43:10PM +0000, Chris Wilson wrote:
> On Sat, Dec 13, 2014 at 07:08:21PM -0800, Ben Widawsky wrote:
> > When the original drm code was written there were no centralized functions for
> > doing a coordinated wbinvd across all CPUs. Now (since 2010) there are, so use
> > them instead of rolling a new one.
> >
> > Cc: Intel GFX <intel-gfx@lists.freedesktop.org>
> > Signed-off-by: Ben Widawsky <ben@bwidawsk.net>
> Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
Merged to drm-misc, thanks.
-Daniel
--
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [Intel-gfx] [PATCH 4/4] drm/i915: Opportunistically reduce flushing at execbuf
2014-12-14 23:37 ` Ben Widawsky
@ 2014-12-15 7:55 ` Daniel Vetter
2014-12-15 8:20 ` Chris Wilson
0 siblings, 1 reply; 24+ messages in thread
From: Daniel Vetter @ 2014-12-15 7:55 UTC (permalink / raw)
To: Ben Widawsky; +Cc: Intel GFX, DRI Development, Ben Widawsky
On Sun, Dec 14, 2014 at 03:37:36PM -0800, Ben Widawsky wrote:
> On Sun, Dec 14, 2014 at 03:12:21PM +0200, Ville Syrjälä wrote:
> > On Sat, Dec 13, 2014 at 07:08:24PM -0800, Ben Widawsky wrote:
> > > If we're moving a bunch of buffers from the CPU domain to the GPU domain, and
> > > we've already blown out the entire cache via a wbinvd, there is nothing more to
> > > do.
> > >
> > > With this and the previous patches, I am seeing a 3x FPS increase on a certain
> > > benchmark which uses a giant 2d array texture. Unless I missed something in the
> > > code, it should only effect non-LLC i915 platforms.
> > >
> > > I haven't yet run any numbers for other benchmarks, nor have I attempted to
> > > check if various conformance tests still pass.
> > >
> > > NOTE: As mentioned in the previous patch, if one can easily obtain the largest
> > > buffer and attempt to flush it first, the results would be even more desirable.
> >
> > So even with that optimization if you only have tons of small buffers
> > that need to be flushed you'd still take the clflush path for every
> > single one.
> >
> > How difficult would it to calculate the total size to be flushed first,
> > and then make the clflush vs. wbinvd decision base on that?
> >
>
> I'll write the patch and send it to Eero for test.
>
> It's not hard, and I think that's a good idea as well. One reason I didn't put
> such code in this series is that moves away from a global DRM solution (and like
> I said in the cover-letter, I am fine with that). Implementing this, I think in
> the i915 code we'd just iterate through the BOs until we got to a certain
> threshold, then just call wbinvd() from i915 and not even both with drm_cache.
> You could also maybe try to shorcut if there are more than X buffers.
I don't mind an i915 specific solution (we have them already in many
places). So will wait for the results of this experiments before merging
more patches.
> However, for what you describe, I think it might make more sense to let
> userspace specify an execbuf flag to do the wbinvd(). Userspace can trivially
> determine such info, it prevents having to iterate through the buffers an extra
> time in the kernel.
If we keep the running tally of clflushing in the reservation step and
then use that in move_to_gpu we shouldn't need an additional loop. And
even if the estimate from the first loop is off a bit (e.g. due to
eviction to make space for new buffers) it won't matter - it's just an
optimization. Imo asking userspace to do this for the kernel doesn't make
much sense since the kernel already keeps track of domains.
Cheers, Daniel
--
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [Intel-gfx] [PATCH 4/4] drm/i915: Opportunistically reduce flushing at execbuf
2014-12-15 7:55 ` [Intel-gfx] " Daniel Vetter
@ 2014-12-15 8:20 ` Chris Wilson
2014-12-15 19:56 ` Ben Widawsky
0 siblings, 1 reply; 24+ messages in thread
From: Chris Wilson @ 2014-12-15 8:20 UTC (permalink / raw)
To: Daniel Vetter; +Cc: Ben Widawsky, Intel GFX, DRI Development, Ben Widawsky
On Mon, Dec 15, 2014 at 08:55:32AM +0100, Daniel Vetter wrote:
> On Sun, Dec 14, 2014 at 03:37:36PM -0800, Ben Widawsky wrote:
> > On Sun, Dec 14, 2014 at 03:12:21PM +0200, Ville Syrjälä wrote:
> > > On Sat, Dec 13, 2014 at 07:08:24PM -0800, Ben Widawsky wrote:
> > > > If we're moving a bunch of buffers from the CPU domain to the GPU domain, and
> > > > we've already blown out the entire cache via a wbinvd, there is nothing more to
> > > > do.
> > > >
> > > > With this and the previous patches, I am seeing a 3x FPS increase on a certain
> > > > benchmark which uses a giant 2d array texture. Unless I missed something in the
> > > > code, it should only effect non-LLC i915 platforms.
> > > >
> > > > I haven't yet run any numbers for other benchmarks, nor have I attempted to
> > > > check if various conformance tests still pass.
> > > >
> > > > NOTE: As mentioned in the previous patch, if one can easily obtain the largest
> > > > buffer and attempt to flush it first, the results would be even more desirable.
> > >
> > > So even with that optimization if you only have tons of small buffers
> > > that need to be flushed you'd still take the clflush path for every
> > > single one.
> > >
> > > How difficult would it to calculate the total size to be flushed first,
> > > and then make the clflush vs. wbinvd decision base on that?
> > >
> >
> > I'll write the patch and send it to Eero for test.
> >
> > It's not hard, and I think that's a good idea as well. One reason I didn't put
> > such code in this series is that moves away from a global DRM solution (and like
> > I said in the cover-letter, I am fine with that). Implementing this, I think in
> > the i915 code we'd just iterate through the BOs until we got to a certain
> > threshold, then just call wbinvd() from i915 and not even both with drm_cache.
> > You could also maybe try to shorcut if there are more than X buffers.
>
> I don't mind an i915 specific solution (we have them already in many
> places). So will wait for the results of this experiments before merging
> more patches.
I actually think an i915 specific solution is required, as the making
drm_clflush_pages autoselect is likely to cause regressions on unaware
drivers (e.g. anything that has a reservation loop in execbuf like i915).
I do think that execbuf regularly hitting clflush is something to be
concerned about in userspace (even a wbinvd every execbuf is not going
to be nice, even an mfence per execbuf shows up on profiles!), but we do
have to occassionally flush large objects like framebuffers which
probably would be faster using wbinvd.
-Chris
--
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [Intel-gfx] [PATCH 2/4] drm/cache: Try to be smarter about clflushing on x86
2014-12-15 4:06 ` Jesse Barnes
@ 2014-12-15 19:54 ` Ben Widawsky
0 siblings, 0 replies; 24+ messages in thread
From: Ben Widawsky @ 2014-12-15 19:54 UTC (permalink / raw)
To: Jesse Barnes; +Cc: Intel GFX, Ben Widawsky, DRI Development
On Sun, Dec 14, 2014 at 08:06:20PM -0800, Jesse Barnes wrote:
> On 12/14/2014 4:59 AM, Chris Wilson wrote:
> >One of the things wbinvd is considered evil for is that it blocks the
> >CPU for an indeterminate amount of time - upsetting latency critcial
> >aspects of the OS. For example, the x86/mm has similar code to use
> >wbinvd for large clflushes that caused a bit of controversy with RT:
> >
> >http://linux-kernel.2935.n7.nabble.com/PATCH-x86-Use-clflush-instead-of-wbinvd-whenever-possible-when-changing-mapping-td493751.html
> >
> >and also the use of wbinvd in the nvidia driver has also been noted as
> >evil by RT folks.
> >
> >However as the wbinvd still exists, it can't be all that bad...
That patch looks eerily similar. I guess I am slightly better in that I take the
cache size into account.
>
> Yeah there are definitely tradeoffs here. In this particular case, we're
> trying to flush out a ~140M object on every frame, which just seems silly.
>
> There's definitely room for optimization in Mesa too; avoiding a mapping
> that marks a large bo as dirty would be good, but if we improve the kernel
> in this area, even sloppy apps and existing binaries will speed up.
>
> Maybe we could apply this only on !llc systems or something? I wonder how
> much wbinvd performance varies across microarchitectures; maybe Thomas's
> issue isn't really relevant anymore (at least one can hope).
I am pretty sure from the mesa perspective we do not hit this path on non-llc
systems because we end up with a linear buffer, and just CPU map the buffer. In
addition to trying to manage the cache dirtiness ourselves, the current code
which determines how it wants to manage subregions within large textures could
probably use some eyes to make sure the order in which we decide the various
fallbacks makes sense for all sizes, and all platforms. I /think/ we can do
better, but when I was writing a patch, the code got messy fast.
>
> When digging into this, I found that an optimization to remove the IPI for
> wbinvd was clobbered during a merge; maybe that should be resurrected too.
> Surely a single, global wbinvd is sufficient; we don't need to do n_cpus^2
> wbinvd + the associated invalidation bus signals here...
If this actually works, then there should be no CPU stall at all.
>
> Alternately, we could insert some delays into this path just to make it
> extra clear to userspace that they really shouldn't be hitting this in the
> common case (and provide some additional interfaces to let them avoid it by
> allowing flushing and dirty management in userspace).
I don't think such an extreme step is needed. If we really don't need the IPI,
then I think this path can only be faster than CLFLUSH.
>
> Jesse
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 4/4] drm/i915: Opportunistically reduce flushing at execbuf
2014-12-15 8:20 ` Chris Wilson
@ 2014-12-15 19:56 ` Ben Widawsky
2014-12-15 20:39 ` Chris Wilson
0 siblings, 1 reply; 24+ messages in thread
From: Ben Widawsky @ 2014-12-15 19:56 UTC (permalink / raw)
To: Chris Wilson, Daniel Vetter, Intel GFX, DRI Development,
Ben Widawsky
On Mon, Dec 15, 2014 at 08:20:50AM +0000, Chris Wilson wrote:
> On Mon, Dec 15, 2014 at 08:55:32AM +0100, Daniel Vetter wrote:
> > On Sun, Dec 14, 2014 at 03:37:36PM -0800, Ben Widawsky wrote:
> > > On Sun, Dec 14, 2014 at 03:12:21PM +0200, Ville Syrjälä wrote:
> > > > On Sat, Dec 13, 2014 at 07:08:24PM -0800, Ben Widawsky wrote:
> > > > > If we're moving a bunch of buffers from the CPU domain to the GPU domain, and
> > > > > we've already blown out the entire cache via a wbinvd, there is nothing more to
> > > > > do.
> > > > >
> > > > > With this and the previous patches, I am seeing a 3x FPS increase on a certain
> > > > > benchmark which uses a giant 2d array texture. Unless I missed something in the
> > > > > code, it should only effect non-LLC i915 platforms.
> > > > >
> > > > > I haven't yet run any numbers for other benchmarks, nor have I attempted to
> > > > > check if various conformance tests still pass.
> > > > >
> > > > > NOTE: As mentioned in the previous patch, if one can easily obtain the largest
> > > > > buffer and attempt to flush it first, the results would be even more desirable.
> > > >
> > > > So even with that optimization if you only have tons of small buffers
> > > > that need to be flushed you'd still take the clflush path for every
> > > > single one.
> > > >
> > > > How difficult would it to calculate the total size to be flushed first,
> > > > and then make the clflush vs. wbinvd decision base on that?
> > > >
> > >
> > > I'll write the patch and send it to Eero for test.
> > >
> > > It's not hard, and I think that's a good idea as well. One reason I didn't put
> > > such code in this series is that moves away from a global DRM solution (and like
> > > I said in the cover-letter, I am fine with that). Implementing this, I think in
> > > the i915 code we'd just iterate through the BOs until we got to a certain
> > > threshold, then just call wbinvd() from i915 and not even both with drm_cache.
> > > You could also maybe try to shorcut if there are more than X buffers.
> >
> > I don't mind an i915 specific solution (we have them already in many
> > places). So will wait for the results of this experiments before merging
> > more patches.
>
> I actually think an i915 specific solution is required, as the making
> drm_clflush_pages autoselect is likely to cause regressions on unaware
> drivers (e.g. anything that has a reservation loop in execbuf like i915).
Assuming the stall is gone as Jesse said in the other thread, I can't envision a
scenario where wbinvd would do worse on large objects.
>
> I do think that execbuf regularly hitting clflush is something to be
> concerned about in userspace (even a wbinvd every execbuf is not going
> to be nice, even an mfence per execbuf shows up on profiles!), but we do
> have to occassionally flush large objects like framebuffers which
> probably would be faster using wbinvd.
> -Chris
>
> --
> Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply [flat|nested] 24+ messages in thread
* [PATCH] [v2] drm/cache: Use wbinvd helpers
2014-12-14 3:08 ` [PATCH 1/4] drm/cache: Use wbinvd helpers Ben Widawsky
2014-12-14 12:43 ` Chris Wilson
@ 2014-12-15 20:26 ` Ben Widawsky
2014-12-16 2:26 ` shuang.he
2014-12-16 7:56 ` Daniel Vetter
1 sibling, 2 replies; 24+ messages in thread
From: Ben Widawsky @ 2014-12-15 20:26 UTC (permalink / raw)
To: DRI Development; +Cc: Ben Widawsky, Intel GFX
From: Ben Widawsky <ben@bwidawsk.net>
When the original drm code was written there were no centralized functions for
doing a coordinated wbinvd across all CPUs. Now (since 2010) there are, so use
them instead of rolling a new one.
v2: On x86 UP systems the wbinvd_on_all_cpus() is defined as a static inline in
smp.h. We must therefore include this file so we don't get compiler errors.
This error was found by 0-DAY kernel test infrastructure. We only need this for
x86.
Cc: Intel GFX <intel-gfx@lists.freedesktop.org>
Signed-off-by: Ben Widawsky <ben@bwidawsk.net>
Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk> (v1)
---
drivers/gpu/drm/drm_cache.c | 13 ++++---------
1 file changed, 4 insertions(+), 9 deletions(-)
diff --git a/drivers/gpu/drm/drm_cache.c b/drivers/gpu/drm/drm_cache.c
index a6b6906..9a62d7a 100644
--- a/drivers/gpu/drm/drm_cache.c
+++ b/drivers/gpu/drm/drm_cache.c
@@ -32,6 +32,7 @@
#include <drm/drmP.h>
#if defined(CONFIG_X86)
+#include <asm/smp.h>
/*
* clflushopt is an unordered instruction which needs fencing with mfence or
@@ -64,12 +65,6 @@ static void drm_cache_flush_clflush(struct page *pages[],
drm_clflush_page(*pages++);
mb();
}
-
-static void
-drm_clflush_ipi_handler(void *null)
-{
- wbinvd();
-}
#endif
void
@@ -82,7 +77,7 @@ drm_clflush_pages(struct page *pages[], unsigned long num_pages)
return;
}
- if (on_each_cpu(drm_clflush_ipi_handler, NULL, 1) != 0)
+ if (wbinvd_on_all_cpus())
printk(KERN_ERR "Timed out waiting for cache flush.\n");
#elif defined(__powerpc__)
@@ -121,7 +116,7 @@ drm_clflush_sg(struct sg_table *st)
return;
}
- if (on_each_cpu(drm_clflush_ipi_handler, NULL, 1) != 0)
+ if (wbinvd_on_all_cpus())
printk(KERN_ERR "Timed out waiting for cache flush.\n");
#else
printk(KERN_ERR "Architecture has no drm_cache.c support\n");
@@ -144,7 +139,7 @@ drm_clflush_virt_range(void *addr, unsigned long length)
return;
}
- if (on_each_cpu(drm_clflush_ipi_handler, NULL, 1) != 0)
+ if (wbinvd_on_all_cpus())
printk(KERN_ERR "Timed out waiting for cache flush.\n");
#else
printk(KERN_ERR "Architecture has no drm_cache.c support\n");
--
2.1.3
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply related [flat|nested] 24+ messages in thread
* Re: [PATCH 4/4] drm/i915: Opportunistically reduce flushing at execbuf
2014-12-15 19:56 ` Ben Widawsky
@ 2014-12-15 20:39 ` Chris Wilson
2014-12-15 21:06 ` [Intel-gfx] " Ben Widawsky
0 siblings, 1 reply; 24+ messages in thread
From: Chris Wilson @ 2014-12-15 20:39 UTC (permalink / raw)
To: Ben Widawsky; +Cc: Intel GFX, DRI Development, Ben Widawsky
On Mon, Dec 15, 2014 at 11:56:05AM -0800, Ben Widawsky wrote:
> On Mon, Dec 15, 2014 at 08:20:50AM +0000, Chris Wilson wrote:
> > On Mon, Dec 15, 2014 at 08:55:32AM +0100, Daniel Vetter wrote:
> > > On Sun, Dec 14, 2014 at 03:37:36PM -0800, Ben Widawsky wrote:
> > > > On Sun, Dec 14, 2014 at 03:12:21PM +0200, Ville Syrjälä wrote:
> > > > > On Sat, Dec 13, 2014 at 07:08:24PM -0800, Ben Widawsky wrote:
> > > > > > If we're moving a bunch of buffers from the CPU domain to the GPU domain, and
> > > > > > we've already blown out the entire cache via a wbinvd, there is nothing more to
> > > > > > do.
> > > > > >
> > > > > > With this and the previous patches, I am seeing a 3x FPS increase on a certain
> > > > > > benchmark which uses a giant 2d array texture. Unless I missed something in the
> > > > > > code, it should only effect non-LLC i915 platforms.
> > > > > >
> > > > > > I haven't yet run any numbers for other benchmarks, nor have I attempted to
> > > > > > check if various conformance tests still pass.
> > > > > >
> > > > > > NOTE: As mentioned in the previous patch, if one can easily obtain the largest
> > > > > > buffer and attempt to flush it first, the results would be even more desirable.
> > > > >
> > > > > So even with that optimization if you only have tons of small buffers
> > > > > that need to be flushed you'd still take the clflush path for every
> > > > > single one.
> > > > >
> > > > > How difficult would it to calculate the total size to be flushed first,
> > > > > and then make the clflush vs. wbinvd decision base on that?
> > > > >
> > > >
> > > > I'll write the patch and send it to Eero for test.
> > > >
> > > > It's not hard, and I think that's a good idea as well. One reason I didn't put
> > > > such code in this series is that moves away from a global DRM solution (and like
> > > > I said in the cover-letter, I am fine with that). Implementing this, I think in
> > > > the i915 code we'd just iterate through the BOs until we got to a certain
> > > > threshold, then just call wbinvd() from i915 and not even both with drm_cache.
> > > > You could also maybe try to shorcut if there are more than X buffers.
> > >
> > > I don't mind an i915 specific solution (we have them already in many
> > > places). So will wait for the results of this experiments before merging
> > > more patches.
> >
> > I actually think an i915 specific solution is required, as the making
> > drm_clflush_pages autoselect is likely to cause regressions on unaware
> > drivers (e.g. anything that has a reservation loop in execbuf like i915).
>
> Assuming the stall is gone as Jesse said in the other thread, I can't envision a
> scenario where wbinvd would do worse on large objects.
It is the multiple wbinvd performed at each execbuffer that is worrisome.
-Chris
--
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [Intel-gfx] [PATCH 4/4] drm/i915: Opportunistically reduce flushing at execbuf
2014-12-15 20:39 ` Chris Wilson
@ 2014-12-15 21:06 ` Ben Widawsky
2014-12-16 7:57 ` Chris Wilson
0 siblings, 1 reply; 24+ messages in thread
From: Ben Widawsky @ 2014-12-15 21:06 UTC (permalink / raw)
To: Chris Wilson, Daniel Vetter, Intel GFX, DRI Development,
Ben Widawsky
On Mon, Dec 15, 2014 at 08:39:35PM +0000, Chris Wilson wrote:
> On Mon, Dec 15, 2014 at 11:56:05AM -0800, Ben Widawsky wrote:
> > On Mon, Dec 15, 2014 at 08:20:50AM +0000, Chris Wilson wrote:
> > > On Mon, Dec 15, 2014 at 08:55:32AM +0100, Daniel Vetter wrote:
> > > > On Sun, Dec 14, 2014 at 03:37:36PM -0800, Ben Widawsky wrote:
> > > > > On Sun, Dec 14, 2014 at 03:12:21PM +0200, Ville Syrjälä wrote:
> > > > > > On Sat, Dec 13, 2014 at 07:08:24PM -0800, Ben Widawsky wrote:
> > > > > > > If we're moving a bunch of buffers from the CPU domain to the GPU domain, and
> > > > > > > we've already blown out the entire cache via a wbinvd, there is nothing more to
> > > > > > > do.
> > > > > > >
> > > > > > > With this and the previous patches, I am seeing a 3x FPS increase on a certain
> > > > > > > benchmark which uses a giant 2d array texture. Unless I missed something in the
> > > > > > > code, it should only effect non-LLC i915 platforms.
> > > > > > >
> > > > > > > I haven't yet run any numbers for other benchmarks, nor have I attempted to
> > > > > > > check if various conformance tests still pass.
> > > > > > >
> > > > > > > NOTE: As mentioned in the previous patch, if one can easily obtain the largest
> > > > > > > buffer and attempt to flush it first, the results would be even more desirable.
> > > > > >
> > > > > > So even with that optimization if you only have tons of small buffers
> > > > > > that need to be flushed you'd still take the clflush path for every
> > > > > > single one.
> > > > > >
> > > > > > How difficult would it to calculate the total size to be flushed first,
> > > > > > and then make the clflush vs. wbinvd decision base on that?
> > > > > >
> > > > >
> > > > > I'll write the patch and send it to Eero for test.
> > > > >
> > > > > It's not hard, and I think that's a good idea as well. One reason I didn't put
> > > > > such code in this series is that moves away from a global DRM solution (and like
> > > > > I said in the cover-letter, I am fine with that). Implementing this, I think in
> > > > > the i915 code we'd just iterate through the BOs until we got to a certain
> > > > > threshold, then just call wbinvd() from i915 and not even both with drm_cache.
> > > > > You could also maybe try to shorcut if there are more than X buffers.
> > > >
> > > > I don't mind an i915 specific solution (we have them already in many
> > > > places). So will wait for the results of this experiments before merging
> > > > more patches.
> > >
> > > I actually think an i915 specific solution is required, as the making
> > > drm_clflush_pages autoselect is likely to cause regressions on unaware
> > > drivers (e.g. anything that has a reservation loop in execbuf like i915).
> >
> > Assuming the stall is gone as Jesse said in the other thread, I can't envision a
> > scenario where wbinvd would do worse on large objects.
>
> It is the multiple wbinvd performed at each execbuffer that is worrisome.
> -Chris
>
This patch attempts to avoid that by dropping all flushing after the first
WBINVD.
--
Ben Widawsky, Intel Open Source Technology Center
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 2/4] drm/cache: Try to be smarter about clflushing on x86
2014-12-14 4:15 ` Matt Turner
@ 2014-12-15 22:07 ` Ben Widawsky
0 siblings, 0 replies; 24+ messages in thread
From: Ben Widawsky @ 2014-12-15 22:07 UTC (permalink / raw)
To: Matt Turner; +Cc: Intel GFX, Ben Widawsky, DRI Development
On Sat, Dec 13, 2014 at 08:15:22PM -0800, Matt Turner wrote:
> On Sat, Dec 13, 2014 at 7:08 PM, Ben Widawsky
> <benjamin.widawsky@intel.com> wrote:
> > Any GEM driver which has very large objects and a slow CPU is subject to very
> > long waits simply for clflushing incoherent objects. Generally, each individual
> > object is not a problem, but if you have very large objects, or very many
> > objects, the flushing begins to show up in profiles. Because on x86 we know the
> > cache size, we can easily determine when an object will use all the cache, and
> > forego iterating over each cacheline.
> >
> > We need to be careful when using wbinvd. wbinvd() is itself potentially slow
> > because it requires synchronizing the flush across all CPUs so they have a
> > coherent view of memory. This can result in either stalling work being done on
> > other CPUs, or this call itself stalling while waiting for a CPU to accept the
> > interrupt. Also, wbinvd() also has the downside of invalidating all cachelines,
> > so we don't want to use it unless we're sure we already own most of the
> > cachelines.
> >
> > The current algorithm is very naive. I think it can be tweaked more, and it
> > would be good if someone else gave it some thought. I am pretty confident in
> > i915, we can even skip the IPI in the execbuf path with minimal code change (or
> > perhaps just some verifying of the existing code). It would be nice to hear what
> > other developers who depend on this code think.
> >
> > Cc: Intel GFX <intel-gfx@lists.freedesktop.org>
> > Signed-off-by: Ben Widawsky <ben@bwidawsk.net>
> > ---
> > drivers/gpu/drm/drm_cache.c | 20 +++++++++++++++++---
> > 1 file changed, 17 insertions(+), 3 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/drm_cache.c b/drivers/gpu/drm/drm_cache.c
> > index d7797e8..6009c2d 100644
> > --- a/drivers/gpu/drm/drm_cache.c
> > +++ b/drivers/gpu/drm/drm_cache.c
> > @@ -64,6 +64,20 @@ static void drm_cache_flush_clflush(struct page *pages[],
> > drm_clflush_page(*pages++);
> > mb();
> > }
> > +
> > +static bool
> > +drm_cache_should_clflush(unsigned long num_pages)
> > +{
> > + const int cache_size = boot_cpu_data.x86_cache_size;
> > +
> > + /* For now the algorithm simply checks if the number of pages to be
> > + * flushed is greater than the entire system cache. One could make the
> > + * function more aware of the actual system (ie. if SMP, how large is
> > + * the cache, CPU freq. etc. All those help to determine when to
> > + * wbinvd() */
> > + WARN_ON_ONCE(!cache_size);
> > + return !cache_size || num_pages < (cache_size >> 2);
> > +}
> > #endif
> >
> > void
> > @@ -71,7 +85,7 @@ drm_clflush_pages(struct page *pages[], unsigned long num_pages)
> > {
> >
> > #if defined(CONFIG_X86)
> > - if (cpu_has_clflush) {
> > + if (cpu_has_clflush && drm_cache_should_clflush(num_pages)) {
> > drm_cache_flush_clflush(pages, num_pages);
> > return;
> > }
> > @@ -104,7 +118,7 @@ void
> > drm_clflush_sg(struct sg_table *st)
> > {
> > #if defined(CONFIG_X86)
> > - if (cpu_has_clflush) {
> > + if (cpu_has_clflush && drm_cache_should_clflush(st->nents)) {
> > struct sg_page_iter sg_iter;
> >
> > mb();
> > @@ -128,7 +142,7 @@ void
> > drm_clflush_virt_range(void *addr, unsigned long length)
> > {
> > #if defined(CONFIG_X86)
> > - if (cpu_has_clflush) {
> > + if (cpu_has_clflush && drm_cache_should_clflush(length / PAGE_SIZE)) {
>
> If length isn't a multiple of page size, isn't this ignoring the
> remainder? Should it be rounding length up to the next multiple of
> PAGE_SIZE, like ROUND_UP_TO?
Yeah, we could round_up. In practice it probably won't matter. I actually think
it would be better to pass a size to drm_cache_should_clflush(), and let that
round it up. It sounds like people don't want this patch anyway, so I'll make
the equivalent change in the i915 only patch.
--
Ben Widawsky, Intel Open Source Technology Center
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] [v2] drm/cache: Use wbinvd helpers
2014-12-15 20:26 ` [PATCH] [v2] " Ben Widawsky
@ 2014-12-16 2:26 ` shuang.he
2014-12-16 7:56 ` Daniel Vetter
1 sibling, 0 replies; 24+ messages in thread
From: shuang.he @ 2014-12-16 2:26 UTC (permalink / raw)
To: shuang.he, intel-gfx, benjamin.widawsky
Tested-By: PRC QA PRTS (Patch Regression Test System Contact: shuang.he@intel.com)
-------------------------------------Summary-------------------------------------
Platform Delta drm-intel-nightly Series Applied
PNV 364/364 364/364
ILK +4-4 360/366 360/366
SNB 448/450 448/450
IVB 497/498 497/498
BYT 289/289 289/289
HSW 563/564 563/564
BDW 417/417 417/417
-------------------------------------Detailed-------------------------------------
Platform Test drm-intel-nightly Series Applied
*ILK igt_kms_pipe_crc_basic_bad-source PASS(2, M26) DMESG_WARN(1, M26)
*ILK igt_kms_setmode_invalid-clone-exclusive-crtc PASS(2, M26) DMESG_WARN(1, M26)
*ILK igt_kms_flip_rcs-wf_vblank-vs-dpms-interruptible PASS(2, M26) DMESG_WARN(1, M26)
ILK igt_drv_suspend_fence-restore-untiled DMESG_WARN(1, M26)PASS(5, M37M26) PASS(1, M26)
ILK igt_kms_flip_bcs-flip-vs-modeset-interruptible DMESG_WARN(1, M26)PASS(5, M37M26) PASS(1, M26)
ILK igt_kms_flip_busy-flip-interruptible DMESG_WARN(1, M26)PASS(5, M37M26) PASS(1, M26)
ILK igt_kms_flip_flip-vs-rmfb-interruptible DMESG_WARN(1, M26)PASS(5, M37M26) PASS(1, M26)
*ILK igt_kms_flip_wf_vblank-vs-modeset-interruptible PASS(2, M26) DMESG_WARN(1, M26)
Note: You need to pay more attention to line start with '*'
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] [v2] drm/cache: Use wbinvd helpers
2014-12-15 20:26 ` [PATCH] [v2] " Ben Widawsky
2014-12-16 2:26 ` shuang.he
@ 2014-12-16 7:56 ` Daniel Vetter
1 sibling, 0 replies; 24+ messages in thread
From: Daniel Vetter @ 2014-12-16 7:56 UTC (permalink / raw)
To: Ben Widawsky; +Cc: Ben Widawsky, Intel GFX, DRI Development
On Mon, Dec 15, 2014 at 12:26:46PM -0800, Ben Widawsky wrote:
> From: Ben Widawsky <ben@bwidawsk.net>
>
> When the original drm code was written there were no centralized functions for
> doing a coordinated wbinvd across all CPUs. Now (since 2010) there are, so use
> them instead of rolling a new one.
>
> v2: On x86 UP systems the wbinvd_on_all_cpus() is defined as a static inline in
> smp.h. We must therefore include this file so we don't get compiler errors.
> This error was found by 0-DAY kernel test infrastructure. We only need this for
> x86.
Oh dear UP ;-) Thanks for the quick updated, new patch merged to dinq.
-Daniel
>
> Cc: Intel GFX <intel-gfx@lists.freedesktop.org>
> Signed-off-by: Ben Widawsky <ben@bwidawsk.net>
> Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk> (v1)
> ---
> drivers/gpu/drm/drm_cache.c | 13 ++++---------
> 1 file changed, 4 insertions(+), 9 deletions(-)
>
> diff --git a/drivers/gpu/drm/drm_cache.c b/drivers/gpu/drm/drm_cache.c
> index a6b6906..9a62d7a 100644
> --- a/drivers/gpu/drm/drm_cache.c
> +++ b/drivers/gpu/drm/drm_cache.c
> @@ -32,6 +32,7 @@
> #include <drm/drmP.h>
>
> #if defined(CONFIG_X86)
> +#include <asm/smp.h>
>
> /*
> * clflushopt is an unordered instruction which needs fencing with mfence or
> @@ -64,12 +65,6 @@ static void drm_cache_flush_clflush(struct page *pages[],
> drm_clflush_page(*pages++);
> mb();
> }
> -
> -static void
> -drm_clflush_ipi_handler(void *null)
> -{
> - wbinvd();
> -}
> #endif
>
> void
> @@ -82,7 +77,7 @@ drm_clflush_pages(struct page *pages[], unsigned long num_pages)
> return;
> }
>
> - if (on_each_cpu(drm_clflush_ipi_handler, NULL, 1) != 0)
> + if (wbinvd_on_all_cpus())
> printk(KERN_ERR "Timed out waiting for cache flush.\n");
>
> #elif defined(__powerpc__)
> @@ -121,7 +116,7 @@ drm_clflush_sg(struct sg_table *st)
> return;
> }
>
> - if (on_each_cpu(drm_clflush_ipi_handler, NULL, 1) != 0)
> + if (wbinvd_on_all_cpus())
> printk(KERN_ERR "Timed out waiting for cache flush.\n");
> #else
> printk(KERN_ERR "Architecture has no drm_cache.c support\n");
> @@ -144,7 +139,7 @@ drm_clflush_virt_range(void *addr, unsigned long length)
> return;
> }
>
> - if (on_each_cpu(drm_clflush_ipi_handler, NULL, 1) != 0)
> + if (wbinvd_on_all_cpus())
> printk(KERN_ERR "Timed out waiting for cache flush.\n");
> #else
> printk(KERN_ERR "Architecture has no drm_cache.c support\n");
> --
> 2.1.3
>
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx@lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/intel-gfx
--
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 4/4] drm/i915: Opportunistically reduce flushing at execbuf
2014-12-15 21:06 ` [Intel-gfx] " Ben Widawsky
@ 2014-12-16 7:57 ` Chris Wilson
2014-12-16 19:38 ` Ben Widawsky
0 siblings, 1 reply; 24+ messages in thread
From: Chris Wilson @ 2014-12-16 7:57 UTC (permalink / raw)
To: Ben Widawsky; +Cc: Intel GFX, DRI Development, Ben Widawsky
On Mon, Dec 15, 2014 at 01:06:40PM -0800, Ben Widawsky wrote:
> On Mon, Dec 15, 2014 at 08:39:35PM +0000, Chris Wilson wrote:
> > On Mon, Dec 15, 2014 at 11:56:05AM -0800, Ben Widawsky wrote:
> > > On Mon, Dec 15, 2014 at 08:20:50AM +0000, Chris Wilson wrote:
> > > > On Mon, Dec 15, 2014 at 08:55:32AM +0100, Daniel Vetter wrote:
> > > > > On Sun, Dec 14, 2014 at 03:37:36PM -0800, Ben Widawsky wrote:
> > > > > > It's not hard, and I think that's a good idea as well. One reason I didn't put
> > > > > > such code in this series is that moves away from a global DRM solution (and like
> > > > > > I said in the cover-letter, I am fine with that). Implementing this, I think in
> > > > > > the i915 code we'd just iterate through the BOs until we got to a certain
> > > > > > threshold, then just call wbinvd() from i915 and not even both with drm_cache.
> > > > > > You could also maybe try to shorcut if there are more than X buffers.
> > > > >
> > > > > I don't mind an i915 specific solution (we have them already in many
> > > > > places). So will wait for the results of this experiments before merging
> > > > > more patches.
> > > >
> > > > I actually think an i915 specific solution is required, as the making
> > > > drm_clflush_pages autoselect is likely to cause regressions on unaware
> > > > drivers (e.g. anything that has a reservation loop in execbuf like i915).
> > >
> > > Assuming the stall is gone as Jesse said in the other thread, I can't envision a
> > > scenario where wbinvd would do worse on large objects.
> >
> > It is the multiple wbinvd performed at each execbuffer that is worrisome.
>
> This patch attempts to avoid that by dropping all flushing after the first
> WBINVD.
But you can't make the central change to drm_clflush_* without driver
specific workarounds like this patch...
-Chris
--
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 4/4] drm/i915: Opportunistically reduce flushing at execbuf
2014-12-16 7:57 ` Chris Wilson
@ 2014-12-16 19:38 ` Ben Widawsky
0 siblings, 0 replies; 24+ messages in thread
From: Ben Widawsky @ 2014-12-16 19:38 UTC (permalink / raw)
To: Chris Wilson, Daniel Vetter, Intel GFX, DRI Development,
Ben Widawsky
On Tue, Dec 16, 2014 at 07:57:39AM +0000, Chris Wilson wrote:
> On Mon, Dec 15, 2014 at 01:06:40PM -0800, Ben Widawsky wrote:
> > On Mon, Dec 15, 2014 at 08:39:35PM +0000, Chris Wilson wrote:
> > > On Mon, Dec 15, 2014 at 11:56:05AM -0800, Ben Widawsky wrote:
> > > > On Mon, Dec 15, 2014 at 08:20:50AM +0000, Chris Wilson wrote:
> > > > > On Mon, Dec 15, 2014 at 08:55:32AM +0100, Daniel Vetter wrote:
> > > > > > On Sun, Dec 14, 2014 at 03:37:36PM -0800, Ben Widawsky wrote:
> > > > > > > It's not hard, and I think that's a good idea as well. One reason I didn't put
> > > > > > > such code in this series is that moves away from a global DRM solution (and like
> > > > > > > I said in the cover-letter, I am fine with that). Implementing this, I think in
> > > > > > > the i915 code we'd just iterate through the BOs until we got to a certain
> > > > > > > threshold, then just call wbinvd() from i915 and not even both with drm_cache.
> > > > > > > You could also maybe try to shorcut if there are more than X buffers.
> > > > > >
> > > > > > I don't mind an i915 specific solution (we have them already in many
> > > > > > places). So will wait for the results of this experiments before merging
> > > > > > more patches.
> > > > >
> > > > > I actually think an i915 specific solution is required, as the making
> > > > > drm_clflush_pages autoselect is likely to cause regressions on unaware
> > > > > drivers (e.g. anything that has a reservation loop in execbuf like i915).
> > > >
> > > > Assuming the stall is gone as Jesse said in the other thread, I can't envision a
> > > > scenario where wbinvd would do worse on large objects.
> > >
> > > It is the multiple wbinvd performed at each execbuffer that is worrisome.
> >
> > This patch attempts to avoid that by dropping all flushing after the first
> > WBINVD.
>
> But you can't make the central change to drm_clflush_* without driver
> specific workarounds like this patch...
Good point. I'll do an i915 one when I find the time. I was still hoping to get
some numbers from Eero on this series.
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply [flat|nested] 24+ messages in thread
end of thread, other threads:[~2014-12-16 19:38 UTC | newest]
Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <1418526504-26316-1-git-send-email-benjamin.widawsky@intel.com>
2014-12-14 3:08 ` [PATCH 1/4] drm/cache: Use wbinvd helpers Ben Widawsky
2014-12-14 12:43 ` Chris Wilson
2014-12-15 7:49 ` Daniel Vetter
2014-12-15 20:26 ` [PATCH] [v2] " Ben Widawsky
2014-12-16 2:26 ` shuang.he
2014-12-16 7:56 ` Daniel Vetter
2014-12-14 3:08 ` [PATCH 2/4] drm/cache: Try to be smarter about clflushing on x86 Ben Widawsky
2014-12-14 4:15 ` Matt Turner
2014-12-15 22:07 ` Ben Widawsky
2014-12-14 12:59 ` [Intel-gfx] " Chris Wilson
2014-12-15 4:06 ` Jesse Barnes
2014-12-15 19:54 ` Ben Widawsky
2014-12-14 3:08 ` [PATCH 3/4] drm/cache: Return what type of cache flush occurred Ben Widawsky
2014-12-14 3:08 ` [PATCH 4/4] drm/i915: Opportunistically reduce flushing at execbuf Ben Widawsky
2014-12-14 9:35 ` shuang.he
2014-12-14 13:12 ` [Intel-gfx] " Ville Syrjälä
2014-12-14 23:37 ` Ben Widawsky
2014-12-15 7:55 ` [Intel-gfx] " Daniel Vetter
2014-12-15 8:20 ` Chris Wilson
2014-12-15 19:56 ` Ben Widawsky
2014-12-15 20:39 ` Chris Wilson
2014-12-15 21:06 ` [Intel-gfx] " Ben Widawsky
2014-12-16 7:57 ` Chris Wilson
2014-12-16 19:38 ` Ben Widawsky
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox