[RFC] drm/ttm: add minimum residency constraint for bo eviction

All of lore.kernel.org
 help / color / mirror / Atom feed

* [RFC] drm/ttm: add minimum residency constraint for bo eviction
@ 2012-11-28 15:58 j.glisse
  2012-11-28 15:58 ` [PATCH] " j.glisse
  2012-11-28 21:51 ` [RFC] " Marek Olšák
  0 siblings, 2 replies; 34+ messages in thread
From: j.glisse @ 2012-11-28 15:58 UTC (permalink / raw)
  To: dri-devel

So i spend the day looking at ttm and eviction. The first patch i sent
earlier is i believe something that should be merged. This patch however
is more about discussing if other people are interested in similar mecanism
to be share among driver through ttm. I could otherwise just move its logic
to the radeon driver.

So the idea of this patch is that we don't want to constantly move object
in and out of certain memory pool, mostly VRAM. So it adds a minimum
residency time and no object that have been in the given pool for less
than this residency time can be moved out. It closely solve regression
we are having with radeon since gallium driver change and probably improve
some other workload.

Statistic i gathered on xonotic/realquake showed that we can have as much
as 1GB in each direction (VRAM to system and system to vram) over a second.
So we are obviously not saturating the PCIE bandwidth. Profiling shows that
80-90% of the cost of this eviction is in memory allocation/deallocation for
the system memory (lot of irqlock, and mostly kernel spending time
allocating pages thing 256 000 or more page per second to allocate/deallocate.

I used this WIP patch to gather statistic and play with various combination :
http://people.freedesktop.org/~glisse/0001-TTM-EVICT-WIP.patch

Some numbers with xonotic :
17.369fps stock 3.7 kernel
27.883fps 3.7 kernel + do not preserve caching patch ~ +60%
49.292fps 3.7 kernel + WIP with 500ms residency for all pool and no bo wait
          for eviction
49.258fps 3.7 kernel + WIP with 500ms residency for all pool and bo wait
48.213fps 3.7 kernel always allowing GTT placement (basicly revent the
          gallium patch effect)

Other design i am thinking of is changing the way radeon handle it's memory
and stop trying to revalidate object to different memory pool at each cs,
instead i think we should keep a vram lru list probably per process and move
bo out of vram according to this lru and following some euristic. So radeon
would only move bo into vram when there is room.

Other improvement i am thinking of is to reuse GTT memory of object that are
moved in for object that are evicted as statistic i gathered showed that it's
often close amount that move in and out. But this would require true dma
as it would mean scheduling in/out move on page granularity or group of
page (write 4 pages from vram to scratch 4pages into sys, write 4 pages of
system memory bo to vram 4 pages, write 4pages of vram to the just moved
4pages of system memory ...).

Cheers,
Jerome

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH] drm/ttm: add minimum residency constraint for bo eviction
  2012-11-28 15:58 [RFC] drm/ttm: add minimum residency constraint for bo eviction j.glisse
@ 2012-11-28 15:58 ` j.glisse
  2012-11-28 23:18   ` Thomas Hellstrom
  2012-11-28 21:51 ` [RFC] " Marek Olšák
  1 sibling, 1 reply; 34+ messages in thread
From: j.glisse @ 2012-11-28 15:58 UTC (permalink / raw)
  To: dri-devel; +Cc: Jerome Glisse

From: Jerome Glisse <jglisse@redhat.com>

This patch add a minimum residency time configurable for each memory
pool (VRAM, GTT, ...). Intention is to avoid having a lot of memory
eviction from VRAM up to a point where the GPU pretty much spend all
it's time moving things in and out.

Signed-off-by: Jerome Glisse <jglisse@redhat.com>
---
 drivers/gpu/drm/radeon/radeon_ttm.c | 3 +++
 drivers/gpu/drm/ttm/ttm_bo.c        | 7 +++++++
 include/drm/ttm/ttm_bo_api.h        | 1 +
 include/drm/ttm/ttm_bo_driver.h     | 1 +
 4 files changed, 12 insertions(+)

diff --git a/drivers/gpu/drm/radeon/radeon_ttm.c b/drivers/gpu/drm/radeon/radeon_ttm.c
index 5ebe1b3..88722c4 100644
--- a/drivers/gpu/drm/radeon/radeon_ttm.c
+++ b/drivers/gpu/drm/radeon/radeon_ttm.c
@@ -129,11 +129,13 @@ static int radeon_init_mem_type(struct ttm_bo_device *bdev, uint32_t type,
 	switch (type) {
 	case TTM_PL_SYSTEM:
 		/* System memory */
+		man->minimum_residency_time_ms = 0;
 		man->flags = TTM_MEMTYPE_FLAG_MAPPABLE;
 		man->available_caching = TTM_PL_MASK_CACHING;
 		man->default_caching = TTM_PL_FLAG_CACHED;
 		break;
 	case TTM_PL_TT:
+		man->minimum_residency_time_ms = 0;
 		man->func = &ttm_bo_manager_func;
 		man->gpu_offset = rdev->mc.gtt_start;
 		man->available_caching = TTM_PL_MASK_CACHING;
@@ -156,6 +158,7 @@ static int radeon_init_mem_type(struct ttm_bo_device *bdev, uint32_t type,
 		break;
 	case TTM_PL_VRAM:
 		/* "On-card" video ram */
+		man->minimum_residency_time_ms = 500;
 		man->func = &ttm_bo_manager_func;
 		man->gpu_offset = rdev->mc.vram_start;
 		man->flags = TTM_MEMTYPE_FLAG_FIXED |
diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c
index 39dcc58..40476121 100644
--- a/drivers/gpu/drm/ttm/ttm_bo.c
+++ b/drivers/gpu/drm/ttm/ttm_bo.c
@@ -452,6 +452,7 @@ moved:
 		bo->cur_placement = bo->mem.placement;
 	} else
 		bo->offset = 0;
+	bo->jiffies = jiffies;
 
 	return 0;
 
@@ -810,6 +811,12 @@ retry:
 	}
 
 	bo = list_first_entry(&man->lru, struct ttm_buffer_object, lru);
+
+	if (time_after(jiffies, bo->jiffies) && jiffies_to_msecs(jiffies - bo->jiffies) >= man->minimum_residency_time_ms) {
+		spin_unlock(&glob->lru_lock);
+		return -EBUSY;
+	}
+
 	kref_get(&bo->list_kref);
 
 	if (!list_empty(&bo->ddestroy)) {
diff --git a/include/drm/ttm/ttm_bo_api.h b/include/drm/ttm/ttm_bo_api.h
index e8028ad..9e12313 100644
--- a/include/drm/ttm/ttm_bo_api.h
+++ b/include/drm/ttm/ttm_bo_api.h
@@ -275,6 +275,7 @@ struct ttm_buffer_object {
 
 	unsigned long offset;
 	uint32_t cur_placement;
+	unsigned long jiffies;
 
 	struct sg_table *sg;
 };
diff --git a/include/drm/ttm/ttm_bo_driver.h b/include/drm/ttm/ttm_bo_driver.h
index d803b92..7f60a18e6 100644
--- a/include/drm/ttm/ttm_bo_driver.h
+++ b/include/drm/ttm/ttm_bo_driver.h
@@ -280,6 +280,7 @@ struct ttm_mem_type_manager {
 	struct mutex io_reserve_mutex;
 	bool use_io_reserve_lru;
 	bool io_reserve_fastpath;
+	unsigned long minimum_residency_time_ms;
 
 	/*
 	 * Protected by @io_reserve_mutex:
-- 
1.7.11.7

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [RFC] drm/ttm: add minimum residency constraint for bo eviction
  2012-11-28 15:58 [RFC] drm/ttm: add minimum residency constraint for bo eviction j.glisse
  2012-11-28 15:58 ` [PATCH] " j.glisse
@ 2012-11-28 21:51 ` Marek Olšák
  2012-11-28 23:18   ` Jerome Glisse
  2012-11-29  9:18   ` Thomas Hellstrom
  1 sibling, 2 replies; 34+ messages in thread
From: Marek Olšák @ 2012-11-28 21:51 UTC (permalink / raw)
  To: j.glisse; +Cc: dri-devel

I think the problem with Radeon/TTM is much deeper. Let me demonstrate
it on the following example.

Unigine Heaven needs about 385MB of space for static resources, that's
only 75% of my 512MB card. Yet, TTM is not capable of getting all of
that into VRAM. If I allow GTT placements, I get 20 fps, which is the
old Mesa behavior. If I force VRAM placements, I get 3 fps, because we
validate buffers 10 times per frame and there's probably a lot of
buffer evictions during each validation.

In theory, we should get the best performance if Radeon/TTM managed to
get everything into VRAM. That's what fglrx probably does. 75% of VRAM
doesn't look like too much. And that's the problem. Even if we
seemingly have enough memory, the current stack is not capable of
using it efficiently.

Marek

On Wed, Nov 28, 2012 at 4:58 PM,  <j.glisse@gmail.com> wrote:
> So i spend the day looking at ttm and eviction. The first patch i sent
> earlier is i believe something that should be merged. This patch however
> is more about discussing if other people are interested in similar mecanism
> to be share among driver through ttm. I could otherwise just move its logic
> to the radeon driver.
>
> So the idea of this patch is that we don't want to constantly move object
> in and out of certain memory pool, mostly VRAM. So it adds a minimum
> residency time and no object that have been in the given pool for less
> than this residency time can be moved out. It closely solve regression
> we are having with radeon since gallium driver change and probably improve
> some other workload.
>
> Statistic i gathered on xonotic/realquake showed that we can have as much
> as 1GB in each direction (VRAM to system and system to vram) over a second.
> So we are obviously not saturating the PCIE bandwidth. Profiling shows that
> 80-90% of the cost of this eviction is in memory allocation/deallocation for
> the system memory (lot of irqlock, and mostly kernel spending time
> allocating pages thing 256 000 or more page per second to allocate/deallocate.
>
> I used this WIP patch to gather statistic and play with various combination :
> http://people.freedesktop.org/~glisse/0001-TTM-EVICT-WIP.patch
>
> Some numbers with xonotic :
> 17.369fps stock 3.7 kernel
> 27.883fps 3.7 kernel + do not preserve caching patch ~ +60%
> 49.292fps 3.7 kernel + WIP with 500ms residency for all pool and no bo wait
>           for eviction
> 49.258fps 3.7 kernel + WIP with 500ms residency for all pool and bo wait
> 48.213fps 3.7 kernel always allowing GTT placement (basicly revent the
>           gallium patch effect)
>
> Other design i am thinking of is changing the way radeon handle it's memory
> and stop trying to revalidate object to different memory pool at each cs,
> instead i think we should keep a vram lru list probably per process and move
> bo out of vram according to this lru and following some euristic. So radeon
> would only move bo into vram when there is room.
>
> Other improvement i am thinking of is to reuse GTT memory of object that are
> moved in for object that are evicted as statistic i gathered showed that it's
> often close amount that move in and out. But this would require true dma
> as it would mean scheduling in/out move on page granularity or group of
> page (write 4 pages from vram to scratch 4pages into sys, write 4 pages of
> system memory bo to vram 4 pages, write 4pages of vram to the just moved
> 4pages of system memory ...).
>
> Cheers,
> Jerome
>
> _______________________________________________
> dri-devel mailing list
> dri-devel@lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] drm/ttm: add minimum residency constraint for bo eviction
  2012-11-28 15:58 ` [PATCH] " j.glisse
@ 2012-11-28 23:18   ` Thomas Hellstrom
  2012-11-28 23:24     ` Jerome Glisse
  0 siblings, 1 reply; 34+ messages in thread
From: Thomas Hellstrom @ 2012-11-28 23:18 UTC (permalink / raw)
  To: j.glisse; +Cc: Jerome Glisse, dri-devel

On 11/28/2012 04:58 PM, j.glisse@gmail.com wrote:
> From: Jerome Glisse <jglisse@redhat.com>
>
> This patch add a minimum residency time configurable for each memory
> pool (VRAM, GTT, ...). Intention is to avoid having a lot of memory
> eviction from VRAM up to a point where the GPU pretty much spend all
> it's time moving things in and out.

This patch seems odd to me.

It seems the net effect is to refuse evictions from VRAM and make buffers go
somewhere else, and that makes things faster?

Why don't they go there in the first place instead of trying to force 
them into VRAM,
when VRAM is full?

/Thomas

>
> Signed-off-by: Jerome Glisse <jglisse@redhat.com>
> ---
>   drivers/gpu/drm/radeon/radeon_ttm.c | 3 +++
>   drivers/gpu/drm/ttm/ttm_bo.c        | 7 +++++++
>   include/drm/ttm/ttm_bo_api.h        | 1 +
>   include/drm/ttm/ttm_bo_driver.h     | 1 +
>   4 files changed, 12 insertions(+)
>
> diff --git a/drivers/gpu/drm/radeon/radeon_ttm.c b/drivers/gpu/drm/radeon/radeon_ttm.c
> index 5ebe1b3..88722c4 100644
> --- a/drivers/gpu/drm/radeon/radeon_ttm.c
> +++ b/drivers/gpu/drm/radeon/radeon_ttm.c
> @@ -129,11 +129,13 @@ static int radeon_init_mem_type(struct ttm_bo_device *bdev, uint32_t type,
>   	switch (type) {
>   	case TTM_PL_SYSTEM:
>   		/* System memory */
> +		man->minimum_residency_time_ms = 0;
>   		man->flags = TTM_MEMTYPE_FLAG_MAPPABLE;
>   		man->available_caching = TTM_PL_MASK_CACHING;
>   		man->default_caching = TTM_PL_FLAG_CACHED;
>   		break;
>   	case TTM_PL_TT:
> +		man->minimum_residency_time_ms = 0;
>   		man->func = &ttm_bo_manager_func;
>   		man->gpu_offset = rdev->mc.gtt_start;
>   		man->available_caching = TTM_PL_MASK_CACHING;
> @@ -156,6 +158,7 @@ static int radeon_init_mem_type(struct ttm_bo_device *bdev, uint32_t type,
>   		break;
>   	case TTM_PL_VRAM:
>   		/* "On-card" video ram */
> +		man->minimum_residency_time_ms = 500;
>   		man->func = &ttm_bo_manager_func;
>   		man->gpu_offset = rdev->mc.vram_start;
>   		man->flags = TTM_MEMTYPE_FLAG_FIXED |
> diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c
> index 39dcc58..40476121 100644
> --- a/drivers/gpu/drm/ttm/ttm_bo.c
> +++ b/drivers/gpu/drm/ttm/ttm_bo.c
> @@ -452,6 +452,7 @@ moved:
>   		bo->cur_placement = bo->mem.placement;
>   	} else
>   		bo->offset = 0;
> +	bo->jiffies = jiffies;
>   
>   	return 0;
>   
> @@ -810,6 +811,12 @@ retry:
>   	}
>   
>   	bo = list_first_entry(&man->lru, struct ttm_buffer_object, lru);
> +
> +	if (time_after(jiffies, bo->jiffies) && jiffies_to_msecs(jiffies - bo->jiffies) >= man->minimum_residency_time_ms) {
> +		spin_unlock(&glob->lru_lock);
> +		return -EBUSY;
> +	}
> +
>   	kref_get(&bo->list_kref);
>   
>   	if (!list_empty(&bo->ddestroy)) {
> diff --git a/include/drm/ttm/ttm_bo_api.h b/include/drm/ttm/ttm_bo_api.h
> index e8028ad..9e12313 100644
> --- a/include/drm/ttm/ttm_bo_api.h
> +++ b/include/drm/ttm/ttm_bo_api.h
> @@ -275,6 +275,7 @@ struct ttm_buffer_object {
>   
>   	unsigned long offset;
>   	uint32_t cur_placement;
> +	unsigned long jiffies;
>   
>   	struct sg_table *sg;
>   };
> diff --git a/include/drm/ttm/ttm_bo_driver.h b/include/drm/ttm/ttm_bo_driver.h
> index d803b92..7f60a18e6 100644
> --- a/include/drm/ttm/ttm_bo_driver.h
> +++ b/include/drm/ttm/ttm_bo_driver.h
> @@ -280,6 +280,7 @@ struct ttm_mem_type_manager {
>   	struct mutex io_reserve_mutex;
>   	bool use_io_reserve_lru;
>   	bool io_reserve_fastpath;
> +	unsigned long minimum_residency_time_ms;
>   
>   	/*
>   	 * Protected by @io_reserve_mutex:

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC] drm/ttm: add minimum residency constraint for bo eviction
  2012-11-28 21:51 ` [RFC] " Marek Olšák
@ 2012-11-28 23:18   ` Jerome Glisse
  2012-11-29  9:18   ` Thomas Hellstrom
  1 sibling, 0 replies; 34+ messages in thread
From: Jerome Glisse @ 2012-11-28 23:18 UTC (permalink / raw)
  To: Marek Olšák; +Cc: dri-devel

On Wed, Nov 28, 2012 at 4:51 PM, Marek Olšák <maraeo@gmail.com> wrote:
> I think the problem with Radeon/TTM is much deeper. Let me demonstrate
> it on the following example.
>
> Unigine Heaven needs about 385MB of space for static resources, that's
> only 75% of my 512MB card. Yet, TTM is not capable of getting all of
> that into VRAM. If I allow GTT placements, I get 20 fps, which is the
> old Mesa behavior. If I force VRAM placements, I get 3 fps, because we
> validate buffers 10 times per frame and there's probably a lot of
> buffer evictions during each validation.
>
> In theory, we should get the best performance if Radeon/TTM managed to
> get everything into VRAM. That's what fglrx probably does. 75% of VRAM
> doesn't look like too much. And that's the problem. Even if we
> seemingly have enough memory, the current stack is not capable of
> using it efficiently.
>
> Marek
>

If you read my second to last paragraph this is what i explain. Right
now it inefficient because each cs try to revalidate things and thus
trigger bo move which increase fragmentation of memory at each cs. As
i explain a way better solution is to have a true heuristic in bo
placement and to not revalidate in different location things at each
cs. I was working on something like that, but the minimum residency
time just fix most of the regression and is a lot simpler and lot
smaller patch so i consider it as a temporary fix and i also believe
that in itself it make sense by putting some boundary on buffer move
frequency.

Cheers,
Jerome
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] drm/ttm: add minimum residency constraint for bo eviction
  2012-11-28 23:18   ` Thomas Hellstrom
@ 2012-11-28 23:24     ` Jerome Glisse
  2012-11-28 23:44       ` Alan Swanson
  2012-11-29  8:41       ` [PATCH] drm/ttm: add minimum residency constraint for bo eviction Thomas Hellstrom
  0 siblings, 2 replies; 34+ messages in thread
From: Jerome Glisse @ 2012-11-28 23:24 UTC (permalink / raw)
  To: Thomas Hellstrom; +Cc: Jerome Glisse, dri-devel

On Wed, Nov 28, 2012 at 6:18 PM, Thomas Hellstrom <thomas@shipmail.org> wrote:
> On 11/28/2012 04:58 PM, j.glisse@gmail.com wrote:
>>
>> From: Jerome Glisse <jglisse@redhat.com>
>>
>> This patch add a minimum residency time configurable for each memory
>> pool (VRAM, GTT, ...). Intention is to avoid having a lot of memory
>> eviction from VRAM up to a point where the GPU pretty much spend all
>> it's time moving things in and out.
>
>
> This patch seems odd to me.
>
> It seems the net effect is to refuse evictions from VRAM and make buffers go
> somewhere else, and that makes things faster?
>
> Why don't they go there in the first place instead of trying to force them
> into VRAM,
> when VRAM is full?
>
> /Thomas

It's mostly a side effect of cs and validating with each cs, if boA is
in cs1 and not in cs2 and boB is in cs1 but not in cs2 than boA could
be evicted by cs2 and boB moved in, if next cs ie cs3 is like cs1 then
boA move back again and boB is evicted, then you get cs4 which
reference boB but not boA, boA get evicted and boB move in ... So ttm
just spend its time doing eviction but he doing so because it's ask by
the driver to do so. Note that what is costly there is not the bo move
in itself but the page allocation.

I propose this patch to put a boundary on bo eviction frequency, i
thought it might help other driver, if you set the residency time to 0
you get the current behavior, if you don't you enforce a minimum
residency time which helps driver like radeon. Of course a proper fix
to the bo eviction for radeon has to be in radeon code and is mostly
an overhaul of how we validate bo.

But i still believe that this patch has value in itself by allowing
driver to put a boundary on buffer movement frequency.

Cheers,
Jerome

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] drm/ttm: add minimum residency constraint for bo eviction
  2012-11-28 23:24     ` Jerome Glisse
@ 2012-11-28 23:44       ` Alan Swanson
  2012-11-29  0:01         ` Jerome Glisse
  2012-11-29  2:15         ` Marek Olšák
  2012-11-29  8:41       ` [PATCH] drm/ttm: add minimum residency constraint for bo eviction Thomas Hellstrom
  1 sibling, 2 replies; 34+ messages in thread
From: Alan Swanson @ 2012-11-28 23:44 UTC (permalink / raw)
  To: Jerome Glisse; +Cc: Jerome Glisse, dri-devel

On Wed, 2012-11-28 at 18:24 -0500, Jerome Glisse wrote:
> On Wed, Nov 28, 2012 at 6:18 PM, Thomas Hellstrom <thomas@shipmail.org> wrote:
> > On 11/28/2012 04:58 PM, j.glisse@gmail.com wrote:
> >>
> >> From: Jerome Glisse <jglisse@redhat.com>
> >>
> >> This patch add a minimum residency time configurable for each memory
> >> pool (VRAM, GTT, ...). Intention is to avoid having a lot of memory
> >> eviction from VRAM up to a point where the GPU pretty much spend all
> >> it's time moving things in and out.
> >
> >
> > This patch seems odd to me.
> >
> > It seems the net effect is to refuse evictions from VRAM and make buffers go
> > somewhere else, and that makes things faster?
> >
> > Why don't they go there in the first place instead of trying to force them
> > into VRAM,
> > when VRAM is full?
> >
> > /Thomas
> 
> It's mostly a side effect of cs and validating with each cs, if boA is
> in cs1 and not in cs2 and boB is in cs1 but not in cs2 than boA could
> be evicted by cs2 and boB moved in, if next cs ie cs3 is like cs1 then
> boA move back again and boB is evicted, then you get cs4 which
> reference boB but not boA, boA get evicted and boB move in ... So ttm
> just spend its time doing eviction but he doing so because it's ask by
> the driver to do so. Note that what is costly there is not the bo move
> in itself but the page allocation.
> 
> I propose this patch to put a boundary on bo eviction frequency, i
> thought it might help other driver, if you set the residency time to 0
> you get the current behavior, if you don't you enforce a minimum
> residency time which helps driver like radeon. Of course a proper fix
> to the bo eviction for radeon has to be in radeon code and is mostly
> an overhaul of how we validate bo.
> 
> But i still believe that this patch has value in itself by allowing
> driver to put a boundary on buffer movement frequency.
> 
> Cheers,
> Jerome

So, a variation on John Carmack's recommendation from 2000 to use MRU,
not LRU, to avoid texture trashing.

  Mar 07, 2000 - Virtualized video card local memory is The Right Thing.
  http://floodyberry.com/carmack/johnc_plan_2000.html

In fact, this was last discussed in 2005 with a patch for a 1 second
stale texture eviction and I (still) wondered why a method it was never
implemented since it was an clear problem.

  http://thread.gmane.org/gmane.comp.video.dri.devel/17274/focus=17305

-- 
Alan.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] drm/ttm: add minimum residency constraint for bo eviction
  2012-11-28 23:44       ` Alan Swanson
@ 2012-11-29  0:01         ` Jerome Glisse
  2012-11-29  2:15         ` Marek Olšák
  1 sibling, 0 replies; 34+ messages in thread
From: Jerome Glisse @ 2012-11-29  0:01 UTC (permalink / raw)
  To: Alan Swanson; +Cc: Jerome Glisse, dri-devel

On Wed, Nov 28, 2012 at 6:44 PM, Alan Swanson <swanson@ukfsn.org> wrote:
> On Wed, 2012-11-28 at 18:24 -0500, Jerome Glisse wrote:
>> On Wed, Nov 28, 2012 at 6:18 PM, Thomas Hellstrom <thomas@shipmail.org> wrote:
>> > On 11/28/2012 04:58 PM, j.glisse@gmail.com wrote:
>> >>
>> >> From: Jerome Glisse <jglisse@redhat.com>
>> >>
>> >> This patch add a minimum residency time configurable for each memory
>> >> pool (VRAM, GTT, ...). Intention is to avoid having a lot of memory
>> >> eviction from VRAM up to a point where the GPU pretty much spend all
>> >> it's time moving things in and out.
>> >
>> >
>> > This patch seems odd to me.
>> >
>> > It seems the net effect is to refuse evictions from VRAM and make buffers go
>> > somewhere else, and that makes things faster?
>> >
>> > Why don't they go there in the first place instead of trying to force them
>> > into VRAM,
>> > when VRAM is full?
>> >
>> > /Thomas
>>
>> It's mostly a side effect of cs and validating with each cs, if boA is
>> in cs1 and not in cs2 and boB is in cs1 but not in cs2 than boA could
>> be evicted by cs2 and boB moved in, if next cs ie cs3 is like cs1 then
>> boA move back again and boB is evicted, then you get cs4 which
>> reference boB but not boA, boA get evicted and boB move in ... So ttm
>> just spend its time doing eviction but he doing so because it's ask by
>> the driver to do so. Note that what is costly there is not the bo move
>> in itself but the page allocation.
>>
>> I propose this patch to put a boundary on bo eviction frequency, i
>> thought it might help other driver, if you set the residency time to 0
>> you get the current behavior, if you don't you enforce a minimum
>> residency time which helps driver like radeon. Of course a proper fix
>> to the bo eviction for radeon has to be in radeon code and is mostly
>> an overhaul of how we validate bo.
>>
>> But i still believe that this patch has value in itself by allowing
>> driver to put a boundary on buffer movement frequency.
>>
>> Cheers,
>> Jerome
>
> So, a variation on John Carmack's recommendation from 2000 to use MRU,
> not LRU, to avoid texture trashing.
>
>   Mar 07, 2000 - Virtualized video card local memory is The Right Thing.
>   http://floodyberry.com/carmack/johnc_plan_2000.html
>
> In fact, this was last discussed in 2005 with a patch for a 1 second
> stale texture eviction and I (still) wondered why a method it was never
> implemented since it was an clear problem.
>
>   http://thread.gmane.org/gmane.comp.video.dri.devel/17274/focus=17305
>
> --
> Alan.

Yes such heuristic might be a good idea, i am working on a prototype
which mostly require a bit of infrastructure.

Cheers,
Jerome

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] drm/ttm: add minimum residency constraint for bo eviction
  2012-11-28 23:44       ` Alan Swanson
  2012-11-29  0:01         ` Jerome Glisse
@ 2012-11-29  2:15         ` Marek Olšák
  2012-11-29  8:04           ` Thomas Hellstrom
  1 sibling, 1 reply; 34+ messages in thread
From: Marek Olšák @ 2012-11-29  2:15 UTC (permalink / raw)
  To: Alan Swanson; +Cc: Jerome Glisse, dri-devel

On Thu, Nov 29, 2012 at 12:44 AM, Alan Swanson <swanson@ukfsn.org> wrote:
> On Wed, 2012-11-28 at 18:24 -0500, Jerome Glisse wrote:
>> On Wed, Nov 28, 2012 at 6:18 PM, Thomas Hellstrom <thomas@shipmail.org> wrote:
>> > On 11/28/2012 04:58 PM, j.glisse@gmail.com wrote:
>> >>
>> >> From: Jerome Glisse <jglisse@redhat.com>
>> >>
>> >> This patch add a minimum residency time configurable for each memory
>> >> pool (VRAM, GTT, ...). Intention is to avoid having a lot of memory
>> >> eviction from VRAM up to a point where the GPU pretty much spend all
>> >> it's time moving things in and out.
>> >
>> >
>> > This patch seems odd to me.
>> >
>> > It seems the net effect is to refuse evictions from VRAM and make buffers go
>> > somewhere else, and that makes things faster?
>> >
>> > Why don't they go there in the first place instead of trying to force them
>> > into VRAM,
>> > when VRAM is full?
>> >
>> > /Thomas
>>
>> It's mostly a side effect of cs and validating with each cs, if boA is
>> in cs1 and not in cs2 and boB is in cs1 but not in cs2 than boA could
>> be evicted by cs2 and boB moved in, if next cs ie cs3 is like cs1 then
>> boA move back again and boB is evicted, then you get cs4 which
>> reference boB but not boA, boA get evicted and boB move in ... So ttm
>> just spend its time doing eviction but he doing so because it's ask by
>> the driver to do so. Note that what is costly there is not the bo move
>> in itself but the page allocation.
>>
>> I propose this patch to put a boundary on bo eviction frequency, i
>> thought it might help other driver, if you set the residency time to 0
>> you get the current behavior, if you don't you enforce a minimum
>> residency time which helps driver like radeon. Of course a proper fix
>> to the bo eviction for radeon has to be in radeon code and is mostly
>> an overhaul of how we validate bo.
>>
>> But i still believe that this patch has value in itself by allowing
>> driver to put a boundary on buffer movement frequency.
>>
>> Cheers,
>> Jerome
>
> So, a variation on John Carmack's recommendation from 2000 to use MRU,
> not LRU, to avoid texture trashing.
>
>   Mar 07, 2000 - Virtualized video card local memory is The Right Thing.
>   http://floodyberry.com/carmack/johnc_plan_2000.html
>
> In fact, this was last discussed in 2005 with a patch for a 1 second
> stale texture eviction and I (still) wondered why a method it was never
> implemented since it was an clear problem.

BTW we can send end-of-frame markers to the kernel, which could be
used to implement Carmack's algorithm.

Marek

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] drm/ttm: add minimum residency constraint for bo eviction
  2012-11-29  2:15         ` Marek Olšák
@ 2012-11-29  8:04           ` Thomas Hellstrom
  2012-11-29 12:52             ` Marek Olšák
  0 siblings, 1 reply; 34+ messages in thread
From: Thomas Hellstrom @ 2012-11-29  8:04 UTC (permalink / raw)
  To: Marek Olšák; +Cc: Jerome Glisse, dri-devel

On 11/29/2012 03:15 AM, Marek Olšák wrote:
> On Thu, Nov 29, 2012 at 12:44 AM, Alan Swanson <swanson@ukfsn.org> wrote:
>> On Wed, 2012-11-28 at 18:24 -0500, Jerome Glisse wrote:
>>> On Wed, Nov 28, 2012 at 6:18 PM, Thomas Hellstrom <thomas@shipmail.org> wrote:
>>>> On 11/28/2012 04:58 PM, j.glisse@gmail.com wrote:
>>>>> From: Jerome Glisse <jglisse@redhat.com>
>>>>>
>>>>> This patch add a minimum residency time configurable for each memory
>>>>> pool (VRAM, GTT, ...). Intention is to avoid having a lot of memory
>>>>> eviction from VRAM up to a point where the GPU pretty much spend all
>>>>> it's time moving things in and out.
>>>>
>>>> This patch seems odd to me.
>>>>
>>>> It seems the net effect is to refuse evictions from VRAM and make buffers go
>>>> somewhere else, and that makes things faster?
>>>>
>>>> Why don't they go there in the first place instead of trying to force them
>>>> into VRAM,
>>>> when VRAM is full?
>>>>
>>>> /Thomas
>>> It's mostly a side effect of cs and validating with each cs, if boA is
>>> in cs1 and not in cs2 and boB is in cs1 but not in cs2 than boA could
>>> be evicted by cs2 and boB moved in, if next cs ie cs3 is like cs1 then
>>> boA move back again and boB is evicted, then you get cs4 which
>>> reference boB but not boA, boA get evicted and boB move in ... So ttm
>>> just spend its time doing eviction but he doing so because it's ask by
>>> the driver to do so. Note that what is costly there is not the bo move
>>> in itself but the page allocation.
>>>
>>> I propose this patch to put a boundary on bo eviction frequency, i
>>> thought it might help other driver, if you set the residency time to 0
>>> you get the current behavior, if you don't you enforce a minimum
>>> residency time which helps driver like radeon. Of course a proper fix
>>> to the bo eviction for radeon has to be in radeon code and is mostly
>>> an overhaul of how we validate bo.
>>>
>>> But i still believe that this patch has value in itself by allowing
>>> driver to put a boundary on buffer movement frequency.
>>>
>>> Cheers,
>>> Jerome
>> So, a variation on John Carmack's recommendation from 2000 to use MRU,
>> not LRU, to avoid texture trashing.
>>
>>    Mar 07, 2000 - Virtualized video card local memory is The Right Thing.
>>    http://floodyberry.com/carmack/johnc_plan_2000.html
>>
>> In fact, this was last discussed in 2005 with a patch for a 1 second
>> stale texture eviction and I (still) wondered why a method it was never
>> implemented since it was an clear problem.
> BTW we can send end-of-frame markers to the kernel, which could be
> used to implement Carmack's algorithm.
>
> Marek

It seems to me like Carmack's algorithm is quite specific to the case 
where only a single GL client is running?

It also seems like it's designed around the fact that when eviction 
takes place, all buffer objects will be idle. With a
reasonably filled graphics fifo / ring, blindly using MRU will cause the 
GPU to run synchronized.


/Thomas







> _______________________________________________
> dri-devel mailing list
> dri-devel@lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] drm/ttm: add minimum residency constraint for bo eviction
  2012-11-28 23:24     ` Jerome Glisse
  2012-11-28 23:44       ` Alan Swanson
@ 2012-11-29  8:41       ` Thomas Hellstrom
  2012-11-29 15:50         ` Jerome Glisse
  1 sibling, 1 reply; 34+ messages in thread
From: Thomas Hellstrom @ 2012-11-29  8:41 UTC (permalink / raw)
  To: Jerome Glisse; +Cc: Jerome Glisse, dri-devel

On 11/29/2012 12:24 AM, Jerome Glisse wrote:
> On Wed, Nov 28, 2012 at 6:18 PM, Thomas Hellstrom <thomas@shipmail.org> wrote:
>> On 11/28/2012 04:58 PM, j.glisse@gmail.com wrote:
>>> From: Jerome Glisse <jglisse@redhat.com>
>>>
>>> This patch add a minimum residency time configurable for each memory
>>> pool (VRAM, GTT, ...). Intention is to avoid having a lot of memory
>>> eviction from VRAM up to a point where the GPU pretty much spend all
>>> it's time moving things in and out.
>>
>> This patch seems odd to me.
>>
>> It seems the net effect is to refuse evictions from VRAM and make buffers go
>> somewhere else, and that makes things faster?
>>
>> Why don't they go there in the first place instead of trying to force them
>> into VRAM,
>> when VRAM is full?
>>
>> /Thomas
> It's mostly a side effect of cs and validating with each cs, if boA is
> in cs1 and not in cs2 and boB is in cs1 but not in cs2 than boA could
> be evicted by cs2 and boB moved in, if next cs ie cs3 is like cs1 then
> boA move back again and boB is evicted, then you get cs4 which
> reference boB but not boA, boA get evicted and boB move in ... So ttm
> just spend its time doing eviction but he doing so because it's ask by
> the driver to do so. Note that what is costly there is not the bo move
> in itself but the page allocation.

Yes, this is the cause of the trashing, but that was not what I asked.

What your patch is doing is looking at the last recently used bo, to 
check if it has been
resident for at least 500ms. Otherwise it refuses eviction for *all* 
buffers of that memory type.

This means new buffers can't fit in VRAM, they need to go somewhere 
else. Perhaps TT?

So my question was. If VRAM is full, instead of starting to evict, why 
not put new buffers in TT, so that

placement(GEM_DOMAIN_VRAM) = VRAM | TT  // Prefer VRAM but allow TT 
before starting to evict.
busy_placement(GEM_DOMAIN_VRAM) = TT | VRAM // *If* we need to evict, 
prefer evicting TT, then evict VRAM)

This will more or less mimic carmack's algorithm by using TT as his "MRU 
scratch space".

And as a side note, your patch breaks
ttm_bo_force_list_clean()
which should be used at GPU memory exhaustion to avoid OOM due to 
fragmentation and for those drivers that
implement VRAM cleanup on VT switch and / or suspend / hibernation.

/Thomas

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC] drm/ttm: add minimum residency constraint for bo eviction
  2012-11-28 21:51 ` [RFC] " Marek Olšák
  2012-11-28 23:18   ` Jerome Glisse
@ 2012-11-29  9:18   ` Thomas Hellstrom
  2012-11-29  9:28     ` Michel Dänzer
  2012-11-29 19:20     ` Marek Olšák
  1 sibling, 2 replies; 34+ messages in thread
From: Thomas Hellstrom @ 2012-11-29  9:18 UTC (permalink / raw)
  To: Marek Olšák; +Cc: dri-devel

On 11/28/2012 10:51 PM, Marek Olšák wrote:
> I think the problem with Radeon/TTM is much deeper. Let me demonstrate
> it on the following example.
>
> Unigine Heaven needs about 385MB of space for static resources, that's
> only 75% of my 512MB card. Yet, TTM is not capable of getting all of
> that into VRAM. If I allow GTT placements, I get 20 fps, which is the
> old Mesa behavior. If I force VRAM placements, I get 3 fps, because we
> validate buffers 10 times per frame and there's probably a lot of
> buffer evictions during each validation.
>

Marek,
Did you look at the total amount of referenced buffers in the ring 
including vertex buffers?

Depending on how hard you throttle, I guess vertex / index buffer data 
referenced by the
ring commands may well exceed the VRAM limitation.

/Thomas

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC] drm/ttm: add minimum residency constraint for bo eviction
  2012-11-29  9:18   ` Thomas Hellstrom
@ 2012-11-29  9:28     ` Michel Dänzer
  2012-11-29  9:48       ` Thomas Hellstrom
  2012-11-29 19:20     ` Marek Olšák
  1 sibling, 1 reply; 34+ messages in thread
From: Michel Dänzer @ 2012-11-29  9:28 UTC (permalink / raw)
  To: Thomas Hellstrom; +Cc: dri-devel

On Don, 2012-11-29 at 10:18 +0100, Thomas Hellstrom wrote: 
> On 11/28/2012 10:51 PM, Marek Olšák wrote:
> > I think the problem with Radeon/TTM is much deeper. Let me demonstrate
> > it on the following example.
> >
> > Unigine Heaven needs about 385MB of space for static resources, that's
> > only 75% of my 512MB card. Yet, TTM is not capable of getting all of
> > that into VRAM. If I allow GTT placements, I get 20 fps, which is the
> > old Mesa behavior. If I force VRAM placements, I get 3 fps, because we
> > validate buffers 10 times per frame and there's probably a lot of
> > buffer evictions during each validation.
> >
> 
> Marek,
> Did you look at the total amount of referenced buffers in the ring 
> including vertex buffers?
> 
> Depending on how hard you throttle, I guess vertex / index buffer data 
> referenced by the
> ring commands may well exceed the VRAM limitation.

I think another reason 100% is not possible is fragmentation. Has anyone
ever thought about defragmentation?


-- 
Earthling Michel Dänzer           |                   http://www.amd.com
Libre software enthusiast         |          Debian, X and DRI developer
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC] drm/ttm: add minimum residency constraint for bo eviction
  2012-11-29  9:28     ` Michel Dänzer
@ 2012-11-29  9:48       ` Thomas Hellstrom
  0 siblings, 0 replies; 34+ messages in thread
From: Thomas Hellstrom @ 2012-11-29  9:48 UTC (permalink / raw)
  To: Michel Dänzer; +Cc: dri-devel

On 11/29/2012 10:28 AM, Michel Dänzer wrote:
> On Don, 2012-11-29 at 10:18 +0100, Thomas Hellstrom wrote:
>> On 11/28/2012 10:51 PM, Marek Olšák wrote:
>>> I think the problem with Radeon/TTM is much deeper. Let me demonstrate
>>> it on the following example.
>>>
>>> Unigine Heaven needs about 385MB of space for static resources, that's
>>> only 75% of my 512MB card. Yet, TTM is not capable of getting all of
>>> that into VRAM. If I allow GTT placements, I get 20 fps, which is the
>>> old Mesa behavior. If I force VRAM placements, I get 3 fps, because we
>>> validate buffers 10 times per frame and there's probably a lot of
>>> buffer evictions during each validation.
>>>
>> Marek,
>> Did you look at the total amount of referenced buffers in the ring
>> including vertex buffers?
>>
>> Depending on how hard you throttle, I guess vertex / index buffer data
>> referenced by the
>> ring commands may well exceed the VRAM limitation.
> I think another reason 100% is not possible is fragmentation. Has anyone
> ever thought about defragmentation?
>
>
TTM doesn't support efficient defragmentation (yet :)) The only 
reasonable situation to defragment is when we've hit an OOM during 
buffer validation.

The execbuf code could then back off completely, shut other concurrent 
buffer validators out and call ttm_bo_evict_mm() to evict all buffers in the
failing memory type(s), and then retry validation.

This is of course very costly, so I guess it should only be used to 
avoid OOMS.

/Thomas



_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] drm/ttm: add minimum residency constraint for bo eviction
  2012-11-29  8:04           ` Thomas Hellstrom
@ 2012-11-29 12:52             ` Marek Olšák
  2012-11-29 20:33               ` Thomas Hellstrom
  0 siblings, 1 reply; 34+ messages in thread
From: Marek Olšák @ 2012-11-29 12:52 UTC (permalink / raw)
  To: Thomas Hellstrom; +Cc: Jerome Glisse, dri-devel

On Thu, Nov 29, 2012 at 9:04 AM, Thomas Hellstrom <thomas@shipmail.org> wrote:
> On 11/29/2012 03:15 AM, Marek Olšák wrote:
>>
>> On Thu, Nov 29, 2012 at 12:44 AM, Alan Swanson <swanson@ukfsn.org> wrote:
>>>
>>> On Wed, 2012-11-28 at 18:24 -0500, Jerome Glisse wrote:
>>>>
>>>> On Wed, Nov 28, 2012 at 6:18 PM, Thomas Hellstrom <thomas@shipmail.org>
>>>> wrote:
>>>>>
>>>>> On 11/28/2012 04:58 PM, j.glisse@gmail.com wrote:
>>>>>>
>>>>>> From: Jerome Glisse <jglisse@redhat.com>
>>>>>>
>>>>>> This patch add a minimum residency time configurable for each memory
>>>>>> pool (VRAM, GTT, ...). Intention is to avoid having a lot of memory
>>>>>> eviction from VRAM up to a point where the GPU pretty much spend all
>>>>>> it's time moving things in and out.
>>>>>
>>>>>
>>>>> This patch seems odd to me.
>>>>>
>>>>> It seems the net effect is to refuse evictions from VRAM and make
>>>>> buffers go
>>>>> somewhere else, and that makes things faster?
>>>>>
>>>>> Why don't they go there in the first place instead of trying to force
>>>>> them
>>>>> into VRAM,
>>>>> when VRAM is full?
>>>>>
>>>>> /Thomas
>>>>
>>>> It's mostly a side effect of cs and validating with each cs, if boA is
>>>> in cs1 and not in cs2 and boB is in cs1 but not in cs2 than boA could
>>>> be evicted by cs2 and boB moved in, if next cs ie cs3 is like cs1 then
>>>> boA move back again and boB is evicted, then you get cs4 which
>>>> reference boB but not boA, boA get evicted and boB move in ... So ttm
>>>> just spend its time doing eviction but he doing so because it's ask by
>>>> the driver to do so. Note that what is costly there is not the bo move
>>>> in itself but the page allocation.
>>>>
>>>> I propose this patch to put a boundary on bo eviction frequency, i
>>>> thought it might help other driver, if you set the residency time to 0
>>>> you get the current behavior, if you don't you enforce a minimum
>>>> residency time which helps driver like radeon. Of course a proper fix
>>>> to the bo eviction for radeon has to be in radeon code and is mostly
>>>> an overhaul of how we validate bo.
>>>>
>>>> But i still believe that this patch has value in itself by allowing
>>>> driver to put a boundary on buffer movement frequency.
>>>>
>>>> Cheers,
>>>> Jerome
>>>
>>> So, a variation on John Carmack's recommendation from 2000 to use MRU,
>>> not LRU, to avoid texture trashing.
>>>
>>>    Mar 07, 2000 - Virtualized video card local memory is The Right Thing.
>>>    http://floodyberry.com/carmack/johnc_plan_2000.html
>>>
>>> In fact, this was last discussed in 2005 with a patch for a 1 second
>>> stale texture eviction and I (still) wondered why a method it was never
>>> implemented since it was an clear problem.
>>
>> BTW we can send end-of-frame markers to the kernel, which could be
>> used to implement Carmack's algorithm.
>>
>> Marek
>
>
> It seems to me like Carmack's algorithm is quite specific to the case where
> only a single GL client is running?

In theory, we could send context IDs to the kernel as well and modify
the conditional to "If the LRU texture was not needed in the previous
frame of any context".


>
> It also seems like it's designed around the fact that when eviction takes
> place, all buffer objects will be idle. With a
> reasonably filled graphics fifo / ring, blindly using MRU will cause the GPU
> to run synchronized.

I don't see why you would need to synchronize. If the GPU takes care
of moving buffers in and out of VRAM and there's only one ring buffer
==> no synchronization is required.

Marek
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] drm/ttm: add minimum residency constraint for bo eviction
  2012-11-29  8:41       ` [PATCH] drm/ttm: add minimum residency constraint for bo eviction Thomas Hellstrom
@ 2012-11-29 15:50         ` Jerome Glisse
  0 siblings, 0 replies; 34+ messages in thread
From: Jerome Glisse @ 2012-11-29 15:50 UTC (permalink / raw)
  To: Thomas Hellstrom; +Cc: Jerome Glisse, dri-devel

On Thu, Nov 29, 2012 at 09:41:34AM +0100, Thomas Hellstrom wrote:
> On 11/29/2012 12:24 AM, Jerome Glisse wrote:
> >On Wed, Nov 28, 2012 at 6:18 PM, Thomas Hellstrom <thomas@shipmail.org> wrote:
> >>On 11/28/2012 04:58 PM, j.glisse@gmail.com wrote:
> >>>From: Jerome Glisse <jglisse@redhat.com>
> >>>
> >>>This patch add a minimum residency time configurable for each memory
> >>>pool (VRAM, GTT, ...). Intention is to avoid having a lot of memory
> >>>eviction from VRAM up to a point where the GPU pretty much spend all
> >>>it's time moving things in and out.
> >>
> >>This patch seems odd to me.
> >>
> >>It seems the net effect is to refuse evictions from VRAM and make buffers go
> >>somewhere else, and that makes things faster?
> >>
> >>Why don't they go there in the first place instead of trying to force them
> >>into VRAM,
> >>when VRAM is full?
> >>
> >>/Thomas
> >It's mostly a side effect of cs and validating with each cs, if boA is
> >in cs1 and not in cs2 and boB is in cs1 but not in cs2 than boA could
> >be evicted by cs2 and boB moved in, if next cs ie cs3 is like cs1 then
> >boA move back again and boB is evicted, then you get cs4 which
> >reference boB but not boA, boA get evicted and boB move in ... So ttm
> >just spend its time doing eviction but he doing so because it's ask by
> >the driver to do so. Note that what is costly there is not the bo move
> >in itself but the page allocation.
> 
> Yes, this is the cause of the trashing, but that was not what I asked.
> 
> What your patch is doing is looking at the last recently used bo, to
> check if it has been
> resident for at least 500ms. Otherwise it refuses eviction for *all*
> buffers of that memory type.
> 
> This means new buffers can't fit in VRAM, they need to go somewhere
> else. Perhaps TT?
> 
> So my question was. If VRAM is full, instead of starting to evict,
> why not put new buffers in TT, so that
> 
> placement(GEM_DOMAIN_VRAM) = VRAM | TT  // Prefer VRAM but allow TT
> before starting to evict.
> busy_placement(GEM_DOMAIN_VRAM) = TT | VRAM // *If* we need to
> evict, prefer evicting TT, then evict VRAM)
> 
> This will more or less mimic carmack's algorithm by using TT as his
> "MRU scratch space".

Well not exactly, the ping-pong btw vram and tt is still very likely
to happen. If we always | GTT placement then some buffer that would
need to be in vram never goes there because some older buffer that
haven't been use in ages is present in vram. Only way to force those
buffer to be evicted is to ask for vram only (case i used i never
filled up vram + gtt so there was always room in either one).

So this way you still allow eviction of old vram buffer, and avoid
too much ping pong with heavily used buffer being moved in and out.

Anyway i did this as a quick hack to thinking it might interest
other. Proper solution as i said lies in not validating, to different
placement, buffer at each cs and use some worker thread to evict
things from vram and do things like compaction to minimize
fragmentation.

> And as a side note, your patch breaks
> ttm_bo_force_list_clean()
> which should be used at GPU memory exhaustion to avoid OOM due to
> fragmentation and for those drivers that
> implement VRAM cleanup on VT switch and / or suspend / hibernation.
> 
> /Thomas

Yeah, i did not paid much attention to the whole cleanup phase, was
only interested in dirty prototype.

Cheers,
Jerome

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC] drm/ttm: add minimum residency constraint for bo eviction
  2012-11-29  9:18   ` Thomas Hellstrom
  2012-11-29  9:28     ` Michel Dänzer
@ 2012-11-29 19:20     ` Marek Olšák
  2012-11-29 19:36       ` Jerome Glisse
  2012-11-29 20:40       ` Thomas Hellstrom
  1 sibling, 2 replies; 34+ messages in thread
From: Marek Olšák @ 2012-11-29 19:20 UTC (permalink / raw)
  To: Thomas Hellstrom; +Cc: dri-devel

On Thu, Nov 29, 2012 at 10:18 AM, Thomas Hellstrom <thomas@shipmail.org> wrote:
> On 11/28/2012 10:51 PM, Marek Olšák wrote:
>>
>> I think the problem with Radeon/TTM is much deeper. Let me demonstrate
>> it on the following example.
>>
>> Unigine Heaven needs about 385MB of space for static resources, that's
>> only 75% of my 512MB card. Yet, TTM is not capable of getting all of
>> that into VRAM. If I allow GTT placements, I get 20 fps, which is the
>> old Mesa behavior. If I force VRAM placements, I get 3 fps, because we
>> validate buffers 10 times per frame and there's probably a lot of
>> buffer evictions during each validation.
>>
>
> Marek,
> Did you look at the total amount of referenced buffers in the ring including
> vertex buffers?
>
> Depending on how hard you throttle, I guess vertex / index buffer data
> referenced by the
> ring commands may well exceed the VRAM limitation.

Buffers (not textures) take only 30 MB. These are stats for 1 frame of
Unigine Heaven. Each line is a CS ioctl.

VRAM [used in CS] / [total allocated], GTT [used in CS] / [total allocated]
1. VRAM: 171 / 390 MB, GTT:   1 /   5 MB
2. VRAM: 144 / 390 MB, GTT:   2 /   5 MB
3. VRAM: 184 / 390 MB, GTT:   1 /   5 MB
4. VRAM:  35 / 390 MB, GTT:   2 /   5 MB
5. VRAM: 119 / 390 MB, GTT:   1 /   5 MB
6. VRAM: 207 / 390 MB, GTT:   1 /   5 MB
7. VRAM:  65 / 390 MB, GTT:   2 /   5 MB

If I move all buffers (vertex, index, constant, streamout, queries,
shader code, etc.) to GTT, this is how one frame looks like (not the
same one though, but it's close):

1. VRAM: 144 / 359 MB, GTT:  16 /  35 MB
2. VRAM:  95 / 359 MB, GTT:  12 /  35 MB
3. VRAM: 178 / 359 MB, GTT:  15 /  35 MB
4. VRAM:  55 / 359 MB, GTT:  13 /  35 MB
5. VRAM:  22 / 359 MB, GTT:  16 /  35 MB
6. VRAM: 163 / 359 MB, GTT:  16 /  35 MB
7. VRAM: 133 / 359 MB, GTT:  11 /  35 MB
8. VRAM:  66 / 359 MB, GTT:   4 /  35 MB

The stats are generated in the Mesa driver based on the driver's
expectations where buffers should be placed.

I can easily see how VRAM is thrashed with the strict LRU approach.

Also, is it possible that one buffer is moved twice for a single CS
ioctl? Imagine there's a buffer at the end of the relocation list,
which is also at the head of the LRU list. Some buffer in the middle
causes eviction of the last buffer. When the last buffer is validated,
it's moved back to VRAM. Can it happen?

Marek
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC] drm/ttm: add minimum residency constraint for bo eviction
  2012-11-29 19:20     ` Marek Olšák
@ 2012-11-29 19:36       ` Jerome Glisse
  2012-11-29 20:40       ` Thomas Hellstrom
  1 sibling, 0 replies; 34+ messages in thread
From: Jerome Glisse @ 2012-11-29 19:36 UTC (permalink / raw)
  To: Marek Olšák; +Cc: dri-devel

On Thu, Nov 29, 2012 at 08:20:17PM +0100, Marek Olšák wrote:
> On Thu, Nov 29, 2012 at 10:18 AM, Thomas Hellstrom <thomas@shipmail.org> wrote:
> > On 11/28/2012 10:51 PM, Marek Olšák wrote:
> >>
> >> I think the problem with Radeon/TTM is much deeper. Let me demonstrate
> >> it on the following example.
> >>
> >> Unigine Heaven needs about 385MB of space for static resources, that's
> >> only 75% of my 512MB card. Yet, TTM is not capable of getting all of
> >> that into VRAM. If I allow GTT placements, I get 20 fps, which is the
> >> old Mesa behavior. If I force VRAM placements, I get 3 fps, because we
> >> validate buffers 10 times per frame and there's probably a lot of
> >> buffer evictions during each validation.
> >>
> >
> > Marek,
> > Did you look at the total amount of referenced buffers in the ring including
> > vertex buffers?
> >
> > Depending on how hard you throttle, I guess vertex / index buffer data
> > referenced by the
> > ring commands may well exceed the VRAM limitation.
> 
> Buffers (not textures) take only 30 MB. These are stats for 1 frame of
> Unigine Heaven. Each line is a CS ioctl.
> 
> VRAM [used in CS] / [total allocated], GTT [used in CS] / [total allocated]
> 1. VRAM: 171 / 390 MB, GTT:   1 /   5 MB
> 2. VRAM: 144 / 390 MB, GTT:   2 /   5 MB
> 3. VRAM: 184 / 390 MB, GTT:   1 /   5 MB
> 4. VRAM:  35 / 390 MB, GTT:   2 /   5 MB
> 5. VRAM: 119 / 390 MB, GTT:   1 /   5 MB
> 6. VRAM: 207 / 390 MB, GTT:   1 /   5 MB
> 7. VRAM:  65 / 390 MB, GTT:   2 /   5 MB
> 
> If I move all buffers (vertex, index, constant, streamout, queries,
> shader code, etc.) to GTT, this is how one frame looks like (not the
> same one though, but it's close):
> 
> 1. VRAM: 144 / 359 MB, GTT:  16 /  35 MB
> 2. VRAM:  95 / 359 MB, GTT:  12 /  35 MB
> 3. VRAM: 178 / 359 MB, GTT:  15 /  35 MB
> 4. VRAM:  55 / 359 MB, GTT:  13 /  35 MB
> 5. VRAM:  22 / 359 MB, GTT:  16 /  35 MB
> 6. VRAM: 163 / 359 MB, GTT:  16 /  35 MB
> 7. VRAM: 133 / 359 MB, GTT:  11 /  35 MB
> 8. VRAM:  66 / 359 MB, GTT:   4 /  35 MB
> 
> The stats are generated in the Mesa driver based on the driver's
> expectations where buffers should be placed.
> 
> I can easily see how VRAM is thrashed with the strict LRU approach.
> 
> Also, is it possible that one buffer is moved twice for a single CS
> ioctl? Imagine there's a buffer at the end of the relocation list,
> which is also at the head of the LRU list. Some buffer in the middle
> causes eviction of the last buffer. When the last buffer is validated,
> it's moved back to VRAM. Can it happen?
> 
> Marek

No it can't happen for 2 reasons, first and foremost, reserving a bo
remove it from the lru list and all bo of a cs are reserved in one shot.
Second because radeon cs ioctl detect duplicate bo and only do work
once.

Cheers,
Jerome
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] drm/ttm: add minimum residency constraint for bo eviction
  2012-11-29 12:52             ` Marek Olšák
@ 2012-11-29 20:33               ` Thomas Hellstrom
  2012-11-29 21:58                 ` Marek Olšák
  0 siblings, 1 reply; 34+ messages in thread
From: Thomas Hellstrom @ 2012-11-29 20:33 UTC (permalink / raw)
  To: Marek Olšák; +Cc: Jerome Glisse, dri-devel

On 11/29/2012 01:52 PM, Marek Olšák wrote:
> On Thu, Nov 29, 2012 at 9:04 AM, Thomas Hellstrom <thomas@shipmail.org> wrote:
>> On 11/29/2012 03:15 AM, Marek Olšák wrote:
>>> On Thu, Nov 29, 2012 at 12:44 AM, Alan Swanson <swanson@ukfsn.org> wrote:
>>>> On Wed, 2012-11-28 at 18:24 -0500, Jerome Glisse wrote:
>>>>> On Wed, Nov 28, 2012 at 6:18 PM, Thomas Hellstrom <thomas@shipmail.org>
>>>>> wrote:
>>>>>> On 11/28/2012 04:58 PM, j.glisse@gmail.com wrote:
>>>>>>> From: Jerome Glisse <jglisse@redhat.com>
>>>>>>>
>>>>>>> This patch add a minimum residency time configurable for each memory
>>>>>>> pool (VRAM, GTT, ...). Intention is to avoid having a lot of memory
>>>>>>> eviction from VRAM up to a point where the GPU pretty much spend all
>>>>>>> it's time moving things in and out.
>>>>>>
>>>>>> This patch seems odd to me.
>>>>>>
>>>>>> It seems the net effect is to refuse evictions from VRAM and make
>>>>>> buffers go
>>>>>> somewhere else, and that makes things faster?
>>>>>>
>>>>>> Why don't they go there in the first place instead of trying to force
>>>>>> them
>>>>>> into VRAM,
>>>>>> when VRAM is full?
>>>>>>
>>>>>> /Thomas
>>>>> It's mostly a side effect of cs and validating with each cs, if boA is
>>>>> in cs1 and not in cs2 and boB is in cs1 but not in cs2 than boA could
>>>>> be evicted by cs2 and boB moved in, if next cs ie cs3 is like cs1 then
>>>>> boA move back again and boB is evicted, then you get cs4 which
>>>>> reference boB but not boA, boA get evicted and boB move in ... So ttm
>>>>> just spend its time doing eviction but he doing so because it's ask by
>>>>> the driver to do so. Note that what is costly there is not the bo move
>>>>> in itself but the page allocation.
>>>>>
>>>>> I propose this patch to put a boundary on bo eviction frequency, i
>>>>> thought it might help other driver, if you set the residency time to 0
>>>>> you get the current behavior, if you don't you enforce a minimum
>>>>> residency time which helps driver like radeon. Of course a proper fix
>>>>> to the bo eviction for radeon has to be in radeon code and is mostly
>>>>> an overhaul of how we validate bo.
>>>>>
>>>>> But i still believe that this patch has value in itself by allowing
>>>>> driver to put a boundary on buffer movement frequency.
>>>>>
>>>>> Cheers,
>>>>> Jerome
>>>> So, a variation on John Carmack's recommendation from 2000 to use MRU,
>>>> not LRU, to avoid texture trashing.
>>>>
>>>>     Mar 07, 2000 - Virtualized video card local memory is The Right Thing.
>>>>     http://floodyberry.com/carmack/johnc_plan_2000.html
>>>>
>>>> In fact, this was last discussed in 2005 with a patch for a 1 second
>>>> stale texture eviction and I (still) wondered why a method it was never
>>>> implemented since it was an clear problem.
>>> BTW we can send end-of-frame markers to the kernel, which could be
>>> used to implement Carmack's algorithm.
>>>
>>> Marek
>>
>> It seems to me like Carmack's algorithm is quite specific to the case where
>> only a single GL client is running?
> In theory, we could send context IDs to the kernel as well and modify
> the conditional to "If the LRU texture was not needed in the previous
> frame of any context".
>
>
>> It also seems like it's designed around the fact that when eviction takes
>> place, all buffer objects will be idle. With a
>> reasonably filled graphics fifo / ring, blindly using MRU will cause the GPU
>> to run synchronized.
> I don't see why you would need to synchronize. If the GPU takes care
> of moving buffers in and out of VRAM and there's only one ring buffer
> ==> no synchronization is required.
The LRU bo has a much higher probability of being idle than the MRU bo, 
and waiting for it to become idle will in
principle synchronize the GPU and unnecessarily drain the ring.

/Thomas


> Marek



_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC] drm/ttm: add minimum residency constraint for bo eviction
  2012-11-29 19:20     ` Marek Olšák
  2012-11-29 19:36       ` Jerome Glisse
@ 2012-11-29 20:40       ` Thomas Hellstrom
  1 sibling, 0 replies; 34+ messages in thread
From: Thomas Hellstrom @ 2012-11-29 20:40 UTC (permalink / raw)
  To: Marek Olšák; +Cc: dri-devel

On 11/29/2012 08:20 PM, Marek Olšák wrote:
> On Thu, Nov 29, 2012 at 10:18 AM, Thomas Hellstrom <thomas@shipmail.org> wrote:
>> On 11/28/2012 10:51 PM, Marek Olšák wrote:
>>> I think the problem with Radeon/TTM is much deeper. Let me demonstrate
>>> it on the following example.
>>>
>>> Unigine Heaven needs about 385MB of space for static resources, that's
>>> only 75% of my 512MB card. Yet, TTM is not capable of getting all of
>>> that into VRAM. If I allow GTT placements, I get 20 fps, which is the
>>> old Mesa behavior. If I force VRAM placements, I get 3 fps, because we
>>> validate buffers 10 times per frame and there's probably a lot of
>>> buffer evictions during each validation.
>>>
>> Marek,
>> Did you look at the total amount of referenced buffers in the ring including
>> vertex buffers?
>>
>> Depending on how hard you throttle, I guess vertex / index buffer data
>> referenced by the
>> ring commands may well exceed the VRAM limitation.
> Buffers (not textures) take only 30 MB. These are stats for 1 frame of
> Unigine Heaven. Each line is a CS ioctl.
>
> VRAM [used in CS] / [total allocated], GTT [used in CS] / [total allocated]
> 1. VRAM: 171 / 390 MB, GTT:   1 /   5 MB
> 2. VRAM: 144 / 390 MB, GTT:   2 /   5 MB
> 3. VRAM: 184 / 390 MB, GTT:   1 /   5 MB
> 4. VRAM:  35 / 390 MB, GTT:   2 /   5 MB
> 5. VRAM: 119 / 390 MB, GTT:   1 /   5 MB
> 6. VRAM: 207 / 390 MB, GTT:   1 /   5 MB
> 7. VRAM:  65 / 390 MB, GTT:   2 /   5 MB
>
> If I move all buffers (vertex, index, constant, streamout, queries,
> shader code, etc.) to GTT, this is how one frame looks like (not the
> same one though, but it's close):
>
> 1. VRAM: 144 / 359 MB, GTT:  16 /  35 MB
> 2. VRAM:  95 / 359 MB, GTT:  12 /  35 MB
> 3. VRAM: 178 / 359 MB, GTT:  15 /  35 MB
> 4. VRAM:  55 / 359 MB, GTT:  13 /  35 MB
> 5. VRAM:  22 / 359 MB, GTT:  16 /  35 MB
> 6. VRAM: 163 / 359 MB, GTT:  16 /  35 MB
> 7. VRAM: 133 / 359 MB, GTT:  11 /  35 MB
> 8. VRAM:  66 / 359 MB, GTT:   4 /  35 MB
>
> The stats are generated in the Mesa driver based on the driver's
> expectations where buffers should be placed.
>
> I can easily see how VRAM is thrashed with the strict LRU approach.
>
> Also, is it possible that one buffer is moved twice for a single CS
> ioctl? Imagine there's a buffer at the end of the relocation list,
> which is also at the head of the LRU list. Some buffer in the middle
> causes eviction of the last buffer. When the last buffer is validated,
> it's moved back to VRAM. Can it happen?

No. Typically that shouldn't happen. In a typical CS sequence, first all 
buffers are reserved, and then all buffers
are validated. Reservation takes them off the LRU list, I'm not 100% 
sure Radeon does it this way, but I think so.

/Thomas



_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] drm/ttm: add minimum residency constraint for bo eviction
  2012-11-29 20:33               ` Thomas Hellstrom
@ 2012-11-29 21:58                 ` Marek Olšák
  2012-11-30  8:38                   ` Thomas Hellstrom
  2012-11-30  9:39                   ` Asynchronous eviction [WAS Re: [PATCH] drm/ttm: add minimum residency constraint for bo eviction] Thomas Hellstrom
  0 siblings, 2 replies; 34+ messages in thread
From: Marek Olšák @ 2012-11-29 21:58 UTC (permalink / raw)
  To: Thomas Hellstrom; +Cc: Jerome Glisse, dri-devel

On Thu, Nov 29, 2012 at 9:33 PM, Thomas Hellstrom <thomas@shipmail.org> wrote:
> On 11/29/2012 01:52 PM, Marek Olšák wrote:
>>
>> On Thu, Nov 29, 2012 at 9:04 AM, Thomas Hellstrom <thomas@shipmail.org>
>> wrote:
>>>
>>> On 11/29/2012 03:15 AM, Marek Olšák wrote:
>>>>
>>>> On Thu, Nov 29, 2012 at 12:44 AM, Alan Swanson <swanson@ukfsn.org>
>>>> wrote:
>>>>>
>>>>> On Wed, 2012-11-28 at 18:24 -0500, Jerome Glisse wrote:
>>>>>>
>>>>>> On Wed, Nov 28, 2012 at 6:18 PM, Thomas Hellstrom
>>>>>> <thomas@shipmail.org>
>>>>>> wrote:
>>>>>>>
>>>>>>> On 11/28/2012 04:58 PM, j.glisse@gmail.com wrote:
>>>>>>>>
>>>>>>>> From: Jerome Glisse <jglisse@redhat.com>
>>>>>>>>
>>>>>>>> This patch add a minimum residency time configurable for each memory
>>>>>>>> pool (VRAM, GTT, ...). Intention is to avoid having a lot of memory
>>>>>>>> eviction from VRAM up to a point where the GPU pretty much spend all
>>>>>>>> it's time moving things in and out.
>>>>>>>
>>>>>>>
>>>>>>> This patch seems odd to me.
>>>>>>>
>>>>>>> It seems the net effect is to refuse evictions from VRAM and make
>>>>>>> buffers go
>>>>>>> somewhere else, and that makes things faster?
>>>>>>>
>>>>>>> Why don't they go there in the first place instead of trying to force
>>>>>>> them
>>>>>>> into VRAM,
>>>>>>> when VRAM is full?
>>>>>>>
>>>>>>> /Thomas
>>>>>>
>>>>>> It's mostly a side effect of cs and validating with each cs, if boA is
>>>>>> in cs1 and not in cs2 and boB is in cs1 but not in cs2 than boA could
>>>>>> be evicted by cs2 and boB moved in, if next cs ie cs3 is like cs1 then
>>>>>> boA move back again and boB is evicted, then you get cs4 which
>>>>>> reference boB but not boA, boA get evicted and boB move in ... So ttm
>>>>>> just spend its time doing eviction but he doing so because it's ask by
>>>>>> the driver to do so. Note that what is costly there is not the bo move
>>>>>> in itself but the page allocation.
>>>>>>
>>>>>> I propose this patch to put a boundary on bo eviction frequency, i
>>>>>> thought it might help other driver, if you set the residency time to 0
>>>>>> you get the current behavior, if you don't you enforce a minimum
>>>>>> residency time which helps driver like radeon. Of course a proper fix
>>>>>> to the bo eviction for radeon has to be in radeon code and is mostly
>>>>>> an overhaul of how we validate bo.
>>>>>>
>>>>>> But i still believe that this patch has value in itself by allowing
>>>>>> driver to put a boundary on buffer movement frequency.
>>>>>>
>>>>>> Cheers,
>>>>>> Jerome
>>>>>
>>>>> So, a variation on John Carmack's recommendation from 2000 to use MRU,
>>>>> not LRU, to avoid texture trashing.
>>>>>
>>>>>     Mar 07, 2000 - Virtualized video card local memory is The Right
>>>>> Thing.
>>>>>     http://floodyberry.com/carmack/johnc_plan_2000.html
>>>>>
>>>>> In fact, this was last discussed in 2005 with a patch for a 1 second
>>>>> stale texture eviction and I (still) wondered why a method it was never
>>>>> implemented since it was an clear problem.
>>>>
>>>> BTW we can send end-of-frame markers to the kernel, which could be
>>>> used to implement Carmack's algorithm.
>>>>
>>>> Marek
>>>
>>>
>>> It seems to me like Carmack's algorithm is quite specific to the case
>>> where
>>> only a single GL client is running?
>>
>> In theory, we could send context IDs to the kernel as well and modify
>> the conditional to "If the LRU texture was not needed in the previous
>> frame of any context".
>>
>>
>>> It also seems like it's designed around the fact that when eviction takes
>>> place, all buffer objects will be idle. With a
>>> reasonably filled graphics fifo / ring, blindly using MRU will cause the
>>> GPU
>>> to run synchronized.
>>
>> I don't see why you would need to synchronize. If the GPU takes care
>> of moving buffers in and out of VRAM and there's only one ring buffer
>> ==> no synchronization is required.
>
> The LRU bo has a much higher probability of being idle than the MRU bo, and
> waiting for it to become idle will in
> principle synchronize the GPU and unnecessarily drain the ring.

What I tried to point out was that the synchronization shouldn't be
needed, because the CPU shouldn't do anything with the contents of
evicted buffers. The GPU moves the buffers, not the CPU. What does the
CPU do besides updating some kernel structures?

Also, buffer deletion is something where you don't need to wait for
the buffer to become idle if you know the memory area won't be
mapped by the CPU, ever. The memory can be reclaimed right away. It
would be the GPU to move new data in and once that happens, the old
buffer will be trivially idle, because single-ring GPUs execute
commands in order.

Marek
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] drm/ttm: add minimum residency constraint for bo eviction
  2012-11-29 21:58                 ` Marek Olšák
@ 2012-11-30  8:38                   ` Thomas Hellstrom
  2012-11-30  9:39                   ` Asynchronous eviction [WAS Re: [PATCH] drm/ttm: add minimum residency constraint for bo eviction] Thomas Hellstrom
  1 sibling, 0 replies; 34+ messages in thread
From: Thomas Hellstrom @ 2012-11-30  8:38 UTC (permalink / raw)
  To: Marek Olšák; +Cc: Jerome Glisse, dri-devel

On 11/29/2012 10:58 PM, Marek Olšák wrote:
> On Thu, Nov 29, 2012 at 9:33 PM, Thomas Hellstrom <thomas@shipmail.org> wrote:
>> On 11/29/2012 01:52 PM, Marek Olšák wrote:
>>> On Thu, Nov 29, 2012 at 9:04 AM, Thomas Hellstrom <thomas@shipmail.org>
>>> wrote:
>>>> On 11/29/2012 03:15 AM, Marek Olšák wrote:
>>>>> On Thu, Nov 29, 2012 at 12:44 AM, Alan Swanson <swanson@ukfsn.org>
>>>>> wrote:
>>>>>> On Wed, 2012-11-28 at 18:24 -0500, Jerome Glisse wrote:
>>>>>>> On Wed, Nov 28, 2012 at 6:18 PM, Thomas Hellstrom
>>>>>>> <thomas@shipmail.org>
>>>>>>> wrote:
>>>>>>>> On 11/28/2012 04:58 PM, j.glisse@gmail.com wrote:
>>>>>>>>> From: Jerome Glisse <jglisse@redhat.com>
>>>>>>>>>
>>>>>>>>> This patch add a minimum residency time configurable for each memory
>>>>>>>>> pool (VRAM, GTT, ...). Intention is to avoid having a lot of memory
>>>>>>>>> eviction from VRAM up to a point where the GPU pretty much spend all
>>>>>>>>> it's time moving things in and out.
>>>>>>>>
>>>>>>>> This patch seems odd to me.
>>>>>>>>
>>>>>>>> It seems the net effect is to refuse evictions from VRAM and make
>>>>>>>> buffers go
>>>>>>>> somewhere else, and that makes things faster?
>>>>>>>>
>>>>>>>> Why don't they go there in the first place instead of trying to force
>>>>>>>> them
>>>>>>>> into VRAM,
>>>>>>>> when VRAM is full?
>>>>>>>>
>>>>>>>> /Thomas
>>>>>>> It's mostly a side effect of cs and validating with each cs, if boA is
>>>>>>> in cs1 and not in cs2 and boB is in cs1 but not in cs2 than boA could
>>>>>>> be evicted by cs2 and boB moved in, if next cs ie cs3 is like cs1 then
>>>>>>> boA move back again and boB is evicted, then you get cs4 which
>>>>>>> reference boB but not boA, boA get evicted and boB move in ... So ttm
>>>>>>> just spend its time doing eviction but he doing so because it's ask by
>>>>>>> the driver to do so. Note that what is costly there is not the bo move
>>>>>>> in itself but the page allocation.
>>>>>>>
>>>>>>> I propose this patch to put a boundary on bo eviction frequency, i
>>>>>>> thought it might help other driver, if you set the residency time to 0
>>>>>>> you get the current behavior, if you don't you enforce a minimum
>>>>>>> residency time which helps driver like radeon. Of course a proper fix
>>>>>>> to the bo eviction for radeon has to be in radeon code and is mostly
>>>>>>> an overhaul of how we validate bo.
>>>>>>>
>>>>>>> But i still believe that this patch has value in itself by allowing
>>>>>>> driver to put a boundary on buffer movement frequency.
>>>>>>>
>>>>>>> Cheers,
>>>>>>> Jerome
>>>>>> So, a variation on John Carmack's recommendation from 2000 to use MRU,
>>>>>> not LRU, to avoid texture trashing.
>>>>>>
>>>>>>      Mar 07, 2000 - Virtualized video card local memory is The Right
>>>>>> Thing.
>>>>>>      http://floodyberry.com/carmack/johnc_plan_2000.html
>>>>>>
>>>>>> In fact, this was last discussed in 2005 with a patch for a 1 second
>>>>>> stale texture eviction and I (still) wondered why a method it was never
>>>>>> implemented since it was an clear problem.
>>>>> BTW we can send end-of-frame markers to the kernel, which could be
>>>>> used to implement Carmack's algorithm.
>>>>>
>>>>> Marek
>>>>
>>>> It seems to me like Carmack's algorithm is quite specific to the case
>>>> where
>>>> only a single GL client is running?
>>> In theory, we could send context IDs to the kernel as well and modify
>>> the conditional to "If the LRU texture was not needed in the previous
>>> frame of any context".
>>>
>>>
>>>> It also seems like it's designed around the fact that when eviction takes
>>>> place, all buffer objects will be idle. With a
>>>> reasonably filled graphics fifo / ring, blindly using MRU will cause the
>>>> GPU
>>>> to run synchronized.
>>> I don't see why you would need to synchronize. If the GPU takes care
>>> of moving buffers in and out of VRAM and there's only one ring buffer
>>> ==> no synchronization is required.
>> The LRU bo has a much higher probability of being idle than the MRU bo, and
>> waiting for it to become idle will in
>> principle synchronize the GPU and unnecessarily drain the ring.
> What I tried to point out was that the synchronization shouldn't be
> needed, because the CPU shouldn't do anything with the contents of
> evicted buffers. The GPU moves the buffers, not the CPU. What does the
> CPU do besides updating some kernel structures?
>
> Also, buffer deletion is something where you don't need to wait for
> the buffer to become idle if you know the memory area won't be
> mapped by the CPU, ever. The memory can be reclaimed right away. It
> would be the GPU to move new data in and once that happens, the old
> buffer will be trivially idle, because single-ring GPUs execute
> commands in order.

Yes, you're right. Sorry about that.

/Thomas



_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Asynchronous eviction [WAS Re: [PATCH] drm/ttm: add minimum residency constraint for bo eviction]
  2012-11-29 21:58                 ` Marek Olšák
  2012-11-30  8:38                   ` Thomas Hellstrom
@ 2012-11-30  9:39                   ` Thomas Hellstrom
  2012-11-30 16:30                     ` Jerome Glisse
  1 sibling, 1 reply; 34+ messages in thread
From: Thomas Hellstrom @ 2012-11-30  9:39 UTC (permalink / raw)
  To: Marek Olšák, Jerome Glisse; +Cc: dri-devel

On 11/29/2012 10:58 PM, Marek Olšák wrote:
>
> What I tried to point out was that the synchronization shouldn't be
> needed, because the CPU shouldn't do anything with the contents of
> evicted buffers. The GPU moves the buffers, not the CPU. What does the
> CPU do besides updating some kernel structures?
>
> Also, buffer deletion is something where you don't need to wait for
> the buffer to become idle if you know the memory area won't be
> mapped by the CPU, ever. The memory can be reclaimed right away. It
> would be the GPU to move new data in and once that happens, the old
> buffer will be trivially idle, because single-ring GPUs execute
> commands in order.
>
> Marek

Actually asynchronous eviction / deletion is something I have been 
prototyping for a while but never gotten around to implement in TTM:

There are a few minor caveats:

With buffer deletion, what you say is true for fixed memory, but not for 
TT memory where pages are reclaimed by the system after buffer 
destruction. That means that we don't have to wait for idle to free GPU 
space, but we need to wait before pages are handed back to the system.

Swapout needs to access the contents of evicted buffers, but 
synchronizing doesn't need to happen until just before swapout.

Multi-ring - CPU support: If another ring / engine or the CPU is about 
to move in buffer contents to VRAM or a GPU aperture that was previously 
evicted by another ring, it needs to sync with that eviction, but 
doesn't know what buffer or even which buffers occupied the space 
previously. Trivially one can attach a sync object to the memory type 
manager that represents the last eviction from that memory type, and 
*any* engine (CPU or GPU) that moves buffer contents in needs to order 
that movement with respect to that fence. As you say, with a single ring 
and no CPU fallbacks, that ordering is a no-op, but any common 
(non-driver based) implementation needs to support this.

A single fence attached to the memory type manager is the simplest 
solution, but a solution with a fence for each free region in the free 
list is also possible. Then TTM needs a driver callback to be able order 
fences w r t echother.

/Thomas

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Asynchronous eviction [WAS Re: [PATCH] drm/ttm: add minimum residency constraint for bo eviction]
  2012-11-30  9:39                   ` Asynchronous eviction [WAS Re: [PATCH] drm/ttm: add minimum residency constraint for bo eviction] Thomas Hellstrom
@ 2012-11-30 16:30                     ` Jerome Glisse
  2012-11-30 17:08                       ` Thomas Hellstrom
  0 siblings, 1 reply; 34+ messages in thread
From: Jerome Glisse @ 2012-11-30 16:30 UTC (permalink / raw)
  To: Thomas Hellstrom; +Cc: Jerome Glisse, dri-devel

On Fri, Nov 30, 2012 at 4:39 AM, Thomas Hellstrom <thomas@shipmail.org> wrote:
> On 11/29/2012 10:58 PM, Marek Olšák wrote:
>>
>>
>> What I tried to point out was that the synchronization shouldn't be
>> needed, because the CPU shouldn't do anything with the contents of
>> evicted buffers. The GPU moves the buffers, not the CPU. What does the
>> CPU do besides updating some kernel structures?
>>
>> Also, buffer deletion is something where you don't need to wait for
>> the buffer to become idle if you know the memory area won't be
>> mapped by the CPU, ever. The memory can be reclaimed right away. It
>> would be the GPU to move new data in and once that happens, the old
>> buffer will be trivially idle, because single-ring GPUs execute
>> commands in order.
>>
>> Marek
>
>
> Actually asynchronous eviction / deletion is something I have been
> prototyping for a while but never gotten around to implement in TTM:
>
> There are a few minor caveats:
>
> With buffer deletion, what you say is true for fixed memory, but not for TT
> memory where pages are reclaimed by the system after buffer destruction.
> That means that we don't have to wait for idle to free GPU space, but we
> need to wait before pages are handed back to the system.
>
> Swapout needs to access the contents of evicted buffers, but synchronizing
> doesn't need to happen until just before swapout.
>
> Multi-ring - CPU support: If another ring / engine or the CPU is about to
> move in buffer contents to VRAM or a GPU aperture that was previously
> evicted by another ring, it needs to sync with that eviction, but doesn't
> know what buffer or even which buffers occupied the space previously.
> Trivially one can attach a sync object to the memory type manager that
> represents the last eviction from that memory type, and *any* engine (CPU or
> GPU) that moves buffer contents in needs to order that movement with respect
> to that fence. As you say, with a single ring and no CPU fallbacks, that
> ordering is a no-op, but any common (non-driver based) implementation needs
> to support this.
>
> A single fence attached to the memory type manager is the simplest solution,
> but a solution with a fence for each free region in the free list is also
> possible. Then TTM needs a driver callback to be able order fences w r t
> echother.
>
> /Thomas
>

Radeon already handle multi-ring and ttm interaction with what we call
semaphore. Semaphore are created to synchronize with fence accross
different ring. I think the easiest solution is to just remove the bo
wait in ttm and let driver handle this.

Cheers,
Jerome
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Asynchronous eviction [WAS Re: [PATCH] drm/ttm: add minimum residency constraint for bo eviction]
  2012-11-30 16:30                     ` Jerome Glisse
@ 2012-11-30 17:08                       ` Thomas Hellstrom
  2012-11-30 17:18                         ` Jerome Glisse
  0 siblings, 1 reply; 34+ messages in thread
From: Thomas Hellstrom @ 2012-11-30 17:08 UTC (permalink / raw)
  To: Jerome Glisse; +Cc: Jerome Glisse, dri-devel

On 11/30/2012 05:30 PM, Jerome Glisse wrote:
> On Fri, Nov 30, 2012 at 4:39 AM, Thomas Hellstrom <thomas@shipmail.org> wrote:
>> On 11/29/2012 10:58 PM, Marek Olšák wrote:
>>>
>>> What I tried to point out was that the synchronization shouldn't be
>>> needed, because the CPU shouldn't do anything with the contents of
>>> evicted buffers. The GPU moves the buffers, not the CPU. What does the
>>> CPU do besides updating some kernel structures?
>>>
>>> Also, buffer deletion is something where you don't need to wait for
>>> the buffer to become idle if you know the memory area won't be
>>> mapped by the CPU, ever. The memory can be reclaimed right away. It
>>> would be the GPU to move new data in and once that happens, the old
>>> buffer will be trivially idle, because single-ring GPUs execute
>>> commands in order.
>>>
>>> Marek
>>
>> Actually asynchronous eviction / deletion is something I have been
>> prototyping for a while but never gotten around to implement in TTM:
>>
>> There are a few minor caveats:
>>
>> With buffer deletion, what you say is true for fixed memory, but not for TT
>> memory where pages are reclaimed by the system after buffer destruction.
>> That means that we don't have to wait for idle to free GPU space, but we
>> need to wait before pages are handed back to the system.
>>
>> Swapout needs to access the contents of evicted buffers, but synchronizing
>> doesn't need to happen until just before swapout.
>>
>> Multi-ring - CPU support: If another ring / engine or the CPU is about to
>> move in buffer contents to VRAM or a GPU aperture that was previously
>> evicted by another ring, it needs to sync with that eviction, but doesn't
>> know what buffer or even which buffers occupied the space previously.
>> Trivially one can attach a sync object to the memory type manager that
>> represents the last eviction from that memory type, and *any* engine (CPU or
>> GPU) that moves buffer contents in needs to order that movement with respect
>> to that fence. As you say, with a single ring and no CPU fallbacks, that
>> ordering is a no-op, but any common (non-driver based) implementation needs
>> to support this.
>>
>> A single fence attached to the memory type manager is the simplest solution,
>> but a solution with a fence for each free region in the free list is also
>> possible. Then TTM needs a driver callback to be able order fences w r t
>> echother.
>>
>> /Thomas
>>
> Radeon already handle multi-ring and ttm interaction with what we call
> semaphore. Semaphore are created to synchronize with fence accross
> different ring. I think the easiest solution is to just remove the bo
> wait in ttm and let driver handle this.

The wait can be removed, but only conditioned on a driver flag that says 
it supports unsynchronous buffer moves.

The multi-ring case I'm talking about is:

Ring 1 evicts buffer A, emits fence 0
Ring 2 evicts buffer B, emits fence 1
..Other eviction takes place by various rings, perhaps including ring 1 
and ring 2.
Ring 3 moves buffer C into the space which happens bo be the union of 
the space prevously occupied buffer A and buffer B.

Question is: which fence do you want to order this move with?
The answer is whichever of fence 0 and 1 signals last.

I think it's a reasonable thing for TTM to keep track of this, but in 
order to do so it needs a driver callback that
can order two fences, and can order a job in the current ring w r t a 
fence. In radeon's case that driver callback
would probably insert a barrier / semaphore. In the case of simpler 
hardware it would wait on one of the fences.

/Thomas

>
> Cheers,
> Jerome



_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Asynchronous eviction [WAS Re: [PATCH] drm/ttm: add minimum residency constraint for bo eviction]
  2012-11-30 17:08                       ` Thomas Hellstrom
@ 2012-11-30 17:18                         ` Jerome Glisse
  2012-11-30 17:43                           ` Thomas Hellstrom
  0 siblings, 1 reply; 34+ messages in thread
From: Jerome Glisse @ 2012-11-30 17:18 UTC (permalink / raw)
  To: Thomas Hellstrom; +Cc: Jerome Glisse, dri-devel

On Fri, Nov 30, 2012 at 12:08 PM, Thomas Hellstrom <thomas@shipmail.org> wrote:
> On 11/30/2012 05:30 PM, Jerome Glisse wrote:
>>
>> On Fri, Nov 30, 2012 at 4:39 AM, Thomas Hellstrom <thomas@shipmail.org>
>> wrote:
>>>
>>> On 11/29/2012 10:58 PM, Marek Olšák wrote:
>>>>
>>>>
>>>> What I tried to point out was that the synchronization shouldn't be
>>>> needed, because the CPU shouldn't do anything with the contents of
>>>> evicted buffers. The GPU moves the buffers, not the CPU. What does the
>>>> CPU do besides updating some kernel structures?
>>>>
>>>> Also, buffer deletion is something where you don't need to wait for
>>>> the buffer to become idle if you know the memory area won't be
>>>> mapped by the CPU, ever. The memory can be reclaimed right away. It
>>>> would be the GPU to move new data in and once that happens, the old
>>>> buffer will be trivially idle, because single-ring GPUs execute
>>>> commands in order.
>>>>
>>>> Marek
>>>
>>>
>>> Actually asynchronous eviction / deletion is something I have been
>>> prototyping for a while but never gotten around to implement in TTM:
>>>
>>> There are a few minor caveats:
>>>
>>> With buffer deletion, what you say is true for fixed memory, but not for
>>> TT
>>> memory where pages are reclaimed by the system after buffer destruction.
>>> That means that we don't have to wait for idle to free GPU space, but we
>>> need to wait before pages are handed back to the system.
>>>
>>> Swapout needs to access the contents of evicted buffers, but
>>> synchronizing
>>> doesn't need to happen until just before swapout.
>>>
>>> Multi-ring - CPU support: If another ring / engine or the CPU is about to
>>> move in buffer contents to VRAM or a GPU aperture that was previously
>>> evicted by another ring, it needs to sync with that eviction, but doesn't
>>> know what buffer or even which buffers occupied the space previously.
>>> Trivially one can attach a sync object to the memory type manager that
>>> represents the last eviction from that memory type, and *any* engine (CPU
>>> or
>>> GPU) that moves buffer contents in needs to order that movement with
>>> respect
>>> to that fence. As you say, with a single ring and no CPU fallbacks, that
>>> ordering is a no-op, but any common (non-driver based) implementation
>>> needs
>>> to support this.
>>>
>>> A single fence attached to the memory type manager is the simplest
>>> solution,
>>> but a solution with a fence for each free region in the free list is also
>>> possible. Then TTM needs a driver callback to be able order fences w r t
>>> echother.
>>>
>>> /Thomas
>>>
>> Radeon already handle multi-ring and ttm interaction with what we call
>> semaphore. Semaphore are created to synchronize with fence accross
>> different ring. I think the easiest solution is to just remove the bo
>> wait in ttm and let driver handle this.
>
>
> The wait can be removed, but only conditioned on a driver flag that says it
> supports unsynchronous buffer moves.
>
> The multi-ring case I'm talking about is:
>
> Ring 1 evicts buffer A, emits fence 0
> Ring 2 evicts buffer B, emits fence 1
> ..Other eviction takes place by various rings, perhaps including ring 1 and
> ring 2.
> Ring 3 moves buffer C into the space which happens bo be the union of the
> space prevously occupied buffer A and buffer B.
>
> Question is: which fence do you want to order this move with?
> The answer is whichever of fence 0 and 1 signals last.
>
> I think it's a reasonable thing for TTM to keep track of this, but in order
> to do so it needs a driver callback that
> can order two fences, and can order a job in the current ring w r t a fence.
> In radeon's case that driver callback
> would probably insert a barrier / semaphore. In the case of simpler hardware
> it would wait on one of the fences.
>
> /Thomas
>

I don't think we can order fence easily with a clean api, i would
rather see ttm provide a list of fence to driver and tell to the
driver before moving this object all the fence on this list need to be
completed. I think it's as easy as associating fence with drm_mm (well
nouveau as its own mm stuff) but idea would basicly be that fence are
both associated with bo and with mm object so you know when a segment
of memory is idle/available for use.

Cheers,
Jerome
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Asynchronous eviction [WAS Re: [PATCH] drm/ttm: add minimum residency constraint for bo eviction]
  2012-11-30 17:18                         ` Jerome Glisse
@ 2012-11-30 17:43                           ` Thomas Hellstrom
  2012-11-30 18:07                             ` Jerome Glisse
  0 siblings, 1 reply; 34+ messages in thread
From: Thomas Hellstrom @ 2012-11-30 17:43 UTC (permalink / raw)
  To: Jerome Glisse; +Cc: Jerome Glisse, dri-devel

On 11/30/2012 06:18 PM, Jerome Glisse wrote:
> On Fri, Nov 30, 2012 at 12:08 PM, Thomas Hellstrom <thomas@shipmail.org> wrote:
>> On 11/30/2012 05:30 PM, Jerome Glisse wrote:
>>> On Fri, Nov 30, 2012 at 4:39 AM, Thomas Hellstrom <thomas@shipmail.org>
>>> wrote:
>>>> On 11/29/2012 10:58 PM, Marek Olšák wrote:
>>>>>
>>>>> What I tried to point out was that the synchronization shouldn't be
>>>>> needed, because the CPU shouldn't do anything with the contents of
>>>>> evicted buffers. The GPU moves the buffers, not the CPU. What does the
>>>>> CPU do besides updating some kernel structures?
>>>>>
>>>>> Also, buffer deletion is something where you don't need to wait for
>>>>> the buffer to become idle if you know the memory area won't be
>>>>> mapped by the CPU, ever. The memory can be reclaimed right away. It
>>>>> would be the GPU to move new data in and once that happens, the old
>>>>> buffer will be trivially idle, because single-ring GPUs execute
>>>>> commands in order.
>>>>>
>>>>> Marek
>>>>
>>>> Actually asynchronous eviction / deletion is something I have been
>>>> prototyping for a while but never gotten around to implement in TTM:
>>>>
>>>> There are a few minor caveats:
>>>>
>>>> With buffer deletion, what you say is true for fixed memory, but not for
>>>> TT
>>>> memory where pages are reclaimed by the system after buffer destruction.
>>>> That means that we don't have to wait for idle to free GPU space, but we
>>>> need to wait before pages are handed back to the system.
>>>>
>>>> Swapout needs to access the contents of evicted buffers, but
>>>> synchronizing
>>>> doesn't need to happen until just before swapout.
>>>>
>>>> Multi-ring - CPU support: If another ring / engine or the CPU is about to
>>>> move in buffer contents to VRAM or a GPU aperture that was previously
>>>> evicted by another ring, it needs to sync with that eviction, but doesn't
>>>> know what buffer or even which buffers occupied the space previously.
>>>> Trivially one can attach a sync object to the memory type manager that
>>>> represents the last eviction from that memory type, and *any* engine (CPU
>>>> or
>>>> GPU) that moves buffer contents in needs to order that movement with
>>>> respect
>>>> to that fence. As you say, with a single ring and no CPU fallbacks, that
>>>> ordering is a no-op, but any common (non-driver based) implementation
>>>> needs
>>>> to support this.
>>>>
>>>> A single fence attached to the memory type manager is the simplest
>>>> solution,
>>>> but a solution with a fence for each free region in the free list is also
>>>> possible. Then TTM needs a driver callback to be able order fences w r t
>>>> echother.
>>>>
>>>> /Thomas
>>>>
>>> Radeon already handle multi-ring and ttm interaction with what we call
>>> semaphore. Semaphore are created to synchronize with fence accross
>>> different ring. I think the easiest solution is to just remove the bo
>>> wait in ttm and let driver handle this.
>>
>> The wait can be removed, but only conditioned on a driver flag that says it
>> supports unsynchronous buffer moves.
>>
>> The multi-ring case I'm talking about is:
>>
>> Ring 1 evicts buffer A, emits fence 0
>> Ring 2 evicts buffer B, emits fence 1
>> ..Other eviction takes place by various rings, perhaps including ring 1 and
>> ring 2.
>> Ring 3 moves buffer C into the space which happens bo be the union of the
>> space prevously occupied buffer A and buffer B.
>>
>> Question is: which fence do you want to order this move with?
>> The answer is whichever of fence 0 and 1 signals last.
>>
>> I think it's a reasonable thing for TTM to keep track of this, but in order
>> to do so it needs a driver callback that
>> can order two fences, and can order a job in the current ring w r t a fence.
>> In radeon's case that driver callback
>> would probably insert a barrier / semaphore. In the case of simpler hardware
>> it would wait on one of the fences.
>>
>> /Thomas
>>
> I don't think we can order fence easily with a clean api, i would
> rather see ttm provide a list of fence to driver and tell to the
> driver before moving this object all the fence on this list need to be
> completed. I think it's as easy as associating fence with drm_mm (well
> nouveau as its own mm stuff) but idea would basicly be that fence are
> both associated with bo and with mm object so you know when a segment
> of memory is idle/available for use.
>
> Cheers,
> Jerome


Hmm. Agreed that would save a lot of barriers.

Even if TTM tracks fences by free mm regions or a single fence for the 
whole memory type,
it's a simple fact that fences from the same ring are trivially ordered, 
which means such a list
should contain at most as many fences as there are rings.

So, whatever approach is chosen, TTM needs to be able to determine that 
trivial ordering,
and I think the upcoming cross-device fencing work will face the exact 
same problem.

My proposed ordering API would look something like

struct fence *order_fences(struct fence *fence_a, struct fence *fence_b, 
bool trivial_order, bool interruptible, bool no_wait_gpu)

Returns which of the fences @fence_a and @fence_b that when signaled, 
guarantees that also the other
fence has signaled. If @quick_order is true, and the driver cannot 
trivially order the fences, it may return ERR_PTR(-EAGAIN),
if @interruptible is true, any wait should be performed interruptibly 
and if no_wait_gpu is true, the function is not
allowed to wait on gpu but should issue ERR_PTR(-EBUSY) if it needs to 
do so to order fences.

(Hardware without semaphores can't order fences without waiting on them).

The list approach you suggest would use @trivial_order = true, Single 
fence approach would use @trivial_order = false.

And a first simple implementation in TTM would perhaps use your list 
approach with a single list for the whole memory type.

/Thomas



_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Asynchronous eviction [WAS Re: [PATCH] drm/ttm: add minimum residency constraint for bo eviction]
  2012-11-30 17:43                           ` Thomas Hellstrom
@ 2012-11-30 18:07                             ` Jerome Glisse
  2012-11-30 18:31                               ` Thomas Hellstrom
  0 siblings, 1 reply; 34+ messages in thread
From: Jerome Glisse @ 2012-11-30 18:07 UTC (permalink / raw)
  To: Thomas Hellstrom; +Cc: Jerome Glisse, dri-devel

On Fri, Nov 30, 2012 at 06:43:36PM +0100, Thomas Hellstrom wrote:
> On 11/30/2012 06:18 PM, Jerome Glisse wrote:
> >On Fri, Nov 30, 2012 at 12:08 PM, Thomas Hellstrom <thomas@shipmail.org> wrote:
> >>On 11/30/2012 05:30 PM, Jerome Glisse wrote:
> >>>On Fri, Nov 30, 2012 at 4:39 AM, Thomas Hellstrom <thomas@shipmail.org>
> >>>wrote:
> >>>>On 11/29/2012 10:58 PM, Marek Olšák wrote:
> >>>>>
> >>>>>What I tried to point out was that the synchronization shouldn't be
> >>>>>needed, because the CPU shouldn't do anything with the contents of
> >>>>>evicted buffers. The GPU moves the buffers, not the CPU. What does the
> >>>>>CPU do besides updating some kernel structures?
> >>>>>
> >>>>>Also, buffer deletion is something where you don't need to wait for
> >>>>>the buffer to become idle if you know the memory area won't be
> >>>>>mapped by the CPU, ever. The memory can be reclaimed right away. It
> >>>>>would be the GPU to move new data in and once that happens, the old
> >>>>>buffer will be trivially idle, because single-ring GPUs execute
> >>>>>commands in order.
> >>>>>
> >>>>>Marek
> >>>>
> >>>>Actually asynchronous eviction / deletion is something I have been
> >>>>prototyping for a while but never gotten around to implement in TTM:
> >>>>
> >>>>There are a few minor caveats:
> >>>>
> >>>>With buffer deletion, what you say is true for fixed memory, but not for
> >>>>TT
> >>>>memory where pages are reclaimed by the system after buffer destruction.
> >>>>That means that we don't have to wait for idle to free GPU space, but we
> >>>>need to wait before pages are handed back to the system.
> >>>>
> >>>>Swapout needs to access the contents of evicted buffers, but
> >>>>synchronizing
> >>>>doesn't need to happen until just before swapout.
> >>>>
> >>>>Multi-ring - CPU support: If another ring / engine or the CPU is about to
> >>>>move in buffer contents to VRAM or a GPU aperture that was previously
> >>>>evicted by another ring, it needs to sync with that eviction, but doesn't
> >>>>know what buffer or even which buffers occupied the space previously.
> >>>>Trivially one can attach a sync object to the memory type manager that
> >>>>represents the last eviction from that memory type, and *any* engine (CPU
> >>>>or
> >>>>GPU) that moves buffer contents in needs to order that movement with
> >>>>respect
> >>>>to that fence. As you say, with a single ring and no CPU fallbacks, that
> >>>>ordering is a no-op, but any common (non-driver based) implementation
> >>>>needs
> >>>>to support this.
> >>>>
> >>>>A single fence attached to the memory type manager is the simplest
> >>>>solution,
> >>>>but a solution with a fence for each free region in the free list is also
> >>>>possible. Then TTM needs a driver callback to be able order fences w r t
> >>>>echother.
> >>>>
> >>>>/Thomas
> >>>>
> >>>Radeon already handle multi-ring and ttm interaction with what we call
> >>>semaphore. Semaphore are created to synchronize with fence accross
> >>>different ring. I think the easiest solution is to just remove the bo
> >>>wait in ttm and let driver handle this.
> >>
> >>The wait can be removed, but only conditioned on a driver flag that says it
> >>supports unsynchronous buffer moves.
> >>
> >>The multi-ring case I'm talking about is:
> >>
> >>Ring 1 evicts buffer A, emits fence 0
> >>Ring 2 evicts buffer B, emits fence 1
> >>..Other eviction takes place by various rings, perhaps including ring 1 and
> >>ring 2.
> >>Ring 3 moves buffer C into the space which happens bo be the union of the
> >>space prevously occupied buffer A and buffer B.
> >>
> >>Question is: which fence do you want to order this move with?
> >>The answer is whichever of fence 0 and 1 signals last.
> >>
> >>I think it's a reasonable thing for TTM to keep track of this, but in order
> >>to do so it needs a driver callback that
> >>can order two fences, and can order a job in the current ring w r t a fence.
> >>In radeon's case that driver callback
> >>would probably insert a barrier / semaphore. In the case of simpler hardware
> >>it would wait on one of the fences.
> >>
> >>/Thomas
> >>
> >I don't think we can order fence easily with a clean api, i would
> >rather see ttm provide a list of fence to driver and tell to the
> >driver before moving this object all the fence on this list need to be
> >completed. I think it's as easy as associating fence with drm_mm (well
> >nouveau as its own mm stuff) but idea would basicly be that fence are
> >both associated with bo and with mm object so you know when a segment
> >of memory is idle/available for use.
> >
> >Cheers,
> >Jerome
> 
> 
> Hmm. Agreed that would save a lot of barriers.
> 
> Even if TTM tracks fences by free mm regions or a single fence for
> the whole memory type, it's a simple fact that fences from the same
> ring are trivially ordered, which means such a list should contain at
> most as many fences as there are rings.

Yes, one function callback is needed to know which fence is necessary,
also ttm needs to know the number of rings (note that i think newer
hw will have somethings like 1024 rings or even more, even today hw
might have as many as i think nvidia channel is pretty much what i
define to be a ring).

But i think most case will be few fence accross few rings. Like 1
ring is the dma ring and then you have a ring for one of the GL
context that using the memory and another ring for the new context
that want to use the memory.

> 
> So, whatever approach is chosen, TTM needs to be able to determine
> that trivial ordering, and I think the upcoming cross-device fencing
> work will face the exact same problem.
> 
> My proposed ordering API would look something like
> 
> struct fence *order_fences(struct fence *fence_a, struct fence
> *fence_b, bool trivial_order, bool interruptible, bool no_wait_gpu)
> 
> Returns which of the fences @fence_a and @fence_b that when
> signaled, guarantees that also the other
> fence has signaled. If @quick_order is true, and the driver cannot
> trivially order the fences, it may return ERR_PTR(-EAGAIN),
> if @interruptible is true, any wait should be performed
> interruptibly and if no_wait_gpu is true, the function is not
> allowed to wait on gpu but should issue ERR_PTR(-EBUSY) if it needs
> to do so to order fences.
> 
> (Hardware without semaphores can't order fences without waiting on them).
> 
> The list approach you suggest would use @trivial_order = true,
> Single fence approach would use @trivial_order = false.
> 
> And a first simple implementation in TTM would perhaps use your list
> approach with a single list for the whole memory type.
> 
> /Thomas

I would rather add a callback like :

ttm_reduce_fences(unsigned *nfences, fence **fencearray)

So each mm block has an array of fence, each time a new fence is added to
the array you callback to reduce the array of fence, it remove all the
fence that can be removed from the array. It might even remove all of them
if they are signaled.

In the ttm bo move callback you provide the list list of mm block (each
having its array of fence) and the move callback is responsible to
use what ever mecanism it wants to properly schedule and synchronize the
move.

One thing i am not sure is should we merge free mm block and merge/reduce
their fence array or should we provide a list of mm block to the move
callback. I think here there is tradeoff you probably want to merge small
mm block up to a certain point but you don't want to merge so much that
any allocation will have to wait on a zillions fences.

In the memcpy fallback you just wait on each fence in the array.

This move all the synchronization into the driver and i think it's the
easiest way to do thing.

Cheers,
Jerome
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Asynchronous eviction [WAS Re: [PATCH] drm/ttm: add minimum residency constraint for bo eviction]
  2012-11-30 18:07                             ` Jerome Glisse
@ 2012-11-30 18:31                               ` Thomas Hellstrom
  2012-11-30 19:25                                 ` Jerome Glisse
  0 siblings, 1 reply; 34+ messages in thread
From: Thomas Hellstrom @ 2012-11-30 18:31 UTC (permalink / raw)
  To: Jerome Glisse; +Cc: Jerome Glisse, dri-devel

On 11/30/2012 07:07 PM, Jerome Glisse wrote:
> On Fri, Nov 30, 2012 at 06:43:36PM +0100, Thomas Hellstrom wrote:
>> On 11/30/2012 06:18 PM, Jerome Glisse wrote:
>>> On Fri, Nov 30, 2012 at 12:08 PM, Thomas Hellstrom <thomas@shipmail.org> wrote:
>>>> On 11/30/2012 05:30 PM, Jerome Glisse wrote:
>>>>> On Fri, Nov 30, 2012 at 4:39 AM, Thomas Hellstrom <thomas@shipmail.org>
>>>>> wrote:
>>>>>> On 11/29/2012 10:58 PM, Marek Olšák wrote:
>>>>>>> What I tried to point out was that the synchronization shouldn't be
>>>>>>> needed, because the CPU shouldn't do anything with the contents of
>>>>>>> evicted buffers. The GPU moves the buffers, not the CPU. What does the
>>>>>>> CPU do besides updating some kernel structures?
>>>>>>>
>>>>>>> Also, buffer deletion is something where you don't need to wait for
>>>>>>> the buffer to become idle if you know the memory area won't be
>>>>>>> mapped by the CPU, ever. The memory can be reclaimed right away. It
>>>>>>> would be the GPU to move new data in and once that happens, the old
>>>>>>> buffer will be trivially idle, because single-ring GPUs execute
>>>>>>> commands in order.
>>>>>>>
>>>>>>> Marek
>>>>>> Actually asynchronous eviction / deletion is something I have been
>>>>>> prototyping for a while but never gotten around to implement in TTM:
>>>>>>
>>>>>> There are a few minor caveats:
>>>>>>
>>>>>> With buffer deletion, what you say is true for fixed memory, but not for
>>>>>> TT
>>>>>> memory where pages are reclaimed by the system after buffer destruction.
>>>>>> That means that we don't have to wait for idle to free GPU space, but we
>>>>>> need to wait before pages are handed back to the system.
>>>>>>
>>>>>> Swapout needs to access the contents of evicted buffers, but
>>>>>> synchronizing
>>>>>> doesn't need to happen until just before swapout.
>>>>>>
>>>>>> Multi-ring - CPU support: If another ring / engine or the CPU is about to
>>>>>> move in buffer contents to VRAM or a GPU aperture that was previously
>>>>>> evicted by another ring, it needs to sync with that eviction, but doesn't
>>>>>> know what buffer or even which buffers occupied the space previously.
>>>>>> Trivially one can attach a sync object to the memory type manager that
>>>>>> represents the last eviction from that memory type, and *any* engine (CPU
>>>>>> or
>>>>>> GPU) that moves buffer contents in needs to order that movement with
>>>>>> respect
>>>>>> to that fence. As you say, with a single ring and no CPU fallbacks, that
>>>>>> ordering is a no-op, but any common (non-driver based) implementation
>>>>>> needs
>>>>>> to support this.
>>>>>>
>>>>>> A single fence attached to the memory type manager is the simplest
>>>>>> solution,
>>>>>> but a solution with a fence for each free region in the free list is also
>>>>>> possible. Then TTM needs a driver callback to be able order fences w r t
>>>>>> echother.
>>>>>>
>>>>>> /Thomas
>>>>>>
>>>>> Radeon already handle multi-ring and ttm interaction with what we call
>>>>> semaphore. Semaphore are created to synchronize with fence accross
>>>>> different ring. I think the easiest solution is to just remove the bo
>>>>> wait in ttm and let driver handle this.
>>>> The wait can be removed, but only conditioned on a driver flag that says it
>>>> supports unsynchronous buffer moves.
>>>>
>>>> The multi-ring case I'm talking about is:
>>>>
>>>> Ring 1 evicts buffer A, emits fence 0
>>>> Ring 2 evicts buffer B, emits fence 1
>>>> ..Other eviction takes place by various rings, perhaps including ring 1 and
>>>> ring 2.
>>>> Ring 3 moves buffer C into the space which happens bo be the union of the
>>>> space prevously occupied buffer A and buffer B.
>>>>
>>>> Question is: which fence do you want to order this move with?
>>>> The answer is whichever of fence 0 and 1 signals last.
>>>>
>>>> I think it's a reasonable thing for TTM to keep track of this, but in order
>>>> to do so it needs a driver callback that
>>>> can order two fences, and can order a job in the current ring w r t a fence.
>>>> In radeon's case that driver callback
>>>> would probably insert a barrier / semaphore. In the case of simpler hardware
>>>> it would wait on one of the fences.
>>>>
>>>> /Thomas
>>>>
>>> I don't think we can order fence easily with a clean api, i would
>>> rather see ttm provide a list of fence to driver and tell to the
>>> driver before moving this object all the fence on this list need to be
>>> completed. I think it's as easy as associating fence with drm_mm (well
>>> nouveau as its own mm stuff) but idea would basicly be that fence are
>>> both associated with bo and with mm object so you know when a segment
>>> of memory is idle/available for use.
>>>
>>> Cheers,
>>> Jerome
>>
>> Hmm. Agreed that would save a lot of barriers.
>>
>> Even if TTM tracks fences by free mm regions or a single fence for
>> the whole memory type, it's a simple fact that fences from the same
>> ring are trivially ordered, which means such a list should contain at
>> most as many fences as there are rings.
> Yes, one function callback is needed to know which fence is necessary,
> also ttm needs to know the number of rings (note that i think newer
> hw will have somethings like 1024 rings or even more, even today hw
> might have as many as i think nvidia channel is pretty much what i
> define to be a ring).
>
> But i think most case will be few fence accross few rings. Like 1
> ring is the dma ring and then you have a ring for one of the GL
> context that using the memory and another ring for the new context
> that want to use the memory.
>
>> So, whatever approach is chosen, TTM needs to be able to determine
>> that trivial ordering, and I think the upcoming cross-device fencing
>> work will face the exact same problem.
>>
>> My proposed ordering API would look something like
>>
>> struct fence *order_fences(struct fence *fence_a, struct fence
>> *fence_b, bool trivial_order, bool interruptible, bool no_wait_gpu)
>>
>> Returns which of the fences @fence_a and @fence_b that when
>> signaled, guarantees that also the other
>> fence has signaled. If @quick_order is true, and the driver cannot
>> trivially order the fences, it may return ERR_PTR(-EAGAIN),
>> if @interruptible is true, any wait should be performed
>> interruptibly and if no_wait_gpu is true, the function is not
>> allowed to wait on gpu but should issue ERR_PTR(-EBUSY) if it needs
>> to do so to order fences.
>>
>> (Hardware without semaphores can't order fences without waiting on them).
>>
>> The list approach you suggest would use @trivial_order = true,
>> Single fence approach would use @trivial_order = false.
>>
>> And a first simple implementation in TTM would perhaps use your list
>> approach with a single list for the whole memory type.
>>
>> /Thomas
> I would rather add a callback like :
>
> ttm_reduce_fences(unsigned *nfences, fence **fencearray)

I don't agree here. I think the fence order function is more versatile 
and a good abstraction that can be
applied in a number of cases to this problem. Anyway we should sync this 
with Maarten and his
fence work. The same problem applies to attaching shared fences to a bo.

>
> In the ttm bo move callback you provide the list list of mm block (each
> having its array of fence) and the move callback is responsible to
> use what ever mecanism it wants to properly schedule and synchronize the
> move.

Agreed.

>
> One thing i am not sure is should we merge free mm block and merge/reduce
> their fence array or should we provide a list of mm block to the move
> callback. I think here there is tradeoff you probably want to merge small
> mm block up to a certain point but you don't want to merge so much that
> any allocation will have to wait on a zillions fences.

As mentioned previously we can choose the complexity here, but the simplest
approach would be to have a single list for the whole manager.

I think if we don't merge mm blocks immediately when freed, we're going 
to use
up a lot of resources.


/Thomas



_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Asynchronous eviction [WAS Re: [PATCH] drm/ttm: add minimum residency constraint for bo eviction]
  2012-11-30 18:31                               ` Thomas Hellstrom
@ 2012-11-30 19:25                                 ` Jerome Glisse
  2012-11-30 20:35                                   ` Thomas Hellstrom
  0 siblings, 1 reply; 34+ messages in thread
From: Jerome Glisse @ 2012-11-30 19:25 UTC (permalink / raw)
  To: Thomas Hellstrom; +Cc: Jerome Glisse, dri-devel

On Fri, Nov 30, 2012 at 07:31:04PM +0100, Thomas Hellstrom wrote:
> On 11/30/2012 07:07 PM, Jerome Glisse wrote:
> >On Fri, Nov 30, 2012 at 06:43:36PM +0100, Thomas Hellstrom wrote:
> >>On 11/30/2012 06:18 PM, Jerome Glisse wrote:
> >>>On Fri, Nov 30, 2012 at 12:08 PM, Thomas Hellstrom <thomas@shipmail.org> wrote:
> >>>>On 11/30/2012 05:30 PM, Jerome Glisse wrote:
> >>>>>On Fri, Nov 30, 2012 at 4:39 AM, Thomas Hellstrom <thomas@shipmail.org>
> >>>>>wrote:
> >>>>>>On 11/29/2012 10:58 PM, Marek Olšák wrote:
> >>>>>>>What I tried to point out was that the synchronization shouldn't be
> >>>>>>>needed, because the CPU shouldn't do anything with the contents of
> >>>>>>>evicted buffers. The GPU moves the buffers, not the CPU. What does the
> >>>>>>>CPU do besides updating some kernel structures?
> >>>>>>>
> >>>>>>>Also, buffer deletion is something where you don't need to wait for
> >>>>>>>the buffer to become idle if you know the memory area won't be
> >>>>>>>mapped by the CPU, ever. The memory can be reclaimed right away. It
> >>>>>>>would be the GPU to move new data in and once that happens, the old
> >>>>>>>buffer will be trivially idle, because single-ring GPUs execute
> >>>>>>>commands in order.
> >>>>>>>
> >>>>>>>Marek
> >>>>>>Actually asynchronous eviction / deletion is something I have been
> >>>>>>prototyping for a while but never gotten around to implement in TTM:
> >>>>>>
> >>>>>>There are a few minor caveats:
> >>>>>>
> >>>>>>With buffer deletion, what you say is true for fixed memory, but not for
> >>>>>>TT
> >>>>>>memory where pages are reclaimed by the system after buffer destruction.
> >>>>>>That means that we don't have to wait for idle to free GPU space, but we
> >>>>>>need to wait before pages are handed back to the system.
> >>>>>>
> >>>>>>Swapout needs to access the contents of evicted buffers, but
> >>>>>>synchronizing
> >>>>>>doesn't need to happen until just before swapout.
> >>>>>>
> >>>>>>Multi-ring - CPU support: If another ring / engine or the CPU is about to
> >>>>>>move in buffer contents to VRAM or a GPU aperture that was previously
> >>>>>>evicted by another ring, it needs to sync with that eviction, but doesn't
> >>>>>>know what buffer or even which buffers occupied the space previously.
> >>>>>>Trivially one can attach a sync object to the memory type manager that
> >>>>>>represents the last eviction from that memory type, and *any* engine (CPU
> >>>>>>or
> >>>>>>GPU) that moves buffer contents in needs to order that movement with
> >>>>>>respect
> >>>>>>to that fence. As you say, with a single ring and no CPU fallbacks, that
> >>>>>>ordering is a no-op, but any common (non-driver based) implementation
> >>>>>>needs
> >>>>>>to support this.
> >>>>>>
> >>>>>>A single fence attached to the memory type manager is the simplest
> >>>>>>solution,
> >>>>>>but a solution with a fence for each free region in the free list is also
> >>>>>>possible. Then TTM needs a driver callback to be able order fences w r t
> >>>>>>echother.
> >>>>>>
> >>>>>>/Thomas
> >>>>>>
> >>>>>Radeon already handle multi-ring and ttm interaction with what we call
> >>>>>semaphore. Semaphore are created to synchronize with fence accross
> >>>>>different ring. I think the easiest solution is to just remove the bo
> >>>>>wait in ttm and let driver handle this.
> >>>>The wait can be removed, but only conditioned on a driver flag that says it
> >>>>supports unsynchronous buffer moves.
> >>>>
> >>>>The multi-ring case I'm talking about is:
> >>>>
> >>>>Ring 1 evicts buffer A, emits fence 0
> >>>>Ring 2 evicts buffer B, emits fence 1
> >>>>..Other eviction takes place by various rings, perhaps including ring 1 and
> >>>>ring 2.
> >>>>Ring 3 moves buffer C into the space which happens bo be the union of the
> >>>>space prevously occupied buffer A and buffer B.
> >>>>
> >>>>Question is: which fence do you want to order this move with?
> >>>>The answer is whichever of fence 0 and 1 signals last.
> >>>>
> >>>>I think it's a reasonable thing for TTM to keep track of this, but in order
> >>>>to do so it needs a driver callback that
> >>>>can order two fences, and can order a job in the current ring w r t a fence.
> >>>>In radeon's case that driver callback
> >>>>would probably insert a barrier / semaphore. In the case of simpler hardware
> >>>>it would wait on one of the fences.
> >>>>
> >>>>/Thomas
> >>>>
> >>>I don't think we can order fence easily with a clean api, i would
> >>>rather see ttm provide a list of fence to driver and tell to the
> >>>driver before moving this object all the fence on this list need to be
> >>>completed. I think it's as easy as associating fence with drm_mm (well
> >>>nouveau as its own mm stuff) but idea would basicly be that fence are
> >>>both associated with bo and with mm object so you know when a segment
> >>>of memory is idle/available for use.
> >>>
> >>>Cheers,
> >>>Jerome
> >>
> >>Hmm. Agreed that would save a lot of barriers.
> >>
> >>Even if TTM tracks fences by free mm regions or a single fence for
> >>the whole memory type, it's a simple fact that fences from the same
> >>ring are trivially ordered, which means such a list should contain at
> >>most as many fences as there are rings.
> >Yes, one function callback is needed to know which fence is necessary,
> >also ttm needs to know the number of rings (note that i think newer
> >hw will have somethings like 1024 rings or even more, even today hw
> >might have as many as i think nvidia channel is pretty much what i
> >define to be a ring).
> >
> >But i think most case will be few fence accross few rings. Like 1
> >ring is the dma ring and then you have a ring for one of the GL
> >context that using the memory and another ring for the new context
> >that want to use the memory.
> >
> >>So, whatever approach is chosen, TTM needs to be able to determine
> >>that trivial ordering, and I think the upcoming cross-device fencing
> >>work will face the exact same problem.
> >>
> >>My proposed ordering API would look something like
> >>
> >>struct fence *order_fences(struct fence *fence_a, struct fence
> >>*fence_b, bool trivial_order, bool interruptible, bool no_wait_gpu)
> >>
> >>Returns which of the fences @fence_a and @fence_b that when
> >>signaled, guarantees that also the other
> >>fence has signaled. If @quick_order is true, and the driver cannot
> >>trivially order the fences, it may return ERR_PTR(-EAGAIN),
> >>if @interruptible is true, any wait should be performed
> >>interruptibly and if no_wait_gpu is true, the function is not
> >>allowed to wait on gpu but should issue ERR_PTR(-EBUSY) if it needs
> >>to do so to order fences.
> >>
> >>(Hardware without semaphores can't order fences without waiting on them).
> >>
> >>The list approach you suggest would use @trivial_order = true,
> >>Single fence approach would use @trivial_order = false.
> >>
> >>And a first simple implementation in TTM would perhaps use your list
> >>approach with a single list for the whole memory type.
> >>
> >>/Thomas
> >I would rather add a callback like :
> >
> >ttm_reduce_fences(unsigned *nfences, fence **fencearray)
> 
> I don't agree here. I think the fence order function is more
> versatile and a good abstraction that can be
> applied in a number of cases to this problem. Anyway we should sync
> this with Maarten and his
> fence work. The same problem applies to attaching shared fences to a bo.

Radeon already handle the bo case on multi-ring and i think it should
be left to the driver to do what its necessary there.

I don't think here more versatility is bad, in fact i am pretty sure
that no hw can give fence ordering and i also think that if driver have
to track multi-ring fence ordering it will just waste resource for no
good reasons. For tracking that in radeon i will need to keep a list
of semaphore and knows which semaphores insure synchronization btw
which ring and for each which fence are concerned just thinking to it
it would be a messy graph with tons of nodes.

Of course here i am thinking in term of newer GPU with tons of rings,
GPU with one ring, which are a vanishing category as newer opencl
require more ring, are easy to handle but i really don't think we
should design for those.

I think here we will have to agree to disagree but i see a fence array
as a perfect solution that is flexible enough and use the less resource
both in term of cpu usage and memory (assuming proper implementation).

> 
> >
> >In the ttm bo move callback you provide the list list of mm block (each
> >having its array of fence) and the move callback is responsible to
> >use what ever mecanism it wants to properly schedule and synchronize the
> >move.
> 
> Agreed.
> 
> >
> >One thing i am not sure is should we merge free mm block and merge/reduce
> >their fence array or should we provide a list of mm block to the move
> >callback. I think here there is tradeoff you probably want to merge small
> >mm block up to a certain point but you don't want to merge so much that
> >any allocation will have to wait on a zillions fences.
> 
> As mentioned previously we can choose the complexity here, but the simplest
> approach would be to have a single list for the whole manager.
> 
> I think if we don't merge mm blocks immediately when freed, we're
> going to use
> up a lot of resources.
> 

I think here there should be a granularity something like don't merge block
over 16M or don't merge block if they have more than N fences if merged
together. Of course it means a special allocator but as my work on radeon
memory placement shows i think that the whole vram business (which is really
where the multi-fence will be used the most) needs a worker thread that
schedule works every now and then, this worker thread could also merge mm
when fence complete ...

But i really don't think there should be a list for the whole manager, think
about the case of a very long cl running job that is still running but already
have it's eviction scheduled and it mm vram node marked as free with a fence.
If we have a list per manager than vram becomes unavailable for as long as
this cl job runs. I am thinking here to newer cpu that can partition there
compute unit (CL 1.2 spec mandate that iirc) for instance btw cl and gl so the
gpu still can run other job while this cl things is going on.

While if we only merge block up to a certain point we can have other work
done on the gpu using this memory manager.

Cheers,
Jerome
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Asynchronous eviction [WAS Re: [PATCH] drm/ttm: add minimum residency constraint for bo eviction]
  2012-11-30 19:25                                 ` Jerome Glisse
@ 2012-11-30 20:35                                   ` Thomas Hellstrom
  2012-11-30 21:07                                     ` Jerome Glisse
  0 siblings, 1 reply; 34+ messages in thread
From: Thomas Hellstrom @ 2012-11-30 20:35 UTC (permalink / raw)
  To: Jerome Glisse; +Cc: Jerome Glisse, dri-devel

On 11/30/2012 08:25 PM, Jerome Glisse wrote:
> On Fri, Nov 30, 2012 at 07:31:04PM +0100, Thomas Hellstrom wrote:
>> On 11/30/2012 07:07 PM, Jerome Glisse wrote:
>>> On Fri, Nov 30, 2012 at 06:43:36PM +0100, Thomas Hellstrom wrote:
>>>> On 11/30/2012 06:18 PM, Jerome Glisse wrote:
>>>>> On Fri, Nov 30, 2012 at 12:08 PM, Thomas Hellstrom <thomas@shipmail.org> wrote:
>>>>>> On 11/30/2012 05:30 PM, Jerome Glisse wrote:
>>>>>>> On Fri, Nov 30, 2012 at 4:39 AM, Thomas Hellstrom <thomas@shipmail.org>
>>>>>>> wrote:
>>>>>>>> On 11/29/2012 10:58 PM, Marek Olšák wrote:
>>>>>>>>> What I tried to point out was that the synchronization shouldn't be
>>>>>>>>> needed, because the CPU shouldn't do anything with the contents of
>>>>>>>>> evicted buffers. The GPU moves the buffers, not the CPU. What does the
>>>>>>>>> CPU do besides updating some kernel structures?
>>>>>>>>>
>>>>>>>>> Also, buffer deletion is something where you don't need to wait for
>>>>>>>>> the buffer to become idle if you know the memory area won't be
>>>>>>>>> mapped by the CPU, ever. The memory can be reclaimed right away. It
>>>>>>>>> would be the GPU to move new data in and once that happens, the old
>>>>>>>>> buffer will be trivially idle, because single-ring GPUs execute
>>>>>>>>> commands in order.
>>>>>>>>>
>>>>>>>>> Marek
>>>>>>>> Actually asynchronous eviction / deletion is something I have been
>>>>>>>> prototyping for a while but never gotten around to implement in TTM:
>>>>>>>>
>>>>>>>> There are a few minor caveats:
>>>>>>>>
>>>>>>>> With buffer deletion, what you say is true for fixed memory, but not for
>>>>>>>> TT
>>>>>>>> memory where pages are reclaimed by the system after buffer destruction.
>>>>>>>> That means that we don't have to wait for idle to free GPU space, but we
>>>>>>>> need to wait before pages are handed back to the system.
>>>>>>>>
>>>>>>>> Swapout needs to access the contents of evicted buffers, but
>>>>>>>> synchronizing
>>>>>>>> doesn't need to happen until just before swapout.
>>>>>>>>
>>>>>>>> Multi-ring - CPU support: If another ring / engine or the CPU is about to
>>>>>>>> move in buffer contents to VRAM or a GPU aperture that was previously
>>>>>>>> evicted by another ring, it needs to sync with that eviction, but doesn't
>>>>>>>> know what buffer or even which buffers occupied the space previously.
>>>>>>>> Trivially one can attach a sync object to the memory type manager that
>>>>>>>> represents the last eviction from that memory type, and *any* engine (CPU
>>>>>>>> or
>>>>>>>> GPU) that moves buffer contents in needs to order that movement with
>>>>>>>> respect
>>>>>>>> to that fence. As you say, with a single ring and no CPU fallbacks, that
>>>>>>>> ordering is a no-op, but any common (non-driver based) implementation
>>>>>>>> needs
>>>>>>>> to support this.
>>>>>>>>
>>>>>>>> A single fence attached to the memory type manager is the simplest
>>>>>>>> solution,
>>>>>>>> but a solution with a fence for each free region in the free list is also
>>>>>>>> possible. Then TTM needs a driver callback to be able order fences w r t
>>>>>>>> echother.
>>>>>>>>
>>>>>>>> /Thomas
>>>>>>>>
>>>>>>> Radeon already handle multi-ring and ttm interaction with what we call
>>>>>>> semaphore. Semaphore are created to synchronize with fence accross
>>>>>>> different ring. I think the easiest solution is to just remove the bo
>>>>>>> wait in ttm and let driver handle this.
>>>>>> The wait can be removed, but only conditioned on a driver flag that says it
>>>>>> supports unsynchronous buffer moves.
>>>>>>
>>>>>> The multi-ring case I'm talking about is:
>>>>>>
>>>>>> Ring 1 evicts buffer A, emits fence 0
>>>>>> Ring 2 evicts buffer B, emits fence 1
>>>>>> ..Other eviction takes place by various rings, perhaps including ring 1 and
>>>>>> ring 2.
>>>>>> Ring 3 moves buffer C into the space which happens bo be the union of the
>>>>>> space prevously occupied buffer A and buffer B.
>>>>>>
>>>>>> Question is: which fence do you want to order this move with?
>>>>>> The answer is whichever of fence 0 and 1 signals last.
>>>>>>
>>>>>> I think it's a reasonable thing for TTM to keep track of this, but in order
>>>>>> to do so it needs a driver callback that
>>>>>> can order two fences, and can order a job in the current ring w r t a fence.
>>>>>> In radeon's case that driver callback
>>>>>> would probably insert a barrier / semaphore. In the case of simpler hardware
>>>>>> it would wait on one of the fences.
>>>>>>
>>>>>> /Thomas
>>>>>>
>>>>> I don't think we can order fence easily with a clean api, i would
>>>>> rather see ttm provide a list of fence to driver and tell to the
>>>>> driver before moving this object all the fence on this list need to be
>>>>> completed. I think it's as easy as associating fence with drm_mm (well
>>>>> nouveau as its own mm stuff) but idea would basicly be that fence are
>>>>> both associated with bo and with mm object so you know when a segment
>>>>> of memory is idle/available for use.
>>>>>
>>>>> Cheers,
>>>>> Jerome
>>>> Hmm. Agreed that would save a lot of barriers.
>>>>
>>>> Even if TTM tracks fences by free mm regions or a single fence for
>>>> the whole memory type, it's a simple fact that fences from the same
>>>> ring are trivially ordered, which means such a list should contain at
>>>> most as many fences as there are rings.
>>> Yes, one function callback is needed to know which fence is necessary,
>>> also ttm needs to know the number of rings (note that i think newer
>>> hw will have somethings like 1024 rings or even more, even today hw
>>> might have as many as i think nvidia channel is pretty much what i
>>> define to be a ring).
>>>
>>> But i think most case will be few fence accross few rings. Like 1
>>> ring is the dma ring and then you have a ring for one of the GL
>>> context that using the memory and another ring for the new context
>>> that want to use the memory.
>>>
>>>> So, whatever approach is chosen, TTM needs to be able to determine
>>>> that trivial ordering, and I think the upcoming cross-device fencing
>>>> work will face the exact same problem.
>>>>
>>>> My proposed ordering API would look something like
>>>>
>>>> struct fence *order_fences(struct fence *fence_a, struct fence
>>>> *fence_b, bool trivial_order, bool interruptible, bool no_wait_gpu)
>>>>
>>>> Returns which of the fences @fence_a and @fence_b that when
>>>> signaled, guarantees that also the other
>>>> fence has signaled. If @quick_order is true, and the driver cannot
>>>> trivially order the fences, it may return ERR_PTR(-EAGAIN),
>>>> if @interruptible is true, any wait should be performed
>>>> interruptibly and if no_wait_gpu is true, the function is not
>>>> allowed to wait on gpu but should issue ERR_PTR(-EBUSY) if it needs
>>>> to do so to order fences.
>>>>
>>>> (Hardware without semaphores can't order fences without waiting on them).
>>>>
>>>> The list approach you suggest would use @trivial_order = true,
>>>> Single fence approach would use @trivial_order = false.
>>>>
>>>> And a first simple implementation in TTM would perhaps use your list
>>>> approach with a single list for the whole memory type.
>>>>
>>>> /Thomas
>>> I would rather add a callback like :
>>>
>>> ttm_reduce_fences(unsigned *nfences, fence **fencearray)
>> I don't agree here. I think the fence order function is more
>> versatile and a good abstraction that can be
>> applied in a number of cases to this problem. Anyway we should sync
>> this with Maarten and his
>> fence work. The same problem applies to attaching shared fences to a bo.
> Radeon already handle the bo case on multi-ring and i think it should
> be left to the driver to do what its necessary there.
>
> I don't think here more versatility is bad, in fact i am pretty sure
> that no hw can give fence ordering and i also think that if driver have
> to track multi-ring fence ordering it will just waste resource for no
> good reasons. For tracking that in radeon i will need to keep a list
> of semaphore and knows which semaphores insure synchronization btw
> which ring and for each which fence are concerned just thinking to it
> it would be a messy graph with tons of nodes.
>
> Of course here i am thinking in term of newer GPU with tons of rings,
> GPU with one ring, which are a vanishing category as newer opencl
> require more ring, are easy to handle but i really don't think we
> should design for those.
The biggest problem I see with ttm_reduce_fences() is that it seems to 
have high complexity, since it
doesn't know which fence is the new one. And if it did, it would only be 
a multi-fence version of
order_fences(trivial=true).

> I think here we will have to agree to disagree but i see a fence array
> as a perfect solution that is flexible enough and use the less resource
> both in term of cpu usage and memory (assuming proper implementation).
I thought it would always require nrings*8 bytes on each memory block?

>
> I think here there should be a granularity something like don't merge block
> over 16M or don't merge block if they have more than N fences if merged
> together. Of course it means a special allocator but as my work on radeon
> memory placement shows i think that the whole vram business (which is really
> where the multi-fence will be used the most) needs a worker thread that
> schedule works every now and then, this worker thread could also merge mm
> when fence complete ...
>
> But i really don't think there should be a list for the whole manager, think
> about the case of a very long cl running job that is still running but already
> have it's eviction scheduled and it mm vram node marked as free with a fence.
> If we have a list per manager than vram becomes unavailable for as long as
> this cl job runs. I am thinking here to newer cpu that can partition there
> compute unit (CL 1.2 spec mandate that iirc) for instance btw cl and gl so the
> gpu still can run other job while this cl things is going on.

Yeah, I'm not saying this is an efficient solution. I'm saying it's the 
simplest solution. We can make
a data structure arbitrarily big out of this, or perhaps provide a 
selection of algorithms to
choose from.

/Thomas



_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Asynchronous eviction [WAS Re: [PATCH] drm/ttm: add minimum residency constraint for bo eviction]
  2012-11-30 20:35                                   ` Thomas Hellstrom
@ 2012-11-30 21:07                                     ` Jerome Glisse
  2012-11-30 21:36                                       ` Thomas Hellstrom
  0 siblings, 1 reply; 34+ messages in thread
From: Jerome Glisse @ 2012-11-30 21:07 UTC (permalink / raw)
  To: Thomas Hellstrom; +Cc: Jerome Glisse, dri-devel

On Fri, Nov 30, 2012 at 09:35:49PM +0100, Thomas Hellstrom wrote:
> On 11/30/2012 08:25 PM, Jerome Glisse wrote:
> >On Fri, Nov 30, 2012 at 07:31:04PM +0100, Thomas Hellstrom wrote:
> >>On 11/30/2012 07:07 PM, Jerome Glisse wrote:
> >>>On Fri, Nov 30, 2012 at 06:43:36PM +0100, Thomas Hellstrom wrote:
> >>>>On 11/30/2012 06:18 PM, Jerome Glisse wrote:
> >>>>>On Fri, Nov 30, 2012 at 12:08 PM, Thomas Hellstrom <thomas@shipmail.org> wrote:
> >>>>>>On 11/30/2012 05:30 PM, Jerome Glisse wrote:
> >>>>>>>On Fri, Nov 30, 2012 at 4:39 AM, Thomas Hellstrom <thomas@shipmail.org>
> >>>>>>>wrote:
> >>>>>>>>On 11/29/2012 10:58 PM, Marek Olšák wrote:
> >>>>>>>>>What I tried to point out was that the synchronization shouldn't be
> >>>>>>>>>needed, because the CPU shouldn't do anything with the contents of
> >>>>>>>>>evicted buffers. The GPU moves the buffers, not the CPU. What does the
> >>>>>>>>>CPU do besides updating some kernel structures?
> >>>>>>>>>
> >>>>>>>>>Also, buffer deletion is something where you don't need to wait for
> >>>>>>>>>the buffer to become idle if you know the memory area won't be
> >>>>>>>>>mapped by the CPU, ever. The memory can be reclaimed right away. It
> >>>>>>>>>would be the GPU to move new data in and once that happens, the old
> >>>>>>>>>buffer will be trivially idle, because single-ring GPUs execute
> >>>>>>>>>commands in order.
> >>>>>>>>>
> >>>>>>>>>Marek
> >>>>>>>>Actually asynchronous eviction / deletion is something I have been
> >>>>>>>>prototyping for a while but never gotten around to implement in TTM:
> >>>>>>>>
> >>>>>>>>There are a few minor caveats:
> >>>>>>>>
> >>>>>>>>With buffer deletion, what you say is true for fixed memory, but not for
> >>>>>>>>TT
> >>>>>>>>memory where pages are reclaimed by the system after buffer destruction.
> >>>>>>>>That means that we don't have to wait for idle to free GPU space, but we
> >>>>>>>>need to wait before pages are handed back to the system.
> >>>>>>>>
> >>>>>>>>Swapout needs to access the contents of evicted buffers, but
> >>>>>>>>synchronizing
> >>>>>>>>doesn't need to happen until just before swapout.
> >>>>>>>>
> >>>>>>>>Multi-ring - CPU support: If another ring / engine or the CPU is about to
> >>>>>>>>move in buffer contents to VRAM or a GPU aperture that was previously
> >>>>>>>>evicted by another ring, it needs to sync with that eviction, but doesn't
> >>>>>>>>know what buffer or even which buffers occupied the space previously.
> >>>>>>>>Trivially one can attach a sync object to the memory type manager that
> >>>>>>>>represents the last eviction from that memory type, and *any* engine (CPU
> >>>>>>>>or
> >>>>>>>>GPU) that moves buffer contents in needs to order that movement with
> >>>>>>>>respect
> >>>>>>>>to that fence. As you say, with a single ring and no CPU fallbacks, that
> >>>>>>>>ordering is a no-op, but any common (non-driver based) implementation
> >>>>>>>>needs
> >>>>>>>>to support this.
> >>>>>>>>
> >>>>>>>>A single fence attached to the memory type manager is the simplest
> >>>>>>>>solution,
> >>>>>>>>but a solution with a fence for each free region in the free list is also
> >>>>>>>>possible. Then TTM needs a driver callback to be able order fences w r t
> >>>>>>>>echother.
> >>>>>>>>
> >>>>>>>>/Thomas
> >>>>>>>>
> >>>>>>>Radeon already handle multi-ring and ttm interaction with what we call
> >>>>>>>semaphore. Semaphore are created to synchronize with fence accross
> >>>>>>>different ring. I think the easiest solution is to just remove the bo
> >>>>>>>wait in ttm and let driver handle this.
> >>>>>>The wait can be removed, but only conditioned on a driver flag that says it
> >>>>>>supports unsynchronous buffer moves.
> >>>>>>
> >>>>>>The multi-ring case I'm talking about is:
> >>>>>>
> >>>>>>Ring 1 evicts buffer A, emits fence 0
> >>>>>>Ring 2 evicts buffer B, emits fence 1
> >>>>>>..Other eviction takes place by various rings, perhaps including ring 1 and
> >>>>>>ring 2.
> >>>>>>Ring 3 moves buffer C into the space which happens bo be the union of the
> >>>>>>space prevously occupied buffer A and buffer B.
> >>>>>>
> >>>>>>Question is: which fence do you want to order this move with?
> >>>>>>The answer is whichever of fence 0 and 1 signals last.
> >>>>>>
> >>>>>>I think it's a reasonable thing for TTM to keep track of this, but in order
> >>>>>>to do so it needs a driver callback that
> >>>>>>can order two fences, and can order a job in the current ring w r t a fence.
> >>>>>>In radeon's case that driver callback
> >>>>>>would probably insert a barrier / semaphore. In the case of simpler hardware
> >>>>>>it would wait on one of the fences.
> >>>>>>
> >>>>>>/Thomas
> >>>>>>
> >>>>>I don't think we can order fence easily with a clean api, i would
> >>>>>rather see ttm provide a list of fence to driver and tell to the
> >>>>>driver before moving this object all the fence on this list need to be
> >>>>>completed. I think it's as easy as associating fence with drm_mm (well
> >>>>>nouveau as its own mm stuff) but idea would basicly be that fence are
> >>>>>both associated with bo and with mm object so you know when a segment
> >>>>>of memory is idle/available for use.
> >>>>>
> >>>>>Cheers,
> >>>>>Jerome
> >>>>Hmm. Agreed that would save a lot of barriers.
> >>>>
> >>>>Even if TTM tracks fences by free mm regions or a single fence for
> >>>>the whole memory type, it's a simple fact that fences from the same
> >>>>ring are trivially ordered, which means such a list should contain at
> >>>>most as many fences as there are rings.
> >>>Yes, one function callback is needed to know which fence is necessary,
> >>>also ttm needs to know the number of rings (note that i think newer
> >>>hw will have somethings like 1024 rings or even more, even today hw
> >>>might have as many as i think nvidia channel is pretty much what i
> >>>define to be a ring).
> >>>
> >>>But i think most case will be few fence accross few rings. Like 1
> >>>ring is the dma ring and then you have a ring for one of the GL
> >>>context that using the memory and another ring for the new context
> >>>that want to use the memory.
> >>>
> >>>>So, whatever approach is chosen, TTM needs to be able to determine
> >>>>that trivial ordering, and I think the upcoming cross-device fencing
> >>>>work will face the exact same problem.
> >>>>
> >>>>My proposed ordering API would look something like
> >>>>
> >>>>struct fence *order_fences(struct fence *fence_a, struct fence
> >>>>*fence_b, bool trivial_order, bool interruptible, bool no_wait_gpu)
> >>>>
> >>>>Returns which of the fences @fence_a and @fence_b that when
> >>>>signaled, guarantees that also the other
> >>>>fence has signaled. If @quick_order is true, and the driver cannot
> >>>>trivially order the fences, it may return ERR_PTR(-EAGAIN),
> >>>>if @interruptible is true, any wait should be performed
> >>>>interruptibly and if no_wait_gpu is true, the function is not
> >>>>allowed to wait on gpu but should issue ERR_PTR(-EBUSY) if it needs
> >>>>to do so to order fences.
> >>>>
> >>>>(Hardware without semaphores can't order fences without waiting on them).
> >>>>
> >>>>The list approach you suggest would use @trivial_order = true,
> >>>>Single fence approach would use @trivial_order = false.
> >>>>
> >>>>And a first simple implementation in TTM would perhaps use your list
> >>>>approach with a single list for the whole memory type.
> >>>>
> >>>>/Thomas
> >>>I would rather add a callback like :
> >>>
> >>>ttm_reduce_fences(unsigned *nfences, fence **fencearray)
> >>I don't agree here. I think the fence order function is more
> >>versatile and a good abstraction that can be
> >>applied in a number of cases to this problem. Anyway we should sync
> >>this with Maarten and his
> >>fence work. The same problem applies to attaching shared fences to a bo.
> >Radeon already handle the bo case on multi-ring and i think it should
> >be left to the driver to do what its necessary there.
> >
> >I don't think here more versatility is bad, in fact i am pretty sure
> >that no hw can give fence ordering and i also think that if driver have
> >to track multi-ring fence ordering it will just waste resource for no
> >good reasons. For tracking that in radeon i will need to keep a list
> >of semaphore and knows which semaphores insure synchronization btw
> >which ring and for each which fence are concerned just thinking to it
> >it would be a messy graph with tons of nodes.
> >
> >Of course here i am thinking in term of newer GPU with tons of rings,
> >GPU with one ring, which are a vanishing category as newer opencl
> >require more ring, are easy to handle but i really don't think we
> >should design for those.
> The biggest problem I see with ttm_reduce_fences() is that it seems
> to have high complexity, since it
> doesn't know which fence is the new one. And if it did, it would
> only be a multi-fence version of
> order_fences(trivial=true).

I am sure all driver with multi-ring will store ring id in there fence
structure, that's from my pov a requirement. So when you get a list of
fence you first go over fence on the same ring and so far all fence
implementation use increasing sequence number for fence. So reducing
becomes as easy as only leaving the most recent fence for each of the
ring with an active fence. At the same time it could check (assuming
it's a quick operation) if fence is already signaled or not.

For radeon this function would be very small, a simple imbricated loop
with couple test in the loop.
 
> >I think here we will have to agree to disagree but i see a fence array
> >as a perfect solution that is flexible enough and use the less resource
> >both in term of cpu usage and memory (assuming proper implementation).
> I thought it would always require nrings*8 bytes on each memory block?

As i said i think there won't be much more than 3/4 ring on average for
each block of memory. So idea is probably to alloc for 8ring and resize
in rare case where we need more.

It sound very unlikely to have more than 3/4 rings with active fence
at any point in time. You would have a dma ring, 3d ring for one client,
maybe a 3d ring for another client, video decoder ring. Also driver can
do optimization like virtualizing fence for each partition of the gpu
this would drasticly reduce the number of fence like 1 for dma, 1 for
each partition of gpu, 1 for the video decoder, 1 for display. Given
that there can't be that many partition (i think on radeon you can
only partition per CU which is something like 4 or 8).

In other word when you have n rings that do works on m partitions of
the gpu you only needs m fences categorie as you know work on each
partition will complete in order.

> >
> >I think here there should be a granularity something like don't merge block
> >over 16M or don't merge block if they have more than N fences if merged
> >together. Of course it means a special allocator but as my work on radeon
> >memory placement shows i think that the whole vram business (which is really
> >where the multi-fence will be used the most) needs a worker thread that
> >schedule works every now and then, this worker thread could also merge mm
> >when fence complete ...
> >
> >But i really don't think there should be a list for the whole manager, think
> >about the case of a very long cl running job that is still running but already
> >have it's eviction scheduled and it mm vram node marked as free with a fence.
> >If we have a list per manager than vram becomes unavailable for as long as
> >this cl job runs. I am thinking here to newer cpu that can partition there
> >compute unit (CL 1.2 spec mandate that iirc) for instance btw cl and gl so the
> >gpu still can run other job while this cl things is going on.
> 
> Yeah, I'm not saying this is an efficient solution. I'm saying it's
> the simplest solution. We can make
> a data structure arbitrarily big out of this, or perhaps provide a
> selection of algorithms to
> choose from.
> 
> /Thomas

Customizing drm_mm to handle doing allocation from several adjacent
block doesn't sounds too complex, we could introduce virtual mm
block that group all adjacent block and keep an array of fence
for each of the subblock with active fence. But yes this could be
done as a second step.

Cheers,
Jerome
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Asynchronous eviction [WAS Re: [PATCH] drm/ttm: add minimum residency constraint for bo eviction]
  2012-11-30 21:07                                     ` Jerome Glisse
@ 2012-11-30 21:36                                       ` Thomas Hellstrom
  2012-11-30 22:02                                         ` Jerome Glisse
  0 siblings, 1 reply; 34+ messages in thread
From: Thomas Hellstrom @ 2012-11-30 21:36 UTC (permalink / raw)
  To: Jerome Glisse; +Cc: Jerome Glisse, dri-devel

On 11/30/2012 10:07 PM, Jerome Glisse wrote:
> On Fri, Nov 30, 2012 at 09:35:49PM +0100, Thomas Hellstrom wrote:
>> On 11/30/2012 08:25 PM, Jerome Glisse wrote:
>>> On Fri, Nov 30, 2012 at 07:31:04PM +0100, Thomas Hellstrom wrote:
>>>> On 11/30/2012 07:07 PM, Jerome Glisse wrote:
>>>>> On Fri, Nov 30, 2012 at 06:43:36PM +0100, Thomas Hellstrom wrote:
>>>>>> On 11/30/2012 06:18 PM, Jerome Glisse wrote:
>>>>>>> On Fri, Nov 30, 2012 at 12:08 PM, Thomas Hellstrom <thomas@shipmail.org> wrote:
>>>>>>>> On 11/30/2012 05:30 PM, Jerome Glisse wrote:
>>>>>>>>> On Fri, Nov 30, 2012 at 4:39 AM, Thomas Hellstrom <thomas@shipmail.org>
>>>>>>>>> wrote:
>>>>>>>>>> On 11/29/2012 10:58 PM, Marek Olšák wrote:
>>>>>>>>>>> What I tried to point out was that the synchronization shouldn't be
>>>>>>>>>>> needed, because the CPU shouldn't do anything with the contents of
>>>>>>>>>>> evicted buffers. The GPU moves the buffers, not the CPU. What does the
>>>>>>>>>>> CPU do besides updating some kernel structures?
>>>>>>>>>>>
>>>>>>>>>>> Also, buffer deletion is something where you don't need to wait for
>>>>>>>>>>> the buffer to become idle if you know the memory area won't be
>>>>>>>>>>> mapped by the CPU, ever. The memory can be reclaimed right away. It
>>>>>>>>>>> would be the GPU to move new data in and once that happens, the old
>>>>>>>>>>> buffer will be trivially idle, because single-ring GPUs execute
>>>>>>>>>>> commands in order.
>>>>>>>>>>>
>>>>>>>>>>> Marek
>>>>>>>>>> Actually asynchronous eviction / deletion is something I have been
>>>>>>>>>> prototyping for a while but never gotten around to implement in TTM:
>>>>>>>>>>
>>>>>>>>>> There are a few minor caveats:
>>>>>>>>>>
>>>>>>>>>> With buffer deletion, what you say is true for fixed memory, but not for
>>>>>>>>>> TT
>>>>>>>>>> memory where pages are reclaimed by the system after buffer destruction.
>>>>>>>>>> That means that we don't have to wait for idle to free GPU space, but we
>>>>>>>>>> need to wait before pages are handed back to the system.
>>>>>>>>>>
>>>>>>>>>> Swapout needs to access the contents of evicted buffers, but
>>>>>>>>>> synchronizing
>>>>>>>>>> doesn't need to happen until just before swapout.
>>>>>>>>>>
>>>>>>>>>> Multi-ring - CPU support: If another ring / engine or the CPU is about to
>>>>>>>>>> move in buffer contents to VRAM or a GPU aperture that was previously
>>>>>>>>>> evicted by another ring, it needs to sync with that eviction, but doesn't
>>>>>>>>>> know what buffer or even which buffers occupied the space previously.
>>>>>>>>>> Trivially one can attach a sync object to the memory type manager that
>>>>>>>>>> represents the last eviction from that memory type, and *any* engine (CPU
>>>>>>>>>> or
>>>>>>>>>> GPU) that moves buffer contents in needs to order that movement with
>>>>>>>>>> respect
>>>>>>>>>> to that fence. As you say, with a single ring and no CPU fallbacks, that
>>>>>>>>>> ordering is a no-op, but any common (non-driver based) implementation
>>>>>>>>>> needs
>>>>>>>>>> to support this.
>>>>>>>>>>
>>>>>>>>>> A single fence attached to the memory type manager is the simplest
>>>>>>>>>> solution,
>>>>>>>>>> but a solution with a fence for each free region in the free list is also
>>>>>>>>>> possible. Then TTM needs a driver callback to be able order fences w r t
>>>>>>>>>> echother.
>>>>>>>>>>
>>>>>>>>>> /Thomas
>>>>>>>>>>
>>>>>>>>> Radeon already handle multi-ring and ttm interaction with what we call
>>>>>>>>> semaphore. Semaphore are created to synchronize with fence accross
>>>>>>>>> different ring. I think the easiest solution is to just remove the bo
>>>>>>>>> wait in ttm and let driver handle this.
>>>>>>>> The wait can be removed, but only conditioned on a driver flag that says it
>>>>>>>> supports unsynchronous buffer moves.
>>>>>>>>
>>>>>>>> The multi-ring case I'm talking about is:
>>>>>>>>
>>>>>>>> Ring 1 evicts buffer A, emits fence 0
>>>>>>>> Ring 2 evicts buffer B, emits fence 1
>>>>>>>> ..Other eviction takes place by various rings, perhaps including ring 1 and
>>>>>>>> ring 2.
>>>>>>>> Ring 3 moves buffer C into the space which happens bo be the union of the
>>>>>>>> space prevously occupied buffer A and buffer B.
>>>>>>>>
>>>>>>>> Question is: which fence do you want to order this move with?
>>>>>>>> The answer is whichever of fence 0 and 1 signals last.
>>>>>>>>
>>>>>>>> I think it's a reasonable thing for TTM to keep track of this, but in order
>>>>>>>> to do so it needs a driver callback that
>>>>>>>> can order two fences, and can order a job in the current ring w r t a fence.
>>>>>>>> In radeon's case that driver callback
>>>>>>>> would probably insert a barrier / semaphore. In the case of simpler hardware
>>>>>>>> it would wait on one of the fences.
>>>>>>>>
>>>>>>>> /Thomas
>>>>>>>>
>>>>>>> I don't think we can order fence easily with a clean api, i would
>>>>>>> rather see ttm provide a list of fence to driver and tell to the
>>>>>>> driver before moving this object all the fence on this list need to be
>>>>>>> completed. I think it's as easy as associating fence with drm_mm (well
>>>>>>> nouveau as its own mm stuff) but idea would basicly be that fence are
>>>>>>> both associated with bo and with mm object so you know when a segment
>>>>>>> of memory is idle/available for use.
>>>>>>>
>>>>>>> Cheers,
>>>>>>> Jerome
>>>>>> Hmm. Agreed that would save a lot of barriers.
>>>>>>
>>>>>> Even if TTM tracks fences by free mm regions or a single fence for
>>>>>> the whole memory type, it's a simple fact that fences from the same
>>>>>> ring are trivially ordered, which means such a list should contain at
>>>>>> most as many fences as there are rings.
>>>>> Yes, one function callback is needed to know which fence is necessary,
>>>>> also ttm needs to know the number of rings (note that i think newer
>>>>> hw will have somethings like 1024 rings or even more, even today hw
>>>>> might have as many as i think nvidia channel is pretty much what i
>>>>> define to be a ring).
>>>>>
>>>>> But i think most case will be few fence accross few rings. Like 1
>>>>> ring is the dma ring and then you have a ring for one of the GL
>>>>> context that using the memory and another ring for the new context
>>>>> that want to use the memory.
>>>>>
>>>>>> So, whatever approach is chosen, TTM needs to be able to determine
>>>>>> that trivial ordering, and I think the upcoming cross-device fencing
>>>>>> work will face the exact same problem.
>>>>>>
>>>>>> My proposed ordering API would look something like
>>>>>>
>>>>>> struct fence *order_fences(struct fence *fence_a, struct fence
>>>>>> *fence_b, bool trivial_order, bool interruptible, bool no_wait_gpu)
>>>>>>
>>>>>> Returns which of the fences @fence_a and @fence_b that when
>>>>>> signaled, guarantees that also the other
>>>>>> fence has signaled. If @quick_order is true, and the driver cannot
>>>>>> trivially order the fences, it may return ERR_PTR(-EAGAIN),
>>>>>> if @interruptible is true, any wait should be performed
>>>>>> interruptibly and if no_wait_gpu is true, the function is not
>>>>>> allowed to wait on gpu but should issue ERR_PTR(-EBUSY) if it needs
>>>>>> to do so to order fences.
>>>>>>
>>>>>> (Hardware without semaphores can't order fences without waiting on them).
>>>>>>
>>>>>> The list approach you suggest would use @trivial_order = true,
>>>>>> Single fence approach would use @trivial_order = false.
>>>>>>
>>>>>> And a first simple implementation in TTM would perhaps use your list
>>>>>> approach with a single list for the whole memory type.
>>>>>>
>>>>>> /Thomas
>>>>> I would rather add a callback like :
>>>>>
>>>>> ttm_reduce_fences(unsigned *nfences, fence **fencearray)
>>>> I don't agree here. I think the fence order function is more
>>>> versatile and a good abstraction that can be
>>>> applied in a number of cases to this problem. Anyway we should sync
>>>> this with Maarten and his
>>>> fence work. The same problem applies to attaching shared fences to a bo.
>>> Radeon already handle the bo case on multi-ring and i think it should
>>> be left to the driver to do what its necessary there.
>>>
>>> I don't think here more versatility is bad, in fact i am pretty sure
>>> that no hw can give fence ordering and i also think that if driver have
>>> to track multi-ring fence ordering it will just waste resource for no
>>> good reasons. For tracking that in radeon i will need to keep a list
>>> of semaphore and knows which semaphores insure synchronization btw
>>> which ring and for each which fence are concerned just thinking to it
>>> it would be a messy graph with tons of nodes.
>>>
>>> Of course here i am thinking in term of newer GPU with tons of rings,
>>> GPU with one ring, which are a vanishing category as newer opencl
>>> require more ring, are easy to handle but i really don't think we
>>> should design for those.
>> The biggest problem I see with ttm_reduce_fences() is that it seems
>> to have high complexity, since it
>> doesn't know which fence is the new one. And if it did, it would
>> only be a multi-fence version of
>> order_fences(trivial=true).
> I am sure all driver with multi-ring will store ring id in there fence
> structure, that's from my pov a requirement. So when you get a list of
> fence you first go over fence on the same ring and so far all fence
> implementation use increasing sequence number for fence. So reducing
> becomes as easy as only leaving the most recent fence for each of the
> ring with an active fence. At the same time it could check (assuming
> it's a quick operation) if fence is already signaled or not.
>
> For radeon this function would be very small, a simple imbricated loop
> with couple test in the loop.

What you describe sounds like an O(n²) complexity algorithm, whereas
an algorithm based on order_fences is O(n) complexity.

Thanks, Thomas



_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Asynchronous eviction [WAS Re: [PATCH] drm/ttm: add minimum residency constraint for bo eviction]
  2012-11-30 21:36                                       ` Thomas Hellstrom
@ 2012-11-30 22:02                                         ` Jerome Glisse
  0 siblings, 0 replies; 34+ messages in thread
From: Jerome Glisse @ 2012-11-30 22:02 UTC (permalink / raw)
  To: Thomas Hellstrom; +Cc: Jerome Glisse, dri-devel

On Fri, Nov 30, 2012 at 10:36:01PM +0100, Thomas Hellstrom wrote:
> On 11/30/2012 10:07 PM, Jerome Glisse wrote:
> >On Fri, Nov 30, 2012 at 09:35:49PM +0100, Thomas Hellstrom wrote:
> >>On 11/30/2012 08:25 PM, Jerome Glisse wrote:
> >>>On Fri, Nov 30, 2012 at 07:31:04PM +0100, Thomas Hellstrom wrote:
> >>>>On 11/30/2012 07:07 PM, Jerome Glisse wrote:
> >>>>>On Fri, Nov 30, 2012 at 06:43:36PM +0100, Thomas Hellstrom wrote:
> >>>>>>On 11/30/2012 06:18 PM, Jerome Glisse wrote:
> >>>>>>>On Fri, Nov 30, 2012 at 12:08 PM, Thomas Hellstrom <thomas@shipmail.org> wrote:
> >>>>>>>>On 11/30/2012 05:30 PM, Jerome Glisse wrote:
> >>>>>>>>>On Fri, Nov 30, 2012 at 4:39 AM, Thomas Hellstrom <thomas@shipmail.org>
> >>>>>>>>>wrote:
> >>>>>>>>>>On 11/29/2012 10:58 PM, Marek Olšák wrote:
> >>>>>>>>>>>What I tried to point out was that the synchronization shouldn't be
> >>>>>>>>>>>needed, because the CPU shouldn't do anything with the contents of
> >>>>>>>>>>>evicted buffers. The GPU moves the buffers, not the CPU. What does the
> >>>>>>>>>>>CPU do besides updating some kernel structures?
> >>>>>>>>>>>
> >>>>>>>>>>>Also, buffer deletion is something where you don't need to wait for
> >>>>>>>>>>>the buffer to become idle if you know the memory area won't be
> >>>>>>>>>>>mapped by the CPU, ever. The memory can be reclaimed right away. It
> >>>>>>>>>>>would be the GPU to move new data in and once that happens, the old
> >>>>>>>>>>>buffer will be trivially idle, because single-ring GPUs execute
> >>>>>>>>>>>commands in order.
> >>>>>>>>>>>
> >>>>>>>>>>>Marek
> >>>>>>>>>>Actually asynchronous eviction / deletion is something I have been
> >>>>>>>>>>prototyping for a while but never gotten around to implement in TTM:
> >>>>>>>>>>
> >>>>>>>>>>There are a few minor caveats:
> >>>>>>>>>>
> >>>>>>>>>>With buffer deletion, what you say is true for fixed memory, but not for
> >>>>>>>>>>TT
> >>>>>>>>>>memory where pages are reclaimed by the system after buffer destruction.
> >>>>>>>>>>That means that we don't have to wait for idle to free GPU space, but we
> >>>>>>>>>>need to wait before pages are handed back to the system.
> >>>>>>>>>>
> >>>>>>>>>>Swapout needs to access the contents of evicted buffers, but
> >>>>>>>>>>synchronizing
> >>>>>>>>>>doesn't need to happen until just before swapout.
> >>>>>>>>>>
> >>>>>>>>>>Multi-ring - CPU support: If another ring / engine or the CPU is about to
> >>>>>>>>>>move in buffer contents to VRAM or a GPU aperture that was previously
> >>>>>>>>>>evicted by another ring, it needs to sync with that eviction, but doesn't
> >>>>>>>>>>know what buffer or even which buffers occupied the space previously.
> >>>>>>>>>>Trivially one can attach a sync object to the memory type manager that
> >>>>>>>>>>represents the last eviction from that memory type, and *any* engine (CPU
> >>>>>>>>>>or
> >>>>>>>>>>GPU) that moves buffer contents in needs to order that movement with
> >>>>>>>>>>respect
> >>>>>>>>>>to that fence. As you say, with a single ring and no CPU fallbacks, that
> >>>>>>>>>>ordering is a no-op, but any common (non-driver based) implementation
> >>>>>>>>>>needs
> >>>>>>>>>>to support this.
> >>>>>>>>>>
> >>>>>>>>>>A single fence attached to the memory type manager is the simplest
> >>>>>>>>>>solution,
> >>>>>>>>>>but a solution with a fence for each free region in the free list is also
> >>>>>>>>>>possible. Then TTM needs a driver callback to be able order fences w r t
> >>>>>>>>>>echother.
> >>>>>>>>>>
> >>>>>>>>>>/Thomas
> >>>>>>>>>>
> >>>>>>>>>Radeon already handle multi-ring and ttm interaction with what we call
> >>>>>>>>>semaphore. Semaphore are created to synchronize with fence accross
> >>>>>>>>>different ring. I think the easiest solution is to just remove the bo
> >>>>>>>>>wait in ttm and let driver handle this.
> >>>>>>>>The wait can be removed, but only conditioned on a driver flag that says it
> >>>>>>>>supports unsynchronous buffer moves.
> >>>>>>>>
> >>>>>>>>The multi-ring case I'm talking about is:
> >>>>>>>>
> >>>>>>>>Ring 1 evicts buffer A, emits fence 0
> >>>>>>>>Ring 2 evicts buffer B, emits fence 1
> >>>>>>>>..Other eviction takes place by various rings, perhaps including ring 1 and
> >>>>>>>>ring 2.
> >>>>>>>>Ring 3 moves buffer C into the space which happens bo be the union of the
> >>>>>>>>space prevously occupied buffer A and buffer B.
> >>>>>>>>
> >>>>>>>>Question is: which fence do you want to order this move with?
> >>>>>>>>The answer is whichever of fence 0 and 1 signals last.
> >>>>>>>>
> >>>>>>>>I think it's a reasonable thing for TTM to keep track of this, but in order
> >>>>>>>>to do so it needs a driver callback that
> >>>>>>>>can order two fences, and can order a job in the current ring w r t a fence.
> >>>>>>>>In radeon's case that driver callback
> >>>>>>>>would probably insert a barrier / semaphore. In the case of simpler hardware
> >>>>>>>>it would wait on one of the fences.
> >>>>>>>>
> >>>>>>>>/Thomas
> >>>>>>>>
> >>>>>>>I don't think we can order fence easily with a clean api, i would
> >>>>>>>rather see ttm provide a list of fence to driver and tell to the
> >>>>>>>driver before moving this object all the fence on this list need to be
> >>>>>>>completed. I think it's as easy as associating fence with drm_mm (well
> >>>>>>>nouveau as its own mm stuff) but idea would basicly be that fence are
> >>>>>>>both associated with bo and with mm object so you know when a segment
> >>>>>>>of memory is idle/available for use.
> >>>>>>>
> >>>>>>>Cheers,
> >>>>>>>Jerome
> >>>>>>Hmm. Agreed that would save a lot of barriers.
> >>>>>>
> >>>>>>Even if TTM tracks fences by free mm regions or a single fence for
> >>>>>>the whole memory type, it's a simple fact that fences from the same
> >>>>>>ring are trivially ordered, which means such a list should contain at
> >>>>>>most as many fences as there are rings.
> >>>>>Yes, one function callback is needed to know which fence is necessary,
> >>>>>also ttm needs to know the number of rings (note that i think newer
> >>>>>hw will have somethings like 1024 rings or even more, even today hw
> >>>>>might have as many as i think nvidia channel is pretty much what i
> >>>>>define to be a ring).
> >>>>>
> >>>>>But i think most case will be few fence accross few rings. Like 1
> >>>>>ring is the dma ring and then you have a ring for one of the GL
> >>>>>context that using the memory and another ring for the new context
> >>>>>that want to use the memory.
> >>>>>
> >>>>>>So, whatever approach is chosen, TTM needs to be able to determine
> >>>>>>that trivial ordering, and I think the upcoming cross-device fencing
> >>>>>>work will face the exact same problem.
> >>>>>>
> >>>>>>My proposed ordering API would look something like
> >>>>>>
> >>>>>>struct fence *order_fences(struct fence *fence_a, struct fence
> >>>>>>*fence_b, bool trivial_order, bool interruptible, bool no_wait_gpu)
> >>>>>>
> >>>>>>Returns which of the fences @fence_a and @fence_b that when
> >>>>>>signaled, guarantees that also the other
> >>>>>>fence has signaled. If @quick_order is true, and the driver cannot
> >>>>>>trivially order the fences, it may return ERR_PTR(-EAGAIN),
> >>>>>>if @interruptible is true, any wait should be performed
> >>>>>>interruptibly and if no_wait_gpu is true, the function is not
> >>>>>>allowed to wait on gpu but should issue ERR_PTR(-EBUSY) if it needs
> >>>>>>to do so to order fences.
> >>>>>>
> >>>>>>(Hardware without semaphores can't order fences without waiting on them).
> >>>>>>
> >>>>>>The list approach you suggest would use @trivial_order = true,
> >>>>>>Single fence approach would use @trivial_order = false.
> >>>>>>
> >>>>>>And a first simple implementation in TTM would perhaps use your list
> >>>>>>approach with a single list for the whole memory type.
> >>>>>>
> >>>>>>/Thomas
> >>>>>I would rather add a callback like :
> >>>>>
> >>>>>ttm_reduce_fences(unsigned *nfences, fence **fencearray)
> >>>>I don't agree here. I think the fence order function is more
> >>>>versatile and a good abstraction that can be
> >>>>applied in a number of cases to this problem. Anyway we should sync
> >>>>this with Maarten and his
> >>>>fence work. The same problem applies to attaching shared fences to a bo.
> >>>Radeon already handle the bo case on multi-ring and i think it should
> >>>be left to the driver to do what its necessary there.
> >>>
> >>>I don't think here more versatility is bad, in fact i am pretty sure
> >>>that no hw can give fence ordering and i also think that if driver have
> >>>to track multi-ring fence ordering it will just waste resource for no
> >>>good reasons. For tracking that in radeon i will need to keep a list
> >>>of semaphore and knows which semaphores insure synchronization btw
> >>>which ring and for each which fence are concerned just thinking to it
> >>>it would be a messy graph with tons of nodes.
> >>>
> >>>Of course here i am thinking in term of newer GPU with tons of rings,
> >>>GPU with one ring, which are a vanishing category as newer opencl
> >>>require more ring, are easy to handle but i really don't think we
> >>>should design for those.
> >>The biggest problem I see with ttm_reduce_fences() is that it seems
> >>to have high complexity, since it
> >>doesn't know which fence is the new one. And if it did, it would
> >>only be a multi-fence version of
> >>order_fences(trivial=true).
> >I am sure all driver with multi-ring will store ring id in there fence
> >structure, that's from my pov a requirement. So when you get a list of
> >fence you first go over fence on the same ring and so far all fence
> >implementation use increasing sequence number for fence. So reducing
> >becomes as easy as only leaving the most recent fence for each of the
> >ring with an active fence. At the same time it could check (assuming
> >it's a quick operation) if fence is already signaled or not.
> >
> >For radeon this function would be very small, a simple imbricated loop
> >with couple test in the loop.
> 
> What you describe sounds like an O(n²) complexity algorithm, whereas
> an algorithm based on order_fences is O(n) complexity.
> 
> Thanks, Thomas

Not exactly O(n²) but close in worst case, i include pseudo code below,
but as i said i expect : n < 8. My point was that order_fences would
never return a single fence if fence were scheduled into different
rings ie it would wait (if gpu wait is allow) so making the whole let
move synch into the driver pointless.

As i said there is no easy way to query fence interdependency and to
know which one will complete first if fence are not emited on same ring.
You're approach ask the driver very early and may require the driver
to do work (likely blocking one) to return a single fence. What i am
suggesting is to delay the whole synchronize ring to the last minute
in the bo move callback at that point i am hopping that in most case
most fence will have signal and only little work will be necessary.

So yes i store more fence and it require more memory, but i am pretty
sure given the argument i gave that we will have a small number of
fences.

Pseudo reduce code :

reduce(unsigned *nfences, fence **fences)
{
  for (i = 0, lastnull = 0, c = 0; i < *nfences; i++) {
    if (issignaled(fences[i])) {
      fence_unref(fences[i]);
      fences[i] = NULL;
    }
    if (fences[i]) {
      c++;
      for (j = i + 1; j < *nfences; j++) {
        if (fences[i].ringid == fences[j].ringid) {
          if (isolder(fences[i], fences[j])) {
            fence_unref(fences[j]);
            fences[j] = NULL;
            fences[lastnull++] = fences[i];
          } else {
            fence_unref(fences[i]);
            fences[i] = NULL;
            fences[lastnull++] = fences[j];
            fences[j] = NULL;
          }
        }
      }
    } else {
      lastnull = i;
    }
  }
  *nfences = c;
}

I probably got the lastnull stuff wrong :)

Cheers,
Jerome
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2012-11-30 22:05 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-11-28 15:58 [RFC] drm/ttm: add minimum residency constraint for bo eviction j.glisse
2012-11-28 15:58 ` [PATCH] " j.glisse
2012-11-28 23:18   ` Thomas Hellstrom
2012-11-28 23:24     ` Jerome Glisse
2012-11-28 23:44       ` Alan Swanson
2012-11-29  0:01         ` Jerome Glisse
2012-11-29  2:15         ` Marek Olšák
2012-11-29  8:04           ` Thomas Hellstrom
2012-11-29 12:52             ` Marek Olšák
2012-11-29 20:33               ` Thomas Hellstrom
2012-11-29 21:58                 ` Marek Olšák
2012-11-30  8:38                   ` Thomas Hellstrom
2012-11-30  9:39                   ` Asynchronous eviction [WAS Re: [PATCH] drm/ttm: add minimum residency constraint for bo eviction] Thomas Hellstrom
2012-11-30 16:30                     ` Jerome Glisse
2012-11-30 17:08                       ` Thomas Hellstrom
2012-11-30 17:18                         ` Jerome Glisse
2012-11-30 17:43                           ` Thomas Hellstrom
2012-11-30 18:07                             ` Jerome Glisse
2012-11-30 18:31                               ` Thomas Hellstrom
2012-11-30 19:25                                 ` Jerome Glisse
2012-11-30 20:35                                   ` Thomas Hellstrom
2012-11-30 21:07                                     ` Jerome Glisse
2012-11-30 21:36                                       ` Thomas Hellstrom
2012-11-30 22:02                                         ` Jerome Glisse
2012-11-29  8:41       ` [PATCH] drm/ttm: add minimum residency constraint for bo eviction Thomas Hellstrom
2012-11-29 15:50         ` Jerome Glisse
2012-11-28 21:51 ` [RFC] " Marek Olšák
2012-11-28 23:18   ` Jerome Glisse
2012-11-29  9:18   ` Thomas Hellstrom
2012-11-29  9:28     ` Michel Dänzer
2012-11-29  9:48       ` Thomas Hellstrom
2012-11-29 19:20     ` Marek Olšák
2012-11-29 19:36       ` Jerome Glisse
2012-11-29 20:40       ` Thomas Hellstrom

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.