dri-devel.lists.freedesktop.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/9] Visible VRAM Management Improvements
@ 2017-06-23 17:39 John Brooks
  2017-06-23 17:39 ` [PATCH 3/9] drm/amdgpu: Don't force BOs into visible VRAM for page faults John Brooks
                   ` (3 more replies)
  0 siblings, 4 replies; 31+ messages in thread
From: John Brooks @ 2017-06-23 17:39 UTC (permalink / raw)
  To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW, Michel Dänzer,
	Marek Olšák
  Cc: David Airlie, Christian König,
	dri-devel-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

This patch series is intended to improve performance when limited CPU-visible
VRAM is under pressure.

Moving BOs into visible VRAM is essentially a housekeeping task. It's faster to
access them in VRAM than GTT, but it isn't a hard requirement for them to be in
VRAM. As such, it is unnecessary to spend valuable time blocking on this in the
page fault handler or during command submission. Doing so translates directly
into a longer frame time (ergo stalls and stuttering).

The problem worsens when attempting to move BOs into visible VRAM when it is
full. This takes much longer than a simple move because other BOs have to be
evicted, which involves finding and then moving potentially hundreds of other
BOs, which is very time consuming. In the case of limited visible VRAM, it's
important to do this sometime to keep the contents of visible VRAM fresh, but
it does not need to be a blocking operation. If visible VRAM is full, the BO
can be read from GTT in the meantime and the BO can be moved to VRAM later.

Thus, I have made it so that neither the command submission code nor page fault
handler spends time evicting BOs from visible VRAM, and instead this is
deferred to a workqueue function that's queued when CS requests BOs flagged
AMDGPU_GEM_CREATE_CPU_ACCESS_REQUIRED.

Speaking of CPU_ACCESS_REQUIRED, I've changed the handling of that flag so that
the kernel driver can clear it later even if it was set by userspace. This is
because the userspace graphics library can't know whether the application
really needs it to be CPU_ACCESS_REQUIRED forever. The kernel driver can't know
either, but it does know when page faults occur, and if a BO doesn't appear to
have any page faults when it's moved somewhere inaccessible, the flag can be
removed and it doesn't have to take up space in CPU-visible memory anymore.
This change was based on IRC discussions with Michel.

Patch 7 fixes a problem with BO moverate throttling that causes visible VRAM
moves to not be throttled if total VRAM isn't full enough.

I've also added a vis_vramlimit module parameter for debugging purposes. It's
similar to the vramlimit parameter except it limits only visible VRAM.

I have tested this patch set with the two games I know to be affected by
visible VRAM pressure: DiRT Rally and Dying Light. It practically eliminates
eviction-related stuttering in DiRT Rally as well as very low performance if
visible VRAM is limited to 64MB. It also fixes severely low framerates that
occurred in some areas of Dying Light. All my testing was done with an R9 290
with 4GB of visible VRAM with an Intel i7 4790.

--
John Brooks (Frogging101)

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH 1/9] drm/amdgpu: Separate placements and busy placements
       [not found] ` <1498239580-17360-1-git-send-email-john-xq/Ko7C6e2Bl57MIdRCFDg@public.gmane.org>
@ 2017-06-23 17:39   ` John Brooks
  2017-06-23 17:39   ` [PATCH 2/9] drm/amdgpu: Add vis_vramlimit module parameter John Brooks
                     ` (6 subsequent siblings)
  7 siblings, 0 replies; 31+ messages in thread
From: John Brooks @ 2017-06-23 17:39 UTC (permalink / raw)
  To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW, Michel Dänzer,
	Marek Olšák
  Cc: David Airlie, Christian König,
	dri-devel-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

This allows a BO to have busy placements that are not part of its normal
placements.

Users that want the busy placements to be the same can change the
placement.busy_placement pointer and corresponding count to be the same as
the regular placements.

Signed-off-by: John Brooks <john@fastquake.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu.h        |  1 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 50 +++++++++++++++++++++---------
 2 files changed, 37 insertions(+), 14 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index 12d61ed..c407d2d 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -416,6 +416,7 @@ struct amdgpu_bo {
 	u32				prefered_domains;
 	u32				allowed_domains;
 	struct ttm_place		placements[AMDGPU_GEM_DOMAIN_MAX + 1];
+	struct ttm_place		busy_placements[AMDGPU_GEM_DOMAIN_MAX + 1];
 	struct ttm_placement		placement;
 	struct ttm_buffer_object	tbo;
 	struct ttm_bo_kmap_obj		kmap;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
index 8ee6965..244a7d3 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
@@ -116,9 +116,11 @@ bool amdgpu_ttm_bo_is_amdgpu_bo(struct ttm_buffer_object *bo)
 static void amdgpu_ttm_placement_init(struct amdgpu_device *adev,
 				      struct ttm_placement *placement,
 				      struct ttm_place *places,
+				      struct ttm_place *busy_places,
 				      u32 domain, u64 flags)
 {
 	u32 c = 0;
+	u32 bc = 0;
 
 	if (domain & AMDGPU_GEM_DOMAIN_VRAM) {
 		unsigned visible_pfn = adev->mc.visible_vram_size >> PAGE_SHIFT;
@@ -135,7 +137,8 @@ static void amdgpu_ttm_placement_init(struct amdgpu_device *adev,
 
 		if (flags & AMDGPU_GEM_CREATE_VRAM_CONTIGUOUS)
 			places[c].flags |= TTM_PL_FLAG_CONTIGUOUS;
-		c++;
+
+		busy_places[bc++] = places[c++];
 	}
 
 	if (domain & AMDGPU_GEM_DOMAIN_GTT) {
@@ -147,7 +150,8 @@ static void amdgpu_ttm_placement_init(struct amdgpu_device *adev,
 				TTM_PL_FLAG_UNCACHED;
 		else
 			places[c].flags |= TTM_PL_FLAG_CACHED;
-		c++;
+
+		busy_places[bc++] = places[c++];
 	}
 
 	if (domain & AMDGPU_GEM_DOMAIN_CPU) {
@@ -159,42 +163,47 @@ static void amdgpu_ttm_placement_init(struct amdgpu_device *adev,
 				TTM_PL_FLAG_UNCACHED;
 		else
 			places[c].flags |= TTM_PL_FLAG_CACHED;
-		c++;
+
+		busy_places[bc++] = places[c++];
 	}
 
 	if (domain & AMDGPU_GEM_DOMAIN_GDS) {
 		places[c].fpfn = 0;
 		places[c].lpfn = 0;
 		places[c].flags = TTM_PL_FLAG_UNCACHED | AMDGPU_PL_FLAG_GDS;
-		c++;
+
+		busy_places[bc++] = places[c++];
 	}
 
 	if (domain & AMDGPU_GEM_DOMAIN_GWS) {
 		places[c].fpfn = 0;
 		places[c].lpfn = 0;
 		places[c].flags = TTM_PL_FLAG_UNCACHED | AMDGPU_PL_FLAG_GWS;
-		c++;
+
+		busy_places[bc++] = places[c++];
 	}
 
 	if (domain & AMDGPU_GEM_DOMAIN_OA) {
 		places[c].fpfn = 0;
 		places[c].lpfn = 0;
 		places[c].flags = TTM_PL_FLAG_UNCACHED | AMDGPU_PL_FLAG_OA;
-		c++;
+
+		busy_places[bc++] = places[c++];
 	}
 
 	if (!c) {
 		places[c].fpfn = 0;
 		places[c].lpfn = 0;
 		places[c].flags = TTM_PL_MASK_CACHING | TTM_PL_FLAG_SYSTEM;
-		c++;
+
+		busy_places[bc++] = places[c++];
 	}
 
 	placement->num_placement = c;
 	placement->placement = places;
 
-	placement->num_busy_placement = c;
-	placement->busy_placement = places;
+	placement->num_busy_placement = bc;
+	placement->busy_placement = busy_places;
 }
 
 void amdgpu_ttm_placement_from_domain(struct amdgpu_bo *abo, u32 domain)
@@ -202,7 +211,7 @@ void amdgpu_ttm_placement_from_domain(struct amdgpu_bo *abo, u32 domain)
 	struct amdgpu_device *adev = amdgpu_ttm_adev(abo->tbo.bdev);
 
 	amdgpu_ttm_placement_init(adev, &abo->placement, abo->placements,
-				  domain, abo->flags);
+				  abo->busy_placements, domain, abo->flags);
 }
 
 static void amdgpu_fill_placement_to_bo(struct amdgpu_bo *bo,
@@ -212,10 +221,12 @@ static void amdgpu_fill_placement_to_bo(struct amdgpu_bo *bo,
 
 	memcpy(bo->placements, placement->placement,
 	       placement->num_placement * sizeof(struct ttm_place));
+	memcpy(bo->busy_placements, placement->busy_placement,
+	       placement->num_busy_placement * sizeof(struct ttm_place));
 	bo->placement.num_placement = placement->num_placement;
 	bo->placement.num_busy_placement = placement->num_busy_placement;
 	bo->placement.placement = bo->placements;
-	bo->placement.busy_placement = bo->placements;
+	bo->placement.busy_placement = bo->busy_placements;
 }
 
 /**
@@ -441,6 +452,7 @@ static int amdgpu_bo_create_shadow(struct amdgpu_device *adev,
 {
 	struct ttm_placement placement = {0};
 	struct ttm_place placements[AMDGPU_GEM_DOMAIN_MAX + 1];
+	struct ttm_place busy_placements[AMDGPU_GEM_DOMAIN_MAX + 1];
 	int r;
 
 	if (bo->shadow)
@@ -449,9 +461,12 @@ static int amdgpu_bo_create_shadow(struct amdgpu_device *adev,
 	bo->flags |= AMDGPU_GEM_CREATE_SHADOW;
 	memset(&placements, 0,
 	       (AMDGPU_GEM_DOMAIN_MAX + 1) * sizeof(struct ttm_place));
+	memset(&busy_placements, 0,
+	       (AMDGPU_GEM_DOMAIN_MAX + 1) * sizeof(struct ttm_place));
 
 	amdgpu_ttm_placement_init(adev, &placement,
-				  placements, AMDGPU_GEM_DOMAIN_GTT,
+				  placements, busy_placements,
+				  AMDGPU_GEM_DOMAIN_GTT,
 				  AMDGPU_GEM_CREATE_CPU_GTT_USWC);
 
 	r = amdgpu_bo_create_restricted(adev, size, byte_align, true,
@@ -479,13 +494,16 @@ int amdgpu_bo_create(struct amdgpu_device *adev,
 {
 	struct ttm_placement placement = {0};
 	struct ttm_place placements[AMDGPU_GEM_DOMAIN_MAX + 1];
+	struct ttm_place busy_placements[AMDGPU_GEM_DOMAIN_MAX + 1];
 	int r;
 
 	memset(&placements, 0,
 	       (AMDGPU_GEM_DOMAIN_MAX + 1) * sizeof(struct ttm_place));
+	memset(&busy_placements, 0,
+	       (AMDGPU_GEM_DOMAIN_MAX + 1) * sizeof(struct ttm_place));
 
-	amdgpu_ttm_placement_init(adev, &placement,
-				  placements, domain, flags);
+	amdgpu_ttm_placement_init(adev, &placement, placements,
+				  busy_placements, domain, flags);
 
 	r = amdgpu_bo_create_restricted(adev, size, byte_align, kernel,
 					domain, flags, sg, &placement,
@@ -718,6 +736,8 @@ int amdgpu_bo_pin_restricted(struct amdgpu_bo *bo, u32 domain,
 			bo->placements[i].lpfn = lpfn;
 		bo->placements[i].flags |= TTM_PL_FLAG_NO_EVICT;
 	}
+	bo->placement.busy_placement = bo->placement.placement;
+	bo->placement.num_busy_placement = bo->placement.num_placement;
 
 	r = ttm_bo_validate(&bo->tbo, &bo->placement, false, false);
 	if (unlikely(r)) {
@@ -970,6 +990,8 @@ int amdgpu_bo_fault_reserve_notify(struct ttm_buffer_object *bo)
 		     abo->placements[i].lpfn > lpfn))
 			abo->placements[i].lpfn = lpfn;
 	}
+	abo->placement.busy_placement = abo->placement.placement;
+	abo->placement.num_busy_placement = abo->placement.num_placement;
 	r = ttm_bo_validate(bo, &abo->placement, false, false);
 	if (unlikely(r == -ENOMEM)) {
 		amdgpu_ttm_placement_from_domain(abo, AMDGPU_GEM_DOMAIN_GTT);
-- 
2.7.4

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 2/9] drm/amdgpu: Add vis_vramlimit module parameter
       [not found] ` <1498239580-17360-1-git-send-email-john-xq/Ko7C6e2Bl57MIdRCFDg@public.gmane.org>
  2017-06-23 17:39   ` [PATCH 1/9] drm/amdgpu: Separate placements and busy placements John Brooks
@ 2017-06-23 17:39   ` John Brooks
  2017-06-26  9:48     ` Michel Dänzer
  2017-06-26  9:57     ` Christian König
  2017-06-23 17:39   ` [PATCH 4/9] drm/amdgpu: Don't force BOs into visible VRAM if they can go to GTT instead John Brooks
                     ` (5 subsequent siblings)
  7 siblings, 2 replies; 31+ messages in thread
From: John Brooks @ 2017-06-23 17:39 UTC (permalink / raw)
  To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW, Michel Dänzer,
	Marek Olšák
  Cc: David Airlie, Christian König,
	dri-devel-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

Allow specifying a limit on visible VRAM via a module parameter. This is
helpful for testing performance under visible VRAM pressure.

Signed-off-by: John Brooks <john@fastquake.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu.h     | 1 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 4 ++++
 drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 8 ++++++++
 3 files changed, 13 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index c407d2d..fe73633 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -74,6 +74,7 @@
  */
 extern int amdgpu_modeset;
 extern int amdgpu_vram_limit;
+extern int amdgpu_vis_vram_limit;
 extern int amdgpu_gart_size;
 extern int amdgpu_moverate;
 extern int amdgpu_benchmarking;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
index 4c7c262..a14f458 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
@@ -73,6 +73,7 @@
 #define KMS_DRIVER_PATCHLEVEL	0
 
 int amdgpu_vram_limit = 0;
+int amdgpu_vis_vram_limit = 0;
 int amdgpu_gart_size = -1; /* auto */
 int amdgpu_moverate = -1; /* auto */
 int amdgpu_benchmarking = 0;
@@ -119,6 +120,9 @@ int amdgpu_lbpw = -1;
 MODULE_PARM_DESC(vramlimit, "Restrict VRAM for testing, in megabytes");
 module_param_named(vramlimit, amdgpu_vram_limit, int, 0600);
 
+MODULE_PARM_DESC(vis_vramlimit, "Restrict visible VRAM for testing, in megabytes");
+module_param_named(vis_vramlimit, amdgpu_vis_vram_limit, int, 0444);
+
 MODULE_PARM_DESC(gartsize, "Size of PCIE/IGP gart to setup in megabytes (32, 64, etc., -1 = auto)");
 module_param_named(gartsize, amdgpu_gart_size, int, 0600);
 
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
index c9b131b..0676a78 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
@@ -1095,6 +1095,7 @@ static struct ttm_bo_driver amdgpu_bo_driver = {
 int amdgpu_ttm_init(struct amdgpu_device *adev)
 {
 	int r;
+	u64 vis_vram_limit;
 
 	r = amdgpu_ttm_global_init(adev);
 	if (r) {
@@ -1118,6 +1119,13 @@ int amdgpu_ttm_init(struct amdgpu_device *adev)
 		DRM_ERROR("Failed initializing VRAM heap.\n");
 		return r;
 	}
+
+	/* Reduce size of CPU-visible VRAM if requested */
+	vis_vram_limit = amdgpu_vis_vram_limit * 1024 * 1024;
+	if (amdgpu_vis_vram_limit > 0 &&
+	    vis_vram_limit <= adev->mc.visible_vram_size)
+		adev->mc.visible_vram_size = vis_vram_limit;
+
 	/* Change the size here instead of the init above so only lpfn is affected */
 	amdgpu_ttm_set_active_vram_size(adev, adev->mc.visible_vram_size);
 
-- 
2.7.4

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 3/9] drm/amdgpu: Don't force BOs into visible VRAM for page faults
  2017-06-23 17:39 [PATCH 0/9] Visible VRAM Management Improvements John Brooks
@ 2017-06-23 17:39 ` John Brooks
       [not found]   ` <1498239580-17360-4-git-send-email-john-xq/Ko7C6e2Bl57MIdRCFDg@public.gmane.org>
  2017-06-23 17:39 ` [PATCH 5/9] drm/amdgpu: Track time of last page fault and last CS move in struct amdgpu_bo John Brooks
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 31+ messages in thread
From: John Brooks @ 2017-06-23 17:39 UTC (permalink / raw)
  To: amd-gfx, Michel Dänzer, Marek Olšák; +Cc: dri-devel

There is no need for page faults to force BOs into visible VRAM if it's
full, and the time it takes to do so is great enough to cause noticeable
stuttering. Add GTT as a possible placement so that if visible VRAM is
full, page faults move BOs to GTT instead of evicting other BOs from VRAM.

Signed-off-by: John Brooks <john@fastquake.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 14 ++++++--------
 1 file changed, 6 insertions(+), 8 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
index 244a7d3..751bc05 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
@@ -981,10 +981,11 @@ int amdgpu_bo_fault_reserve_notify(struct ttm_buffer_object *bo)
 
 	/* hurrah the memory is not visible ! */
 	atomic64_inc(&adev->num_vram_cpu_page_faults);
-	amdgpu_ttm_placement_from_domain(abo, AMDGPU_GEM_DOMAIN_VRAM);
+	amdgpu_ttm_placement_from_domain(abo, AMDGPU_GEM_DOMAIN_VRAM |
+					 AMDGPU_GEM_DOMAIN_GTT);
 	lpfn =	adev->mc.visible_vram_size >> PAGE_SHIFT;
 	for (i = 0; i < abo->placement.num_placement; i++) {
-		/* Force into visible VRAM */
+		/* Try to move the BO into visible VRAM */
 		if ((abo->placements[i].flags & TTM_PL_FLAG_VRAM) &&
 		    (!abo->placements[i].lpfn ||
 		     abo->placements[i].lpfn > lpfn))
@@ -993,16 +994,13 @@ int amdgpu_bo_fault_reserve_notify(struct ttm_buffer_object *bo)
 	abo->placement.busy_placement = abo->placement.placement;
 	abo->placement.num_busy_placement = abo->placement.num_placement;
 	r = ttm_bo_validate(bo, &abo->placement, false, false);
-	if (unlikely(r == -ENOMEM)) {
-		amdgpu_ttm_placement_from_domain(abo, AMDGPU_GEM_DOMAIN_GTT);
-		return ttm_bo_validate(bo, &abo->placement, false, false);
-	} else if (unlikely(r != 0)) {
+	if (unlikely(r != 0))
 		return r;
-	}
 
 	offset = bo->mem.start << PAGE_SHIFT;
 	/* this should never happen */
-	if ((offset + size) > adev->mc.visible_vram_size)
+	if (bo->mem.mem_type == TTM_PL_VRAM &&
+	    (offset + size) > adev->mc.visible_vram_size)
 		return -EINVAL;
 
 	return 0;
-- 
2.7.4

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 4/9] drm/amdgpu: Don't force BOs into visible VRAM if they can go to GTT instead
       [not found] ` <1498239580-17360-1-git-send-email-john-xq/Ko7C6e2Bl57MIdRCFDg@public.gmane.org>
  2017-06-23 17:39   ` [PATCH 1/9] drm/amdgpu: Separate placements and busy placements John Brooks
  2017-06-23 17:39   ` [PATCH 2/9] drm/amdgpu: Add vis_vramlimit module parameter John Brooks
@ 2017-06-23 17:39   ` John Brooks
       [not found]     ` <1498239580-17360-5-git-send-email-john-xq/Ko7C6e2Bl57MIdRCFDg@public.gmane.org>
  2017-06-23 17:39   ` [PATCH 6/9] drm/amdgpu: Set/clear CPU_ACCESS_REQUIRED flag on page fault and CS John Brooks
                     ` (4 subsequent siblings)
  7 siblings, 1 reply; 31+ messages in thread
From: John Brooks @ 2017-06-23 17:39 UTC (permalink / raw)
  To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW, Michel Dänzer,
	Marek Olšák
  Cc: David Airlie, Christian König,
	dri-devel-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

amdgpu_ttm_placement_init() callers that are using both VRAM and GTT as
domains usually don't want visible VRAM as a busy placement.

Signed-off-by: John Brooks <john@fastquake.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
index 751bc05..0ff555a 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
@@ -138,7 +138,15 @@ static void amdgpu_ttm_placement_init(struct amdgpu_device *adev,
 		if (flags & AMDGPU_GEM_CREATE_VRAM_CONTIGUOUS)
 			places[c].flags |= TTM_PL_FLAG_CONTIGUOUS;
 
-		busy_places[bc++] = places[c++];
+		/* Don't set limited visible VRAM as a busy placement if we can
+		 * use GTT instead
+		 */
+		if (!((flags & AMDGPU_GEM_CREATE_CPU_ACCESS_REQUIRED) &&
+		      adev->mc.visible_vram_size < adev->mc.real_vram_size &&
+		      (domain & AMDGPU_GEM_DOMAIN_GTT)))
+			busy_places[bc++] = places[c];
+
+		c++;
 	}
 
 	if (domain & AMDGPU_GEM_DOMAIN_GTT) {
-- 
2.7.4

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 5/9] drm/amdgpu: Track time of last page fault and last CS move in struct amdgpu_bo
  2017-06-23 17:39 [PATCH 0/9] Visible VRAM Management Improvements John Brooks
  2017-06-23 17:39 ` [PATCH 3/9] drm/amdgpu: Don't force BOs into visible VRAM for page faults John Brooks
@ 2017-06-23 17:39 ` John Brooks
       [not found] ` <1498239580-17360-1-git-send-email-john-xq/Ko7C6e2Bl57MIdRCFDg@public.gmane.org>
  2017-06-23 17:39 ` [PATCH 9/9] drm/amdgpu: Reduce lock contention when evicting from visible VRAM John Brooks
  3 siblings, 0 replies; 31+ messages in thread
From: John Brooks @ 2017-06-23 17:39 UTC (permalink / raw)
  To: amd-gfx, Michel Dänzer, Marek Olšák; +Cc: dri-devel

Signed-off-by: John Brooks <john@fastquake.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu.h        | 3 +++
 drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c     | 5 +++++
 drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 2 ++
 3 files changed, 10 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index fe73633..245234e 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -429,6 +429,9 @@ struct amdgpu_bo {
 	void				*metadata;
 	u32				metadata_size;
 	unsigned			prime_shared_count;
+	unsigned long			last_page_fault_jiffies;
+	unsigned long			last_cs_move_jiffies;
+
 	/* list of all virtual address to which this bo
 	 * is associated to
 	 */
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
index aeee684..2fad8bd 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
@@ -306,6 +306,7 @@ static int amdgpu_cs_bo_validate(struct amdgpu_cs_parser *p,
 	struct amdgpu_device *adev = amdgpu_ttm_adev(bo->tbo.bdev);
 	u64 initial_bytes_moved;
 	uint32_t domain;
+	uint32_t old_mem;
 	int r;
 
 	if (bo->pin_count)
@@ -322,6 +323,7 @@ static int amdgpu_cs_bo_validate(struct amdgpu_cs_parser *p,
 retry:
 	amdgpu_ttm_placement_from_domain(bo, domain);
 	initial_bytes_moved = atomic64_read(&adev->num_bytes_moved);
+	old_mem = bo->tbo.mem.mem_type;
 	r = ttm_bo_validate(&bo->tbo, &bo->placement, true, false);
 	p->bytes_moved += atomic64_read(&adev->num_bytes_moved) -
 		initial_bytes_moved;
@@ -331,6 +333,9 @@ static int amdgpu_cs_bo_validate(struct amdgpu_cs_parser *p,
 		goto retry;
 	}
 
+	if (bo->tbo.mem.mem_type != old_mem)
+		bo->last_cs_move_jiffies = jiffies;
+
 	return r;
 }
 
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
index 0ff555a..31d1f21 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
@@ -977,6 +977,8 @@ int amdgpu_bo_fault_reserve_notify(struct ttm_buffer_object *bo)
 	if (bo->mem.mem_type != TTM_PL_VRAM)
 		return 0;
 
+	abo->last_page_fault_jiffies = jiffies;
+
 	size = bo->mem.num_pages << PAGE_SHIFT;
 	offset = bo->mem.start << PAGE_SHIFT;
 	/* TODO: figure out how to map scattered VRAM to the CPU */
-- 
2.7.4

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 6/9] drm/amdgpu: Set/clear CPU_ACCESS_REQUIRED flag on page fault and CS
       [not found] ` <1498239580-17360-1-git-send-email-john-xq/Ko7C6e2Bl57MIdRCFDg@public.gmane.org>
                     ` (2 preceding siblings ...)
  2017-06-23 17:39   ` [PATCH 4/9] drm/amdgpu: Don't force BOs into visible VRAM if they can go to GTT instead John Brooks
@ 2017-06-23 17:39   ` John Brooks
       [not found]     ` <1498239580-17360-7-git-send-email-john-xq/Ko7C6e2Bl57MIdRCFDg@public.gmane.org>
  2017-06-23 17:39   ` [PATCH 7/9] drm/amdgpu: Throttle visible VRAM moves separately John Brooks
                     ` (3 subsequent siblings)
  7 siblings, 1 reply; 31+ messages in thread
From: John Brooks @ 2017-06-23 17:39 UTC (permalink / raw)
  To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW, Michel Dänzer,
	Marek Olšák
  Cc: David Airlie, Christian König,
	dri-devel-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

When the AMDGPU_GEM_CREATE_CPU_ACCESS_REQUIRED flag is given by userspace,
it should only be treated as a hint to initially place a BO somewhere CPU
accessible, rather than having a permanent effect on BO placement.

Instead of the flag being set in stone at BO creation, set the flag when a
page fault occurs so that it goes somewhere CPU-visible, and clear it when
the BO is requested by the GPU.

However, clearing the CPU_ACCESS_REQUIRED flag may move a BO to invisible
VRAM, which is likely to cause a page fault that moves it right back to
GTT. When this happens too much, it is highly detrimental to performance.
Only clear the flag on CS if:

- The BO wasn't page faulted for a certain amount of time (currently 10
  seconds, measured with jiffies), and
- its last page fault didn't occur too soon (currently 500ms) after its
  last CS request, or vice versa.

Setting the flag in amdgpu_fault_reserve_notify() also means that we can
remove the loop to restrict lpfn to the end of visible VRAM, because
amdgpu_ttm_placement_init() will do it for us.

Signed-off-by: John Brooks <john@fastquake.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c     |  1 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 46 ++++++++++++++++++++++--------
 drivers/gpu/drm/amd/amdgpu/amdgpu_object.h |  1 +
 3 files changed, 36 insertions(+), 12 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
index 2fad8bd..73d6882 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
@@ -320,6 +320,7 @@ static int amdgpu_cs_bo_validate(struct amdgpu_cs_parser *p,
 	else
 		domain = bo->allowed_domains;
 
+	amdgpu_bo_clear_cpu_access_required(bo);
 retry:
 	amdgpu_ttm_placement_from_domain(bo, domain);
 	initial_bytes_moved = atomic64_read(&adev->num_bytes_moved);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
index 31d1f21..a7d48a7 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
@@ -967,8 +967,8 @@ int amdgpu_bo_fault_reserve_notify(struct ttm_buffer_object *bo)
 {
 	struct amdgpu_device *adev = amdgpu_ttm_adev(bo->bdev);
 	struct amdgpu_bo *abo;
-	unsigned long offset, size, lpfn;
-	int i, r;
+	unsigned long offset, size;
+	int r;
 
 	if (!amdgpu_ttm_bo_is_amdgpu_bo(bo))
 		return 0;
@@ -991,18 +991,9 @@ int amdgpu_bo_fault_reserve_notify(struct ttm_buffer_object *bo)
 
 	/* hurrah the memory is not visible ! */
 	atomic64_inc(&adev->num_vram_cpu_page_faults);
+	abo->flags |= AMDGPU_GEM_CREATE_CPU_ACCESS_REQUIRED;
 	amdgpu_ttm_placement_from_domain(abo, AMDGPU_GEM_DOMAIN_VRAM |
 					 AMDGPU_GEM_DOMAIN_GTT);
-	lpfn =	adev->mc.visible_vram_size >> PAGE_SHIFT;
-	for (i = 0; i < abo->placement.num_placement; i++) {
-		/* Try to move the BO into visible VRAM */
-		if ((abo->placements[i].flags & TTM_PL_FLAG_VRAM) &&
-		    (!abo->placements[i].lpfn ||
-		     abo->placements[i].lpfn > lpfn))
-			abo->placements[i].lpfn = lpfn;
-	}
-	abo->placement.busy_placement = abo->placement.placement;
-	abo->placement.num_busy_placement = abo->placement.num_placement;
 	r = ttm_bo_validate(bo, &abo->placement, false, false);
 	if (unlikely(r != 0))
 		return r;
@@ -1057,3 +1048,34 @@ u64 amdgpu_bo_gpu_offset(struct amdgpu_bo *bo)
 
 	return bo->tbo.offset;
 }
+
+/**
+ * amdgpu_bo_clear_cpu_access_required
+ * @bo: BO to update
+ *
+ * Clears CPU_ACCESS_REQUIRED flag if the BO hasn't had a page fault in a while
+ * and it didn't have a page fault too soon after the last time it was moved to
+ * VRAM.
+ *
+ * Caller should have bo reserved.
+ *
+ */
+void amdgpu_bo_clear_cpu_access_required(struct amdgpu_bo *bo)
+{
+	const unsigned int page_fault_timeout_ms = 10000;
+	const unsigned int min_period_ms = 500;
+	unsigned int ms_since_pf, period_ms;
+
+	ms_since_pf = jiffies_to_msecs(jiffies - bo->last_page_fault_jiffies);
+	period_ms = jiffies_to_msecs(abs(bo->last_page_fault_jiffies -
+					 bo->last_cs_move_jiffies));
+
+	/*
+	 * Try to avoid a revolving door between GTT and VRAM. Clearing the
+	 * flag may move this BO back to VRAM, so don't clear it if it's likely
+	 * to page fault and go right back to GTT.
+	 */
+	if ((!bo->last_page_fault_jiffies || !bo->last_cs_move_jiffies) ||
+	    (ms_since_pf > page_fault_timeout_ms && period_ms > min_period_ms))
+		bo->flags &= ~AMDGPU_GEM_CREATE_CPU_ACCESS_REQUIRED;
+}
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h
index 3824851..b0cb137 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h
@@ -182,6 +182,7 @@ int amdgpu_bo_restore_from_shadow(struct amdgpu_device *adev,
 				  struct reservation_object *resv,
 				  struct dma_fence **fence,
 				  bool direct);
+void amdgpu_bo_clear_cpu_access_required(struct amdgpu_bo *bo);
 
 
 /*
-- 
2.7.4

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 7/9] drm/amdgpu: Throttle visible VRAM moves separately
       [not found] ` <1498239580-17360-1-git-send-email-john-xq/Ko7C6e2Bl57MIdRCFDg@public.gmane.org>
                     ` (3 preceding siblings ...)
  2017-06-23 17:39   ` [PATCH 6/9] drm/amdgpu: Set/clear CPU_ACCESS_REQUIRED flag on page fault and CS John Brooks
@ 2017-06-23 17:39   ` John Brooks
       [not found]     ` <1498239580-17360-8-git-send-email-john-xq/Ko7C6e2Bl57MIdRCFDg@public.gmane.org>
  2017-06-23 17:39   ` [PATCH 8/9] drm/amdgpu: Asynchronously move BOs to visible VRAM John Brooks
                     ` (2 subsequent siblings)
  7 siblings, 1 reply; 31+ messages in thread
From: John Brooks @ 2017-06-23 17:39 UTC (permalink / raw)
  To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW, Michel Dänzer,
	Marek Olšák
  Cc: David Airlie, Christian König,
	dri-devel-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

The BO move throttling code is designed to allow VRAM to fill quickly if it
is relatively empty. However, this does not take into account situations
where the visible VRAM is smaller than total VRAM, and total VRAM may not
be close to full but the visible VRAM segment is under pressure. In such
situations, visible VRAM would experience unrestricted swapping and
performance would drop.

Add a separate counter specifically for moves involving visible VRAM, and
check it before moving BOs there.

Fixes: 95844d20ae02 (drm/amdgpu: throttle buffer migrations at CS using a fixed MBps limit (v2))
Signed-off-by: John Brooks <john@fastquake.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu.h        |  6 ++-
 drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c     | 76 +++++++++++++++++++++++-------
 drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 10 ++--
 3 files changed, 70 insertions(+), 22 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index 245234e..81dbb93 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -1161,7 +1161,9 @@ struct amdgpu_cs_parser {
 	struct list_head		validated;
 	struct dma_fence		*fence;
 	uint64_t			bytes_moved_threshold;
+	uint64_t			bytes_moved_vis_threshold;
 	uint64_t			bytes_moved;
+	uint64_t			bytes_moved_vis;
 	struct amdgpu_bo_list_entry	*evictable;
 
 	/* user fence */
@@ -1595,6 +1597,7 @@ struct amdgpu_device {
 		spinlock_t		lock;
 		s64			last_update_us;
 		s64			accum_us; /* accumulated microseconds */
+		s64			accum_us_vis; /* for visible VRAM */
 		u32			log2_max_MBps;
 	} mm_stats;
 
@@ -1930,7 +1933,8 @@ bool amdgpu_need_post(struct amdgpu_device *adev);
 void amdgpu_update_display_priority(struct amdgpu_device *adev);
 
 int amdgpu_cs_parser_init(struct amdgpu_cs_parser *p, void *data);
-void amdgpu_cs_report_moved_bytes(struct amdgpu_device *adev, u64 num_bytes);
+void amdgpu_cs_report_moved_bytes(struct amdgpu_device *adev, u64 num_bytes,
+				  u64 num_vis_bytes);
 void amdgpu_ttm_placement_from_domain(struct amdgpu_bo *abo, u32 domain);
 bool amdgpu_ttm_bo_is_amdgpu_bo(struct ttm_buffer_object *bo);
 int amdgpu_ttm_tt_get_user_pages(struct ttm_tt *ttm, struct page **pages);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
index 73d6882..25d6df6 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
@@ -223,11 +223,13 @@ static s64 bytes_to_us(struct amdgpu_device *adev, u64 bytes)
  * ticks. The accumulated microseconds (us) are converted to bytes and
  * returned.
  */
-static u64 amdgpu_cs_get_threshold_for_moves(struct amdgpu_device *adev)
+static void amdgpu_cs_get_threshold_for_moves(struct amdgpu_device *adev,
+					      u64 *max_bytes,
+					      u64 *max_vis_bytes)
 {
 	s64 time_us, increment_us;
-	u64 max_bytes;
 	u64 free_vram, total_vram, used_vram;
+	u64 free_vis_vram, total_vis_vram, used_vis_vram;
 
 	/* Allow a maximum of 200 accumulated ms. This is basically per-IB
 	 * throttling.
@@ -238,13 +240,21 @@ static u64 amdgpu_cs_get_threshold_for_moves(struct amdgpu_device *adev)
 	 */
 	const s64 us_upper_bound = 200000;
 
-	if (!adev->mm_stats.log2_max_MBps)
-		return 0;
+	if (!adev->mm_stats.log2_max_MBps) {
+		*max_bytes = 0;
+		*max_vis_bytes = 0;
+		return;
+	}
 
 	total_vram = adev->mc.real_vram_size - adev->vram_pin_size;
 	used_vram = atomic64_read(&adev->vram_usage);
 	free_vram = used_vram >= total_vram ? 0 : total_vram - used_vram;
 
+	total_vis_vram = adev->mc.visible_vram_size;
+	used_vis_vram = atomic64_read(&adev->vram_vis_usage);
+	free_vis_vram = used_vis_vram >= total_vis_vram ? 0 : total_vis_vram -
+							      used_vis_vram;
+
 	spin_lock(&adev->mm_stats.lock);
 
 	/* Increase the amount of accumulated us. */
@@ -252,7 +262,9 @@ static u64 amdgpu_cs_get_threshold_for_moves(struct amdgpu_device *adev)
 	increment_us = time_us - adev->mm_stats.last_update_us;
 	adev->mm_stats.last_update_us = time_us;
 	adev->mm_stats.accum_us = min(adev->mm_stats.accum_us + increment_us,
-                                      us_upper_bound);
+				      us_upper_bound);
+	adev->mm_stats.accum_us_vis = min(adev->mm_stats.accum_us_vis +
+					  increment_us, us_upper_bound);
 
 	/* This prevents the short period of low performance when the VRAM
 	 * usage is low and the driver is in debt or doesn't have enough
@@ -280,23 +292,32 @@ static u64 amdgpu_cs_get_threshold_for_moves(struct amdgpu_device *adev)
 		adev->mm_stats.accum_us = max(min_us, adev->mm_stats.accum_us);
 	}
 
-	/* This returns 0 if the driver is in debt to disallow (optional)
+	/* Do the same for visible VRAM if half of it is free */
+	if (free_vis_vram >= total_vis_vram / 2) {
+		adev->mm_stats.accum_us_vis = max(bytes_to_us(adev,
+						  free_vis_vram / 2),
+						  adev->mm_stats.accum_us_vis);
+	}
+
+	/* This is set to 0 if the driver is in debt to disallow (optional)
 	 * buffer moves.
 	 */
-	max_bytes = us_to_bytes(adev, adev->mm_stats.accum_us);
+	*max_bytes = us_to_bytes(adev, adev->mm_stats.accum_us);
+	*max_vis_bytes = us_to_bytes(adev, adev->mm_stats.accum_us_vis);
 
 	spin_unlock(&adev->mm_stats.lock);
-	return max_bytes;
 }
 
 /* Report how many bytes have really been moved for the last command
  * submission. This can result in a debt that can stop buffer migrations
  * temporarily.
  */
-void amdgpu_cs_report_moved_bytes(struct amdgpu_device *adev, u64 num_bytes)
+void amdgpu_cs_report_moved_bytes(struct amdgpu_device *adev, u64 num_bytes,
+				  u64 num_vis_bytes)
 {
 	spin_lock(&adev->mm_stats.lock);
 	adev->mm_stats.accum_us -= bytes_to_us(adev, num_bytes);
+	adev->mm_stats.accum_us_vis -= bytes_to_us(adev, num_vis_bytes);
 	spin_unlock(&adev->mm_stats.lock);
 }
 
@@ -304,7 +325,7 @@ static int amdgpu_cs_bo_validate(struct amdgpu_cs_parser *p,
 				 struct amdgpu_bo *bo)
 {
 	struct amdgpu_device *adev = amdgpu_ttm_adev(bo->tbo.bdev);
-	u64 initial_bytes_moved;
+	u64 initial_bytes_moved, bytes_moved;
 	uint32_t domain;
 	uint32_t old_mem;
 	int r;
@@ -312,22 +333,39 @@ static int amdgpu_cs_bo_validate(struct amdgpu_cs_parser *p,
 	if (bo->pin_count)
 		return 0;
 
+	amdgpu_bo_clear_cpu_access_required(bo);
+
 	/* Don't move this buffer if we have depleted our allowance
 	 * to move it. Don't move anything if the threshold is zero.
 	 */
-	if (p->bytes_moved < p->bytes_moved_threshold)
-		domain = bo->prefered_domains;
-	else
+	if (p->bytes_moved < p->bytes_moved_threshold) {
+		if (adev->mc.visible_vram_size < adev->mc.real_vram_size &&
+		    (bo->flags & AMDGPU_GEM_CREATE_CPU_ACCESS_REQUIRED)) {
+			/* And don't move a CPU_ACCESS_REQUIRED BO to limited
+			 * visible VRAM if we've depleted our allowance to do
+			 * that.
+			 */
+			if (p->bytes_moved_vis < p->bytes_moved_vis_threshold)
+				domain = bo->prefered_domains;
+			else
+				domain = bo->allowed_domains;
+		} else {
+			domain = bo->prefered_domains;
+		}
+	} else {
 		domain = bo->allowed_domains;
+	}
 
-	amdgpu_bo_clear_cpu_access_required(bo);
 retry:
 	amdgpu_ttm_placement_from_domain(bo, domain);
 	initial_bytes_moved = atomic64_read(&adev->num_bytes_moved);
 	old_mem = bo->tbo.mem.mem_type;
 	r = ttm_bo_validate(&bo->tbo, &bo->placement, true, false);
-	p->bytes_moved += atomic64_read(&adev->num_bytes_moved) -
+	bytes_moved = atomic64_read(&adev->num_bytes_moved) -
 		initial_bytes_moved;
+	p->bytes_moved += bytes_moved;
+	if (bo->flags & AMDGPU_GEM_CREATE_CPU_ACCESS_REQUIRED)
+		p->bytes_moved_vis += bytes_moved;
 
 	if (unlikely(r == -ENOMEM) && domain != bo->allowed_domains) {
 		domain = bo->allowed_domains;
@@ -560,8 +598,10 @@ static int amdgpu_cs_parser_bos(struct amdgpu_cs_parser *p,
 		list_splice(&need_pages, &p->validated);
 	}
 
-	p->bytes_moved_threshold = amdgpu_cs_get_threshold_for_moves(p->adev);
+	amdgpu_cs_get_threshold_for_moves(p->adev, &p->bytes_moved_threshold,
+					  &p->bytes_moved_vis_threshold);
 	p->bytes_moved = 0;
+	p->bytes_moved_vis = 0;
 	p->evictable = list_last_entry(&p->validated,
 				       struct amdgpu_bo_list_entry,
 				       tv.head);
@@ -585,8 +625,8 @@ static int amdgpu_cs_parser_bos(struct amdgpu_cs_parser *p,
 		goto error_validate;
 	}
 
-	amdgpu_cs_report_moved_bytes(p->adev, p->bytes_moved);
-
+	amdgpu_cs_report_moved_bytes(p->adev, p->bytes_moved,
+				     p->bytes_moved_vis);
 	fpriv->vm.last_eviction_counter =
 		atomic64_read(&p->adev->num_evictions);
 
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
index a7d48a7..27d8c77 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
@@ -341,7 +341,7 @@ int amdgpu_bo_create_restricted(struct amdgpu_device *adev,
 	struct amdgpu_bo *bo;
 	enum ttm_bo_type type;
 	unsigned long page_align;
-	u64 initial_bytes_moved;
+	u64 initial_bytes_moved, bytes_moved;
 	size_t acc_size;
 	int r;
 
@@ -417,8 +417,12 @@ int amdgpu_bo_create_restricted(struct amdgpu_device *adev,
 	r = ttm_bo_init_reserved(&adev->mman.bdev, &bo->tbo, size, type,
 				 &bo->placement, page_align, !kernel, NULL,
 				 acc_size, sg, resv, &amdgpu_ttm_bo_destroy);
-	amdgpu_cs_report_moved_bytes(adev,
-		atomic64_read(&adev->num_bytes_moved) - initial_bytes_moved);
+	bytes_moved = atomic64_read(&adev->num_bytes_moved) -
+		initial_bytes_moved;
+	if (bo->flags & AMDGPU_GEM_CREATE_CPU_ACCESS_REQUIRED)
+		amdgpu_cs_report_moved_bytes(adev, bytes_moved, bytes_moved);
+	else
+		amdgpu_cs_report_moved_bytes(adev, bytes_moved, 0);
 
 	if (unlikely(r != 0))
 		return r;
-- 
2.7.4

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 8/9] drm/amdgpu: Asynchronously move BOs to visible VRAM
       [not found] ` <1498239580-17360-1-git-send-email-john-xq/Ko7C6e2Bl57MIdRCFDg@public.gmane.org>
                     ` (4 preceding siblings ...)
  2017-06-23 17:39   ` [PATCH 7/9] drm/amdgpu: Throttle visible VRAM moves separately John Brooks
@ 2017-06-23 17:39   ` John Brooks
  2017-06-23 21:02   ` [PATCH 0/9] Visible VRAM Management Improvements Felix Kuehling
  2017-06-24 18:07   ` Christian König
  7 siblings, 0 replies; 31+ messages in thread
From: John Brooks @ 2017-06-23 17:39 UTC (permalink / raw)
  To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW, Michel Dänzer,
	Marek Olšák
  Cc: David Airlie, Christian König,
	dri-devel-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

Moving CPU-accessible BOs from GTT to visible VRAM reduces latency on the
GPU and improves average framerate. However, it's an expensive operation.
When visible VRAM is full and evictions are necessary, it can easily take
tens of milliseconds. On the CS path, that directly increases the frame
time and causes noticeable momentary stutters.

Unlike other BO move operations, moving BOs to visible VRAM is a
housekeeping task and does not have to happen immediately. As a compromise
to allow evictions to occur and keep the contents of visible VRAM fresh,
but without stalling the rendering pipeline, we can defer these moves to a
worker thread.

Add a work function that moves a BO into visible VRAM and evicting other
BOs if necessary. And during CS, queue this work function for all requested
CPU_ACCESS_REQUIRED BOs (subject to the usual move throttling).

This decreases the frequency and severity of visible-VRAM-related
stuttering.

Signed-off-by: John Brooks <john@fastquake.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu.h        |  5 ++++
 drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c     | 14 ++++++----
 drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 45 ++++++++++++++++++++++++++++++
 drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c    |  7 +++++
 4 files changed, 65 insertions(+), 6 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index 81dbb93..a809742 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -436,6 +436,10 @@ struct amdgpu_bo {
 	 * is associated to
 	 */
 	struct list_head		va;
+
+	/* Work item for moving this BO to visible VRAM asynchronously */
+	struct work_struct		move_vis_vram_work;
+
 	/* Constant after initialization */
 	struct drm_gem_object		gem_base;
 	struct amdgpu_bo		*parent;
@@ -1583,6 +1587,7 @@ struct amdgpu_device {
 	struct amdgpu_mman		mman;
 	struct amdgpu_vram_scratch	vram_scratch;
 	struct amdgpu_wb		wb;
+	struct workqueue_struct		*vis_vram_wq;
 	atomic64_t			vram_usage;
 	atomic64_t			vram_vis_usage;
 	atomic64_t			gtt_usage;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
index 25d6df6..9215611 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
@@ -341,14 +341,16 @@ static int amdgpu_cs_bo_validate(struct amdgpu_cs_parser *p,
 	if (p->bytes_moved < p->bytes_moved_threshold) {
 		if (adev->mc.visible_vram_size < adev->mc.real_vram_size &&
 		    (bo->flags & AMDGPU_GEM_CREATE_CPU_ACCESS_REQUIRED)) {
-			/* And don't move a CPU_ACCESS_REQUIRED BO to limited
-			 * visible VRAM if we've depleted our allowance to do
-			 * that.
+			/* Move CPU_ACCESS_REQUIRED BOs to limited visible VRAM
+			 * asynchronously, if we're allowed.
 			 */
-			if (p->bytes_moved_vis < p->bytes_moved_vis_threshold)
-				domain = bo->prefered_domains;
-			else
+			if (p->bytes_moved_vis < p->bytes_moved_vis_threshold) {
+				queue_work(adev->vis_vram_wq,
+					   &bo->move_vis_vram_work);
+				return 0;
+			} else {
 				domain = bo->allowed_domains;
+			}
 		} else {
 			domain = bo->prefered_domains;
 		}
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
index 27d8c77..a69441d 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
@@ -93,6 +93,8 @@ static void amdgpu_ttm_bo_destroy(struct ttm_buffer_object *tbo)
 
 	bo = container_of(tbo, struct amdgpu_bo, tbo);
 
+	cancel_work_sync(&bo->move_vis_vram_work);
+
 	amdgpu_update_memory_usage(adev, &bo->tbo.mem, NULL);
 
 	drm_gem_object_release(&bo->gem_base);
@@ -330,6 +332,47 @@ void amdgpu_bo_free_kernel(struct amdgpu_bo **bo, u64 *gpu_addr,
 		*cpu_addr = NULL;
 }
 
+static void amdgpu_bo_move_vis_vram_work_func(struct work_struct *work)
+{
+	struct amdgpu_bo *bo = container_of(work, struct amdgpu_bo,
+					    move_vis_vram_work);
+	struct amdgpu_device *adev = amdgpu_ttm_adev(bo->tbo.bdev);
+	u64 initial_bytes_moved, bytes_moved;
+	uint32_t old_mem;
+	int r;
+
+	spin_lock(&adev->mm_stats.lock);
+	if (adev->mm_stats.accum_us_vis <= 0 ||
+	    adev->mm_stats.accum_us <= 0) {
+		spin_unlock(&adev->mm_stats.lock);
+		return;
+	}
+	spin_unlock(&adev->mm_stats.lock);
+
+	r = amdgpu_bo_reserve(bo, true);
+	if (r != 0)
+		return;
+
+	amdgpu_bo_clear_cpu_access_required(bo);
+
+	if (!(bo->flags & AMDGPU_GEM_CREATE_CPU_ACCESS_REQUIRED))
+		goto out;
+
+	amdgpu_ttm_placement_from_domain(bo, AMDGPU_GEM_DOMAIN_VRAM);
+	old_mem = bo->tbo.mem.mem_type;
+	initial_bytes_moved = atomic64_read(&adev->num_bytes_moved);
+	ttm_bo_validate(&bo->tbo, &bo->placement, false, false);
+	bytes_moved = atomic64_read(&adev->num_bytes_moved) -
+				    initial_bytes_moved;
+	amdgpu_cs_report_moved_bytes(adev, bytes_moved, bytes_moved);
+
+	if (bo->tbo.mem.mem_type != old_mem)
+		bo->last_cs_move_jiffies = jiffies;
+
+out:
+	amdgpu_bo_unreserve(bo);
+}
+
 int amdgpu_bo_create_restricted(struct amdgpu_device *adev,
 				unsigned long size, int byte_align,
 				bool kernel, u32 domain, u64 flags,
@@ -382,6 +425,8 @@ int amdgpu_bo_create_restricted(struct amdgpu_device *adev,
 
 	bo->flags = flags;
 
+	INIT_WORK(&bo->move_vis_vram_work, amdgpu_bo_move_vis_vram_work_func);
+
 #ifdef CONFIG_X86_32
 	/* XXX: Write-combined CPU mappings of GTT seem broken on 32-bit
 	 * See https://bugs.freedesktop.org/show_bug.cgi?id=84627
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
index 0676a78..5852ca1 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
@@ -1120,6 +1120,13 @@ int amdgpu_ttm_init(struct amdgpu_device *adev)
 		return r;
 	}
 
+	/* Initialize workqueue for asynchronously moving BOs to visible VRAM.
+	 * We want to avoid lock contention (and therefore concurrency), so set
+	 * max_active = 1, and set unbound to disable concurrency management
+	 * (which can interleave sleeping workers).
+	 */
+	adev->vis_vram_wq = alloc_workqueue("amdgpu_vis_vram", WQ_UNBOUND, 1);
+
 	/* Reduce size of CPU-visible VRAM if requested */
 	vis_vram_limit = amdgpu_vis_vram_limit * 1024 * 1024;
 	if (amdgpu_vis_vram_limit > 0 &&
-- 
2.7.4

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 9/9] drm/amdgpu: Reduce lock contention when evicting from visible VRAM
  2017-06-23 17:39 [PATCH 0/9] Visible VRAM Management Improvements John Brooks
                   ` (2 preceding siblings ...)
       [not found] ` <1498239580-17360-1-git-send-email-john-xq/Ko7C6e2Bl57MIdRCFDg@public.gmane.org>
@ 2017-06-23 17:39 ` John Brooks
  3 siblings, 0 replies; 31+ messages in thread
From: John Brooks @ 2017-06-23 17:39 UTC (permalink / raw)
  To: amd-gfx, Michel Dänzer, Marek Olšák; +Cc: dri-devel

Performing expensive BO moves asynchronously reduces the direct burden on
the CS path, but it can still indirectly cause occasional stalls because
the worker may reserve the BO for a long time during evictions, and if this
coincides with it being needed by CS, the CS path will have to wait.

Instead of reserving the BO and keeping it reserved while we wait for
ttm_bo_validate() to move it and perform any evictions, we can afford to be
more surgical and re-implement the ttm_bo_validate() path in the worker
function with some changes to make it friendlier to other threads.

Specifically, if evictions are needed when moving a BO to visible VRAM,
unreserve the BO while performing them, so as to not block other tasks for
too long. Also, sleep for an interval between each eviction so that the
worker doesn't hog lru_lock.

Signed-off-by: John Brooks <john@fastquake.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 88 +++++++++++++++++++++++++++++-
 drivers/gpu/drm/ttm/ttm_bo.c               | 34 +++++++-----
 include/drm/ttm/ttm_bo_driver.h            | 13 +++++
 include/drm/ttm/ttm_placement.h            |  6 ++
 4 files changed, 125 insertions(+), 16 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
index a69441d..854e037 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
@@ -337,10 +337,16 @@ static void amdgpu_bo_move_vis_vram_work_func(struct work_struct *work)
 	struct amdgpu_bo *bo = container_of(work, struct amdgpu_bo,
 					    move_vis_vram_work);
 	struct amdgpu_device *adev = amdgpu_ttm_adev(bo->tbo.bdev);
+	struct ttm_placement placement;
+	struct ttm_mem_reg mem;
+	struct ttm_mem_type_manager *man = &bo->tbo.bdev->man[TTM_PL_VRAM];
 	u64 initial_bytes_moved, bytes_moved;
 	uint32_t old_mem;
+	uint32_t new_flags;
 	int r;
 
+	mem.mm_node = NULL;
+
 	spin_lock(&adev->mm_stats.lock);
 	if (adev->mm_stats.accum_us_vis <= 0 ||
 	    adev->mm_stats.accum_us <= 0) {
@@ -359,17 +365,97 @@ static void amdgpu_bo_move_vis_vram_work_func(struct work_struct *work)
 		goto out;
 
 	amdgpu_ttm_placement_from_domain(bo, AMDGPU_GEM_DOMAIN_VRAM);
+	placement = bo->placement;
+
+	if (ttm_bo_mem_compat(&placement, &bo->tbo.mem, &new_flags))
+		goto out;
+
+	mem.num_pages = bo->tbo.num_pages;
+	mem.size = mem.num_pages << PAGE_SHIFT;
+	mem.page_alignment = bo->tbo.mem.page_alignment;
+	mem.bus.io_reserved_vm = false;
+	mem.bus.io_reserved_count = 0;
+
+	placement.num_busy_placement = 0;
+
 	old_mem = bo->tbo.mem.mem_type;
 	initial_bytes_moved = atomic64_read(&adev->num_bytes_moved);
-	ttm_bo_validate(&bo->tbo, &bo->placement, false, false);
+
+	r = ttm_bo_mem_space(&bo->tbo, &placement, &mem, false, false);
+	if (r == -ENOMEM) {
+		/* Unreserve the BO while we make space for it */
+		struct ttm_bo_device *bdev = bo->tbo.bdev;
+
+		amdgpu_bo_unreserve(bo);
+		do {
+			r = (*man->func->get_node)(man, NULL,
+						   &placement.placement[0],
+						   &mem);
+			if (unlikely(r != 0))
+				return;
+
+			if (mem.mm_node)
+				break;
+
+			r = ttm_mem_evict_first(bdev, TTM_PL_VRAM,
+						&placement.placement[0], false,
+						false);
+			if (unlikely(r != 0))
+				return;
+
+			/* Sleep to give other threads the opportunity to grab
+			 * lru_lock
+			 */
+			msleep(20);
+		} while (1);
+
+		if (!kref_read(&bo->tbo.kref)) {
+			/* The BO was deleted since we last held it. Abort. */
+			if (mem.mm_node)
+				(*man->func->put_node)(man, &mem);
+			return;
+		}
+
+		r = amdgpu_bo_reserve(bo, true);
+		if (r != 0) {
+			if (mem.mm_node)
+				(*man->func->put_node)(man, &mem);
+			return;
+		}
+
+		mem.mem_type = TTM_PL_VRAM;
+
+		r = ttm_bo_add_move_fence(&bo->tbo, man, &mem);
+		if (unlikely(r != 0))
+			goto out;
+
+		mem.placement = TTM_PL_FLAG_VRAM;
+		mem.placement |= (placement.placement[0].flags &
+				  man->available_caching);
+		mem.placement |= ttm_bo_select_caching(man,
+						       bo->tbo.mem.placement,
+						       mem.placement);
+		ttm_flag_masked(&mem.placement, placement.placement[0].flags,
+				~TTM_PL_MASK_MEMTYPE);
+	} else if (unlikely(r != 0)) {
+		goto out;
+	}
+
+	r = ttm_bo_handle_move_mem(&bo->tbo, &mem, false, false, false);
+
 	bytes_moved = atomic64_read(&adev->num_bytes_moved) -
 				    initial_bytes_moved;
 	amdgpu_cs_report_moved_bytes(adev, bytes_moved, bytes_moved);
 
+	if (unlikely(r != 0))
+		goto out;
+
 	if (bo->tbo.mem.mem_type != old_mem)
 		bo->last_cs_move_jiffies = jiffies;
 
 out:
+	if (r && mem.mm_node)
+		ttm_bo_mem_put(&bo->tbo, &mem);
 	amdgpu_bo_unreserve(bo);
 }
 
diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c
index a6d7fcb..f783aa3 100644
--- a/drivers/gpu/drm/ttm/ttm_bo.c
+++ b/drivers/gpu/drm/ttm/ttm_bo.c
@@ -268,10 +268,10 @@ static int ttm_bo_add_ttm(struct ttm_buffer_object *bo, bool zero_alloc)
 	return ret;
 }
 
-static int ttm_bo_handle_move_mem(struct ttm_buffer_object *bo,
-				  struct ttm_mem_reg *mem,
-				  bool evict, bool interruptible,
-				  bool no_wait_gpu)
+int ttm_bo_handle_move_mem(struct ttm_buffer_object *bo,
+			   struct ttm_mem_reg *mem,
+			   bool evict, bool interruptible,
+			   bool no_wait_gpu)
 {
 	struct ttm_bo_device *bdev = bo->bdev;
 	bool old_is_pci = ttm_mem_reg_is_pci(bdev, &bo->mem);
@@ -373,6 +373,7 @@ static int ttm_bo_handle_move_mem(struct ttm_buffer_object *bo,
 
 	return ret;
 }
+EXPORT_SYMBOL(ttm_bo_handle_move_mem);
 
 /**
  * Call bo::reserved.
@@ -695,11 +696,11 @@ bool ttm_bo_eviction_valuable(struct ttm_buffer_object *bo,
 }
 EXPORT_SYMBOL(ttm_bo_eviction_valuable);
 
-static int ttm_mem_evict_first(struct ttm_bo_device *bdev,
-				uint32_t mem_type,
-				const struct ttm_place *place,
-				bool interruptible,
-				bool no_wait_gpu)
+int ttm_mem_evict_first(struct ttm_bo_device *bdev,
+			uint32_t mem_type,
+			const struct ttm_place *place,
+			bool interruptible,
+			bool no_wait_gpu)
 {
 	struct ttm_bo_global *glob = bdev->glob;
 	struct ttm_mem_type_manager *man = &bdev->man[mem_type];
@@ -753,6 +754,7 @@ static int ttm_mem_evict_first(struct ttm_bo_device *bdev,
 	kref_put(&bo->list_kref, ttm_bo_release_list);
 	return ret;
 }
+EXPORT_SYMBOL(ttm_mem_evict_first);
 
 void ttm_bo_mem_put(struct ttm_buffer_object *bo, struct ttm_mem_reg *mem)
 {
@@ -766,9 +768,9 @@ EXPORT_SYMBOL(ttm_bo_mem_put);
 /**
  * Add the last move fence to the BO and reserve a new shared slot.
  */
-static int ttm_bo_add_move_fence(struct ttm_buffer_object *bo,
-				 struct ttm_mem_type_manager *man,
-				 struct ttm_mem_reg *mem)
+int ttm_bo_add_move_fence(struct ttm_buffer_object *bo,
+			  struct ttm_mem_type_manager *man,
+			  struct ttm_mem_reg *mem)
 {
 	struct dma_fence *fence;
 	int ret;
@@ -790,6 +792,7 @@ static int ttm_bo_add_move_fence(struct ttm_buffer_object *bo,
 
 	return 0;
 }
+EXPORT_SYMBOL(ttm_bo_add_move_fence);
 
 /**
  * Repeatedly evict memory from the LRU for @mem_type until we create enough
@@ -821,9 +824,9 @@ static int ttm_bo_mem_force_space(struct ttm_buffer_object *bo,
 	return ttm_bo_add_move_fence(bo, man, mem);
 }
 
-static uint32_t ttm_bo_select_caching(struct ttm_mem_type_manager *man,
-				      uint32_t cur_placement,
-				      uint32_t proposed_placement)
+uint32_t ttm_bo_select_caching(struct ttm_mem_type_manager *man,
+			       uint32_t cur_placement,
+			       uint32_t proposed_placement)
 {
 	uint32_t caching = proposed_placement & TTM_PL_MASK_CACHING;
 	uint32_t result = proposed_placement & ~TTM_PL_MASK_CACHING;
@@ -845,6 +848,7 @@ static uint32_t ttm_bo_select_caching(struct ttm_mem_type_manager *man,
 
 	return result;
 }
+EXPORT_SYMBOL(ttm_bo_select_caching);
 
 static bool ttm_bo_mt_compatible(struct ttm_mem_type_manager *man,
 				 uint32_t mem_type,
diff --git a/include/drm/ttm/ttm_bo_driver.h b/include/drm/ttm/ttm_bo_driver.h
index 990d529..3bf267c 100644
--- a/include/drm/ttm/ttm_bo_driver.h
+++ b/include/drm/ttm/ttm_bo_driver.h
@@ -730,6 +730,11 @@ extern int ttm_bo_mem_space(struct ttm_buffer_object *bo,
 				bool interruptible,
 				bool no_wait_gpu);
 
+extern int ttm_mem_evict_first(struct ttm_bo_device *bdev,
+			       uint32_t mem_type,
+			       const struct ttm_place *place,
+			       bool interruptible, bool no_wait_gpu);
+
 extern void ttm_bo_mem_put(struct ttm_buffer_object *bo,
 			   struct ttm_mem_reg *mem);
 extern void ttm_bo_mem_put_locked(struct ttm_buffer_object *bo,
@@ -740,6 +745,14 @@ extern int ttm_bo_global_init(struct drm_global_reference *ref);
 
 extern int ttm_bo_device_release(struct ttm_bo_device *bdev);
 
+extern int ttm_bo_handle_move_mem(struct ttm_buffer_object *bo,
+				  struct ttm_mem_reg *mem,
+				  bool evict, bool interruptible,
+				  bool no_wait_gpu);
+extern int ttm_bo_add_move_fence(struct ttm_buffer_object *bo,
+				 struct ttm_mem_type_manager *man,
+				 struct ttm_mem_reg *mem);
+
 /**
  * ttm_bo_device_init
  *
diff --git a/include/drm/ttm/ttm_placement.h b/include/drm/ttm/ttm_placement.h
index e88a8e3..6455214 100644
--- a/include/drm/ttm/ttm_placement.h
+++ b/include/drm/ttm/ttm_placement.h
@@ -105,4 +105,10 @@ struct ttm_placement {
 	const struct ttm_place	*busy_placement;
 };
 
+struct ttm_mem_type_manager;
+
+extern uint32_t ttm_bo_select_caching(struct ttm_mem_type_manager *man,
+				      uint32_t cur_placement,
+				      uint32_t proposed_placement);
+
 #endif
-- 
2.7.4

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH 0/9] Visible VRAM Management Improvements
       [not found] ` <1498239580-17360-1-git-send-email-john-xq/Ko7C6e2Bl57MIdRCFDg@public.gmane.org>
                     ` (5 preceding siblings ...)
  2017-06-23 17:39   ` [PATCH 8/9] drm/amdgpu: Asynchronously move BOs to visible VRAM John Brooks
@ 2017-06-23 21:02   ` Felix Kuehling
       [not found]     ` <82339d2d-481c-ab3f-1590-ab22f0eac371-5C7GfCeVMHo@public.gmane.org>
  2017-06-24 18:07   ` Christian König
  7 siblings, 1 reply; 31+ messages in thread
From: Felix Kuehling @ 2017-06-23 21:02 UTC (permalink / raw)
  To: John Brooks, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW,
	Michel Dänzer, Marek Olšák
  Cc: David Airlie, Christian König,
	dri-devel-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

Hi John,

I haven't read your patches. Just a question based on the cover letter.

I understand that visible VRAM is the biggest pain point. But could the
same reasoning make sense for invisible VRAM? That is, doing all the
migrations to VRAM in a workqueue?

Regards,
  Felix


On 17-06-23 01:39 PM, John Brooks wrote:
> This patch series is intended to improve performance when limited CPU-visible
> VRAM is under pressure.
>
> Moving BOs into visible VRAM is essentially a housekeeping task. It's faster to
> access them in VRAM than GTT, but it isn't a hard requirement for them to be in
> VRAM. As such, it is unnecessary to spend valuable time blocking on this in the
> page fault handler or during command submission. Doing so translates directly
> into a longer frame time (ergo stalls and stuttering).
>
> The problem worsens when attempting to move BOs into visible VRAM when it is
> full. This takes much longer than a simple move because other BOs have to be
> evicted, which involves finding and then moving potentially hundreds of other
> BOs, which is very time consuming. In the case of limited visible VRAM, it's
> important to do this sometime to keep the contents of visible VRAM fresh, but
> it does not need to be a blocking operation. If visible VRAM is full, the BO
> can be read from GTT in the meantime and the BO can be moved to VRAM later.
>
> Thus, I have made it so that neither the command submission code nor page fault
> handler spends time evicting BOs from visible VRAM, and instead this is
> deferred to a workqueue function that's queued when CS requests BOs flagged
> AMDGPU_GEM_CREATE_CPU_ACCESS_REQUIRED.
>
> Speaking of CPU_ACCESS_REQUIRED, I've changed the handling of that flag so that
> the kernel driver can clear it later even if it was set by userspace. This is
> because the userspace graphics library can't know whether the application
> really needs it to be CPU_ACCESS_REQUIRED forever. The kernel driver can't know
> either, but it does know when page faults occur, and if a BO doesn't appear to
> have any page faults when it's moved somewhere inaccessible, the flag can be
> removed and it doesn't have to take up space in CPU-visible memory anymore.
> This change was based on IRC discussions with Michel.
>
> Patch 7 fixes a problem with BO moverate throttling that causes visible VRAM
> moves to not be throttled if total VRAM isn't full enough.
>
> I've also added a vis_vramlimit module parameter for debugging purposes. It's
> similar to the vramlimit parameter except it limits only visible VRAM.
>
> I have tested this patch set with the two games I know to be affected by
> visible VRAM pressure: DiRT Rally and Dying Light. It practically eliminates
> eviction-related stuttering in DiRT Rally as well as very low performance if
> visible VRAM is limited to 64MB. It also fixes severely low framerates that
> occurred in some areas of Dying Light. All my testing was done with an R9 290
> with 4GB of visible VRAM with an Intel i7 4790.
>
> --
> John Brooks (Frogging101)
>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 0/9] Visible VRAM Management Improvements
       [not found]     ` <82339d2d-481c-ab3f-1590-ab22f0eac371-5C7GfCeVMHo@public.gmane.org>
@ 2017-06-23 23:16       ` John Brooks
  2017-06-24 18:20         ` Christian König
  0 siblings, 1 reply; 31+ messages in thread
From: John Brooks @ 2017-06-23 23:16 UTC (permalink / raw)
  To: Felix Kuehling
  Cc: Marek Olšák, David Airlie, Michel Dänzer,
	dri-devel-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW, Christian König,
	amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

On Fri, Jun 23, 2017 at 05:02:58PM -0400, Felix Kuehling wrote:
> Hi John,
> 
> I haven't read your patches. Just a question based on the cover letter.
> 
> I understand that visible VRAM is the biggest pain point. But could the
> same reasoning make sense for invisible VRAM? That is, doing all the
> migrations to VRAM in a workqueue?
> 
> Regards,
>   Felix
> 

I don't see why not. In theory, all non-essential buffer moves could be done
this way, and it would be relatively trivial to extend it to that.

But I wanted to limit the scope of my changes, at least for this series.
Testing takes a long time and I wanted to focus those testing efforts as much
as possible, produce something well-tested (I hope), and get feedback on this
limited application of the concept before expanding its reach.

John

> 
> On 17-06-23 01:39 PM, John Brooks wrote:
> > This patch series is intended to improve performance when limited CPU-visible
> > VRAM is under pressure.
> >
> > Moving BOs into visible VRAM is essentially a housekeeping task. It's faster to
> > access them in VRAM than GTT, but it isn't a hard requirement for them to be in
> > VRAM. As such, it is unnecessary to spend valuable time blocking on this in the
> > page fault handler or during command submission. Doing so translates directly
> > into a longer frame time (ergo stalls and stuttering).
> >
> > The problem worsens when attempting to move BOs into visible VRAM when it is
> > full. This takes much longer than a simple move because other BOs have to be
> > evicted, which involves finding and then moving potentially hundreds of other
> > BOs, which is very time consuming. In the case of limited visible VRAM, it's
> > important to do this sometime to keep the contents of visible VRAM fresh, but
> > it does not need to be a blocking operation. If visible VRAM is full, the BO
> > can be read from GTT in the meantime and the BO can be moved to VRAM later.
> >
> > Thus, I have made it so that neither the command submission code nor page fault
> > handler spends time evicting BOs from visible VRAM, and instead this is
> > deferred to a workqueue function that's queued when CS requests BOs flagged
> > AMDGPU_GEM_CREATE_CPU_ACCESS_REQUIRED.
> >
> > Speaking of CPU_ACCESS_REQUIRED, I've changed the handling of that flag so that
> > the kernel driver can clear it later even if it was set by userspace. This is
> > because the userspace graphics library can't know whether the application
> > really needs it to be CPU_ACCESS_REQUIRED forever. The kernel driver can't know
> > either, but it does know when page faults occur, and if a BO doesn't appear to
> > have any page faults when it's moved somewhere inaccessible, the flag can be
> > removed and it doesn't have to take up space in CPU-visible memory anymore.
> > This change was based on IRC discussions with Michel.
> >
> > Patch 7 fixes a problem with BO moverate throttling that causes visible VRAM
> > moves to not be throttled if total VRAM isn't full enough.
> >
> > I've also added a vis_vramlimit module parameter for debugging purposes. It's
> > similar to the vramlimit parameter except it limits only visible VRAM.
> >
> > I have tested this patch set with the two games I know to be affected by
> > visible VRAM pressure: DiRT Rally and Dying Light. It practically eliminates
> > eviction-related stuttering in DiRT Rally as well as very low performance if
> > visible VRAM is limited to 64MB. It also fixes severely low framerates that
> > occurred in some areas of Dying Light. All my testing was done with an R9 290
> > with 4GB of visible VRAM with an Intel i7 4790.
> >
> > --
> > John Brooks (Frogging101)
> >
> > _______________________________________________
> > amd-gfx mailing list
> > amd-gfx@lists.freedesktop.org
> > https://lists.freedesktop.org/mailman/listinfo/amd-gfx
> 
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 6/9] drm/amdgpu: Set/clear CPU_ACCESS_REQUIRED flag on page fault and CS
       [not found]     ` <1498239580-17360-7-git-send-email-john-xq/Ko7C6e2Bl57MIdRCFDg@public.gmane.org>
@ 2017-06-24 18:00       ` Christian König
  2017-06-25  1:57         ` John Brooks
       [not found]         ` <55ea5e84-0791-5a70-6278-ade83c343a3b-ANTagKRnAhcb1SvskN2V4Q@public.gmane.org>
  0 siblings, 2 replies; 31+ messages in thread
From: Christian König @ 2017-06-24 18:00 UTC (permalink / raw)
  To: John Brooks, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW,
	Michel Dänzer, Marek Olšák
  Cc: David Airlie, dri-devel-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

Am 23.06.2017 um 19:39 schrieb John Brooks:
> When the AMDGPU_GEM_CREATE_CPU_ACCESS_REQUIRED flag is given by userspace,
> it should only be treated as a hint to initially place a BO somewhere CPU
> accessible, rather than having a permanent effect on BO placement.

And that is a clear NAK from my side.

CPU_ACCESS_REQUIRED is a permanent limitation to where the buffer should 
be placed.

Ignoring that and changing the IOCTL interface like this is clearly not 
acceptable.

Christian.

>
> Instead of the flag being set in stone at BO creation, set the flag when a
> page fault occurs so that it goes somewhere CPU-visible, and clear it when
> the BO is requested by the GPU.
>
> However, clearing the CPU_ACCESS_REQUIRED flag may move a BO to invisible
> VRAM, which is likely to cause a page fault that moves it right back to
> GTT. When this happens too much, it is highly detrimental to performance.
> Only clear the flag on CS if:
>
> - The BO wasn't page faulted for a certain amount of time (currently 10
>    seconds, measured with jiffies), and
> - its last page fault didn't occur too soon (currently 500ms) after its
>    last CS request, or vice versa.
>
> Setting the flag in amdgpu_fault_reserve_notify() also means that we can
> remove the loop to restrict lpfn to the end of visible VRAM, because
> amdgpu_ttm_placement_init() will do it for us.
>
> Signed-off-by: John Brooks <john@fastquake.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c     |  1 +
>   drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 46 ++++++++++++++++++++++--------
>   drivers/gpu/drm/amd/amdgpu/amdgpu_object.h |  1 +
>   3 files changed, 36 insertions(+), 12 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> index 2fad8bd..73d6882 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> @@ -320,6 +320,7 @@ static int amdgpu_cs_bo_validate(struct amdgpu_cs_parser *p,
>   	else
>   		domain = bo->allowed_domains;
>   
> +	amdgpu_bo_clear_cpu_access_required(bo);
>   retry:
>   	amdgpu_ttm_placement_from_domain(bo, domain);
>   	initial_bytes_moved = atomic64_read(&adev->num_bytes_moved);
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
> index 31d1f21..a7d48a7 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
> @@ -967,8 +967,8 @@ int amdgpu_bo_fault_reserve_notify(struct ttm_buffer_object *bo)
>   {
>   	struct amdgpu_device *adev = amdgpu_ttm_adev(bo->bdev);
>   	struct amdgpu_bo *abo;
> -	unsigned long offset, size, lpfn;
> -	int i, r;
> +	unsigned long offset, size;
> +	int r;
>   
>   	if (!amdgpu_ttm_bo_is_amdgpu_bo(bo))
>   		return 0;
> @@ -991,18 +991,9 @@ int amdgpu_bo_fault_reserve_notify(struct ttm_buffer_object *bo)
>   
>   	/* hurrah the memory is not visible ! */
>   	atomic64_inc(&adev->num_vram_cpu_page_faults);
> +	abo->flags |= AMDGPU_GEM_CREATE_CPU_ACCESS_REQUIRED;
>   	amdgpu_ttm_placement_from_domain(abo, AMDGPU_GEM_DOMAIN_VRAM |
>   					 AMDGPU_GEM_DOMAIN_GTT);
> -	lpfn =	adev->mc.visible_vram_size >> PAGE_SHIFT;
> -	for (i = 0; i < abo->placement.num_placement; i++) {
> -		/* Try to move the BO into visible VRAM */
> -		if ((abo->placements[i].flags & TTM_PL_FLAG_VRAM) &&
> -		    (!abo->placements[i].lpfn ||
> -		     abo->placements[i].lpfn > lpfn))
> -			abo->placements[i].lpfn = lpfn;
> -	}
> -	abo->placement.busy_placement = abo->placement.placement;
> -	abo->placement.num_busy_placement = abo->placement.num_placement;
>   	r = ttm_bo_validate(bo, &abo->placement, false, false);
>   	if (unlikely(r != 0))
>   		return r;
> @@ -1057,3 +1048,34 @@ u64 amdgpu_bo_gpu_offset(struct amdgpu_bo *bo)
>   
>   	return bo->tbo.offset;
>   }
> +
> +/**
> + * amdgpu_bo_clear_cpu_access_required
> + * @bo: BO to update
> + *
> + * Clears CPU_ACCESS_REQUIRED flag if the BO hasn't had a page fault in a while
> + * and it didn't have a page fault too soon after the last time it was moved to
> + * VRAM.
> + *
> + * Caller should have bo reserved.
> + *
> + */
> +void amdgpu_bo_clear_cpu_access_required(struct amdgpu_bo *bo)
> +{
> +	const unsigned int page_fault_timeout_ms = 10000;
> +	const unsigned int min_period_ms = 500;
> +	unsigned int ms_since_pf, period_ms;
> +
> +	ms_since_pf = jiffies_to_msecs(jiffies - bo->last_page_fault_jiffies);
> +	period_ms = jiffies_to_msecs(abs(bo->last_page_fault_jiffies -
> +					 bo->last_cs_move_jiffies));
> +
> +	/*
> +	 * Try to avoid a revolving door between GTT and VRAM. Clearing the
> +	 * flag may move this BO back to VRAM, so don't clear it if it's likely
> +	 * to page fault and go right back to GTT.
> +	 */
> +	if ((!bo->last_page_fault_jiffies || !bo->last_cs_move_jiffies) ||
> +	    (ms_since_pf > page_fault_timeout_ms && period_ms > min_period_ms))
> +		bo->flags &= ~AMDGPU_GEM_CREATE_CPU_ACCESS_REQUIRED;
> +}
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h
> index 3824851..b0cb137 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h
> @@ -182,6 +182,7 @@ int amdgpu_bo_restore_from_shadow(struct amdgpu_device *adev,
>   				  struct reservation_object *resv,
>   				  struct dma_fence **fence,
>   				  bool direct);
> +void amdgpu_bo_clear_cpu_access_required(struct amdgpu_bo *bo);
>   
>   
>   /*


_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 0/9] Visible VRAM Management Improvements
       [not found] ` <1498239580-17360-1-git-send-email-john-xq/Ko7C6e2Bl57MIdRCFDg@public.gmane.org>
                     ` (6 preceding siblings ...)
  2017-06-23 21:02   ` [PATCH 0/9] Visible VRAM Management Improvements Felix Kuehling
@ 2017-06-24 18:07   ` Christian König
       [not found]     ` <3cd916a7-6734-5eff-b645-66f3ee83f13a-ANTagKRnAhcb1SvskN2V4Q@public.gmane.org>
  7 siblings, 1 reply; 31+ messages in thread
From: Christian König @ 2017-06-24 18:07 UTC (permalink / raw)
  To: John Brooks, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW,
	Michel Dänzer, Marek Olšák
  Cc: David Airlie, dri-devel-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

Am 23.06.2017 um 19:39 schrieb John Brooks:
> This patch series is intended to improve performance when limited CPU-visible
> VRAM is under pressure.
>
> Moving BOs into visible VRAM is essentially a housekeeping task. It's faster to
> access them in VRAM than GTT, but it isn't a hard requirement for them to be in
> VRAM. As such, it is unnecessary to spend valuable time blocking on this in the
> page fault handler or during command submission. Doing so translates directly
> into a longer frame time (ergo stalls and stuttering).

Sorry, but that strongly sounds like you are messing with things you 
don't fully understand.

Blocking in the page fault handler is mandatory to handle the page fault 
correctly. So that is not something we can change easily.

Regards,
Christian.

>
> The problem worsens when attempting to move BOs into visible VRAM when it is
> full. This takes much longer than a simple move because other BOs have to be
> evicted, which involves finding and then moving potentially hundreds of other
> BOs, which is very time consuming. In the case of limited visible VRAM, it's
> important to do this sometime to keep the contents of visible VRAM fresh, but
> it does not need to be a blocking operation. If visible VRAM is full, the BO
> can be read from GTT in the meantime and the BO can be moved to VRAM later.
>
> Thus, I have made it so that neither the command submission code nor page fault
> handler spends time evicting BOs from visible VRAM, and instead this is
> deferred to a workqueue function that's queued when CS requests BOs flagged
> AMDGPU_GEM_CREATE_CPU_ACCESS_REQUIRED.
>
> Speaking of CPU_ACCESS_REQUIRED, I've changed the handling of that flag so that
> the kernel driver can clear it later even if it was set by userspace. This is
> because the userspace graphics library can't know whether the application
> really needs it to be CPU_ACCESS_REQUIRED forever. The kernel driver can't know
> either, but it does know when page faults occur, and if a BO doesn't appear to
> have any page faults when it's moved somewhere inaccessible, the flag can be
> removed and it doesn't have to take up space in CPU-visible memory anymore.
> This change was based on IRC discussions with Michel.
>
> Patch 7 fixes a problem with BO moverate throttling that causes visible VRAM
> moves to not be throttled if total VRAM isn't full enough.
>
> I've also added a vis_vramlimit module parameter for debugging purposes. It's
> similar to the vramlimit parameter except it limits only visible VRAM.
>
> I have tested this patch set with the two games I know to be affected by
> visible VRAM pressure: DiRT Rally and Dying Light. It practically eliminates
> eviction-related stuttering in DiRT Rally as well as very low performance if
> visible VRAM is limited to 64MB. It also fixes severely low framerates that
> occurred in some areas of Dying Light. All my testing was done with an R9 290
> with 4GB of visible VRAM with an Intel i7 4790.
>
> --
> John Brooks (Frogging101)
>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx


_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 4/9] drm/amdgpu: Don't force BOs into visible VRAM if they can go to GTT instead
       [not found]     ` <1498239580-17360-5-git-send-email-john-xq/Ko7C6e2Bl57MIdRCFDg@public.gmane.org>
@ 2017-06-24 18:09       ` Christian König
       [not found]         ` <0c5064f9-5b84-8833-b410-055b5e2064bf-ANTagKRnAhcb1SvskN2V4Q@public.gmane.org>
  0 siblings, 1 reply; 31+ messages in thread
From: Christian König @ 2017-06-24 18:09 UTC (permalink / raw)
  To: John Brooks, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW,
	Michel Dänzer, Marek Olšák
  Cc: David Airlie, dri-devel-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

Am 23.06.2017 um 19:39 schrieb John Brooks:
> amdgpu_ttm_placement_init() callers that are using both VRAM and GTT as
> domains usually don't want visible VRAM as a busy placement.
>
> Signed-off-by: John Brooks <john@fastquake.com>

NAK to this as well. Some callers of amdgpu_ttm_placement_init() have 
hard placement limitations that BOs *MUST* be in VRAM (VM page tables 
and old UVD hardware).

So changing that here will just break those under memory pressure.

Regards,
Christian.

> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 10 +++++++++-
>   1 file changed, 9 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
> index 751bc05..0ff555a 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
> @@ -138,7 +138,15 @@ static void amdgpu_ttm_placement_init(struct amdgpu_device *adev,
>   		if (flags & AMDGPU_GEM_CREATE_VRAM_CONTIGUOUS)
>   			places[c].flags |= TTM_PL_FLAG_CONTIGUOUS;
>   
> -		busy_places[bc++] = places[c++];
> +		/* Don't set limited visible VRAM as a busy placement if we can
> +		 * use GTT instead
> +		 */
> +		if (!((flags & AMDGPU_GEM_CREATE_CPU_ACCESS_REQUIRED) &&
> +		      adev->mc.visible_vram_size < adev->mc.real_vram_size &&
> +		      (domain & AMDGPU_GEM_DOMAIN_GTT)))
> +			busy_places[bc++] = places[c];
> +
> +		c++;
>   	}
>   
>   	if (domain & AMDGPU_GEM_DOMAIN_GTT) {


_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 0/9] Visible VRAM Management Improvements
  2017-06-23 23:16       ` John Brooks
@ 2017-06-24 18:20         ` Christian König
       [not found]           ` <644cf9b4-e22b-eab1-a505-b0e1f9850f82-ANTagKRnAhcb1SvskN2V4Q@public.gmane.org>
  0 siblings, 1 reply; 31+ messages in thread
From: Christian König @ 2017-06-24 18:20 UTC (permalink / raw)
  To: John Brooks, Felix Kuehling; +Cc: Michel Dänzer, dri-devel, amd-gfx

Am 24.06.2017 um 01:16 schrieb John Brooks:
> On Fri, Jun 23, 2017 at 05:02:58PM -0400, Felix Kuehling wrote:
>> Hi John,
>>
>> I haven't read your patches. Just a question based on the cover letter.
>>
>> I understand that visible VRAM is the biggest pain point. But could the
>> same reasoning make sense for invisible VRAM? That is, doing all the
>> migrations to VRAM in a workqueue?
>>
>> Regards,
>>    Felix
>>
> I don't see why not. In theory, all non-essential buffer moves could be done
> this way, and it would be relatively trivial to extend it to that.
>
> But I wanted to limit the scope of my changes, at least for this series.
> Testing takes a long time and I wanted to focus those testing efforts as much
> as possible, produce something well-tested (I hope), and get feedback on this
> limited application of the concept before expanding its reach.

Yeah, sorry I have to say that but the whole approach is utterly nonsense.

What happens is that the delayed BO can only be moved AFTER the command 
submission which wants it to be in VRAM.

So you use the BO in a CS and *then* move it to where the CS wants it to 
be, no matter if the BO is then needed there or not.

Regards,
Christian.


>
> John
>
>> On 17-06-23 01:39 PM, John Brooks wrote:
>>> This patch series is intended to improve performance when limited CPU-visible
>>> VRAM is under pressure.
>>>
>>> Moving BOs into visible VRAM is essentially a housekeeping task. It's faster to
>>> access them in VRAM than GTT, but it isn't a hard requirement for them to be in
>>> VRAM. As such, it is unnecessary to spend valuable time blocking on this in the
>>> page fault handler or during command submission. Doing so translates directly
>>> into a longer frame time (ergo stalls and stuttering).
>>>
>>> The problem worsens when attempting to move BOs into visible VRAM when it is
>>> full. This takes much longer than a simple move because other BOs have to be
>>> evicted, which involves finding and then moving potentially hundreds of other
>>> BOs, which is very time consuming. In the case of limited visible VRAM, it's
>>> important to do this sometime to keep the contents of visible VRAM fresh, but
>>> it does not need to be a blocking operation. If visible VRAM is full, the BO
>>> can be read from GTT in the meantime and the BO can be moved to VRAM later.
>>>
>>> Thus, I have made it so that neither the command submission code nor page fault
>>> handler spends time evicting BOs from visible VRAM, and instead this is
>>> deferred to a workqueue function that's queued when CS requests BOs flagged
>>> AMDGPU_GEM_CREATE_CPU_ACCESS_REQUIRED.
>>>
>>> Speaking of CPU_ACCESS_REQUIRED, I've changed the handling of that flag so that
>>> the kernel driver can clear it later even if it was set by userspace. This is
>>> because the userspace graphics library can't know whether the application
>>> really needs it to be CPU_ACCESS_REQUIRED forever. The kernel driver can't know
>>> either, but it does know when page faults occur, and if a BO doesn't appear to
>>> have any page faults when it's moved somewhere inaccessible, the flag can be
>>> removed and it doesn't have to take up space in CPU-visible memory anymore.
>>> This change was based on IRC discussions with Michel.
>>>
>>> Patch 7 fixes a problem with BO moverate throttling that causes visible VRAM
>>> moves to not be throttled if total VRAM isn't full enough.
>>>
>>> I've also added a vis_vramlimit module parameter for debugging purposes. It's
>>> similar to the vramlimit parameter except it limits only visible VRAM.
>>>
>>> I have tested this patch set with the two games I know to be affected by
>>> visible VRAM pressure: DiRT Rally and Dying Light. It practically eliminates
>>> eviction-related stuttering in DiRT Rally as well as very low performance if
>>> visible VRAM is limited to 64MB. It also fixes severely low framerates that
>>> occurred in some areas of Dying Light. All my testing was done with an R9 290
>>> with 4GB of visible VRAM with an Intel i7 4790.
>>>
>>> --
>>> John Brooks (Frogging101)
>>>
>>> _______________________________________________
>>> amd-gfx mailing list
>>> amd-gfx@lists.freedesktop.org
>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx


_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 0/9] Visible VRAM Management Improvements
       [not found]     ` <3cd916a7-6734-5eff-b645-66f3ee83f13a-ANTagKRnAhcb1SvskN2V4Q@public.gmane.org>
@ 2017-06-24 18:36       ` John Brooks
  2017-06-25 11:31         ` Christian König
  0 siblings, 1 reply; 31+ messages in thread
From: John Brooks @ 2017-06-24 18:36 UTC (permalink / raw)
  To: Christian König
  Cc: David Airlie, Michel Dänzer,
	dri-devel-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW, Marek Olšák,
	amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

On Sat, Jun 24, 2017 at 08:07:15PM +0200, Christian König wrote:
> Am 23.06.2017 um 19:39 schrieb John Brooks:
> >This patch series is intended to improve performance when limited CPU-visible
> >VRAM is under pressure.
> >
> >Moving BOs into visible VRAM is essentially a housekeeping task. It's faster to
> >access them in VRAM than GTT, but it isn't a hard requirement for them to be in
> >VRAM. As such, it is unnecessary to spend valuable time blocking on this in the
> >page fault handler or during command submission. Doing so translates directly
> >into a longer frame time (ergo stalls and stuttering).
> 
> Sorry, but that strongly sounds like you are messing with things you don't
> fully understand.
> 
> Blocking in the page fault handler is mandatory to handle the page fault
> correctly. So that is not something we can change easily.

I do understand that. Indeed, the page fault handler must block until the
memory is accessible. But the memory doesn't have to be in visible VRAM to be
accessible; it could also be in GTT. So, what I meant was that it does not have
to block for the excessively long time it takes to move BOs into visible VRAM
when it is already full. It could spend less time blocking by just moving it to
GTT in that situation. I apologize for the poor wording.

John

> 
> Regards,
> Christian.
> 
> >
> >The problem worsens when attempting to move BOs into visible VRAM when it is
> >full. This takes much longer than a simple move because other BOs have to be
> >evicted, which involves finding and then moving potentially hundreds of other
> >BOs, which is very time consuming. In the case of limited visible VRAM, it's
> >important to do this sometime to keep the contents of visible VRAM fresh, but
> >it does not need to be a blocking operation. If visible VRAM is full, the BO
> >can be read from GTT in the meantime and the BO can be moved to VRAM later.
> >
> >Thus, I have made it so that neither the command submission code nor page fault
> >handler spends time evicting BOs from visible VRAM, and instead this is
> >deferred to a workqueue function that's queued when CS requests BOs flagged
> >AMDGPU_GEM_CREATE_CPU_ACCESS_REQUIRED.
> >
> >Speaking of CPU_ACCESS_REQUIRED, I've changed the handling of that flag so that
> >the kernel driver can clear it later even if it was set by userspace. This is
> >because the userspace graphics library can't know whether the application
> >really needs it to be CPU_ACCESS_REQUIRED forever. The kernel driver can't know
> >either, but it does know when page faults occur, and if a BO doesn't appear to
> >have any page faults when it's moved somewhere inaccessible, the flag can be
> >removed and it doesn't have to take up space in CPU-visible memory anymore.
> >This change was based on IRC discussions with Michel.
> >
> >Patch 7 fixes a problem with BO moverate throttling that causes visible VRAM
> >moves to not be throttled if total VRAM isn't full enough.
> >
> >I've also added a vis_vramlimit module parameter for debugging purposes. It's
> >similar to the vramlimit parameter except it limits only visible VRAM.
> >
> >I have tested this patch set with the two games I know to be affected by
> >visible VRAM pressure: DiRT Rally and Dying Light. It practically eliminates
> >eviction-related stuttering in DiRT Rally as well as very low performance if
> >visible VRAM is limited to 64MB. It also fixes severely low framerates that
> >occurred in some areas of Dying Light. All my testing was done with an R9 290
> >with 4GB of visible VRAM with an Intel i7 4790.
> >
> >--
> >John Brooks (Frogging101)
> >
> >_______________________________________________
> >amd-gfx mailing list
> >amd-gfx@lists.freedesktop.org
> >https://lists.freedesktop.org/mailman/listinfo/amd-gfx
> 
> 
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 4/9] drm/amdgpu: Don't force BOs into visible VRAM if they can go to GTT instead
       [not found]         ` <0c5064f9-5b84-8833-b410-055b5e2064bf-ANTagKRnAhcb1SvskN2V4Q@public.gmane.org>
@ 2017-06-24 18:37           ` John Brooks
  0 siblings, 0 replies; 31+ messages in thread
From: John Brooks @ 2017-06-24 18:37 UTC (permalink / raw)
  To: Christian König
  Cc: David Airlie, Michel Dänzer,
	dri-devel-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW, Marek Olšák,
	amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

On Sat, Jun 24, 2017 at 08:09:56PM +0200, Christian König wrote:
> Am 23.06.2017 um 19:39 schrieb John Brooks:
> >amdgpu_ttm_placement_init() callers that are using both VRAM and GTT as
> >domains usually don't want visible VRAM as a busy placement.
> >
> >Signed-off-by: John Brooks <john@fastquake.com>
> 
> NAK to this as well. Some callers of amdgpu_ttm_placement_init() have hard
> placement limitations that BOs *MUST* be in VRAM (VM page tables and old UVD
> hardware).
> 
> So changing that here will just break those under memory pressure.
> 
> Regards,
> Christian.

Right, that was just some silly attempt at deduplication. I can rework it.

Thanks,
John

> 
> >---
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 10 +++++++++-
> >  1 file changed, 9 insertions(+), 1 deletion(-)
> >
> >diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
> >index 751bc05..0ff555a 100644
> >--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
> >+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
> >@@ -138,7 +138,15 @@ static void amdgpu_ttm_placement_init(struct amdgpu_device *adev,
> >  		if (flags & AMDGPU_GEM_CREATE_VRAM_CONTIGUOUS)
> >  			places[c].flags |= TTM_PL_FLAG_CONTIGUOUS;
> >-		busy_places[bc++] = places[c++];
> >+		/* Don't set limited visible VRAM as a busy placement if we can
> >+		 * use GTT instead
> >+		 */
> >+		if (!((flags & AMDGPU_GEM_CREATE_CPU_ACCESS_REQUIRED) &&
> >+		      adev->mc.visible_vram_size < adev->mc.real_vram_size &&
> >+		      (domain & AMDGPU_GEM_DOMAIN_GTT)))
> >+			busy_places[bc++] = places[c];
> >+
> >+		c++;
> >  	}
> >  	if (domain & AMDGPU_GEM_DOMAIN_GTT) {
> 
> 
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 0/9] Visible VRAM Management Improvements
       [not found]           ` <644cf9b4-e22b-eab1-a505-b0e1f9850f82-ANTagKRnAhcb1SvskN2V4Q@public.gmane.org>
@ 2017-06-24 21:50             ` John Brooks
  2017-06-25 11:54               ` Christian König
  0 siblings, 1 reply; 31+ messages in thread
From: John Brooks @ 2017-06-24 21:50 UTC (permalink / raw)
  To: Christian König
  Cc: Marek Olšák, David Airlie, Felix Kuehling,
	dri-devel-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW,
	amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW, Michel Dänzer

On Sat, Jun 24, 2017 at 08:20:22PM +0200, Christian König wrote:
> Am 24.06.2017 um 01:16 schrieb John Brooks:
> >On Fri, Jun 23, 2017 at 05:02:58PM -0400, Felix Kuehling wrote:
> >>Hi John,
> >>
> >>I haven't read your patches. Just a question based on the cover letter.
> >>
> >>I understand that visible VRAM is the biggest pain point. But could the
> >>same reasoning make sense for invisible VRAM? That is, doing all the
> >>migrations to VRAM in a workqueue?
> >>
> >>Regards,
> >>   Felix
> >>
> >I don't see why not. In theory, all non-essential buffer moves could be done
> >this way, and it would be relatively trivial to extend it to that.
> >
> >But I wanted to limit the scope of my changes, at least for this series.
> >Testing takes a long time and I wanted to focus those testing efforts as much
> >as possible, produce something well-tested (I hope), and get feedback on this
> >limited application of the concept before expanding its reach.
> 
> Yeah, sorry I have to say that but the whole approach is utterly nonsense.
> 
> What happens is that the delayed BO can only be moved AFTER the command
> submission which wants it to be in VRAM.
> 
> So you use the BO in a CS and *then* move it to where the CS wants it to be,
> no matter if the BO is then needed there or not.
> 
> Regards,
> Christian.
> 

I'm aware of the effect it has. The BO won't be in VRAM for the current command
submission, but it'll be there for a future one. If a BO is used at a given
time, then it's likely it'll be used again soon. In which case you'll come out
ahead on latency even if the GPU has to read it from GTT a few times. In any
case, it's never going to hurt as much as full-stop waiting for a worst-case BO
move that needs a lot of evictions.

Feel free to correct my understanding; you'd certainly know any of this better
than I do. But my tests indicate that immediate forced moves during CS cause
stalls, and the average framerate with delayed moves is the almost (~2%) the
same as with immediate ones, which is about 9% higher than with no forced moves
during CS at all.

DiRT Rally average framerates:
    With the whole patch set (n=3):
        89.56
    Without it (drm-next-4.13 5ac55629d6b3fcde69f46aa772c6e83be0bdcbbf)
    (n=3):
        91.16 (+stalls)
    Patches 1 and 3 only, and with GTT set as the only busy placement for
    CPU_ACCESS_REQUIRED BOs in amdgpu_cs_bo_validate (n=3):
        82.15

John

> 
> >
> >John
> >
> >>On 17-06-23 01:39 PM, John Brooks wrote:
> >>>This patch series is intended to improve performance when limited CPU-visible
> >>>VRAM is under pressure.
> >>>
> >>>Moving BOs into visible VRAM is essentially a housekeeping task. It's faster to
> >>>access them in VRAM than GTT, but it isn't a hard requirement for them to be in
> >>>VRAM. As such, it is unnecessary to spend valuable time blocking on this in the
> >>>page fault handler or during command submission. Doing so translates directly
> >>>into a longer frame time (ergo stalls and stuttering).
> >>>
> >>>The problem worsens when attempting to move BOs into visible VRAM when it is
> >>>full. This takes much longer than a simple move because other BOs have to be
> >>>evicted, which involves finding and then moving potentially hundreds of other
> >>>BOs, which is very time consuming. In the case of limited visible VRAM, it's
> >>>important to do this sometime to keep the contents of visible VRAM fresh, but
> >>>it does not need to be a blocking operation. If visible VRAM is full, the BO
> >>>can be read from GTT in the meantime and the BO can be moved to VRAM later.
> >>>
> >>>Thus, I have made it so that neither the command submission code nor page fault
> >>>handler spends time evicting BOs from visible VRAM, and instead this is
> >>>deferred to a workqueue function that's queued when CS requests BOs flagged
> >>>AMDGPU_GEM_CREATE_CPU_ACCESS_REQUIRED.
> >>>
> >>>Speaking of CPU_ACCESS_REQUIRED, I've changed the handling of that flag so that
> >>>the kernel driver can clear it later even if it was set by userspace. This is
> >>>because the userspace graphics library can't know whether the application
> >>>really needs it to be CPU_ACCESS_REQUIRED forever. The kernel driver can't know
> >>>either, but it does know when page faults occur, and if a BO doesn't appear to
> >>>have any page faults when it's moved somewhere inaccessible, the flag can be
> >>>removed and it doesn't have to take up space in CPU-visible memory anymore.
> >>>This change was based on IRC discussions with Michel.
> >>>
> >>>Patch 7 fixes a problem with BO moverate throttling that causes visible VRAM
> >>>moves to not be throttled if total VRAM isn't full enough.
> >>>
> >>>I've also added a vis_vramlimit module parameter for debugging purposes. It's
> >>>similar to the vramlimit parameter except it limits only visible VRAM.
> >>>
> >>>I have tested this patch set with the two games I know to be affected by
> >>>visible VRAM pressure: DiRT Rally and Dying Light. It practically eliminates
> >>>eviction-related stuttering in DiRT Rally as well as very low performance if
> >>>visible VRAM is limited to 64MB. It also fixes severely low framerates that
> >>>occurred in some areas of Dying Light. All my testing was done with an R9 290
> >>>with 4GB of visible VRAM with an Intel i7 4790.
> >>>
> >>>--
> >>>John Brooks (Frogging101)
> >>>
> >>>_______________________________________________
> >>>amd-gfx mailing list
> >>>amd-gfx@lists.freedesktop.org
> >>>https://lists.freedesktop.org/mailman/listinfo/amd-gfx
> 
> 
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 6/9] drm/amdgpu: Set/clear CPU_ACCESS_REQUIRED flag on page fault and CS
  2017-06-24 18:00       ` Christian König
@ 2017-06-25  1:57         ` John Brooks
       [not found]         ` <55ea5e84-0791-5a70-6278-ade83c343a3b-ANTagKRnAhcb1SvskN2V4Q@public.gmane.org>
  1 sibling, 0 replies; 31+ messages in thread
From: John Brooks @ 2017-06-25  1:57 UTC (permalink / raw)
  To: Christian König; +Cc: Michel Dänzer, dri-devel, amd-gfx

On Sat, Jun 24, 2017 at 08:00:05PM +0200, Christian König wrote:
> Am 23.06.2017 um 19:39 schrieb John Brooks:
> >When the AMDGPU_GEM_CREATE_CPU_ACCESS_REQUIRED flag is given by userspace,
> >it should only be treated as a hint to initially place a BO somewhere CPU
> >accessible, rather than having a permanent effect on BO placement.
> 
> And that is a clear NAK from my side.
> 
> CPU_ACCESS_REQUIRED is a permanent limitation to where the buffer should be
> placed.
> 
> Ignoring that and changing the IOCTL interface like this is clearly not
> acceptable.
> 
> Christian.
> 

In that case, I will remove this patch. But I will keep in some form the
safeguard against BOs oscillating between VRAM and GTT, as patch 3 allows page
faults to move BOs to GTT (more frequently than before, anyway). If userspace
has *not* marked said BOs with CPU_ACCESS_REQUIRED, amdgpu_cs_bo_validate will
move them to invisible VRAM, where they may promptly generate another page
fault. Cue framerate death. The same problem was seen last month in patch 2 of
the "Tweaks for high pressure on CPU visible VRAM" series.

John

> >
> >Instead of the flag being set in stone at BO creation, set the flag when a
> >page fault occurs so that it goes somewhere CPU-visible, and clear it when
> >the BO is requested by the GPU.
> >
> >However, clearing the CPU_ACCESS_REQUIRED flag may move a BO to invisible
> >VRAM, which is likely to cause a page fault that moves it right back to
> >GTT. When this happens too much, it is highly detrimental to performance.
> >Only clear the flag on CS if:
> >
> >- The BO wasn't page faulted for a certain amount of time (currently 10
> >   seconds, measured with jiffies), and
> >- its last page fault didn't occur too soon (currently 500ms) after its
> >   last CS request, or vice versa.
> >
> >Setting the flag in amdgpu_fault_reserve_notify() also means that we can
> >remove the loop to restrict lpfn to the end of visible VRAM, because
> >amdgpu_ttm_placement_init() will do it for us.
> >
> >Signed-off-by: John Brooks <john@fastquake.com>
> >---
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c     |  1 +
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 46 ++++++++++++++++++++++--------
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_object.h |  1 +
> >  3 files changed, 36 insertions(+), 12 deletions(-)
> >
> >diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> >index 2fad8bd..73d6882 100644
> >--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> >+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> >@@ -320,6 +320,7 @@ static int amdgpu_cs_bo_validate(struct amdgpu_cs_parser *p,
> >  	else
> >  		domain = bo->allowed_domains;
> >+	amdgpu_bo_clear_cpu_access_required(bo);
> >  retry:
> >  	amdgpu_ttm_placement_from_domain(bo, domain);
> >  	initial_bytes_moved = atomic64_read(&adev->num_bytes_moved);
> >diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
> >index 31d1f21..a7d48a7 100644
> >--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
> >+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
> >@@ -967,8 +967,8 @@ int amdgpu_bo_fault_reserve_notify(struct ttm_buffer_object *bo)
> >  {
> >  	struct amdgpu_device *adev = amdgpu_ttm_adev(bo->bdev);
> >  	struct amdgpu_bo *abo;
> >-	unsigned long offset, size, lpfn;
> >-	int i, r;
> >+	unsigned long offset, size;
> >+	int r;
> >  	if (!amdgpu_ttm_bo_is_amdgpu_bo(bo))
> >  		return 0;
> >@@ -991,18 +991,9 @@ int amdgpu_bo_fault_reserve_notify(struct ttm_buffer_object *bo)
> >  	/* hurrah the memory is not visible ! */
> >  	atomic64_inc(&adev->num_vram_cpu_page_faults);
> >+	abo->flags |= AMDGPU_GEM_CREATE_CPU_ACCESS_REQUIRED;
> >  	amdgpu_ttm_placement_from_domain(abo, AMDGPU_GEM_DOMAIN_VRAM |
> >  					 AMDGPU_GEM_DOMAIN_GTT);
> >-	lpfn =	adev->mc.visible_vram_size >> PAGE_SHIFT;
> >-	for (i = 0; i < abo->placement.num_placement; i++) {
> >-		/* Try to move the BO into visible VRAM */
> >-		if ((abo->placements[i].flags & TTM_PL_FLAG_VRAM) &&
> >-		    (!abo->placements[i].lpfn ||
> >-		     abo->placements[i].lpfn > lpfn))
> >-			abo->placements[i].lpfn = lpfn;
> >-	}
> >-	abo->placement.busy_placement = abo->placement.placement;
> >-	abo->placement.num_busy_placement = abo->placement.num_placement;
> >  	r = ttm_bo_validate(bo, &abo->placement, false, false);
> >  	if (unlikely(r != 0))
> >  		return r;
> >@@ -1057,3 +1048,34 @@ u64 amdgpu_bo_gpu_offset(struct amdgpu_bo *bo)
> >  	return bo->tbo.offset;
> >  }
> >+
> >+/**
> >+ * amdgpu_bo_clear_cpu_access_required
> >+ * @bo: BO to update
> >+ *
> >+ * Clears CPU_ACCESS_REQUIRED flag if the BO hasn't had a page fault in a while
> >+ * and it didn't have a page fault too soon after the last time it was moved to
> >+ * VRAM.
> >+ *
> >+ * Caller should have bo reserved.
> >+ *
> >+ */
> >+void amdgpu_bo_clear_cpu_access_required(struct amdgpu_bo *bo)
> >+{
> >+	const unsigned int page_fault_timeout_ms = 10000;
> >+	const unsigned int min_period_ms = 500;
> >+	unsigned int ms_since_pf, period_ms;
> >+
> >+	ms_since_pf = jiffies_to_msecs(jiffies - bo->last_page_fault_jiffies);
> >+	period_ms = jiffies_to_msecs(abs(bo->last_page_fault_jiffies -
> >+					 bo->last_cs_move_jiffies));
> >+
> >+	/*
> >+	 * Try to avoid a revolving door between GTT and VRAM. Clearing the
> >+	 * flag may move this BO back to VRAM, so don't clear it if it's likely
> >+	 * to page fault and go right back to GTT.
> >+	 */
> >+	if ((!bo->last_page_fault_jiffies || !bo->last_cs_move_jiffies) ||
> >+	    (ms_since_pf > page_fault_timeout_ms && period_ms > min_period_ms))
> >+		bo->flags &= ~AMDGPU_GEM_CREATE_CPU_ACCESS_REQUIRED;
> >+}
> >diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h
> >index 3824851..b0cb137 100644
> >--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h
> >+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h
> >@@ -182,6 +182,7 @@ int amdgpu_bo_restore_from_shadow(struct amdgpu_device *adev,
> >  				  struct reservation_object *resv,
> >  				  struct dma_fence **fence,
> >  				  bool direct);
> >+void amdgpu_bo_clear_cpu_access_required(struct amdgpu_bo *bo);
> >  /*
> 
> 
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 0/9] Visible VRAM Management Improvements
  2017-06-24 18:36       ` John Brooks
@ 2017-06-25 11:31         ` Christian König
  0 siblings, 0 replies; 31+ messages in thread
From: Christian König @ 2017-06-25 11:31 UTC (permalink / raw)
  To: John Brooks; +Cc: Michel Dänzer, dri-devel, amd-gfx

Am 24.06.2017 um 20:36 schrieb John Brooks:
> On Sat, Jun 24, 2017 at 08:07:15PM +0200, Christian König wrote:
>> Am 23.06.2017 um 19:39 schrieb John Brooks:
>>> This patch series is intended to improve performance when limited CPU-visible
>>> VRAM is under pressure.
>>>
>>> Moving BOs into visible VRAM is essentially a housekeeping task. It's faster to
>>> access them in VRAM than GTT, but it isn't a hard requirement for them to be in
>>> VRAM. As such, it is unnecessary to spend valuable time blocking on this in the
>>> page fault handler or during command submission. Doing so translates directly
>>> into a longer frame time (ergo stalls and stuttering).
>> Sorry, but that strongly sounds like you are messing with things you don't
>> fully understand.
>>
>> Blocking in the page fault handler is mandatory to handle the page fault
>> correctly. So that is not something we can change easily.
> I do understand that. Indeed, the page fault handler must block until the
> memory is accessible. But the memory doesn't have to be in visible VRAM to be
> accessible; it could also be in GTT. So, what I meant was that it does not have
> to block for the excessively long time it takes to move BOs into visible VRAM
> when it is already full. It could spend less time blocking by just moving it to
> GTT in that situation. I apologize for the poor wording.

Ah! Yeah that makes more sense, I could have read the source first 
before writing the mail.

Sorry just got goosebumps from the idea that you tried to avoid the 
blocking in the page fault handler :)

Christian.

>
> John
>
>> Regards,
>> Christian.
>>
>>> The problem worsens when attempting to move BOs into visible VRAM when it is
>>> full. This takes much longer than a simple move because other BOs have to be
>>> evicted, which involves finding and then moving potentially hundreds of other
>>> BOs, which is very time consuming. In the case of limited visible VRAM, it's
>>> important to do this sometime to keep the contents of visible VRAM fresh, but
>>> it does not need to be a blocking operation. If visible VRAM is full, the BO
>>> can be read from GTT in the meantime and the BO can be moved to VRAM later.
>>>
>>> Thus, I have made it so that neither the command submission code nor page fault
>>> handler spends time evicting BOs from visible VRAM, and instead this is
>>> deferred to a workqueue function that's queued when CS requests BOs flagged
>>> AMDGPU_GEM_CREATE_CPU_ACCESS_REQUIRED.
>>>
>>> Speaking of CPU_ACCESS_REQUIRED, I've changed the handling of that flag so that
>>> the kernel driver can clear it later even if it was set by userspace. This is
>>> because the userspace graphics library can't know whether the application
>>> really needs it to be CPU_ACCESS_REQUIRED forever. The kernel driver can't know
>>> either, but it does know when page faults occur, and if a BO doesn't appear to
>>> have any page faults when it's moved somewhere inaccessible, the flag can be
>>> removed and it doesn't have to take up space in CPU-visible memory anymore.
>>> This change was based on IRC discussions with Michel.
>>>
>>> Patch 7 fixes a problem with BO moverate throttling that causes visible VRAM
>>> moves to not be throttled if total VRAM isn't full enough.
>>>
>>> I've also added a vis_vramlimit module parameter for debugging purposes. It's
>>> similar to the vramlimit parameter except it limits only visible VRAM.
>>>
>>> I have tested this patch set with the two games I know to be affected by
>>> visible VRAM pressure: DiRT Rally and Dying Light. It practically eliminates
>>> eviction-related stuttering in DiRT Rally as well as very low performance if
>>> visible VRAM is limited to 64MB. It also fixes severely low framerates that
>>> occurred in some areas of Dying Light. All my testing was done with an R9 290
>>> with 4GB of visible VRAM with an Intel i7 4790.
>>>
>>> --
>>> John Brooks (Frogging101)
>>>
>>> _______________________________________________
>>> amd-gfx mailing list
>>> amd-gfx@lists.freedesktop.org
>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 0/9] Visible VRAM Management Improvements
  2017-06-24 21:50             ` John Brooks
@ 2017-06-25 11:54               ` Christian König
  0 siblings, 0 replies; 31+ messages in thread
From: Christian König @ 2017-06-25 11:54 UTC (permalink / raw)
  To: John Brooks; +Cc: Felix Kuehling, dri-devel, amd-gfx, Michel Dänzer

Am 24.06.2017 um 23:50 schrieb John Brooks:
> On Sat, Jun 24, 2017 at 08:20:22PM +0200, Christian König wrote:
>> Am 24.06.2017 um 01:16 schrieb John Brooks:
>>> On Fri, Jun 23, 2017 at 05:02:58PM -0400, Felix Kuehling wrote:
>>>> Hi John,
>>>>
>>>> I haven't read your patches. Just a question based on the cover letter.
>>>>
>>>> I understand that visible VRAM is the biggest pain point. But could the
>>>> same reasoning make sense for invisible VRAM? That is, doing all the
>>>> migrations to VRAM in a workqueue?
>>>>
>>>> Regards,
>>>>    Felix
>>>>
>>> I don't see why not. In theory, all non-essential buffer moves could be done
>>> this way, and it would be relatively trivial to extend it to that.
>>>
>>> But I wanted to limit the scope of my changes, at least for this series.
>>> Testing takes a long time and I wanted to focus those testing efforts as much
>>> as possible, produce something well-tested (I hope), and get feedback on this
>>> limited application of the concept before expanding its reach.
>> Yeah, sorry I have to say that but the whole approach is utterly nonsense.
>>
>> What happens is that the delayed BO can only be moved AFTER the command
>> submission which wants it to be in VRAM.
>>
>> So you use the BO in a CS and *then* move it to where the CS wants it to be,
>> no matter if the BO is then needed there or not.
>>
>> Regards,
>> Christian.
>>
> I'm aware of the effect it has. The BO won't be in VRAM for the current command
> submission, but it'll be there for a future one. If a BO is used at a given
> time, then it's likely it'll be used again soon.

Exactly that's the problem here. Keep in mind that BOs can only move in 
between command submissions.

So instead of moving the BO on the first command submission which needs 
it you most likely move it directly after it.

It is actually quite unlikely that a BO which was swapped out will be 
used for multiple command submissions in a row (well it was swapped out 
because it was unused for a while).

> In which case you'll come out
> ahead on latency even if the GPU has to read it from GTT a few times. In any
> case, it's never going to hurt as much as full-stop waiting for a worst-case BO
> move that needs a lot of evictions.
>
> Feel free to correct my understanding; you'd certainly know any of this better
> than I do. But my tests indicate that immediate forced moves during CS cause
> stalls, and the average framerate with delayed moves is the almost (~2%) the
> same as with immediate ones, which is about 9% higher than with no forced moves
> during CS at all.

Your understanding is incorrect.

The stalls are not caused by the BO moves itself, but rather by the 
ping/pong they result in and the intermediate waits this causes.

See memory bandwidth is usually not the problem, but swapping a BO out 
needs to be reflected in all VMs the BO is using. This adds extra 
dependencies in between command submissions, so instead of rendering and 
copying at the same time you get everything serialized.

I'm pretty sure that this is actually the root cause of the issues you 
run into here.

Beside limiting those ping/pongs to something reasonable per second 
another possible solution would be to make more room for CPU accessible 
VRAM BOs. I will take a look into this next week, Felix is kicking me 
for this for a while anyway.

Regards,
Christian.

>
> DiRT Rally average framerates:
>      With the whole patch set (n=3):
>          89.56
>      Without it (drm-next-4.13 5ac55629d6b3fcde69f46aa772c6e83be0bdcbbf)
>      (n=3):
>          91.16 (+stalls)
>      Patches 1 and 3 only, and with GTT set as the only busy placement for
>      CPU_ACCESS_REQUIRED BOs in amdgpu_cs_bo_validate (n=3):
>          82.15
>
> John
>
>>> John
>>>
>>>> On 17-06-23 01:39 PM, John Brooks wrote:
>>>>> This patch series is intended to improve performance when limited CPU-visible
>>>>> VRAM is under pressure.
>>>>>
>>>>> Moving BOs into visible VRAM is essentially a housekeeping task. It's faster to
>>>>> access them in VRAM than GTT, but it isn't a hard requirement for them to be in
>>>>> VRAM. As such, it is unnecessary to spend valuable time blocking on this in the
>>>>> page fault handler or during command submission. Doing so translates directly
>>>>> into a longer frame time (ergo stalls and stuttering).
>>>>>
>>>>> The problem worsens when attempting to move BOs into visible VRAM when it is
>>>>> full. This takes much longer than a simple move because other BOs have to be
>>>>> evicted, which involves finding and then moving potentially hundreds of other
>>>>> BOs, which is very time consuming. In the case of limited visible VRAM, it's
>>>>> important to do this sometime to keep the contents of visible VRAM fresh, but
>>>>> it does not need to be a blocking operation. If visible VRAM is full, the BO
>>>>> can be read from GTT in the meantime and the BO can be moved to VRAM later.
>>>>>
>>>>> Thus, I have made it so that neither the command submission code nor page fault
>>>>> handler spends time evicting BOs from visible VRAM, and instead this is
>>>>> deferred to a workqueue function that's queued when CS requests BOs flagged
>>>>> AMDGPU_GEM_CREATE_CPU_ACCESS_REQUIRED.
>>>>>
>>>>> Speaking of CPU_ACCESS_REQUIRED, I've changed the handling of that flag so that
>>>>> the kernel driver can clear it later even if it was set by userspace. This is
>>>>> because the userspace graphics library can't know whether the application
>>>>> really needs it to be CPU_ACCESS_REQUIRED forever. The kernel driver can't know
>>>>> either, but it does know when page faults occur, and if a BO doesn't appear to
>>>>> have any page faults when it's moved somewhere inaccessible, the flag can be
>>>>> removed and it doesn't have to take up space in CPU-visible memory anymore.
>>>>> This change was based on IRC discussions with Michel.
>>>>>
>>>>> Patch 7 fixes a problem with BO moverate throttling that causes visible VRAM
>>>>> moves to not be throttled if total VRAM isn't full enough.
>>>>>
>>>>> I've also added a vis_vramlimit module parameter for debugging purposes. It's
>>>>> similar to the vramlimit parameter except it limits only visible VRAM.
>>>>>
>>>>> I have tested this patch set with the two games I know to be affected by
>>>>> visible VRAM pressure: DiRT Rally and Dying Light. It practically eliminates
>>>>> eviction-related stuttering in DiRT Rally as well as very low performance if
>>>>> visible VRAM is limited to 64MB. It also fixes severely low framerates that
>>>>> occurred in some areas of Dying Light. All my testing was done with an R9 290
>>>>> with 4GB of visible VRAM with an Intel i7 4790.
>>>>>
>>>>> --
>>>>> John Brooks (Frogging101)
>>>>>
>>>>> _______________________________________________
>>>>> amd-gfx mailing list
>>>>> amd-gfx@lists.freedesktop.org
>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 6/9] drm/amdgpu: Set/clear CPU_ACCESS_REQUIRED flag on page fault and CS
       [not found]         ` <55ea5e84-0791-5a70-6278-ade83c343a3b-ANTagKRnAhcb1SvskN2V4Q@public.gmane.org>
@ 2017-06-26  9:27           ` Michel Dänzer
       [not found]             ` <6c6fca21-df95-a413-d5eb-c05f1913787b-otUistvHUpPR7s880joybQ@public.gmane.org>
  0 siblings, 1 reply; 31+ messages in thread
From: Michel Dänzer @ 2017-06-26  9:27 UTC (permalink / raw)
  To: Christian König, John Brooks, Marek Olšák
  Cc: David Airlie, dri-devel-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW,
	amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

On 25/06/17 03:00 AM, Christian König wrote:
> Am 23.06.2017 um 19:39 schrieb John Brooks:
>> When the AMDGPU_GEM_CREATE_CPU_ACCESS_REQUIRED flag is given by
>> userspace,
>> it should only be treated as a hint to initially place a BO somewhere CPU
>> accessible, rather than having a permanent effect on BO placement.
> 
> And that is a clear NAK from my side.
> 
> CPU_ACCESS_REQUIRED is a permanent limitation to where the buffer should
> be placed.

It really can't be more than a hint. The userspace driver cannot
reliably know ahead of time whether a BO will be accessed by the CPU at
all, let alone how often. A BO which incorrectly has this flag set
creates artificial pressure on CPU visible VRAM.


> Ignoring that and changing the IOCTL interface like this is clearly not
> acceptable.

I'd say it's more adapting the semantics of the flag to reality. :)


>> Instead of the flag being set in stone at BO creation, set the flag
>> when a page fault occurs so that it goes somewhere CPU-visible, and clear it
>> when the BO is requested by the GPU.

My idea was to only clear the flag when the BO is moved from GTT to
VRAM. That way, any BO which was ever accessed by the CPU since it was
last moved to VRAM will be preferably put in the CPU visible part of VRAM.

Not sure which way is better though.


-- 
Earthling Michel Dänzer               |               http://www.amd.com
Libre software enthusiast             |             Mesa and X developer
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/9] drm/amdgpu: Don't force BOs into visible VRAM for page faults
       [not found]   ` <1498239580-17360-4-git-send-email-john-xq/Ko7C6e2Bl57MIdRCFDg@public.gmane.org>
@ 2017-06-26  9:38     ` Michel Dänzer
       [not found]       ` <f399c192-d90d-9f43-9b8a-820fa51a7715-otUistvHUpPR7s880joybQ@public.gmane.org>
  0 siblings, 1 reply; 31+ messages in thread
From: Michel Dänzer @ 2017-06-26  9:38 UTC (permalink / raw)
  To: John Brooks, Marek Olšák
  Cc: David Airlie, Christian König,
	dri-devel-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW,
	amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

On 24/06/17 02:39 AM, John Brooks wrote:
> There is no need for page faults to force BOs into visible VRAM if it's
> full, and the time it takes to do so is great enough to cause noticeable
> stuttering. Add GTT as a possible placement so that if visible VRAM is
> full, page faults move BOs to GTT instead of evicting other BOs from VRAM.
> 
> Signed-off-by: John Brooks <john@fastquake.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 14 ++++++--------
>  1 file changed, 6 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
> index 244a7d3..751bc05 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
> @@ -981,10 +981,11 @@ int amdgpu_bo_fault_reserve_notify(struct ttm_buffer_object *bo)
>  
>  	/* hurrah the memory is not visible ! */
>  	atomic64_inc(&adev->num_vram_cpu_page_faults);
> -	amdgpu_ttm_placement_from_domain(abo, AMDGPU_GEM_DOMAIN_VRAM);
> +	amdgpu_ttm_placement_from_domain(abo, AMDGPU_GEM_DOMAIN_VRAM |
> +					 AMDGPU_GEM_DOMAIN_GTT);
>  	lpfn =	adev->mc.visible_vram_size >> PAGE_SHIFT;
>  	for (i = 0; i < abo->placement.num_placement; i++) {
> -		/* Force into visible VRAM */
> +		/* Try to move the BO into visible VRAM */
>  		if ((abo->placements[i].flags & TTM_PL_FLAG_VRAM) &&
>  		    (!abo->placements[i].lpfn ||
>  		     abo->placements[i].lpfn > lpfn))
> @@ -993,16 +994,13 @@ int amdgpu_bo_fault_reserve_notify(struct ttm_buffer_object *bo)
>  	abo->placement.busy_placement = abo->placement.placement;
>  	abo->placement.num_busy_placement = abo->placement.num_placement;
>  	r = ttm_bo_validate(bo, &abo->placement, false, false);
> -	if (unlikely(r == -ENOMEM)) {
> -		amdgpu_ttm_placement_from_domain(abo, AMDGPU_GEM_DOMAIN_GTT);
> -		return ttm_bo_validate(bo, &abo->placement, false, false);
> -	} else if (unlikely(r != 0)) {
> +	if (unlikely(r != 0))
>  		return r;
> -	}
>  
>  	offset = bo->mem.start << PAGE_SHIFT;
>  	/* this should never happen */
> -	if ((offset + size) > adev->mc.visible_vram_size)
> +	if (bo->mem.mem_type == TTM_PL_VRAM &&
> +	    (offset + size) > adev->mc.visible_vram_size)
>  		return -EINVAL;
>  
>  	return 0;
> 

AFAICT this is essentially the same as
https://patchwork.freedesktop.org/patch/156993/ . I retracted that patch
due to it causing Rally Dirt performance degradation for Marek.
Presumably other patches in this series compensate for that, but at the
least this patch should be moved to the end of the series.


-- 
Earthling Michel Dänzer               |               http://www.amd.com
Libre software enthusiast             |             Mesa and X developer
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 7/9] drm/amdgpu: Throttle visible VRAM moves separately
       [not found]     ` <1498239580-17360-8-git-send-email-john-xq/Ko7C6e2Bl57MIdRCFDg@public.gmane.org>
@ 2017-06-26  9:44       ` Michel Dänzer
       [not found]         ` <c132d211-bb7c-1e7d-617a-6f128343a581-otUistvHUpPR7s880joybQ@public.gmane.org>
  0 siblings, 1 reply; 31+ messages in thread
From: Michel Dänzer @ 2017-06-26  9:44 UTC (permalink / raw)
  To: John Brooks, Marek Olšák
  Cc: David Airlie, Christian König,
	dri-devel-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW,
	amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

On 24/06/17 02:39 AM, John Brooks wrote:
> The BO move throttling code is designed to allow VRAM to fill quickly if it
> is relatively empty. However, this does not take into account situations
> where the visible VRAM is smaller than total VRAM, and total VRAM may not
> be close to full but the visible VRAM segment is under pressure. In such
> situations, visible VRAM would experience unrestricted swapping and
> performance would drop.
> 
> Add a separate counter specifically for moves involving visible VRAM, and
> check it before moving BOs there.
> 
> Fixes: 95844d20ae02 (drm/amdgpu: throttle buffer migrations at CS using a fixed MBps limit (v2))
> Signed-off-by: John Brooks <john@fastquake.com>

Something like this patch is definitely needed, good catch.

However, as is one issue is that it incurs CPU overhead even when all of
VRAM is CPU visible. Can that be avoided somehow?


-- 
Earthling Michel Dänzer               |               http://www.amd.com
Libre software enthusiast             |             Mesa and X developer
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/9] drm/amdgpu: Add vis_vramlimit module parameter
  2017-06-23 17:39   ` [PATCH 2/9] drm/amdgpu: Add vis_vramlimit module parameter John Brooks
@ 2017-06-26  9:48     ` Michel Dänzer
  2017-06-26  9:57     ` Christian König
  1 sibling, 0 replies; 31+ messages in thread
From: Michel Dänzer @ 2017-06-26  9:48 UTC (permalink / raw)
  To: John Brooks, Marek Olšák; +Cc: dri-devel, amd-gfx

On 24/06/17 02:39 AM, John Brooks wrote:
> Allow specifying a limit on visible VRAM via a module parameter. This is
> helpful for testing performance under visible VRAM pressure.
> 
> Signed-off-by: John Brooks <john@fastquake.com>

Reviewed-by: Michel Dänzer <michel.daenzer@amd.com>


-- 
Earthling Michel Dänzer               |               http://www.amd.com
Libre software enthusiast             |             Mesa and X developer
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/9] drm/amdgpu: Add vis_vramlimit module parameter
  2017-06-23 17:39   ` [PATCH 2/9] drm/amdgpu: Add vis_vramlimit module parameter John Brooks
  2017-06-26  9:48     ` Michel Dänzer
@ 2017-06-26  9:57     ` Christian König
  1 sibling, 0 replies; 31+ messages in thread
From: Christian König @ 2017-06-26  9:57 UTC (permalink / raw)
  To: John Brooks, amd-gfx, Michel Dänzer, Marek Olšák; +Cc: dri-devel

Am 23.06.2017 um 19:39 schrieb John Brooks:
> Allow specifying a limit on visible VRAM via a module parameter. This is
> helpful for testing performance under visible VRAM pressure.
>
> Signed-off-by: John Brooks <john@fastquake.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu.h     | 1 +
>   drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 4 ++++
>   drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 8 ++++++++
>   3 files changed, 13 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> index c407d2d..fe73633 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> @@ -74,6 +74,7 @@
>    */
>   extern int amdgpu_modeset;
>   extern int amdgpu_vram_limit;
> +extern int amdgpu_vis_vram_limit;
>   extern int amdgpu_gart_size;
>   extern int amdgpu_moverate;
>   extern int amdgpu_benchmarking;
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> index 4c7c262..a14f458 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> @@ -73,6 +73,7 @@
>   #define KMS_DRIVER_PATCHLEVEL	0
>   
>   int amdgpu_vram_limit = 0;
> +int amdgpu_vis_vram_limit = 0;
>   int amdgpu_gart_size = -1; /* auto */
>   int amdgpu_moverate = -1; /* auto */
>   int amdgpu_benchmarking = 0;
> @@ -119,6 +120,9 @@ int amdgpu_lbpw = -1;
>   MODULE_PARM_DESC(vramlimit, "Restrict VRAM for testing, in megabytes");
>   module_param_named(vramlimit, amdgpu_vram_limit, int, 0600);
>   
> +MODULE_PARM_DESC(vis_vramlimit, "Restrict visible VRAM for testing, in megabytes");
> +module_param_named(vis_vramlimit, amdgpu_vis_vram_limit, int, 0444);
> +
>   MODULE_PARM_DESC(gartsize, "Size of PCIE/IGP gart to setup in megabytes (32, 64, etc., -1 = auto)");
>   module_param_named(gartsize, amdgpu_gart_size, int, 0600);
>   
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
> index c9b131b..0676a78 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
> @@ -1095,6 +1095,7 @@ static struct ttm_bo_driver amdgpu_bo_driver = {
>   int amdgpu_ttm_init(struct amdgpu_device *adev)
>   {
>   	int r;
> +	u64 vis_vram_limit;
>   
>   	r = amdgpu_ttm_global_init(adev);
>   	if (r) {
> @@ -1118,6 +1119,13 @@ int amdgpu_ttm_init(struct amdgpu_device *adev)
>   		DRM_ERROR("Failed initializing VRAM heap.\n");
>   		return r;
>   	}
> +
> +	/* Reduce size of CPU-visible VRAM if requested */
> +	vis_vram_limit = amdgpu_vis_vram_limit * 1024 * 1024;

You need a 64bit cast here, otherwise you can get an overrun.

Apart from that the patch is Reviewed-by: Christian König 
<christian.koenig@amd.com>.

Regards,
Christian.

> +	if (amdgpu_vis_vram_limit > 0 &&
> +	    vis_vram_limit <= adev->mc.visible_vram_size)
> +		adev->mc.visible_vram_size = vis_vram_limit;
> +
>   	/* Change the size here instead of the init above so only lpfn is affected */
>   	amdgpu_ttm_set_active_vram_size(adev, adev->mc.visible_vram_size);
>   


_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 7/9] drm/amdgpu: Throttle visible VRAM moves separately
       [not found]         ` <c132d211-bb7c-1e7d-617a-6f128343a581-otUistvHUpPR7s880joybQ@public.gmane.org>
@ 2017-06-26 22:29           ` John Brooks
  2017-06-27  8:25             ` Michel Dänzer
  0 siblings, 1 reply; 31+ messages in thread
From: John Brooks @ 2017-06-26 22:29 UTC (permalink / raw)
  To: Michel Dänzer
  Cc: David Airlie, Christian König,
	dri-devel-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW,
	amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW, Marek Olšák

On Mon, Jun 26, 2017 at 06:44:30PM +0900, Michel Dänzer wrote:
> On 24/06/17 02:39 AM, John Brooks wrote:
> > The BO move throttling code is designed to allow VRAM to fill quickly if it
> > is relatively empty. However, this does not take into account situations
> > where the visible VRAM is smaller than total VRAM, and total VRAM may not
> > be close to full but the visible VRAM segment is under pressure. In such
> > situations, visible VRAM would experience unrestricted swapping and
> > performance would drop.
> > 
> > Add a separate counter specifically for moves involving visible VRAM, and
> > check it before moving BOs there.
> > 
> > Fixes: 95844d20ae02 (drm/amdgpu: throttle buffer migrations at CS using a fixed MBps limit (v2))
> > Signed-off-by: John Brooks <john@fastquake.com>
> 
> Something like this patch is definitely needed, good catch.
> 
> However, as is one issue is that it incurs CPU overhead even when all of
> VRAM is CPU visible. Can that be avoided somehow?
> 

I guess that depends which part of it you consider to be expensive.

bytes_moved_vis_threshold isn't used unless visible VRAM is smaller than total
VRAM, so any work/instructions that go into computing it could be skipped in
that case (at the cost of checking that visible_vram_size < real_vram_size)
Would that help?

John

> 
> -- 
> Earthling Michel Dänzer               |               http://www.amd.com
> Libre software enthusiast             |             Mesa and X developer
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 6/9] drm/amdgpu: Set/clear CPU_ACCESS_REQUIRED flag on page fault and CS
       [not found]             ` <6c6fca21-df95-a413-d5eb-c05f1913787b-otUistvHUpPR7s880joybQ@public.gmane.org>
@ 2017-06-26 23:25               ` Marek Olšák
  0 siblings, 0 replies; 31+ messages in thread
From: Marek Olšák @ 2017-06-26 23:25 UTC (permalink / raw)
  To: Michel Dänzer
  Cc: David Airlie, John Brooks, Christian König, dri-devel,
	amd-gfx mailing list

On Mon, Jun 26, 2017 at 11:27 AM, Michel Dänzer <michel@daenzer.net> wrote:
> On 25/06/17 03:00 AM, Christian König wrote:
>> Am 23.06.2017 um 19:39 schrieb John Brooks:
>>> When the AMDGPU_GEM_CREATE_CPU_ACCESS_REQUIRED flag is given by
>>> userspace,
>>> it should only be treated as a hint to initially place a BO somewhere CPU
>>> accessible, rather than having a permanent effect on BO placement.
>>
>> And that is a clear NAK from my side.
>>
>> CPU_ACCESS_REQUIRED is a permanent limitation to where the buffer should
>> be placed.
>
> It really can't be more than a hint. The userspace driver cannot
> reliably know ahead of time whether a BO will be accessed by the CPU at
> all, let alone how often. A BO which incorrectly has this flag set
> creates artificial pressure on CPU visible VRAM.

I also think the flag is only a hint and shouldn't be taken too seriously.

Only AMDGPU_GEM_CREATE_NO_CPU_ACCESS has a strict behavior.

Marek
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/9] drm/amdgpu: Don't force BOs into visible VRAM for page faults
       [not found]       ` <f399c192-d90d-9f43-9b8a-820fa51a7715-otUistvHUpPR7s880joybQ@public.gmane.org>
@ 2017-06-27  3:25         ` John Brooks
  0 siblings, 0 replies; 31+ messages in thread
From: John Brooks @ 2017-06-27  3:25 UTC (permalink / raw)
  To: Michel Dänzer
  Cc: David Airlie, Christian König,
	dri-devel-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW,
	amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW, Marek Olšák

On Mon, Jun 26, 2017 at 06:38:29PM +0900, Michel Dänzer wrote:
> On 24/06/17 02:39 AM, John Brooks wrote:
> > There is no need for page faults to force BOs into visible VRAM if it's
> > full, and the time it takes to do so is great enough to cause noticeable
> > stuttering. Add GTT as a possible placement so that if visible VRAM is
> > full, page faults move BOs to GTT instead of evicting other BOs from VRAM.
> > 
> > Signed-off-by: John Brooks <john@fastquake.com>
> > ---
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 14 ++++++--------
> >  1 file changed, 6 insertions(+), 8 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
> > index 244a7d3..751bc05 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
> > @@ -981,10 +981,11 @@ int amdgpu_bo_fault_reserve_notify(struct ttm_buffer_object *bo)
> >  
> >  	/* hurrah the memory is not visible ! */
> >  	atomic64_inc(&adev->num_vram_cpu_page_faults);
> > -	amdgpu_ttm_placement_from_domain(abo, AMDGPU_GEM_DOMAIN_VRAM);
> > +	amdgpu_ttm_placement_from_domain(abo, AMDGPU_GEM_DOMAIN_VRAM |
> > +					 AMDGPU_GEM_DOMAIN_GTT);
> >  	lpfn =	adev->mc.visible_vram_size >> PAGE_SHIFT;
> >  	for (i = 0; i < abo->placement.num_placement; i++) {
> > -		/* Force into visible VRAM */
> > +		/* Try to move the BO into visible VRAM */
> >  		if ((abo->placements[i].flags & TTM_PL_FLAG_VRAM) &&
> >  		    (!abo->placements[i].lpfn ||
> >  		     abo->placements[i].lpfn > lpfn))
> > @@ -993,16 +994,13 @@ int amdgpu_bo_fault_reserve_notify(struct ttm_buffer_object *bo)
> >  	abo->placement.busy_placement = abo->placement.placement;
> >  	abo->placement.num_busy_placement = abo->placement.num_placement;
> >  	r = ttm_bo_validate(bo, &abo->placement, false, false);
> > -	if (unlikely(r == -ENOMEM)) {
> > -		amdgpu_ttm_placement_from_domain(abo, AMDGPU_GEM_DOMAIN_GTT);
> > -		return ttm_bo_validate(bo, &abo->placement, false, false);
> > -	} else if (unlikely(r != 0)) {
> > +	if (unlikely(r != 0))
> >  		return r;
> > -	}
> >  
> >  	offset = bo->mem.start << PAGE_SHIFT;
> >  	/* this should never happen */
> > -	if ((offset + size) > adev->mc.visible_vram_size)
> > +	if (bo->mem.mem_type == TTM_PL_VRAM &&
> > +	    (offset + size) > adev->mc.visible_vram_size)
> >  		return -EINVAL;
> >  
> >  	return 0;
> > 
> 
> AFAICT this is essentially the same as
> https://patchwork.freedesktop.org/patch/156993/ . I retracted that patch
> due to it causing Rally Dirt performance degradation for Marek.
> Presumably other patches in this series compensate for that, but at the
> least this patch should be moved to the end of the series.

Yeah, it did end up pretty much the same as your patch :p

The reason this causes the performance degredation is that if a page fault
moves a BO to GTT, and it's *not* marked CPU_ACCESS_REQUIRED, then CS will move
it to invisible VRAM. And then another page fault happens. And this repeats
fast enough that the BO is constantly moving back and forth between GTT and
VRAM.

Patch 6 (Set/clear CPU_ACCESS_REQUIRED flag on page fault and CS) takes care of
it by not allowing BOs to go to invisible VRAM until after a timeout and only
if it didn't have a page fault too soon after the last time it was moved to
VRAM. This was implemented by setting the CPU_ACCESS_REQUIRED flag on page
faults and not clearing it unless those conditions are met. It doesn't
necessarily need to involve the flag, depending on where we decide to go with
that.

John

> 
> 
> -- 
> Earthling Michel Dänzer               |               http://www.amd.com
> Libre software enthusiast             |             Mesa and X developer
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 7/9] drm/amdgpu: Throttle visible VRAM moves separately
  2017-06-26 22:29           ` John Brooks
@ 2017-06-27  8:25             ` Michel Dänzer
  0 siblings, 0 replies; 31+ messages in thread
From: Michel Dänzer @ 2017-06-27  8:25 UTC (permalink / raw)
  To: John Brooks; +Cc: amd-gfx, dri-devel

On 27/06/17 07:29 AM, John Brooks wrote:
> On Mon, Jun 26, 2017 at 06:44:30PM +0900, Michel Dänzer wrote:
>> On 24/06/17 02:39 AM, John Brooks wrote:
>>> The BO move throttling code is designed to allow VRAM to fill quickly if it
>>> is relatively empty. However, this does not take into account situations
>>> where the visible VRAM is smaller than total VRAM, and total VRAM may not
>>> be close to full but the visible VRAM segment is under pressure. In such
>>> situations, visible VRAM would experience unrestricted swapping and
>>> performance would drop.
>>>
>>> Add a separate counter specifically for moves involving visible VRAM, and
>>> check it before moving BOs there.
>>>
>>> Fixes: 95844d20ae02 (drm/amdgpu: throttle buffer migrations at CS using a fixed MBps limit (v2))
>>> Signed-off-by: John Brooks <john@fastquake.com>
>>
>> Something like this patch is definitely needed, good catch.
>>
>> However, as is one issue is that it incurs CPU overhead even when all of
>> VRAM is CPU visible. Can that be avoided somehow?
>>
> 
> I guess that depends which part of it you consider to be expensive.
> 
> bytes_moved_vis_threshold isn't used unless visible VRAM is smaller than total
> VRAM, so any work/instructions that go into computing it could be skipped in
> that case (at the cost of checking that visible_vram_size < real_vram_size)
> Would that help?

I think so. E.g. in amdgpu_cs_get_threshold_for_moves, it should be
possible to handle most of this in a single if (visible_vram_size <
real_vram_size) block, which should be mostly free in the
visible_vram_size == real_vram_size case thanks to branch prediction.

Not sure there's much that can be done elsewhere though.


-- 
Earthling Michel Dänzer               |               http://www.amd.com
Libre software enthusiast             |             Mesa and X developer

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2017-06-27  8:25 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-06-23 17:39 [PATCH 0/9] Visible VRAM Management Improvements John Brooks
2017-06-23 17:39 ` [PATCH 3/9] drm/amdgpu: Don't force BOs into visible VRAM for page faults John Brooks
     [not found]   ` <1498239580-17360-4-git-send-email-john-xq/Ko7C6e2Bl57MIdRCFDg@public.gmane.org>
2017-06-26  9:38     ` Michel Dänzer
     [not found]       ` <f399c192-d90d-9f43-9b8a-820fa51a7715-otUistvHUpPR7s880joybQ@public.gmane.org>
2017-06-27  3:25         ` John Brooks
2017-06-23 17:39 ` [PATCH 5/9] drm/amdgpu: Track time of last page fault and last CS move in struct amdgpu_bo John Brooks
     [not found] ` <1498239580-17360-1-git-send-email-john-xq/Ko7C6e2Bl57MIdRCFDg@public.gmane.org>
2017-06-23 17:39   ` [PATCH 1/9] drm/amdgpu: Separate placements and busy placements John Brooks
2017-06-23 17:39   ` [PATCH 2/9] drm/amdgpu: Add vis_vramlimit module parameter John Brooks
2017-06-26  9:48     ` Michel Dänzer
2017-06-26  9:57     ` Christian König
2017-06-23 17:39   ` [PATCH 4/9] drm/amdgpu: Don't force BOs into visible VRAM if they can go to GTT instead John Brooks
     [not found]     ` <1498239580-17360-5-git-send-email-john-xq/Ko7C6e2Bl57MIdRCFDg@public.gmane.org>
2017-06-24 18:09       ` Christian König
     [not found]         ` <0c5064f9-5b84-8833-b410-055b5e2064bf-ANTagKRnAhcb1SvskN2V4Q@public.gmane.org>
2017-06-24 18:37           ` John Brooks
2017-06-23 17:39   ` [PATCH 6/9] drm/amdgpu: Set/clear CPU_ACCESS_REQUIRED flag on page fault and CS John Brooks
     [not found]     ` <1498239580-17360-7-git-send-email-john-xq/Ko7C6e2Bl57MIdRCFDg@public.gmane.org>
2017-06-24 18:00       ` Christian König
2017-06-25  1:57         ` John Brooks
     [not found]         ` <55ea5e84-0791-5a70-6278-ade83c343a3b-ANTagKRnAhcb1SvskN2V4Q@public.gmane.org>
2017-06-26  9:27           ` Michel Dänzer
     [not found]             ` <6c6fca21-df95-a413-d5eb-c05f1913787b-otUistvHUpPR7s880joybQ@public.gmane.org>
2017-06-26 23:25               ` Marek Olšák
2017-06-23 17:39   ` [PATCH 7/9] drm/amdgpu: Throttle visible VRAM moves separately John Brooks
     [not found]     ` <1498239580-17360-8-git-send-email-john-xq/Ko7C6e2Bl57MIdRCFDg@public.gmane.org>
2017-06-26  9:44       ` Michel Dänzer
     [not found]         ` <c132d211-bb7c-1e7d-617a-6f128343a581-otUistvHUpPR7s880joybQ@public.gmane.org>
2017-06-26 22:29           ` John Brooks
2017-06-27  8:25             ` Michel Dänzer
2017-06-23 17:39   ` [PATCH 8/9] drm/amdgpu: Asynchronously move BOs to visible VRAM John Brooks
2017-06-23 21:02   ` [PATCH 0/9] Visible VRAM Management Improvements Felix Kuehling
     [not found]     ` <82339d2d-481c-ab3f-1590-ab22f0eac371-5C7GfCeVMHo@public.gmane.org>
2017-06-23 23:16       ` John Brooks
2017-06-24 18:20         ` Christian König
     [not found]           ` <644cf9b4-e22b-eab1-a505-b0e1f9850f82-ANTagKRnAhcb1SvskN2V4Q@public.gmane.org>
2017-06-24 21:50             ` John Brooks
2017-06-25 11:54               ` Christian König
2017-06-24 18:07   ` Christian König
     [not found]     ` <3cd916a7-6734-5eff-b645-66f3ee83f13a-ANTagKRnAhcb1SvskN2V4Q@public.gmane.org>
2017-06-24 18:36       ` John Brooks
2017-06-25 11:31         ` Christian König
2017-06-23 17:39 ` [PATCH 9/9] drm/amdgpu: Reduce lock contention when evicting from visible VRAM John Brooks

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).