Re: [PATCH v4 4/5] drm/amdgpu: use bulk moves for efficient VM LRU handling (v4)

From: Huang Rui <ray.huang-5C7GfCeVMHo@public.gmane.org>
To: "Christian K�nig" <christian.koenig-5C7GfCeVMHo@public.gmane.org>
Cc: "dri-devel-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org"
	<dri-devel-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org>,
	"amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org"
	<amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org>
Subject: Re: [PATCH v4 4/5] drm/amdgpu: use bulk moves for efficient VM LRU handling (v4)
Date: Wed, 22 Aug 2018 15:29:33 +0800	[thread overview]
Message-ID: <20180822072931.GD28364@hr-amur2> (raw)
In-Reply-To: <20180822033101.GB13834@hr-amur2>

On Wed, Aug 22, 2018 at 11:31:02AM +0800, Huang Rui wrote:
> On Tue, Aug 21, 2018 at 03:54:28PM +0200, Christian König wrote:
> > Am 21.08.2018 um 15:43 schrieb Huang Rui:
> > >On Mon, Aug 20, 2018 at 09:17:12PM +0800, Christian König wrote:
> > >>Am 20.08.2018 um 08:05 schrieb Huang Rui:
> > >>>On Fri, Aug 17, 2018 at 06:38:16PM +0800, Koenig, Christian wrote:
> > >>>>Am 17.08.2018 um 12:08 schrieb Huang Rui:
> > >>>>>I continue to work for bulk moving that based on the proposal by Christian.
> > >>>>>
> > >>>>>Background:
> > >>>>>amdgpu driver will move all PD/PT and PerVM BOs into idle list. Then move all of
> > >>>>>them on the end of LRU list one by one. Thus, that cause so many BOs moved to
> > >>>>>the end of the LRU, and impact performance seriously.
> > >>>>>
> > >>>>>Then Christian provided a workaround to not move PD/PT BOs on LRU with below
> > >>>>>patch:
> > >>>>>"drm/amdgpu: band aid validating VM PTs"
> > >>>>>Commit 0bbf32026cf5ba41e9922b30e26e1bed1ecd38ae
> > >>>>>
> > >>>>>However, the final solution should bulk move all PD/PT and PerVM BOs on the LRU
> > >>>>>instead of one by one.
> > >>>>>
> > >>>>>Whenever amdgpu_vm_validate_pt_bos() is called and we have BOs which need to be
> > >>>>>validated we move all BOs together to the end of the LRU without dropping the
> > >>>>>lock for the LRU.
> > >>>>>
> > >>>>>While doing so we note the beginning and end of this block in the LRU list.
> > >>>>>
> > >>>>>Now when amdgpu_vm_validate_pt_bos() is called and we don't have anything to do,
> > >>>>>we don't move every BO one by one, but instead cut the LRU list into pieces so
> > >>>>>that we bulk move everything to the end in just one operation.
> > >>>>>
> > >>>>>Test data:
> > >>>>>+--------------+-----------------+-----------+---------------------------------------+
> > >>>>>|              |The Talos        |Clpeak(OCL)|BusSpeedReadback(OCL)                  |
> > >>>>>|              |Principle(Vulkan)|           |                                       |
> > >>>>>+------------------------------------------------------------------------------------+
> > >>>>>|              |                 |           |0.319 ms(1k) 0.314 ms(2K) 0.308 ms(4K) |
> > >>>>>| Original     |  147.7 FPS      |  76.86 us |0.307 ms(8K) 0.310 ms(16K)             |
> > >>>>>+------------------------------------------------------------------------------------+
> > >>>>>| Orignial + WA|                 |           |0.254 ms(1K) 0.241 ms(2K)              |
> > >>>>>|(don't move   |  162.1 FPS      |  42.15 us |0.230 ms(4K) 0.223 ms(8K) 0.204 ms(16K)|
> > >>>>>|PT BOs on LRU)|                 |           |                                       |
> > >>>>>+------------------------------------------------------------------------------------+
> > >>>>>| Bulk move    |  163.1 FPS      |  40.52 us |0.244 ms(1K) 0.252 ms(2K) 0.213 ms(4K) |
> > >>>>>|              |                 |           |0.214 ms(8K) 0.225 ms(16K)             |
> > >>>>>+--------------+-----------------+-----------+---------------------------------------+
> > >>>>>
> > >>>>>After test them with above three benchmarks include vulkan and opencl. We can
> > >>>>>see the visible improvement than original, and even better than original with
> > >>>>>workaround.
> > >>>>>
> > >>>>>v2: move all BOs include idle, relocated, and moved list to the end of LRU and
> > >>>>>put them together.
> > >>>>>v3: remove unused parameter and use list_for_each_entry instead of the one with
> > >>>>>save entry.
> > >>>>>v4: move the amdgpu_vm_move_to_lru_tail after command submission, at that time,
> > >>>>>all bo will be back on idle list.
> > >>>>>
> > >>>>>Signed-off-by: Christian König <christian.koenig@amd.com>
> > >>>>>Signed-off-by: Huang Rui <ray.huang@amd.com>
> > >>>>>Tested-by: Mike Lothian <mike@fireburn.co.uk>
> > >>>>>Tested-by: Dieter Nützel <Dieter@nuetzel-hh.de>
> > >>>>>Acked-by: Chunming Zhou <david1.zhou@amd.com>
> > >>>>>---
> > >>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 11 ++++++
> > >>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 71 ++++++++++++++++++++++++++--------
> > >>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h | 11 +++++-
> > >>>>>    3 files changed, 75 insertions(+), 18 deletions(-)
> > >>>>>
> > >>>>>diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> > >>>>>index 502b94f..9fbdf02 100644
> > >>>>>--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> > >>>>>+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> > >>>>>@@ -1260,6 +1260,16 @@ static int amdgpu_cs_submit(struct amdgpu_cs_parser *p,
> > >>>>>    	return 0;
> > >>>>>    }
> > >>>>>+static void amdgpu_cs_vm_move_on_lru(struct amdgpu_device *adev,
> > >>>>>+				     struct amdgpu_cs_parser *p)
> > >>>>>+{
> > >>>>>+	struct amdgpu_fpriv *fpriv = p->filp->driver_priv;
> > >>>>>+	struct amdgpu_vm *vm = &fpriv->vm;
> > >>>>>+
> > >>>>>+	if (vm->validated)
> > >>>>That check belongs inside amdgpu_vm_move_to_lru_tail().
> > >>>>
> > >>>>>+		amdgpu_vm_move_to_lru_tail(adev, vm);
> > >>>>>+}
> > >>>>>+
> > >>>>>    int amdgpu_cs_ioctl(struct drm_device *dev, void *data, struct drm_file *filp)
> > >>>>>    {
> > >>>>>    	struct amdgpu_device *adev = dev->dev_private;
> > >>>>>@@ -1310,6 +1320,7 @@ int amdgpu_cs_ioctl(struct drm_device *dev, void *data, struct drm_file *filp)
> > >>>>>    	r = amdgpu_cs_submit(&parser, cs);
> > >>>>>+	amdgpu_cs_vm_move_on_lru(adev, &parser);
> > >>>>>    out:
> > >>>>>    	amdgpu_cs_parser_fini(&parser, r, reserved_buffers);
> > >>>>>    	return r;
> > >>>>>diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> > >>>>>index 9c84770..037cfbc 100644
> > >>>>>--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> > >>>>>+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> > >>>>>@@ -268,6 +268,53 @@ void amdgpu_vm_get_pd_bo(struct amdgpu_vm *vm,
> > >>>>>    }
> > >>>>>    /**
> > >>>>>+ * amdgpu_vm_move_to_lru_tail_by_list - move one list of BOs to end of LRU
> > >>>>>+ *
> > >>>>>+ * @vm: vm providing the BOs
> > >>>>>+ * @list: the list that stored BOs
> > >>>>>+ *
> > >>>>>+ * Move one list of BOs to the end of LRU and update the positions.
> > >>>>>+ */
> > >>>>>+static void
> > >>>>>+amdgpu_vm_move_to_lru_tail_by_list(struct amdgpu_vm *vm, struct list_head *list)
> > >>>>I don't see much of a point having a separate function for this any more.
> > >>>>
> > >>>>>+{
> > >>>>>+	struct amdgpu_vm_bo_base *bo_base;
> > >>>>>+
> > >>>>>+	list_for_each_entry(bo_base, list, vm_status) {
> > >>>>>+		struct amdgpu_bo *bo = bo_base->bo;
> > >>>>>+
> > >>>>>+		if (!bo->parent)
> > >>>>>+			continue;
> > >>>>>+
> > >>>>>+		ttm_bo_move_to_lru_tail(&bo->tbo, &vm->lru_bulk_move);
> > >>>>>+		if (bo->shadow)
> > >>>>>+			ttm_bo_move_to_lru_tail(&bo->shadow->tbo,
> > >>>>>+						&vm->lru_bulk_move);
> > >>>>>+	}
> > >>>>>+}
> > >>>>>+
> > >>>>>+/**
> > >>>>>+ * amdgpu_vm_move_to_lru_tail - move all BOs to the end of LRU
> > >>>>>+ *
> > >>>>>+ * @adev: amdgpu device pointer
> > >>>>>+ * @vm: vm providing the BOs
> > >>>>>+ *
> > >>>>>+ * Move all BOs to the end of LRU and remember their positions to put them
> > >>>>>+ * together.
> > >>>>>+ */
> > >>>>>+void amdgpu_vm_move_to_lru_tail(struct amdgpu_device *adev,
> > >>>>>+				struct amdgpu_vm *vm)
> > >>>>>+{
> > >>>>>+	struct ttm_bo_global *glob = adev->mman.bdev.glob;
> > >>>>>+
> > >>>>>+	memset(&vm->lru_bulk_move, 0, sizeof(vm->lru_bulk_move));
> > >>>>>+
> > >>>>>+	spin_lock(&glob->lru_lock);
> > >>>>>+	amdgpu_vm_move_to_lru_tail_by_list(vm, &vm->idle);
> > >>>>>+	spin_unlock(&glob->lru_lock);
> > >>>>>+}
> > >>>>>+
> > >>>>>+/**
> > >>>>>     * amdgpu_vm_validate_pt_bos - validate the page table BOs
> > >>>>>     *
> > >>>>>     * @adev: amdgpu device pointer
> > >>>>>@@ -288,6 +335,7 @@ int amdgpu_vm_validate_pt_bos(struct amdgpu_device *adev, struct amdgpu_vm *vm,
> > >>>>>    	struct amdgpu_vm_bo_base *bo_base, *tmp;
> > >>>>>    	int r = 0;
> > >>>>>+	vm->validated = false;
> > >>>>That won't work like this. It is perfectly possible that CS is aborted
> > >>>>because of a signal and amdgpu_vm_move_to_lru_tail() never called.
> > >>>>
> > >>>>I suggest to do "vm->validated |= !list_empty(&vm->evicted);" here and
> > >>>>then set it to false again in amdgpu_vm_move_to_lru_tail().
> > >>>>
> > >>>>And maybe we need a better name than "validated", maybe "bulk_moveable"
> > >>>>or something like that.
> > >>>>
> > >>>Actually, "validated" is opposite to "bulk_moveable".
> > >>>So how about use "vm->bulk_moveable = list_empty(&vm->evicted);" here
> > >>That still won't work correct. See the validation can be interrupted.
> > >>
> > >>So we need something like "vm->bulk_moveable &=
> > >>list_empty(&vm->evicted);" here.
> > >Thanks. I found if we use vm->bulk_moveable, it will have a corruption of list.
> > >
> > >[ 2763.834228] ------------[ cut here ]------------
> > >[ 2763.839221] list_del corruption. prev->next should be ffff9137dae028f8, but was ffffa59b82003ae0
> > >[ 2763.848330] WARNING: CPU: 1 PID: 281 at lib/list_debug.c:53 __list_del_entry_valid+0x7c/0xa0
> > >...
> > >[ 2764.053257] Call Trace:
> > >[ 2764.055748]  ttm_bo_del_from_lru+0x77/0xc0 [ttm]
> > >[ 2764.060437]  ttm_bo_release+0x23a/0x2c0 [ttm]
> > >[ 2764.064924]  amdgpu_bo_destroy+0x93/0x140 [amdgpu]
> > >[ 2764.069769]  ttm_bo_release_list+0x113/0x170 [ttm]
> > >[ 2764.074695]  amdgpu_bo_destroy+0x93/0x140 [amdgpu]
> > >[ 2764.079544]  ttm_bo_release_list+0x113/0x170 [ttm]
> > >
> > >However, if use vm->bulk_not_moveable (like vm->validated), the issue is gone.
> > >I didn't confirm the reason yet.
> > 
> > Make sure that you don't try to move the root PD as well. E.g. test
> > for bo->parent and skip if it is NULL.
> > 
> 
> Yes, I have skipped the root PD, but the issue is still existed.
> 
> list_for_each_entry(bo_base, &vm->idle, vm_status) {
>         struct amdgpu_bo *bo = bo_base->bo;
> 
>         if (!bo->parent)
>                 continue;
> 
>         ttm_bo_move_to_lru_tail(&bo->tbo, &vm->lru_bulk_move);
>         if (bo->shadow)
>                 ttm_bo_move_to_lru_tail(&bo->shadow->tbo,
>                                         &vm->lru_bulk_move);
> }
> 
> Let me continue to narrow down this issue.
> 

I found the cause of this corruption.
The init value of vm->bulk_moveable is actually false.
Then after vm->bulk_moveable &= list_empty(&vm->evicted), the bulk_moveable
is still false even the evicted list is empty at that time. That's not
expected.

So we just set vm->bulk_moveable as true at amdgpu_vm_init() during vm
init, the issue is fixed.

Thanks,
Ray
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx