Intel-GFX Archive on lore.kernel.org
 help / color / mirror / Atom feed
From: Matthew Auld <matthew.auld@intel.com>
To: "Christian König" <christian.koenig@amd.com>,
	dri-devel@lists.freedesktop.org,
	"Intel Graphics" <intel-gfx@lists.freedesktop.org>,
	"Thomas Hellström" <thomas.hellstrom@linux.intel.com>
Cc: Luben Tuikov <luben.tuikov@amd.com>
Subject: Re: [Intel-gfx] [PATCH 1/3] drm/i915: audit bo->resource usage
Date: Wed, 31 Aug 2022 11:37:09 +0100	[thread overview]
Message-ID: <55d88b45-6986-de38-d574-d0ce7d06c62e@intel.com> (raw)
In-Reply-To: <9c7cc7ea-dd30-6df5-3f06-97a6c6d254e0@amd.com>

On 31/08/2022 10:38, Christian König wrote:
> Am 31.08.22 um 11:26 schrieb Matthew Auld:
>> On 31/08/2022 09:16, Christian König wrote:
>>> Hi Matthew,
>>>
>>> Am 30.08.22 um 12:45 schrieb Matthew Auld:
>>>> Hi,
>>>>
>>>> On 30/08/2022 08:33, Christian König wrote:
>>>>> Hi guys,
>>>>>
>>>>> can we get an rb/acked-by for this i915 change?
>>>>>
>>>>> Basically we are just making sure that the driver doesn't crash 
>>>>> when bo->resource is NULL and a bo doesn't have any backing store 
>>>>> assigned to it.
>>>>>
>>>>> The Intel CI seems to be happy with this change, so I'm pretty sure 
>>>>> it is correct.
>>>>
>>>> It looks like DG2/DG1 (which happen to use TTM here) are no longer 
>>>> loading the module:
>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fintel-gfx-ci.01.org%2Ftree%2Fdrm-tip%2FPatchwork_107680v1%2Findex.html&amp;data=05%7C01%7Cchristian.koenig%40amd.com%7Ccaca567b3279492450fd08da8b32e598%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637975347967354305%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=apanfOjzSWD2vduINzr2j6F2DiIBC93hLRnnGJcGQ5o%3D&amp;reserved=0?
>>>>
>>>> According to the logs the firmware is failing to load, so perhaps 
>>>> related to I915_BO_ALLOC_CPU_CLEAR, since that is one of the rare 
>>>> users. See below.
>>>>
>>>>>
>>>>> Thanks,
>>>>> Christian.
>>>>>
>>>>> Am 24.08.22 um 16:23 schrieb Luben Tuikov:
>>>>>> From: Christian König <christian.koenig@amd.com>
>>>>>>
>>>>>> Make sure we can at least move and alloc TT objects without 
>>>>>> backing store.
>>>>>>
>>>>>> Signed-off-by: Christian König <christian.koenig@amd.com>
>>>>>> ---
>>>>>>   drivers/gpu/drm/i915/gem/i915_gem_ttm.c      | 6 ++----
>>>>>>   drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c | 2 +-
>>>>>>   2 files changed, 3 insertions(+), 5 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_ttm.c 
>>>>>> b/drivers/gpu/drm/i915/gem/i915_gem_ttm.c
>>>>>> index bc9c432edffe03..45ce2d1f754cc4 100644
>>>>>> --- a/drivers/gpu/drm/i915/gem/i915_gem_ttm.c
>>>>>> +++ b/drivers/gpu/drm/i915/gem/i915_gem_ttm.c
>>>>>> @@ -271,8 +271,6 @@ static struct ttm_tt 
>>>>>> *i915_ttm_tt_create(struct ttm_buffer_object *bo,
>>>>>>   {
>>>>>>       struct drm_i915_private *i915 = container_of(bo->bdev, 
>>>>>> typeof(*i915),
>>>>>>                                bdev);
>>>>>> -    struct ttm_resource_manager *man =
>>>>>> -        ttm_manager_type(bo->bdev, bo->resource->mem_type);
>>>>>>       struct drm_i915_gem_object *obj = i915_ttm_to_gem(bo);
>>>>>>       unsigned long ccs_pages = 0;
>>>>>>       enum ttm_caching caching;
>>>>>> @@ -286,8 +284,8 @@ static struct ttm_tt 
>>>>>> *i915_ttm_tt_create(struct ttm_buffer_object *bo,
>>>>>>       if (!i915_tt)
>>>>>>           return NULL;
>>>>>> -    if (obj->flags & I915_BO_ALLOC_CPU_CLEAR &&
>>>>>> -        man->use_tt)
>>>>>> +    if (obj->flags & I915_BO_ALLOC_CPU_CLEAR && bo->resource &&
>>>>>> +        ttm_manager_type(bo->bdev, bo->resource->mem_type)->use_tt)
>>>>>>           page_flags |= TTM_TT_FLAG_ZERO_ALLOC;
>>>>
>>>> AFAICT i915 was massively relying on everything starting out in a 
>>>> "dummy" system memory resource (ttm_tt), where it then later 
>>>> "transitions" to the real resource. And if we need to clear the 
>>>> memory we rely on ZERO_ALLOC being set before calling into the 
>>>> i915_ttm_move() callback (even when allocating local-memory).
>>>>
>>>> For ttm_bo_type_device objects (userspace stuff) it looks like this 
>>>> was previously handled by ttm_bo_validate() always doing:
>>>>
>>>>   ret = ttm_tt_create(bo, true); /* clear = true */
>>>>
>>>> Which we would always hit since the resource was always "compatible" 
>>>> for the dummy case. But it looks like this is no longer even called, 
>>>> since we can now call into ttm_move with bo->resource == NULL, which 
>>>> still calls tt_create eventually, which not always with clear = true.
>>>>
>>>> All other objects are then ttm_bo_type_kernel so we don't care about 
>>>> clearing, except in the rare case of ALLOC_CPU_CLEAR, which was 
>>>> handled as per above in i915_ttm_tt_create(). But I think here 
>>>> bo->resource is NULL at the start when first creating the object, 
>>>> which will skip setting ZERO_ALLOC, which might explain the CI failure.
>>>>
>>>> The other possible concern (not sure since CI didn't get that far) 
>>>> is around ttm_bo_pipeline_gutting(), which now leaves bo->resource = 
>>>> NULL. It looks like i915_ttm_shrink() was relying on that to 
>>>> unpopulate the ttm_tt. When later calling ttm_bo_validate(), 
>>>> i915_ttm_move() would see the SWAPPED flag set on the ttm_tt , 
>>>> re-populate it and then potentially move it back to local-memory.
>>>>
>>>> What are your thoughts here? Also sorry if i915 is making a bit of 
>>>> mess here.
>>>
>>> First of all thanks a lot for taking a look. We previously got 
>>> reports about strange crashes with this patch, but couldn't really 
>>> reproduce them (even not by sending them out again). This explains 
>>> that a bit.
>>>
>>> The simplest solution would be to just change the && into a ||, e.g. 
>>> when previously either no resource is allocated or the resource 
>>> requires to use a tt we clear it.
>>>
>>> That should give you the same behavior as before, but I agree that 
>>> this is a bit messy.
>>
>> Yeah, that should do the trick.
>>
>> That hopefully just leaves i915_ttm_shrink(), which is swapping out 
>> shmem ttm_tt and is calling ttm_bo_validate() with empty placements to 
>> force the pipeline-gutting path, which importantly unpopulates the 
>> ttm_tt for us (since ttm_tt_unpopulate is not exported it seems). But 
>> AFAICT it looks like that will now also nuke the bo->resource, instead 
>> of just leaving it in system memory. My assumption is that when later 
>> calling ttm_bo_validate(), it will just do the bo_move_null() in 
>> i915_ttm_move(), instead of re-populating the ttm_tt and then 
>> potentially copying it back to local-memory?
> 
> Well you do ttm_bo_validate() with something like GTT domain, don't you? 
> This should result in re-populating the tt object, but I'm not 100% sure 
> if that really works as expected.

AFAIK for domains we either have system memory (which uses ttm_tt and 
might be shmem underneath) or local-memory. But perhaps i915 is doing 
something wrong here, or abusing TTM in some way. I'm not sure tbh.

Anyway, I think we have two cases here:

- We have some system memory only object. After doing i915_ttm_shrink(), 
bo->resource is now NULL. We then call ttm_bo_validate() at some later 
point, but here we don't need to copy anything, but it also looks like 
ttm_bo_handle_move_mem() won't populate the ttm_tt or us either, since 
mem_type == TTM_PL_SYSTEM. It looks like i915_ttm_move() was taking care 
of this, but now we just call ttm_bo_move_null().

- We have a local-memory only object, which was evicted to shmem, and 
then swapped out by the shrinker like above. The bo->resource is NULL. 
However this time when calling ttm_bo_validate() we need to actually do 
a copy in i915_ttm_move(), as well as re-populate the ttm_tt. 
i915_ttm_move() was taking care of this, but now we just call 
ttm_bo_move_null().

Perhaps i915 is doing something wrong in the above two cases?

> 
> Thanks,
> Christian.
> 
>>
>>>
>>> I've been considering to replacing the ttm_bo_type with a bunch of 
>>> behavior flags for a bo. I'm hoping that this will clean things up a 
>>> bit.
>>>
>>> Regards,
>>> Christian.
>>>
>>>>
>>>>>>       caching = i915_ttm_select_tt_caching(obj);
>>>>>> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c 
>>>>>> b/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
>>>>>> index 9a7e50534b84bb..c420d1ab605b6f 100644
>>>>>> --- a/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
>>>>>> +++ b/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
>>>>>> @@ -560,7 +560,7 @@ int i915_ttm_move(struct ttm_buffer_object 
>>>>>> *bo, bool evict,
>>>>>>       bool clear;
>>>>>>       int ret;
>>>>>> -    if (GEM_WARN_ON(!obj)) {
>>>>>> +    if (GEM_WARN_ON(!obj) || !bo->resource) {
>>>>>>           ttm_bo_move_null(bo, dst_mem);
>>>>>>           return 0;
>>>>>>       }
>>>>>
>>>
> 

  reply	other threads:[~2022-08-31 10:37 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-08-24 14:23 [Intel-gfx] [PATCH 1/3] drm/i915: audit bo->resource usage Luben Tuikov
2022-08-24 14:23 ` [Intel-gfx] [PATCH 2/3] drm/ttm: stop allocating dummy resources during BO creation Luben Tuikov
2022-08-24 14:23 ` [Intel-gfx] [PATCH 3/3] drm/ttm: stop allocating a dummy resource for pipelined gutting Luben Tuikov
2022-08-24 16:21 ` [Intel-gfx] ✗ Fi.CI.SPARSE: warning for series starting with [1/3] drm/i915: audit bo->resource usage Patchwork
2022-08-24 16:44 ` [Intel-gfx] ✗ Fi.CI.BAT: failure " Patchwork
2022-08-30  7:33 ` [Intel-gfx] [PATCH 1/3] " Christian König
2022-08-30 10:45   ` Matthew Auld
2022-08-31  8:16     ` Christian König
2022-08-31  9:26       ` Matthew Auld
2022-08-31  9:38         ` Christian König
2022-08-31 10:37           ` Matthew Auld [this message]
2022-08-31 11:03             ` Christian König
2022-08-31 12:06               ` Matthew Auld
2022-08-31 12:35                 ` Christian König
2022-08-31 12:50                   ` Matthew Auld
2022-08-31 13:34                     ` Christian König
2022-08-31 14:53                       ` Matthew Auld
2022-08-31 16:32                         ` Matthew Auld
2022-09-01  8:00                           ` Christian König
2022-09-01 12:52                             ` Matthew Auld
2022-09-01 17:48                       ` Thomas Hellström
2022-09-01  8:43 ` [Intel-gfx] ✗ Fi.CI.BUILD: failure for series starting with [1/3] drm/i915: audit bo->resource usage (rev2) Patchwork
  -- strict thread matches above, loose matches on Subject: below --
2022-07-12 11:46 [Intel-gfx] [PATCH 1/3] drm/i915: audit bo->resource usage Christian König
2022-07-13 10:08 ` Matthew Auld

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=55d88b45-6986-de38-d574-d0ce7d06c62e@intel.com \
    --to=matthew.auld@intel.com \
    --cc=christian.koenig@amd.com \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=intel-gfx@lists.freedesktop.org \
    --cc=luben.tuikov@amd.com \
    --cc=thomas.hellstrom@linux.intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox