From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 8AAEAC5320E for ; Tue, 27 Aug 2024 09:45:20 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id 560D110E236; Tue, 27 Aug 2024 09:45:20 +0000 (UTC) Authentication-Results: gabe.freedesktop.org; dkim=pass (2048-bit key; unprotected) header.d=intel.com header.i=@intel.com header.b="lFnH8uC9"; dkim-atps=neutral Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.10]) by gabe.freedesktop.org (Postfix) with ESMTPS id 9316610E236 for ; Tue, 27 Aug 2024 09:45:19 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1724751919; x=1756287919; h=message-id:date:mime-version:subject:to:cc:references: from:in-reply-to:content-transfer-encoding; bh=xCdh+lJk+5WmlRTL1Nvur6j4cV76wElceGPHkV/5gQ0=; b=lFnH8uC9/FfRc95d4EfaAkN9yc/Gtte2n6TaIzDayr3j2iP1QtHXMP7/ WlsUdo0/dg+Bjm3j5V76KkvOssEdlf0o1sq+EONsdOoIszSc5R0hubBU6 7AlXPCsICgvx4pj/4EGbaTpq+qnOOQdV/LqEqutYWCHkL9Fi9rEqqEnzK ByAZEsNECY6Cg+L0qTxMOOHLBmBI7ae1Y9zHfhin6kJ3Exu+H6TPMcPbd ciZN8nFRAcqgM86opN8VDIQX4oiyimzALoDWysQvEL2POiSE0VYFVQy4Y vZEHoXhXj7RtV8UyCd3iF0HF9505t0yLKHf61NRpvCJBorLG7IaYYxacw A==; X-CSE-ConnectionGUID: QIMDvzwwTtuDDQSYbW2GJg== X-CSE-MsgGUID: hLiXDg2yQne7SgrIgwmu4g== X-IronPort-AV: E=McAfee;i="6700,10204,11176"; a="40688460" X-IronPort-AV: E=Sophos;i="6.10,180,1719903600"; d="scan'208";a="40688460" Received: from orviesa008.jf.intel.com ([10.64.159.148]) by orvoesa102.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 27 Aug 2024 02:45:19 -0700 X-CSE-ConnectionGUID: z/rRqKWIT0mv5u5V0Ki2bA== X-CSE-MsgGUID: qJshb3WbTOOqQ+GjsmNq4A== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.10,180,1719903600"; d="scan'208";a="63526023" Received: from nirmoyda-mobl.ger.corp.intel.com (HELO [10.245.192.32]) ([10.245.192.32]) by orviesa008-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 27 Aug 2024 02:45:17 -0700 Message-ID: <989f101b-270d-45cb-ace0-475f295c2365@linux.intel.com> Date: Tue, 27 Aug 2024 11:45:14 +0200 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [RFC PATCH] drm/xe/lnl: Implement clear-on-free for pooled BOs To: Matthew Auld , =?UTF-8?Q?Thomas_Hellstr=C3=B6m?= , Nirmoy Das , intel-xe@lists.freedesktop.org Cc: Matthew Brost References: <20240822124244.10554-1-nirmoy.das@intel.com> <7645111403a453311b16ff2b11d49cb63a74518f.camel@linux.intel.com> Content-Language: en-US From: Nirmoy Das In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-BeenThere: intel-xe@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Intel Xe graphics driver List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: intel-xe-bounces@lists.freedesktop.org Sender: "Intel-xe" On 8/27/2024 11:04 AM, Matthew Auld wrote: > On 26/08/2024 09:36, Thomas Hellström wrote: >> On Mon, 2024-08-26 at 10:26 +0200, Nirmoy Das wrote: >>> Hi Thomas, >>> >>> On 8/23/2024 11:38 AM, Thomas Hellström wrote: >>>> Hi, Nirmoy, >>>> >>>> On Thu, 2024-08-22 at 14:42 +0200, Nirmoy Das wrote: >>>>> Implement GPU clear-on-free for pooled system pages in Xe. >>>>> >>>>> Ensure proper use of TTM_TT_FLAG_CLEARED_ON_FREE by leveraging >>>>> ttm_device_funcs.release_notify() for GPU clear-on-free. If GPU >>>>> clear >>>>> fails, xe_ttm_tt_unpopulate() will fallback to CPU clear. >>>>> >>>>> Clear-on-free is only relevant for pooled pages as driver needs >>>>> to >>>>> give >>>>> back those pages. So do clear-on-free only for such BOs and keep >>>>> doing >>>>> clear-on-alloc for ttm_cached type BOs >>>>> >>>>> Cc: Matthew Auld >>>>> Cc: Matthew Brost >>>>> Cc: Thomas Hellström >>>>> Signed-off-by: Nirmoy Das >>>> While this would probably work, I don't immediately see the benefit >>>> over CPU clearing, since we have no way of combining this with the >>>> CCS >>>> clear, right? >>> >>> >>> If XE/ttm could do clear-on-free(data+CCS) with GPU all the time then >>> I >>> think we could >>> >>> skip ccs clearing on alloc, assuming only GPU access modifies a CCS >>> state and on boot CCS region >>> >>> is zeroed. I think that can't be guaranteed so we have to clear ccs >>> on >>> alloc. I agree, there won't be much >>> >>> latency benefit of doing clear-on-free for ccs devices. I will still >>> try >>> to run some tests to validate it, I have done that for this RFC. >> >> OK, yes this would probably work. Do we need to clear all CCS on module >> load or can we safely assume that no useful info is left in the CCS >> memory at that time? > > Maybe another thing to check is if we can skip ccs clearing for wb, > assuming we don't already. Still in my todo list. > I think compression is always coh_none (not compatible with wb), Yes, HW treats such(1,2-way) access as compression-disabled. > and I think that to copy the ccs directly would require mapping the > src with compression enabled. Yes for igfx. Regards, Nirmoy > > I'm pretty sure that is the case, otherwire userptr/imported external > dma-buf would be an issue today, assuming user could manually copy the > uncleared ccs. > >> >>> >>> >>> I've discussed this with Ron and it seems there is on going >>> conversation >>> if there is a way to avoid ccs clearing if data is zeroed. >>> >>> Let's see how that goes. >>> >>> >>>>    So the clearing latency will most probably be increased, >>>> but the bo releasing thread won't see that because the waiting for >>>> clear is offloaded to the TTM delayed destroy mechanism. >>>> >>>> Also, once we've dropped the gem refcount to zero, the gem members >>>> of >>>> the object, including bo_move, are strictly not valid anymore and >>>> shouldn't be used. >>> >>> >>> Could you please  expand this? I am not seeing the connection between >>> bo_move and refcount. >>> >>> Are you saying release_notify is not the right place to do this ? >> >> Yes. At release_notify, the gem refcount has dropped to zero, and we >> don't allow calling bo_move at that point, as the driver might want to >> do some cleanup in the gem_release before putting the last ttm_bo >> reference. >> >> Thanks, >> Thomas >> >> >>> >>>> If we want to try to improve freeing latency by offloading the >>>> clearing >>>> on free to a separate CPU thread, though, maybe we could discuss >>>> with >>>> Christian to always (or if a flag in the ttm device requests it) >>>> take >>>> the TTM delayed destruction path for bos with pooled pages, rather >>>> than >>>> to free them sync, something along the lines of: >>>> >>>> diff --git a/drivers/gpu/drm/ttm/ttm_bo.c >>>> b/drivers/gpu/drm/ttm/ttm_bo.c >>>> index 320592435252..fca69ec1740d 100644 >>>> --- a/drivers/gpu/drm/ttm/ttm_bo.c >>>> +++ b/drivers/gpu/drm/ttm/ttm_bo.c >>>> @@ -271,7 +271,7 @@ static void ttm_bo_release(struct kref *kref) >>>>                      if (!dma_resv_test_signaled(bo->base.resv, >>>> DMA_RESV_USAGE_BOOKKEEP) || >>>> -                   (want_init_on_free() && (bo->ttm != NULL)) || >>>> +                   (bo->ttm && (want_init_on_free() || bo->ttm- >>>>> caching != ttm_cached)) || >>>>                       bo->type == ttm_bo_type_sg || >>>>                       !dma_resv_trylock(bo->base.resv)) { >>>>                           /* The BO is not idle, resurrect it for >>>> delayed >>>> destroy */ >>>> >>>> Would ofc require some substantial proven latency gain, though. >>>> Overall >>>> system cpu usage would probably not improve. >>> >>> >>> I will run some tests with the above change and get back. >>> >>> >>> Thanks, >>> >>> Nirmoy >>> >>>> >>>> /Thomas >>>> >>>> >>>>> --- >>>>>    drivers/gpu/drm/xe/xe_bo.c | 101 >>>>> +++++++++++++++++++++++++++++++++-- >>>>> -- >>>>>    1 file changed, 91 insertions(+), 10 deletions(-) >>>>> >>>>> diff --git a/drivers/gpu/drm/xe/xe_bo.c >>>>> b/drivers/gpu/drm/xe/xe_bo.c >>>>> index 6ed0e1955215..e7bc74f8ae82 100644 >>>>> --- a/drivers/gpu/drm/xe/xe_bo.c >>>>> +++ b/drivers/gpu/drm/xe/xe_bo.c >>>>> @@ -283,6 +283,8 @@ struct xe_ttm_tt { >>>>>        struct device *dev; >>>>>        struct sg_table sgt; >>>>>        struct sg_table *sg; >>>>> +    bool sys_clear_on_free; >>>>> +    bool sys_clear_on_alloc; >>>>>    }; >>>>>       static int xe_tt_map_sg(struct ttm_tt *tt) >>>>> @@ -401,8 +403,23 @@ static struct ttm_tt >>>>> *xe_ttm_tt_create(struct >>>>> ttm_buffer_object *ttm_bo, >>>>>         * flag. Zeroed pages are only required for >>>>> ttm_bo_type_device so >>>>>         * unwanted data is not leaked to userspace. >>>>>         */ >>>>> -    if (ttm_bo->type == ttm_bo_type_device && xe- >>>>>> mem.gpu_page_clear_sys) >>>>> -        page_flags |= TTM_TT_FLAG_CLEARED_ON_FREE; >>>>> +    if (ttm_bo->type == ttm_bo_type_device && xe- >>>>>> mem.gpu_page_clear_sys) { >>>>> +        /* >>>>> +         * Non-pooled BOs are always clear on alloc when >>>>> possible. >>>>> +         * clear-on-free is not needed as there is no >>>>> pool >>>>> to give pages back. >>>>> +         */ >>>>> +        if (caching == ttm_cached) { >>>>> +            tt->sys_clear_on_alloc = true; >>>>> +            tt->sys_clear_on_free = false; >>>>> +        } else { >>>>> +        /* >>>>> +         * For pooled BO, clear-on-alloc is done by the >>>>> CPU >>>>> for now and >>>>> +         * GPU will do clear on free when releasing the >>>>> BO. >>>>> +         */ >>>>> +            tt->sys_clear_on_alloc = false; >>>>> +            tt->sys_clear_on_free = true; >>>>> +        } >>>>> +    } >>>>>           err = ttm_tt_init(&tt->ttm, &bo->ttm, page_flags, >>>>> caching, >>>>> extra_pages); >>>>>        if (err) { >>>>> @@ -416,8 +433,10 @@ static struct ttm_tt >>>>> *xe_ttm_tt_create(struct >>>>> ttm_buffer_object *ttm_bo, >>>>>    static int xe_ttm_tt_populate(struct ttm_device *ttm_dev, >>>>> struct >>>>> ttm_tt *tt, >>>>>                      struct ttm_operation_ctx *ctx) >>>>>    { >>>>> +    struct xe_ttm_tt *xe_tt; >>>>>        int err; >>>>>    +    xe_tt = container_of(tt, struct xe_ttm_tt, ttm); >>>>>        /* >>>>>         * dma-bufs are not populated with pages, and the dma- >>>>>         * addresses are set up when moved to XE_PL_TT. >>>>> @@ -426,7 +445,7 @@ static int xe_ttm_tt_populate(struct >>>>> ttm_device >>>>> *ttm_dev, struct ttm_tt *tt, >>>>>            return 0; >>>>>           /* Clear TTM_TT_FLAG_ZERO_ALLOC when GPU is set to clear >>>>> system pages */ >>>>> -    if (tt->page_flags & TTM_TT_FLAG_CLEARED_ON_FREE) >>>>> +    if (xe_tt->sys_clear_on_alloc) >>>>>            tt->page_flags &= ~TTM_TT_FLAG_ZERO_ALLOC; >>>>>           err = ttm_pool_alloc(&ttm_dev->pool, tt, ctx); >>>>> @@ -438,11 +457,19 @@ static int xe_ttm_tt_populate(struct >>>>> ttm_device >>>>> *ttm_dev, struct ttm_tt *tt, >>>>>       static void xe_ttm_tt_unpopulate(struct ttm_device *ttm_dev, >>>>> struct >>>>> ttm_tt *tt) >>>>>    { >>>>> +    struct xe_ttm_tt *xe_tt; >>>>> + >>>>> +    xe_tt = container_of(tt, struct xe_ttm_tt, ttm); >>>>> + >>>>>        if (tt->page_flags & TTM_TT_FLAG_EXTERNAL) >>>>>            return; >>>>>           xe_tt_unmap_sg(tt); >>>>>    +    /* Hint TTM pool that pages are already cleared */ >>>>> +    if (xe_tt->sys_clear_on_free) >>>>> +        tt->page_flags |= TTM_TT_FLAG_CLEARED_ON_FREE; >>>>> + >>>>>        return ttm_pool_free(&ttm_dev->pool, tt); >>>>>    } >>>>>    @@ -664,6 +691,7 @@ static int xe_bo_move(struct >>>>> ttm_buffer_object >>>>> *ttm_bo, bool evict, >>>>>        struct ttm_resource *old_mem = ttm_bo->resource; >>>>>        u32 old_mem_type = old_mem ? old_mem->mem_type : >>>>> XE_PL_SYSTEM; >>>>>        struct ttm_tt *ttm = ttm_bo->ttm; >>>>> +    struct xe_ttm_tt *xe_tt; >>>>>        struct xe_migrate *migrate = NULL; >>>>>        struct dma_fence *fence; >>>>>        bool move_lacks_source; >>>>> @@ -674,12 +702,13 @@ static int xe_bo_move(struct >>>>> ttm_buffer_object >>>>> *ttm_bo, bool evict, >>>>>        bool clear_system_pages; >>>>>        int ret = 0; >>>>>    +    xe_tt = container_of(ttm_bo->ttm, struct xe_ttm_tt, >>>>> ttm); >>>>>        /* >>>>>         * Clear TTM_TT_FLAG_CLEARED_ON_FREE on bo creation path >>>>> when >>>>>         * moving to system as the bo doesn't have dma_mapping. >>>>>         */ >>>>>        if (!old_mem && ttm && !ttm_tt_is_populated(ttm)) >>>>> -        ttm->page_flags &= ~TTM_TT_FLAG_CLEARED_ON_FREE; >>>>> +        xe_tt->sys_clear_on_alloc = false; >>>>>           /* Bo creation path, moving to system or TT. */ >>>>>        if ((!old_mem && ttm) && !handle_system_ccs) { >>>>> @@ -703,10 +732,9 @@ static int xe_bo_move(struct >>>>> ttm_buffer_object >>>>> *ttm_bo, bool evict, >>>>>        move_lacks_source = handle_system_ccs ? (!bo- >>>>>> ccs_cleared) >>>>> : >>>>>                            (!mem_type_is_vr >>>>> am(o >>>>> ld_mem_type) && !tt_has_data); >>>>>    -    clear_system_pages = ttm && (ttm->page_flags & >>>>> TTM_TT_FLAG_CLEARED_ON_FREE); >>>>> +    clear_system_pages = ttm && xe_tt->sys_clear_on_alloc; >>>>>        needs_clear = (ttm && ttm->page_flags & >>>>> TTM_TT_FLAG_ZERO_ALLOC) || >>>>> -        (!ttm && ttm_bo->type == ttm_bo_type_device) || >>>>> -        clear_system_pages; >>>>> +        (!ttm && ttm_bo->type == ttm_bo_type_device) || >>>>> clear_system_pages; >>>>>           if (new_mem->mem_type == XE_PL_TT) { >>>>>            ret = xe_tt_map_sg(ttm); >>>>> @@ -1028,10 +1056,47 @@ static bool >>>>> xe_ttm_bo_lock_in_destructor(struct ttm_buffer_object *ttm_bo) >>>>>        return locked; >>>>>    } >>>>>    +static struct dma_fence *xe_ttm_bo_clear_on_free(struct >>>>> ttm_buffer_object *ttm_bo) >>>>> +{ >>>>> +    struct xe_bo *bo  = ttm_to_xe_bo(ttm_bo); >>>>> +    struct xe_device *xe = xe_bo_device(bo); >>>>> +    struct xe_migrate *migrate; >>>>> +    struct xe_ttm_tt *xe_tt; >>>>> +    struct dma_fence *clear_fence; >>>>> + >>>>> +    /* return early if nothing to clear */ >>>>> +    if (!ttm_bo->ttm) >>>>> +        return NULL; >>>>> + >>>>> +    xe_tt = container_of(ttm_bo->ttm, struct xe_ttm_tt, >>>>> ttm); >>>>> +    /* return early if nothing to clear */ >>>>> +    if (!xe_tt->sys_clear_on_free || !bo->ttm.resource) >>>>> +        return NULL; >>>>> + >>>>> +    if (XE_WARN_ON(!xe_tt->sg)) >>>>> +        return NULL; >>>>> + >>>>> +    if (bo->tile) >>>>> +        migrate = bo->tile->migrate; >>>>> +    else >>>>> +        migrate = xe->tiles[0].migrate; >>>>> + >>>>> +    xe_assert(xe, migrate); >>>>> + >>>>> +    clear_fence = xe_migrate_clear(migrate, bo, bo- >>>>>> ttm.resource, >>>>> + >>>>> XE_MIGRATE_CLEAR_FLAG_FULL); >>>>> +    if (IS_ERR(clear_fence)) >>>>> +        return NULL; >>>>> + >>>>> +    xe_tt->sys_clear_on_free = false; >>>>> + >>>>> +    return clear_fence; >>>>> +} >>>>> + >>>>>    static void xe_ttm_bo_release_notify(struct ttm_buffer_object >>>>> *ttm_bo) >>>>>    { >>>>>        struct dma_resv_iter cursor; >>>>> -    struct dma_fence *fence; >>>>> +    struct dma_fence *clear_fence, *fence; >>>>>        struct dma_fence *replacement = NULL; >>>>>        struct xe_bo *bo; >>>>>    @@ -1041,15 +1106,31 @@ static void >>>>> xe_ttm_bo_release_notify(struct >>>>> ttm_buffer_object *ttm_bo) >>>>>        bo = ttm_to_xe_bo(ttm_bo); >>>>>        xe_assert(xe_bo_device(bo), !(bo->created && >>>>> kref_read(&ttm_bo->base.refcount))); >>>>>    +    clear_fence = xe_ttm_bo_clear_on_free(ttm_bo); >>>>> + >>>>>        /* >>>>>         * Corner case where TTM fails to allocate memory and >>>>> this >>>>> BOs resv >>>>>         * still points the VMs resv >>>>>         */ >>>>> -    if (ttm_bo->base.resv != &ttm_bo->base._resv) >>>>> +    if (ttm_bo->base.resv != &ttm_bo->base._resv) { >>>>> +        if (clear_fence) >>>>> +            dma_fence_wait(clear_fence, false); >>>>>            return; >>>>> +    } >>>>>    -    if (!xe_ttm_bo_lock_in_destructor(ttm_bo)) >>>>> +    if (!xe_ttm_bo_lock_in_destructor(ttm_bo)) { >>>>> +        if (clear_fence) >>>>> +            dma_fence_wait(clear_fence, false); >>>>>            return; >>>>> +    } >>>>> + >>>>> +    if (clear_fence) { >>>>> +        if (dma_resv_reserve_fences(ttm_bo->base.resv, >>>>> 1)) >>>>> +            dma_fence_wait(clear_fence, false); >>>>> +        else >>>>> +            dma_resv_add_fence(ttm_bo->base.resv, >>>>> clear_fence, >>>>> + >>>>> DMA_RESV_USAGE_KERNEL); >>>>> +    } >>>>>           /* >>>>>         * Scrub the preempt fences if any. The unbind fence is >>>>> already >>