From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 9D02BC4332F for ; Wed, 13 Dec 2023 17:28:04 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id 5676710E285; Wed, 13 Dec 2023 17:28:04 +0000 (UTC) Received: from mgamail.intel.com (mgamail.intel.com [192.55.52.115]) by gabe.freedesktop.org (Postfix) with ESMTPS id 4935C10E285 for ; Wed, 13 Dec 2023 17:28:02 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1702488482; x=1734024482; h=message-id:date:mime-version:subject:to:cc:references: from:in-reply-to:content-transfer-encoding; bh=SS2Hw3E7q4j80b6jbQUXVk8VgGwqPhNdmuWOQB+ZK7c=; b=m6ELIftP9fRy5LJabE8V0Q/7tJW9bcbpVP7IbHcyqqKSOmpT3UCV9stA BPrEXCb06MO2u59IMQlHPDe+SGXXLkDzZT5QSaFriIQjn3z/sz057f3Q0 QWNsqBrmAtRnoRZvxx9FGz8KrfqAzvxnsv0Tm+v5yDGcbPwtvKIrai1DW 6AijOYx+0zBc4ll8W3DoTstzJk9PXRx2CG9QKKpbplfnW7wTl2lnOsQOg J6zyUaomF+K/HJLwJ0HDpEE2xNTBynPIgFGq/zRkFx8nZB+LU04tyvOE7 BAT4pqI9VItwej43gbT4iObElIPVSd/JIDvayq2LtWKrrdvVQdO568sGW A==; X-IronPort-AV: E=McAfee;i="6600,9927,10923"; a="394751224" X-IronPort-AV: E=Sophos;i="6.04,273,1695711600"; d="scan'208";a="394751224" Received: from orsmga002.jf.intel.com ([10.7.209.21]) by fmsmga103.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 13 Dec 2023 09:27:53 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10923"; a="774022421" X-IronPort-AV: E=Sophos;i="6.04,273,1695711600"; d="scan'208";a="774022421" Received: from asparren-mobl2.ger.corp.intel.com (HELO [10.249.254.152]) ([10.249.254.152]) by orsmga002-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 13 Dec 2023 09:27:46 -0800 Message-ID: <5831c125-3b32-05fe-a195-082f6231b4e5@linux.intel.com> Date: Wed, 13 Dec 2023 18:27:44 +0100 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.15.1 Subject: Re: [Intel-xe] LLC configurating, mmap and bo cache management questions Content-Language: en-US To: Tvrtko Ursulin , "Hellstrom, Thomas" , "intel-xe@lists.freedesktop.org" References: <02c74379-6513-40cc-a195-c4eeb7a7fd79@linux.intel.com> <36e0e059e0f74ca7c55b2647658456fabe26420d.camel@intel.com> <2990cd86-7238-4ea9-a9aa-5ab624bea572@linux.intel.com> <7fe7d21d-77fc-6163-47d2-944906974cdb@linux.intel.com> From: =?UTF-8?Q?Thomas_Hellstr=c3=b6m?= In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-BeenThere: intel-xe@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Intel Xe graphics driver List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: "Auld, Matthew" , "Santa, Carlos" , "Vivi, Rodrigo" Errors-To: intel-xe-bounces@lists.freedesktop.org Sender: "Intel-xe" On 12/13/23 12:55, Tvrtko Ursulin wrote: > > On 07/12/2023 12:01, Thomas Hellström wrote: >> >> On 12/7/23 12:11, Tvrtko Ursulin wrote: >>> >>> On 06/12/2023 11:46, Hellstrom, Thomas wrote: >>>> Hi, Tvrtko. >>>> >>>> On Wed, 2023-12-06 at 10:58 +0000, Tvrtko Ursulin wrote: >>>>> >>>>> Hi Thomas, >>>>> >>>>> On 06/12/2023 08:26, Hellstrom, Thomas wrote: >>>>>> On Tue, 2023-12-05 at 14:19 +0000, Tvrtko Ursulin wrote: >>>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> We are working on adding xe support to ChromeOS minigbm and have >>>>>>> a >>>>>>> couple questions. >>>>>>> >>>>>>> If I follow things correctly with xe mmap caching mode is fixed >>>>>>> to >>>>>>> object caching modes set at bo create. For framebuffers it will >>>>>>> be WC >>>>>>> and for the rest userspace can choose WB or WC via >>>>>>> drm_xe_gem_create->cpu_caching. (Unless discrete, when WB cannot >>>>>>> be >>>>>>> used >>>>>>> at all.) >>>>>>> >>>>>>> AFAICT minigbm basically cares about two transition points. Lets >>>>>>> call >>>>>>> them CPU access begin and end. >>>>>>> >>>>>>> 1) >>>>>>> When a bo is mmapped it wants to invalidate the cache, which >>>>>>> looks to >>>>>>> be >>>>>>> about making sure all GPU writes have landed to the backing >>>>>>> store. In >>>>>>> the i915 world that translates to the set_domain ioctl. >>>>>>> >>>>>>> What is the uapi for this with xe, or it is somehow guaranteed to >>>>>>> not >>>>>>> be >>>>>>> needed? >>>>>> >>>>>> Signalling a user-fence or dma-fence obtained as an out-fence from >>>>>> an >>>>>> exec call will guarantee GPU caches are flushed. Currently I don't >>>>>> think there is anything like gem wait in the uAPI, although Matt is >>>>>> just about to add functionality to wait on all outstanding work on >>>>>> an >>>>>> exec_queue. >>>>> >>>>> Problem I see is there are no execs or therefore fences in the >>>>> minigbm >>>>> ABI. It's just buffers, created or imported, CPU access and some >>>>> other >>>>> stuff. >>>>> >>>>> And it is quite extensively used in the OS so I assume it has to work >>>>> (I >>>>> mean the invalidation/flushing was not put in there for no reason), >>>>> in >>>>> other words, where the i915 backend today does >>>>> DRM_I915_GEM_SET_DOMAIN >>>>> on "cpu access begin", which is buffer based, I am not clear how to >>>>> implement that with xe. >>>>> >>>>> For the outstanding work you mention, since you say it is about >>>>> exec_queue, I assume again it will not work purely with buffers? If >>>>> so >>>>> it probably won't be useful for minigbm. >>>>> >>>>> Also if I look at all the other minigbm backends, I see a mix of >>>>> behaviours: >>>>> >>>>>    * msm and vc4 appear to not concern themselves with any of this. >>>>> >>>>>    * rockchip appears to be doing full bounce buffering via memcpy on >>>>> CPU >>>>> access begin/end. >>>>> >>>>>    * i915 and amdgpu respectively use >>>>> DRM_I915_GEM_SET_DOMAIN/DRM_AMDGPU_GEM_WAIT_IDLE (also buffer based, >>>>> not >>>>> execution queue). Andgpu curiously does not do any flushing on CPU >>>>> access end. >>>>> >>>>> Digging into git history, both DRM_I915_GEM_SET_DOMAIN on CPU access >>>>> begin and clflushing on end were added to fix various CTS test >>>>> failures. >>>>> So I guess we could also wait and see what happens there. If those or >>>>> some will be failing with xe too then propose adding some new uapi. >>>>> Or >>>>> if manual testing will start reporting visual corruption in the UI >>>>> elements or such. >>>> >>>> Indeed it sounds like we'd need a DRM_XE_GEM_WAIT_IDLE, similar to >>>> AMDGPU for this use-case. I'll bring that up for discussion. >>> >>> Thanks! >>> >>>>>>> 2) >>>>>>> When a bo is unmapped, or CPU access finished, it wants to flush >>>>>>> the >>>>>>> CPU >>>>>>> caches. That is /almost/ completely a CPU operation, where it >>>>>>> just >>>>>>> needs >>>>>>> to either clflush or invalidate the WC buffer respectively, if >>>>>>> not >>>>>>> the >>>>>>> fact that clflush can be skipped on platforms with LLC. >>>>>>> >>>>>>> I did not see an equivalent of an I915_PARAM_HAS_LLC in xe? Did I >>>>>>> miss >>>>>>> it or what it is the plan for querying this detail? >>>>>> >>>>>> XeKMD is generally coherent, except if UMD selects a GPU PAT index >>>>>> with >>>>>> limited coherency together with WB instead of WC memory. In that >>>>>> case, >>>>>> UMD is responsible for doing the needed CLFLUSH-ing, whereas KMD >>>>>> only >>>>>> ensures initial clearing of the pages is CLFLUSHED for security >>>>>> reasons. >>>>>> >>>>>> I'm not 100% sure if UMD can actually select WB with limited >>>>>> coherency >>>>>> PAT index in the initial uAPI revision, but Matthew has received >>>>>> requests for that so any additional input here on performance >>>>>> implications is appreciated. >>>>>> >>>>>> The thinking here is otherwise that GPU PAT indices with limited >>>>>> coherency should be used together with WC memory in the same >>>>>> situations >>>>>> as VRAM/LMEM is used on DGFX. >>>>> >>>>> Hm okay, this would be the VM BIND side of things which deals with >>>>> GPU >>>>> PATs. From the CPU side it is just CPU caching modes implicitly >>>>> selected >>>>> via DRM_XE_GEM_CPU_CACHING_WB/WC. >>>>> >>>>> The question about the analogue of I915_PARAM_HAS_LLC was AFAIU about >>>>> a >>>>> performance optimisation where UMD is deciding whether it is okay to >>>>> skip issuing clflush for the mapped bo if DRM_XE_GEM_CPU_CACHING_WB >>>>> was >>>>> used. (If DRM_XE_GEM_CPU_CACHING_WC was used it obviously only needs >>>>> to >>>>> flush the write combine buffer.) >>>>> >>>>> Looking at what Mesa is doing it appears it is not using >>>>> I915_PARAM_HAS_LLC but has its own device info tables. So I guess >>>>> minigbm xe backend will have to do the same if xe will not provide >>>>> the >>>>> analogue query. >>>> >>>> IMO clflushing in this case should never be needed, (Again, similar to >>>> AMD). Whatever renders from / to those buffers should make sure they >>>> are clflushed before or while accessing them. How is a rendering API >>>> made aware of these bos? Are they imported using drm prime? >>> >>> Yes they can be imported via drm prime. It's a quite a huge code >>> base with a few abstractions and quite hard to figure out all >>> possible paths. There are gbm_bo_map paths in the host and vm >>> compositors (Exo and Sommelier), Skia library, camera pipeline, >>> maybe more. >>> >>> You mean specifically on the "cpu access end" part of the story >>> flushing would never be needed? Regardless of llc and !llc, wb or wc >>> mappings, imported or native buffer? What mechanism would next in >>> the pipeline (after CPU access) use to ensure the flush? >>> >> Currently we're not supporting any older !LLC hardware, and with >> newer !LLC hardware, we're only allowing at least 1-way coherency >> with dma-bufs, meaning the GPU will acquire the cache line when >> accessing the bo. If the bo is not a dma-buf, and the GPU wants to >> access it in non-coherent mode, we're currently enforcing WC >> cpu-mappings. >> >> So: >> igfx with shared LLC: Not a problem (always coherent) >> dgfx: Not a problem (always coherent) >> newer igfx WO shared LLC (KMD is enforcing coherent access). >> >> Now this may of course, as mentions, have performance implications >> compared to previous igfx solutions, depending on how it all is >> written, but should have similar performance characteristics as dgfx. >> If we want to change this and relax the coherency enforcement and >> resort to CLFLUSHes again,  could UMD communicate the need for this >> in the same way it communicates the format of the gbm_bos? > > I think it is probably not needed. If minigbm will know mmap is always > WC on MTL then it can know that all it needs to do to flush is flush > the write-combine buffer. So the logic in minigbm xe backend "cpu > access end" would be like: > > xe_bo_flush(...) > { >   if (dgfx || (igfx && !llc)) >     __builtin_ia32_sfence(); > } > > Does that sound right to you? > > So far on the i915 backend this explicit sfence was not even there in > the mmap_wc case, which I guess possibly works because something is > likely to trigger the flush implicitly inside the complexities of the > call stack anyway. IIRC WC buffers are flushed also on uncached register writes (does that include updating the ring tail perhapsl?). It might also be that these gbm_bos are created WB and the igfx PAT is set to 1-way coherency. What code creates those gbm_bos? I figure for dgfx at least we don't want any intel platform-specific code in the stack if possible.... > > The only opens are 1) how to determine if LLC is present ie. would xe > provide a query of minigbm would implement device id tables like Mesa, > and 2) how to implement the "cpu access begin" side of things ie. > making sure all GPU writes are CPU visible before starting to use the > mmap-ed bo. 2) Yes, AR on me. /Thomas > > Regards, > > Tvrtko