From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 60AC3C4332F for ; Wed, 13 Dec 2023 20:12:05 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id 28D5F10E153; Wed, 13 Dec 2023 20:12:05 +0000 (UTC) Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.13]) by gabe.freedesktop.org (Postfix) with ESMTPS id C2F2A10E153 for ; Wed, 13 Dec 2023 20:12:02 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1702498323; x=1734034323; h=message-id:date:mime-version:subject:to:cc:references: from:in-reply-to:content-transfer-encoding; bh=JGOheYEq7aKdiXOCQuaF+00hXysoguzIXz0s9j3v9g0=; b=Oj8tCNhmPHX4Raz1qIca2wRdf/bpbkkuCOV4FhDI1zHEEpF7o4puNG+h lARk4457kpqy1nMlSu3uUmgz2b/CwJM4KYv92nU2XhAoNd6knnW2/BLON UMp8gP4tAJS0rCkqxMWfXEbiynNeuJxBpXRhf+krabT+B9LL23TfXyb29 SnvRYRIa7Z0CSFjoyw60BROYeDAvbqB1/PdgH3LvKcXavcDh665kb2N6a oGYiOfD5p4UjDMLB+bQP/rPngK1ajV6+vrP/3Dq8iA3l93SFhPzw0tzcF 2ZMEHzlqoBUuL5SQRnwey/S/uYW5cxXNE2ABjuXqDJeNymMiG7k9m9cU1 Q==; X-IronPort-AV: E=McAfee;i="6600,9927,10923"; a="2106339" X-IronPort-AV: E=Sophos;i="6.04,273,1695711600"; d="scan'208";a="2106339" Received: from orviesa001.jf.intel.com ([10.64.159.141]) by orvoesa105.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 13 Dec 2023 12:12:03 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.04,273,1695711600"; d="scan'208";a="22112772" Received: from rvohra1-mobl.amr.corp.intel.com (HELO [10.249.254.150]) ([10.249.254.150]) by smtpauth.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 13 Dec 2023 12:11:54 -0800 Message-ID: <07922208-1e3c-6bea-1e2d-5215bd33b38a@linux.intel.com> Date: Wed, 13 Dec 2023 21:11:50 +0100 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.15.1 Subject: Re: [Intel-xe] LLC configurating, mmap and bo cache management questions Content-Language: en-US To: Tvrtko Ursulin , "Hellstrom, Thomas" , "intel-xe@lists.freedesktop.org" References: <02c74379-6513-40cc-a195-c4eeb7a7fd79@linux.intel.com> <36e0e059e0f74ca7c55b2647658456fabe26420d.camel@intel.com> <2990cd86-7238-4ea9-a9aa-5ab624bea572@linux.intel.com> <7fe7d21d-77fc-6163-47d2-944906974cdb@linux.intel.com> <5831c125-3b32-05fe-a195-082f6231b4e5@linux.intel.com> <7637f53e-8886-4d3f-9e08-07b0f117b882@linux.intel.com> From: =?UTF-8?Q?Thomas_Hellstr=c3=b6m?= In-Reply-To: <7637f53e-8886-4d3f-9e08-07b0f117b882@linux.intel.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-BeenThere: intel-xe@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Intel Xe graphics driver List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: "Auld, Matthew" , "Santa, Carlos" , "Vivi, Rodrigo" Errors-To: intel-xe-bounces@lists.freedesktop.org Sender: "Intel-xe" On 12/13/23 18:50, Tvrtko Ursulin wrote: > > On 13/12/2023 17:27, Thomas Hellström wrote: >> >> On 12/13/23 12:55, Tvrtko Ursulin wrote: >>> >>> On 07/12/2023 12:01, Thomas Hellström wrote: >>>> >>>> On 12/7/23 12:11, Tvrtko Ursulin wrote: >>>>> >>>>> On 06/12/2023 11:46, Hellstrom, Thomas wrote: >>>>>> Hi, Tvrtko. >>>>>> >>>>>> On Wed, 2023-12-06 at 10:58 +0000, Tvrtko Ursulin wrote: >>>>>>> >>>>>>> Hi Thomas, >>>>>>> >>>>>>> On 06/12/2023 08:26, Hellstrom, Thomas wrote: >>>>>>>> On Tue, 2023-12-05 at 14:19 +0000, Tvrtko Ursulin wrote: >>>>>>>>> >>>>>>>>> Hi, >>>>>>>>> >>>>>>>>> We are working on adding xe support to ChromeOS minigbm and have >>>>>>>>> a >>>>>>>>> couple questions. >>>>>>>>> >>>>>>>>> If I follow things correctly with xe mmap caching mode is fixed >>>>>>>>> to >>>>>>>>> object caching modes set at bo create. For framebuffers it will >>>>>>>>> be WC >>>>>>>>> and for the rest userspace can choose WB or WC via >>>>>>>>> drm_xe_gem_create->cpu_caching. (Unless discrete, when WB cannot >>>>>>>>> be >>>>>>>>> used >>>>>>>>> at all.) >>>>>>>>> >>>>>>>>> AFAICT minigbm basically cares about two transition points. Lets >>>>>>>>> call >>>>>>>>> them CPU access begin and end. >>>>>>>>> >>>>>>>>> 1) >>>>>>>>> When a bo is mmapped it wants to invalidate the cache, which >>>>>>>>> looks to >>>>>>>>> be >>>>>>>>> about making sure all GPU writes have landed to the backing >>>>>>>>> store. In >>>>>>>>> the i915 world that translates to the set_domain ioctl. >>>>>>>>> >>>>>>>>> What is the uapi for this with xe, or it is somehow guaranteed to >>>>>>>>> not >>>>>>>>> be >>>>>>>>> needed? >>>>>>>> >>>>>>>> Signalling a user-fence or dma-fence obtained as an out-fence from >>>>>>>> an >>>>>>>> exec call will guarantee GPU caches are flushed. Currently I don't >>>>>>>> think there is anything like gem wait in the uAPI, although >>>>>>>> Matt is >>>>>>>> just about to add functionality to wait on all outstanding work on >>>>>>>> an >>>>>>>> exec_queue. >>>>>>> >>>>>>> Problem I see is there are no execs or therefore fences in the >>>>>>> minigbm >>>>>>> ABI. It's just buffers, created or imported, CPU access and some >>>>>>> other >>>>>>> stuff. >>>>>>> >>>>>>> And it is quite extensively used in the OS so I assume it has to >>>>>>> work >>>>>>> (I >>>>>>> mean the invalidation/flushing was not put in there for no reason), >>>>>>> in >>>>>>> other words, where the i915 backend today does >>>>>>> DRM_I915_GEM_SET_DOMAIN >>>>>>> on "cpu access begin", which is buffer based, I am not clear how to >>>>>>> implement that with xe. >>>>>>> >>>>>>> For the outstanding work you mention, since you say it is about >>>>>>> exec_queue, I assume again it will not work purely with buffers? If >>>>>>> so >>>>>>> it probably won't be useful for minigbm. >>>>>>> >>>>>>> Also if I look at all the other minigbm backends, I see a mix of >>>>>>> behaviours: >>>>>>> >>>>>>>    * msm and vc4 appear to not concern themselves with any of this. >>>>>>> >>>>>>>    * rockchip appears to be doing full bounce buffering via >>>>>>> memcpy on >>>>>>> CPU >>>>>>> access begin/end. >>>>>>> >>>>>>>    * i915 and amdgpu respectively use >>>>>>> DRM_I915_GEM_SET_DOMAIN/DRM_AMDGPU_GEM_WAIT_IDLE (also buffer >>>>>>> based, >>>>>>> not >>>>>>> execution queue). Andgpu curiously does not do any flushing on CPU >>>>>>> access end. >>>>>>> >>>>>>> Digging into git history, both DRM_I915_GEM_SET_DOMAIN on CPU >>>>>>> access >>>>>>> begin and clflushing on end were added to fix various CTS test >>>>>>> failures. >>>>>>> So I guess we could also wait and see what happens there. If >>>>>>> those or >>>>>>> some will be failing with xe too then propose adding some new uapi. >>>>>>> Or >>>>>>> if manual testing will start reporting visual corruption in the UI >>>>>>> elements or such. >>>>>> >>>>>> Indeed it sounds like we'd need a DRM_XE_GEM_WAIT_IDLE, similar to >>>>>> AMDGPU for this use-case. I'll bring that up for discussion. >>>>> >>>>> Thanks! >>>>> >>>>>>>>> 2) >>>>>>>>> When a bo is unmapped, or CPU access finished, it wants to flush >>>>>>>>> the >>>>>>>>> CPU >>>>>>>>> caches. That is /almost/ completely a CPU operation, where it >>>>>>>>> just >>>>>>>>> needs >>>>>>>>> to either clflush or invalidate the WC buffer respectively, if >>>>>>>>> not >>>>>>>>> the >>>>>>>>> fact that clflush can be skipped on platforms with LLC. >>>>>>>>> >>>>>>>>> I did not see an equivalent of an I915_PARAM_HAS_LLC in xe? Did I >>>>>>>>> miss >>>>>>>>> it or what it is the plan for querying this detail? >>>>>>>> >>>>>>>> XeKMD is generally coherent, except if UMD selects a GPU PAT index >>>>>>>> with >>>>>>>> limited coherency together with WB instead of WC memory. In that >>>>>>>> case, >>>>>>>> UMD is responsible for doing the needed CLFLUSH-ing, whereas KMD >>>>>>>> only >>>>>>>> ensures initial clearing of the pages is CLFLUSHED for security >>>>>>>> reasons. >>>>>>>> >>>>>>>> I'm not 100% sure if UMD can actually select WB with limited >>>>>>>> coherency >>>>>>>> PAT index in the initial uAPI revision, but Matthew has received >>>>>>>> requests for that so any additional input here on performance >>>>>>>> implications is appreciated. >>>>>>>> >>>>>>>> The thinking here is otherwise that GPU PAT indices with limited >>>>>>>> coherency should be used together with WC memory in the same >>>>>>>> situations >>>>>>>> as VRAM/LMEM is used on DGFX. >>>>>>> >>>>>>> Hm okay, this would be the VM BIND side of things which deals with >>>>>>> GPU >>>>>>> PATs. From the CPU side it is just CPU caching modes implicitly >>>>>>> selected >>>>>>> via DRM_XE_GEM_CPU_CACHING_WB/WC. >>>>>>> >>>>>>> The question about the analogue of I915_PARAM_HAS_LLC was AFAIU >>>>>>> about >>>>>>> a >>>>>>> performance optimisation where UMD is deciding whether it is >>>>>>> okay to >>>>>>> skip issuing clflush for the mapped bo if DRM_XE_GEM_CPU_CACHING_WB >>>>>>> was >>>>>>> used. (If DRM_XE_GEM_CPU_CACHING_WC was used it obviously only >>>>>>> needs >>>>>>> to >>>>>>> flush the write combine buffer.) >>>>>>> >>>>>>> Looking at what Mesa is doing it appears it is not using >>>>>>> I915_PARAM_HAS_LLC but has its own device info tables. So I guess >>>>>>> minigbm xe backend will have to do the same if xe will not provide >>>>>>> the >>>>>>> analogue query. >>>>>> >>>>>> IMO clflushing in this case should never be needed, (Again, >>>>>> similar to >>>>>> AMD). Whatever renders from / to those buffers should make sure they >>>>>> are clflushed before or while accessing them. How is a rendering API >>>>>> made aware of these bos? Are they imported using drm prime? >>>>> >>>>> Yes they can be imported via drm prime. It's a quite a huge code >>>>> base with a few abstractions and quite hard to figure out all >>>>> possible paths. There are gbm_bo_map paths in the host and vm >>>>> compositors (Exo and Sommelier), Skia library, camera pipeline, >>>>> maybe more. >>>>> >>>>> You mean specifically on the "cpu access end" part of the story >>>>> flushing would never be needed? Regardless of llc and !llc, wb or >>>>> wc mappings, imported or native buffer? What mechanism would next >>>>> in the pipeline (after CPU access) use to ensure the flush? >>>>> >>>> Currently we're not supporting any older !LLC hardware, and with >>>> newer !LLC hardware, we're only allowing at least 1-way coherency >>>> with dma-bufs, meaning the GPU will acquire the cache line when >>>> accessing the bo. If the bo is not a dma-buf, and the GPU wants to >>>> access it in non-coherent mode, we're currently enforcing WC >>>> cpu-mappings. >>>> >>>> So: >>>> igfx with shared LLC: Not a problem (always coherent) >>>> dgfx: Not a problem (always coherent) >>>> newer igfx WO shared LLC (KMD is enforcing coherent access). >>>> >>>> Now this may of course, as mentions, have performance implications >>>> compared to previous igfx solutions, depending on how it all is >>>> written, but should have similar performance characteristics as >>>> dgfx. If we want to change this and relax the coherency enforcement >>>> and resort to CLFLUSHes again,  could UMD communicate the need for >>>> this in the same way it communicates the format of the gbm_bos? >>> >>> I think it is probably not needed. If minigbm will know mmap is >>> always WC on MTL then it can know that all it needs to do to flush >>> is flush the write-combine buffer. So the logic in minigbm xe >>> backend "cpu access end" would be like: >>> >>> xe_bo_flush(...) >>> { >>>   if (dgfx || (igfx && !llc)) >>>     __builtin_ia32_sfence(); >>> } >>> >>> Does that sound right to you? >>> >>> So far on the i915 backend this explicit sfence was not even there >>> in the mmap_wc case, which I guess possibly works because something >>> is likely to trigger the flush implicitly inside the complexities of >>> the call stack anyway. >> >> IIRC WC buffers are flushed also on uncached register writes (does >> that include updating the ring tail perhapsl?). > > If next in chain is the GPU I guess so. It would indeed be a typical > use case but for correctness it probably needs a sfence since AFAIU > write-combine buffer is per CPU core. > >> It might also be that these gbm_bos are created WB and the igfx PAT >> is set to 1-way coherency. What code creates those gbm_bos? > > Minigbm on behalf of various callers. It is based on the allocation > flags such as will it be for scanout, protected content, modifiers and > such. > > When scanout is request we will set DRM_XE_GEM_CREATE_FLAG_SCANOUT. > Which would mean scanount bos will be WC and the rest WB-1-way when > mmaped. > >> I figure for dgfx at least we don't want any intel platform-specific >> code in the stack if possible.... > > True, would have to be platform specific. So if we unconditionally add an sfence, I wouldn't expect the overhead to be significant even if it's not strictly needed? Or have the gbm_bo_create() forward to gbm_bo_map() whether the bo was created WC? /Thomas