Re: [Intel-xe] LLC configurating, mmap and bo cache management questions

Intel-XE Archive on lore.kernel.org
 help / color / mirror / Atom feed

From: "Thomas Hellström" <thomas.hellstrom@linux.intel.com>
To: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>,
	"Hellstrom, Thomas" <thomas.hellstrom@intel.com>,
	"intel-xe@lists.freedesktop.org" <intel-xe@lists.freedesktop.org>
Cc: "Auld, Matthew" <matthew.auld@intel.com>,
	"Santa, Carlos" <carlos.santa@intel.com>,
	"Vivi, Rodrigo" <rodrigo.vivi@intel.com>
Subject: Re: [Intel-xe] LLC configurating, mmap and bo cache management questions
Date: Wed, 13 Dec 2023 21:11:50 +0100	[thread overview]
Message-ID: <07922208-1e3c-6bea-1e2d-5215bd33b38a@linux.intel.com> (raw)
In-Reply-To: <7637f53e-8886-4d3f-9e08-07b0f117b882@linux.intel.com>


On 12/13/23 18:50, Tvrtko Ursulin wrote:
>
> On 13/12/2023 17:27, Thomas Hellström wrote:
>>
>> On 12/13/23 12:55, Tvrtko Ursulin wrote:
>>>
>>> On 07/12/2023 12:01, Thomas Hellström wrote:
>>>>
>>>> On 12/7/23 12:11, Tvrtko Ursulin wrote:
>>>>>
>>>>> On 06/12/2023 11:46, Hellstrom, Thomas wrote:
>>>>>> Hi, Tvrtko.
>>>>>>
>>>>>> On Wed, 2023-12-06 at 10:58 +0000, Tvrtko Ursulin wrote:
>>>>>>>
>>>>>>> Hi Thomas,
>>>>>>>
>>>>>>> On 06/12/2023 08:26, Hellstrom, Thomas wrote:
>>>>>>>> On Tue, 2023-12-05 at 14:19 +0000, Tvrtko Ursulin wrote:
>>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> We are working on adding xe support to ChromeOS minigbm and have
>>>>>>>>> a
>>>>>>>>> couple questions.
>>>>>>>>>
>>>>>>>>> If I follow things correctly with xe mmap caching mode is fixed
>>>>>>>>> to
>>>>>>>>> object caching modes set at bo create. For framebuffers it will
>>>>>>>>> be WC
>>>>>>>>> and for the rest userspace can choose WB or WC via
>>>>>>>>> drm_xe_gem_create->cpu_caching. (Unless discrete, when WB cannot
>>>>>>>>> be
>>>>>>>>> used
>>>>>>>>> at all.)
>>>>>>>>>
>>>>>>>>> AFAICT minigbm basically cares about two transition points. Lets
>>>>>>>>> call
>>>>>>>>> them CPU access begin and end.
>>>>>>>>>
>>>>>>>>> 1)
>>>>>>>>> When a bo is mmapped it wants to invalidate the cache, which
>>>>>>>>> looks to
>>>>>>>>> be
>>>>>>>>> about making sure all GPU writes have landed to the backing
>>>>>>>>> store. In
>>>>>>>>> the i915 world that translates to the set_domain ioctl.
>>>>>>>>>
>>>>>>>>> What is the uapi for this with xe, or it is somehow guaranteed to
>>>>>>>>> not
>>>>>>>>> be
>>>>>>>>> needed?
>>>>>>>>
>>>>>>>> Signalling a user-fence or dma-fence obtained as an out-fence from
>>>>>>>> an
>>>>>>>> exec call will guarantee GPU caches are flushed. Currently I don't
>>>>>>>> think there is anything like gem wait in the uAPI, although 
>>>>>>>> Matt is
>>>>>>>> just about to add functionality to wait on all outstanding work on
>>>>>>>> an
>>>>>>>> exec_queue.
>>>>>>>
>>>>>>> Problem I see is there are no execs or therefore fences in the
>>>>>>> minigbm
>>>>>>> ABI. It's just buffers, created or imported, CPU access and some
>>>>>>> other
>>>>>>> stuff.
>>>>>>>
>>>>>>> And it is quite extensively used in the OS so I assume it has to 
>>>>>>> work
>>>>>>> (I
>>>>>>> mean the invalidation/flushing was not put in there for no reason),
>>>>>>> in
>>>>>>> other words, where the i915 backend today does
>>>>>>> DRM_I915_GEM_SET_DOMAIN
>>>>>>> on "cpu access begin", which is buffer based, I am not clear how to
>>>>>>> implement that with xe.
>>>>>>>
>>>>>>> For the outstanding work you mention, since you say it is about
>>>>>>> exec_queue, I assume again it will not work purely with buffers? If
>>>>>>> so
>>>>>>> it probably won't be useful for minigbm.
>>>>>>>
>>>>>>> Also if I look at all the other minigbm backends, I see a mix of
>>>>>>> behaviours:
>>>>>>>
>>>>>>>    * msm and vc4 appear to not concern themselves with any of this.
>>>>>>>
>>>>>>>    * rockchip appears to be doing full bounce buffering via 
>>>>>>> memcpy on
>>>>>>> CPU
>>>>>>> access begin/end.
>>>>>>>
>>>>>>>    * i915 and amdgpu respectively use
>>>>>>> DRM_I915_GEM_SET_DOMAIN/DRM_AMDGPU_GEM_WAIT_IDLE (also buffer 
>>>>>>> based,
>>>>>>> not
>>>>>>> execution queue). Andgpu curiously does not do any flushing on CPU
>>>>>>> access end.
>>>>>>>
>>>>>>> Digging into git history, both DRM_I915_GEM_SET_DOMAIN on CPU 
>>>>>>> access
>>>>>>> begin and clflushing on end were added to fix various CTS test
>>>>>>> failures.
>>>>>>> So I guess we could also wait and see what happens there. If 
>>>>>>> those or
>>>>>>> some will be failing with xe too then propose adding some new uapi.
>>>>>>> Or
>>>>>>> if manual testing will start reporting visual corruption in the UI
>>>>>>> elements or such.
>>>>>>
>>>>>> Indeed it sounds like we'd need a DRM_XE_GEM_WAIT_IDLE, similar to
>>>>>> AMDGPU for this use-case. I'll bring that up for discussion.
>>>>>
>>>>> Thanks!
>>>>>
>>>>>>>>> 2)
>>>>>>>>> When a bo is unmapped, or CPU access finished, it wants to flush
>>>>>>>>> the
>>>>>>>>> CPU
>>>>>>>>> caches. That is /almost/ completely a CPU operation, where it
>>>>>>>>> just
>>>>>>>>> needs
>>>>>>>>> to either clflush or invalidate the WC buffer respectively, if
>>>>>>>>> not
>>>>>>>>> the
>>>>>>>>> fact that clflush can be skipped on platforms with LLC.
>>>>>>>>>
>>>>>>>>> I did not see an equivalent of an I915_PARAM_HAS_LLC in xe? Did I
>>>>>>>>> miss
>>>>>>>>> it or what it is the plan for querying this detail?
>>>>>>>>
>>>>>>>> XeKMD is generally coherent, except if UMD selects a GPU PAT index
>>>>>>>> with
>>>>>>>> limited coherency together with WB instead of WC memory. In that
>>>>>>>> case,
>>>>>>>> UMD is responsible for doing the needed CLFLUSH-ing, whereas KMD
>>>>>>>> only
>>>>>>>> ensures initial clearing of the pages is CLFLUSHED for security
>>>>>>>> reasons.
>>>>>>>>
>>>>>>>> I'm not 100% sure if UMD can actually select WB with limited
>>>>>>>> coherency
>>>>>>>> PAT index in the initial uAPI revision, but Matthew has received
>>>>>>>> requests for that so any additional input here on performance
>>>>>>>> implications is appreciated.
>>>>>>>>
>>>>>>>> The thinking here is otherwise that GPU PAT indices with limited
>>>>>>>> coherency should be used together with WC memory in the same
>>>>>>>> situations
>>>>>>>> as VRAM/LMEM is used on DGFX.
>>>>>>>
>>>>>>> Hm okay, this would be the VM BIND side of things which deals with
>>>>>>> GPU
>>>>>>> PATs. From the CPU side it is just CPU caching modes implicitly
>>>>>>> selected
>>>>>>> via DRM_XE_GEM_CPU_CACHING_WB/WC.
>>>>>>>
>>>>>>> The question about the analogue of I915_PARAM_HAS_LLC was AFAIU 
>>>>>>> about
>>>>>>> a
>>>>>>> performance optimisation where UMD is deciding whether it is 
>>>>>>> okay to
>>>>>>> skip issuing clflush for the mapped bo if DRM_XE_GEM_CPU_CACHING_WB
>>>>>>> was
>>>>>>> used. (If DRM_XE_GEM_CPU_CACHING_WC was used it obviously only 
>>>>>>> needs
>>>>>>> to
>>>>>>> flush the write combine buffer.)
>>>>>>>
>>>>>>> Looking at what Mesa is doing it appears it is not using
>>>>>>> I915_PARAM_HAS_LLC but has its own device info tables. So I guess
>>>>>>> minigbm xe backend will have to do the same if xe will not provide
>>>>>>> the
>>>>>>> analogue query.
>>>>>>
>>>>>> IMO clflushing in this case should never be needed, (Again, 
>>>>>> similar to
>>>>>> AMD). Whatever renders from / to those buffers should make sure they
>>>>>> are clflushed before or while accessing them. How is a rendering API
>>>>>> made aware of these bos? Are they imported using drm prime?
>>>>>
>>>>> Yes they can be imported via drm prime. It's a quite a huge code 
>>>>> base with a few abstractions and quite hard to figure out all 
>>>>> possible paths. There are gbm_bo_map paths in the host and vm 
>>>>> compositors (Exo and Sommelier), Skia library, camera pipeline, 
>>>>> maybe more.
>>>>>
>>>>> You mean specifically on the "cpu access end" part of the story 
>>>>> flushing would never be needed? Regardless of llc and !llc, wb or 
>>>>> wc mappings, imported or native buffer? What mechanism would next 
>>>>> in the pipeline (after CPU access) use to ensure the flush?
>>>>>
>>>> Currently we're not supporting any older !LLC hardware, and with 
>>>> newer !LLC hardware, we're only allowing at least 1-way coherency 
>>>> with dma-bufs, meaning the GPU will acquire the cache line when 
>>>> accessing the bo. If the bo is not a dma-buf, and the GPU wants to 
>>>> access it in non-coherent mode, we're currently enforcing WC 
>>>> cpu-mappings.
>>>>
>>>> So:
>>>> igfx with shared LLC: Not a problem (always coherent)
>>>> dgfx: Not a problem (always coherent)
>>>> newer igfx WO shared LLC (KMD is enforcing coherent access).
>>>>
>>>> Now this may of course, as mentions, have performance implications 
>>>> compared to previous igfx solutions, depending on how it all is 
>>>> written, but should have similar performance characteristics as 
>>>> dgfx. If we want to change this and relax the coherency enforcement 
>>>> and resort to CLFLUSHes again,  could UMD communicate the need for 
>>>> this in the same way it communicates the format of the gbm_bos?
>>>
>>> I think it is probably not needed. If minigbm will know mmap is 
>>> always WC on MTL then it can know that all it needs to do to flush 
>>> is flush the write-combine buffer. So the logic in minigbm xe 
>>> backend "cpu access end" would be like:
>>>
>>> xe_bo_flush(...)
>>> {
>>>   if (dgfx || (igfx && !llc))
>>>     __builtin_ia32_sfence();
>>> }
>>>
>>> Does that sound right to you?
>>>
>>> So far on the i915 backend this explicit sfence was not even there 
>>> in the mmap_wc case, which I guess possibly works because something 
>>> is likely to trigger the flush implicitly inside the complexities of 
>>> the call stack anyway.
>>
>> IIRC WC buffers are flushed also on uncached register writes (does 
>> that include updating the ring tail perhapsl?).
>
> If next in chain is the GPU I guess so. It would indeed be a typical 
> use case but for correctness it probably needs a sfence since AFAIU 
> write-combine buffer is per CPU core.
>
>> It might also be that these gbm_bos are created WB and the igfx PAT 
>> is set to 1-way coherency. What code creates those gbm_bos?
>
> Minigbm on behalf of various callers. It is based on the allocation 
> flags such as will it be for scanout, protected content, modifiers and 
> such.
>
> When scanout is request we will set DRM_XE_GEM_CREATE_FLAG_SCANOUT. 
> Which would mean scanount bos will be WC and the rest WB-1-way when 
> mmaped.
>
>> I figure for dgfx at least we don't want any intel platform-specific 
>> code in the stack if possible....
>
> True, would have to be platform specific.

So if we unconditionally add an sfence, I wouldn't expect the overhead 
to be significant even if it's not strictly needed? Or have the 
gbm_bo_create() forward to gbm_bo_map() whether the bo was created WC?

/Thomas

next prev parent reply	other threads:[~2023-12-13 20:12 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-12-05 14:19 [Intel-xe] LLC configurating, mmap and bo cache management questions Tvrtko Ursulin
2023-12-06  8:26 ` Hellstrom, Thomas
2023-12-06 10:58   ` Tvrtko Ursulin
2023-12-06 11:46     ` Hellstrom, Thomas
2023-12-07 11:11       ` Tvrtko Ursulin
2023-12-07 12:01         ` Thomas Hellström
2023-12-13 11:55           ` Tvrtko Ursulin
2023-12-13 17:27             ` Thomas Hellström
2023-12-13 17:50               ` Tvrtko Ursulin
2023-12-13 20:11                 ` Thomas Hellström [this message]
2023-12-14  8:10                   ` Tvrtko Ursulin
2023-12-14 10:52                     ` Thomas Hellström
2023-12-14 12:34                       ` Tvrtko Ursulin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=07922208-1e3c-6bea-1e2d-5215bd33b38a@linux.intel.com \
    --to=thomas.hellstrom@linux.intel.com \
    --cc=carlos.santa@intel.com \
    --cc=intel-xe@lists.freedesktop.org \
    --cc=matthew.auld@intel.com \
    --cc=rodrigo.vivi@intel.com \
    --cc=thomas.hellstrom@intel.com \
    --cc=tvrtko.ursulin@linux.intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox