[Intel-xe] LLC configurating, mmap and bo cache management questions

Intel-XE Archive on lore.kernel.org
 help / color / mirror / Atom feed

* [Intel-xe] LLC configurating, mmap and bo cache management questions
@ 2023-12-05 14:19 Tvrtko Ursulin
  2023-12-06  8:26 ` Hellstrom, Thomas
  0 siblings, 1 reply; 13+ messages in thread
From: Tvrtko Ursulin @ 2023-12-05 14:19 UTC (permalink / raw)
  To: intel-xe; +Cc: Thomas Hellstrom, Matthew Auld, Carlos Santa, Rodrigo Vivi

Hi,

We are working on adding xe support to ChromeOS minigbm and have a 
couple questions.

If I follow things correctly with xe mmap caching mode is fixed to 
object caching modes set at bo create. For framebuffers it will be WC 
and for the rest userspace can choose WB or WC via 
drm_xe_gem_create->cpu_caching. (Unless discrete, when WB cannot be used 
at all.)

AFAICT minigbm basically cares about two transition points. Lets call 
them CPU access begin and end.

1)
When a bo is mmapped it wants to invalidate the cache, which looks to be 
about making sure all GPU writes have landed to the backing store. In 
the i915 world that translates to the set_domain ioctl.

What is the uapi for this with xe, or it is somehow guaranteed to not be 
needed?

2)
When a bo is unmapped, or CPU access finished, it wants to flush the CPU 
caches. That is /almost/ completely a CPU operation, where it just needs 
to either clflush or invalidate the WC buffer respectively, if not the 
fact that clflush can be skipped on platforms with LLC.

I did not see an equivalent of an I915_PARAM_HAS_LLC in xe? Did I miss 
it or what it is the plan for querying this detail?

Regards,

Tvrtko

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Intel-xe] LLC configurating, mmap and bo cache management questions
  2023-12-05 14:19 [Intel-xe] LLC configurating, mmap and bo cache management questions Tvrtko Ursulin
@ 2023-12-06  8:26 ` Hellstrom, Thomas
  2023-12-06 10:58   ` Tvrtko Ursulin
  0 siblings, 1 reply; 13+ messages in thread
From: Hellstrom, Thomas @ 2023-12-06  8:26 UTC (permalink / raw)
  To: intel-xe@lists.freedesktop.org, tvrtko.ursulin@linux.intel.com
  Cc: Auld, Matthew, Santa, Carlos, Vivi, Rodrigo

On Tue, 2023-12-05 at 14:19 +0000, Tvrtko Ursulin wrote:
> 
> Hi,
> 
> We are working on adding xe support to ChromeOS minigbm and have a 
> couple questions.
> 
> If I follow things correctly with xe mmap caching mode is fixed to 
> object caching modes set at bo create. For framebuffers it will be WC
> and for the rest userspace can choose WB or WC via 
> drm_xe_gem_create->cpu_caching. (Unless discrete, when WB cannot be
> used 
> at all.)
> 
> AFAICT minigbm basically cares about two transition points. Lets call
> them CPU access begin and end.
> 
> 1)
> When a bo is mmapped it wants to invalidate the cache, which looks to
> be 
> about making sure all GPU writes have landed to the backing store. In
> the i915 world that translates to the set_domain ioctl.
> 
> What is the uapi for this with xe, or it is somehow guaranteed to not
> be 
> needed?

Signalling a user-fence or dma-fence obtained as an out-fence from an
exec call will guarantee GPU caches are flushed. Currently I don't
think there is anything like gem wait in the uAPI, although Matt is
just about to add functionality to wait on all outstanding work on an
exec_queue.

> 
> 2)
> When a bo is unmapped, or CPU access finished, it wants to flush the
> CPU 
> caches. That is /almost/ completely a CPU operation, where it just
> needs 
> to either clflush or invalidate the WC buffer respectively, if not
> the 
> fact that clflush can be skipped on platforms with LLC.
> 
> I did not see an equivalent of an I915_PARAM_HAS_LLC in xe? Did I
> miss 
> it or what it is the plan for querying this detail?

XeKMD is generally coherent, except if UMD selects a GPU PAT index with
limited coherency together with WB instead of WC memory. In that case,
UMD is responsible for doing the needed CLFLUSH-ing, whereas KMD only
ensures initial clearing of the pages is CLFLUSHED for security
reasons.

I'm not 100% sure if UMD can actually select WB with limited coherency
PAT index in the initial uAPI revision, but Matthew has received
requests for that so any additional input here on performance
implications is appreciated.

The thinking here is otherwise that GPU PAT indices with limited
coherency should be used together with WC memory in the same situations
as VRAM/LMEM is used on DGFX.

/Thomas

> 
> Regards,
> 
> Tvrtko

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Intel-xe] LLC configurating, mmap and bo cache management questions
  2023-12-06  8:26 ` Hellstrom, Thomas
@ 2023-12-06 10:58   ` Tvrtko Ursulin
  2023-12-06 11:46     ` Hellstrom, Thomas
  0 siblings, 1 reply; 13+ messages in thread
From: Tvrtko Ursulin @ 2023-12-06 10:58 UTC (permalink / raw)
  To: Hellstrom, Thomas, intel-xe@lists.freedesktop.org
  Cc: Auld, Matthew, Santa, Carlos, Vivi, Rodrigo

Hi Thomas,

On 06/12/2023 08:26, Hellstrom, Thomas wrote:
> On Tue, 2023-12-05 at 14:19 +0000, Tvrtko Ursulin wrote:
>>
>> Hi,
>>
>> We are working on adding xe support to ChromeOS minigbm and have a
>> couple questions.
>>
>> If I follow things correctly with xe mmap caching mode is fixed to
>> object caching modes set at bo create. For framebuffers it will be WC
>> and for the rest userspace can choose WB or WC via
>> drm_xe_gem_create->cpu_caching. (Unless discrete, when WB cannot be
>> used
>> at all.)
>>
>> AFAICT minigbm basically cares about two transition points. Lets call
>> them CPU access begin and end.
>>
>> 1)
>> When a bo is mmapped it wants to invalidate the cache, which looks to
>> be
>> about making sure all GPU writes have landed to the backing store. In
>> the i915 world that translates to the set_domain ioctl.
>>
>> What is the uapi for this with xe, or it is somehow guaranteed to not
>> be
>> needed?
> 
> Signalling a user-fence or dma-fence obtained as an out-fence from an
> exec call will guarantee GPU caches are flushed. Currently I don't
> think there is anything like gem wait in the uAPI, although Matt is
> just about to add functionality to wait on all outstanding work on an
> exec_queue.

Problem I see is there are no execs or therefore fences in the minigbm 
ABI. It's just buffers, created or imported, CPU access and some other 
stuff.

And it is quite extensively used in the OS so I assume it has to work (I 
mean the invalidation/flushing was not put in there for no reason), in 
other words, where the i915 backend today does DRM_I915_GEM_SET_DOMAIN 
on "cpu access begin", which is buffer based, I am not clear how to 
implement that with xe.

For the outstanding work you mention, since you say it is about 
exec_queue, I assume again it will not work purely with buffers? If so 
it probably won't be useful for minigbm.

Also if I look at all the other minigbm backends, I see a mix of behaviours:

  * msm and vc4 appear to not concern themselves with any of this.

  * rockchip appears to be doing full bounce buffering via memcpy on CPU 
access begin/end.

  * i915 and amdgpu respectively use 
DRM_I915_GEM_SET_DOMAIN/DRM_AMDGPU_GEM_WAIT_IDLE (also buffer based, not 
execution queue). Andgpu curiously does not do any flushing on CPU 
access end.

Digging into git history, both DRM_I915_GEM_SET_DOMAIN on CPU access 
begin and clflushing on end were added to fix various CTS test failures. 
So I guess we could also wait and see what happens there. If those or 
some will be failing with xe too then propose adding some new uapi. Or 
if manual testing will start reporting visual corruption in the UI 
elements or such.

>> 2)
>> When a bo is unmapped, or CPU access finished, it wants to flush the
>> CPU
>> caches. That is /almost/ completely a CPU operation, where it just
>> needs
>> to either clflush or invalidate the WC buffer respectively, if not
>> the
>> fact that clflush can be skipped on platforms with LLC.
>>
>> I did not see an equivalent of an I915_PARAM_HAS_LLC in xe? Did I
>> miss
>> it or what it is the plan for querying this detail?
> 
> XeKMD is generally coherent, except if UMD selects a GPU PAT index with
> limited coherency together with WB instead of WC memory. In that case,
> UMD is responsible for doing the needed CLFLUSH-ing, whereas KMD only
> ensures initial clearing of the pages is CLFLUSHED for security
> reasons.
> 
> I'm not 100% sure if UMD can actually select WB with limited coherency
> PAT index in the initial uAPI revision, but Matthew has received
> requests for that so any additional input here on performance
> implications is appreciated.
> 
> The thinking here is otherwise that GPU PAT indices with limited
> coherency should be used together with WC memory in the same situations
> as VRAM/LMEM is used on DGFX.

Hm okay, this would be the VM BIND side of things which deals with GPU 
PATs. From the CPU side it is just CPU caching modes implicitly selected 
via DRM_XE_GEM_CPU_CACHING_WB/WC.

The question about the analogue of I915_PARAM_HAS_LLC was AFAIU about a 
performance optimisation where UMD is deciding whether it is okay to 
skip issuing clflush for the mapped bo if DRM_XE_GEM_CPU_CACHING_WB was 
used. (If DRM_XE_GEM_CPU_CACHING_WC was used it obviously only needs to 
flush the write combine buffer.)

Looking at what Mesa is doing it appears it is not using 
I915_PARAM_HAS_LLC but has its own device info tables. So I guess 
minigbm xe backend will have to do the same if xe will not provide the 
analogue query.

Regards,

Tvrtko

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Intel-xe] LLC configurating, mmap and bo cache management questions
  2023-12-06 10:58   ` Tvrtko Ursulin
@ 2023-12-06 11:46     ` Hellstrom, Thomas
  2023-12-07 11:11       ` Tvrtko Ursulin
  0 siblings, 1 reply; 13+ messages in thread
From: Hellstrom, Thomas @ 2023-12-06 11:46 UTC (permalink / raw)
  To: intel-xe@lists.freedesktop.org, tvrtko.ursulin@linux.intel.com
  Cc: Auld, Matthew, Santa, Carlos, Vivi, Rodrigo

Hi, Tvrtko.

On Wed, 2023-12-06 at 10:58 +0000, Tvrtko Ursulin wrote:
> 
> Hi Thomas,
> 
> On 06/12/2023 08:26, Hellstrom, Thomas wrote:
> > On Tue, 2023-12-05 at 14:19 +0000, Tvrtko Ursulin wrote:
> > > 
> > > Hi,
> > > 
> > > We are working on adding xe support to ChromeOS minigbm and have
> > > a
> > > couple questions.
> > > 
> > > If I follow things correctly with xe mmap caching mode is fixed
> > > to
> > > object caching modes set at bo create. For framebuffers it will
> > > be WC
> > > and for the rest userspace can choose WB or WC via
> > > drm_xe_gem_create->cpu_caching. (Unless discrete, when WB cannot
> > > be
> > > used
> > > at all.)
> > > 
> > > AFAICT minigbm basically cares about two transition points. Lets
> > > call
> > > them CPU access begin and end.
> > > 
> > > 1)
> > > When a bo is mmapped it wants to invalidate the cache, which
> > > looks to
> > > be
> > > about making sure all GPU writes have landed to the backing
> > > store. In
> > > the i915 world that translates to the set_domain ioctl.
> > > 
> > > What is the uapi for this with xe, or it is somehow guaranteed to
> > > not
> > > be
> > > needed?
> > 
> > Signalling a user-fence or dma-fence obtained as an out-fence from
> > an
> > exec call will guarantee GPU caches are flushed. Currently I don't
> > think there is anything like gem wait in the uAPI, although Matt is
> > just about to add functionality to wait on all outstanding work on
> > an
> > exec_queue.
> 
> Problem I see is there are no execs or therefore fences in the
> minigbm 
> ABI. It's just buffers, created or imported, CPU access and some
> other 
> stuff.
> 
> And it is quite extensively used in the OS so I assume it has to work
> (I 
> mean the invalidation/flushing was not put in there for no reason),
> in 
> other words, where the i915 backend today does
> DRM_I915_GEM_SET_DOMAIN 
> on "cpu access begin", which is buffer based, I am not clear how to 
> implement that with xe.
> 
> For the outstanding work you mention, since you say it is about 
> exec_queue, I assume again it will not work purely with buffers? If
> so 
> it probably won't be useful for minigbm.
> 
> Also if I look at all the other minigbm backends, I see a mix of
> behaviours:
> 
>   * msm and vc4 appear to not concern themselves with any of this.
> 
>   * rockchip appears to be doing full bounce buffering via memcpy on
> CPU 
> access begin/end.
> 
>   * i915 and amdgpu respectively use 
> DRM_I915_GEM_SET_DOMAIN/DRM_AMDGPU_GEM_WAIT_IDLE (also buffer based,
> not 
> execution queue). Andgpu curiously does not do any flushing on CPU 
> access end.
> 
> Digging into git history, both DRM_I915_GEM_SET_DOMAIN on CPU access 
> begin and clflushing on end were added to fix various CTS test
> failures. 
> So I guess we could also wait and see what happens there. If those or
> some will be failing with xe too then propose adding some new uapi.
> Or 
> if manual testing will start reporting visual corruption in the UI 
> elements or such.

Indeed it sounds like we'd need a DRM_XE_GEM_WAIT_IDLE, similar to
AMDGPU for this use-case. I'll bring that up for discussion.

> 
> > > 2)
> > > When a bo is unmapped, or CPU access finished, it wants to flush
> > > the
> > > CPU
> > > caches. That is /almost/ completely a CPU operation, where it
> > > just
> > > needs
> > > to either clflush or invalidate the WC buffer respectively, if
> > > not
> > > the
> > > fact that clflush can be skipped on platforms with LLC.
> > > 
> > > I did not see an equivalent of an I915_PARAM_HAS_LLC in xe? Did I
> > > miss
> > > it or what it is the plan for querying this detail?
> > 
> > XeKMD is generally coherent, except if UMD selects a GPU PAT index
> > with
> > limited coherency together with WB instead of WC memory. In that
> > case,
> > UMD is responsible for doing the needed CLFLUSH-ing, whereas KMD
> > only
> > ensures initial clearing of the pages is CLFLUSHED for security
> > reasons.
> > 
> > I'm not 100% sure if UMD can actually select WB with limited
> > coherency
> > PAT index in the initial uAPI revision, but Matthew has received
> > requests for that so any additional input here on performance
> > implications is appreciated.
> > 
> > The thinking here is otherwise that GPU PAT indices with limited
> > coherency should be used together with WC memory in the same
> > situations
> > as VRAM/LMEM is used on DGFX.
> 
> Hm okay, this would be the VM BIND side of things which deals with
> GPU 
> PATs. From the CPU side it is just CPU caching modes implicitly
> selected 
> via DRM_XE_GEM_CPU_CACHING_WB/WC.
> 
> The question about the analogue of I915_PARAM_HAS_LLC was AFAIU about
> a 
> performance optimisation where UMD is deciding whether it is okay to 
> skip issuing clflush for the mapped bo if DRM_XE_GEM_CPU_CACHING_WB
> was 
> used. (If DRM_XE_GEM_CPU_CACHING_WC was used it obviously only needs
> to 
> flush the write combine buffer.)
> 
> Looking at what Mesa is doing it appears it is not using 
> I915_PARAM_HAS_LLC but has its own device info tables. So I guess 
> minigbm xe backend will have to do the same if xe will not provide
> the 
> analogue query.

IMO clflushing in this case should never be needed, (Again, similar to
AMD). Whatever renders from / to those buffers should make sure they
are clflushed before or while accessing them. How is a rendering API
made aware of these bos? Are they imported using drm prime?

/Thomas


> 
> Regards,
> 
> Tvrtko


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Intel-xe] LLC configurating, mmap and bo cache management questions
  2023-12-06 11:46     ` Hellstrom, Thomas
@ 2023-12-07 11:11       ` Tvrtko Ursulin
  2023-12-07 12:01         ` Thomas Hellström
  0 siblings, 1 reply; 13+ messages in thread
From: Tvrtko Ursulin @ 2023-12-07 11:11 UTC (permalink / raw)
  To: Hellstrom, Thomas, intel-xe@lists.freedesktop.org
  Cc: Auld, Matthew, Santa, Carlos, Vivi, Rodrigo


On 06/12/2023 11:46, Hellstrom, Thomas wrote:
> Hi, Tvrtko.
> 
> On Wed, 2023-12-06 at 10:58 +0000, Tvrtko Ursulin wrote:
>>
>> Hi Thomas,
>>
>> On 06/12/2023 08:26, Hellstrom, Thomas wrote:
>>> On Tue, 2023-12-05 at 14:19 +0000, Tvrtko Ursulin wrote:
>>>>
>>>> Hi,
>>>>
>>>> We are working on adding xe support to ChromeOS minigbm and have
>>>> a
>>>> couple questions.
>>>>
>>>> If I follow things correctly with xe mmap caching mode is fixed
>>>> to
>>>> object caching modes set at bo create. For framebuffers it will
>>>> be WC
>>>> and for the rest userspace can choose WB or WC via
>>>> drm_xe_gem_create->cpu_caching. (Unless discrete, when WB cannot
>>>> be
>>>> used
>>>> at all.)
>>>>
>>>> AFAICT minigbm basically cares about two transition points. Lets
>>>> call
>>>> them CPU access begin and end.
>>>>
>>>> 1)
>>>> When a bo is mmapped it wants to invalidate the cache, which
>>>> looks to
>>>> be
>>>> about making sure all GPU writes have landed to the backing
>>>> store. In
>>>> the i915 world that translates to the set_domain ioctl.
>>>>
>>>> What is the uapi for this with xe, or it is somehow guaranteed to
>>>> not
>>>> be
>>>> needed?
>>>
>>> Signalling a user-fence or dma-fence obtained as an out-fence from
>>> an
>>> exec call will guarantee GPU caches are flushed. Currently I don't
>>> think there is anything like gem wait in the uAPI, although Matt is
>>> just about to add functionality to wait on all outstanding work on
>>> an
>>> exec_queue.
>>
>> Problem I see is there are no execs or therefore fences in the
>> minigbm
>> ABI. It's just buffers, created or imported, CPU access and some
>> other
>> stuff.
>>
>> And it is quite extensively used in the OS so I assume it has to work
>> (I
>> mean the invalidation/flushing was not put in there for no reason),
>> in
>> other words, where the i915 backend today does
>> DRM_I915_GEM_SET_DOMAIN
>> on "cpu access begin", which is buffer based, I am not clear how to
>> implement that with xe.
>>
>> For the outstanding work you mention, since you say it is about
>> exec_queue, I assume again it will not work purely with buffers? If
>> so
>> it probably won't be useful for minigbm.
>>
>> Also if I look at all the other minigbm backends, I see a mix of
>> behaviours:
>>
>>    * msm and vc4 appear to not concern themselves with any of this.
>>
>>    * rockchip appears to be doing full bounce buffering via memcpy on
>> CPU
>> access begin/end.
>>
>>    * i915 and amdgpu respectively use
>> DRM_I915_GEM_SET_DOMAIN/DRM_AMDGPU_GEM_WAIT_IDLE (also buffer based,
>> not
>> execution queue). Andgpu curiously does not do any flushing on CPU
>> access end.
>>
>> Digging into git history, both DRM_I915_GEM_SET_DOMAIN on CPU access
>> begin and clflushing on end were added to fix various CTS test
>> failures.
>> So I guess we could also wait and see what happens there. If those or
>> some will be failing with xe too then propose adding some new uapi.
>> Or
>> if manual testing will start reporting visual corruption in the UI
>> elements or such.
> 
> Indeed it sounds like we'd need a DRM_XE_GEM_WAIT_IDLE, similar to
> AMDGPU for this use-case. I'll bring that up for discussion.

Thanks!

>>>> 2)
>>>> When a bo is unmapped, or CPU access finished, it wants to flush
>>>> the
>>>> CPU
>>>> caches. That is /almost/ completely a CPU operation, where it
>>>> just
>>>> needs
>>>> to either clflush or invalidate the WC buffer respectively, if
>>>> not
>>>> the
>>>> fact that clflush can be skipped on platforms with LLC.
>>>>
>>>> I did not see an equivalent of an I915_PARAM_HAS_LLC in xe? Did I
>>>> miss
>>>> it or what it is the plan for querying this detail?
>>>
>>> XeKMD is generally coherent, except if UMD selects a GPU PAT index
>>> with
>>> limited coherency together with WB instead of WC memory. In that
>>> case,
>>> UMD is responsible for doing the needed CLFLUSH-ing, whereas KMD
>>> only
>>> ensures initial clearing of the pages is CLFLUSHED for security
>>> reasons.
>>>
>>> I'm not 100% sure if UMD can actually select WB with limited
>>> coherency
>>> PAT index in the initial uAPI revision, but Matthew has received
>>> requests for that so any additional input here on performance
>>> implications is appreciated.
>>>
>>> The thinking here is otherwise that GPU PAT indices with limited
>>> coherency should be used together with WC memory in the same
>>> situations
>>> as VRAM/LMEM is used on DGFX.
>>
>> Hm okay, this would be the VM BIND side of things which deals with
>> GPU
>> PATs. From the CPU side it is just CPU caching modes implicitly
>> selected
>> via DRM_XE_GEM_CPU_CACHING_WB/WC.
>>
>> The question about the analogue of I915_PARAM_HAS_LLC was AFAIU about
>> a
>> performance optimisation where UMD is deciding whether it is okay to
>> skip issuing clflush for the mapped bo if DRM_XE_GEM_CPU_CACHING_WB
>> was
>> used. (If DRM_XE_GEM_CPU_CACHING_WC was used it obviously only needs
>> to
>> flush the write combine buffer.)
>>
>> Looking at what Mesa is doing it appears it is not using
>> I915_PARAM_HAS_LLC but has its own device info tables. So I guess
>> minigbm xe backend will have to do the same if xe will not provide
>> the
>> analogue query.
> 
> IMO clflushing in this case should never be needed, (Again, similar to
> AMD). Whatever renders from / to those buffers should make sure they
> are clflushed before or while accessing them. How is a rendering API
> made aware of these bos? Are they imported using drm prime?

Yes they can be imported via drm prime. It's a quite a huge code base 
with a few abstractions and quite hard to figure out all possible paths. 
There are gbm_bo_map paths in the host and vm compositors (Exo and 
Sommelier), Skia library, camera pipeline, maybe more.

You mean specifically on the "cpu access end" part of the story flushing 
would never be needed? Regardless of llc and !llc, wb or wc mappings, 
imported or native buffer? What mechanism would next in the pipeline 
(after CPU access) use to ensure the flush?

Regards,

Tvrtko

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Intel-xe] LLC configurating, mmap and bo cache management questions
  2023-12-07 11:11       ` Tvrtko Ursulin
@ 2023-12-07 12:01         ` Thomas Hellström
  2023-12-13 11:55           ` Tvrtko Ursulin
  0 siblings, 1 reply; 13+ messages in thread
From: Thomas Hellström @ 2023-12-07 12:01 UTC (permalink / raw)
  To: Tvrtko Ursulin, Hellstrom, Thomas, intel-xe@lists.freedesktop.org
  Cc: Auld, Matthew, Santa, Carlos, Vivi, Rodrigo


On 12/7/23 12:11, Tvrtko Ursulin wrote:
>
> On 06/12/2023 11:46, Hellstrom, Thomas wrote:
>> Hi, Tvrtko.
>>
>> On Wed, 2023-12-06 at 10:58 +0000, Tvrtko Ursulin wrote:
>>>
>>> Hi Thomas,
>>>
>>> On 06/12/2023 08:26, Hellstrom, Thomas wrote:
>>>> On Tue, 2023-12-05 at 14:19 +0000, Tvrtko Ursulin wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>> We are working on adding xe support to ChromeOS minigbm and have
>>>>> a
>>>>> couple questions.
>>>>>
>>>>> If I follow things correctly with xe mmap caching mode is fixed
>>>>> to
>>>>> object caching modes set at bo create. For framebuffers it will
>>>>> be WC
>>>>> and for the rest userspace can choose WB or WC via
>>>>> drm_xe_gem_create->cpu_caching. (Unless discrete, when WB cannot
>>>>> be
>>>>> used
>>>>> at all.)
>>>>>
>>>>> AFAICT minigbm basically cares about two transition points. Lets
>>>>> call
>>>>> them CPU access begin and end.
>>>>>
>>>>> 1)
>>>>> When a bo is mmapped it wants to invalidate the cache, which
>>>>> looks to
>>>>> be
>>>>> about making sure all GPU writes have landed to the backing
>>>>> store. In
>>>>> the i915 world that translates to the set_domain ioctl.
>>>>>
>>>>> What is the uapi for this with xe, or it is somehow guaranteed to
>>>>> not
>>>>> be
>>>>> needed?
>>>>
>>>> Signalling a user-fence or dma-fence obtained as an out-fence from
>>>> an
>>>> exec call will guarantee GPU caches are flushed. Currently I don't
>>>> think there is anything like gem wait in the uAPI, although Matt is
>>>> just about to add functionality to wait on all outstanding work on
>>>> an
>>>> exec_queue.
>>>
>>> Problem I see is there are no execs or therefore fences in the
>>> minigbm
>>> ABI. It's just buffers, created or imported, CPU access and some
>>> other
>>> stuff.
>>>
>>> And it is quite extensively used in the OS so I assume it has to work
>>> (I
>>> mean the invalidation/flushing was not put in there for no reason),
>>> in
>>> other words, where the i915 backend today does
>>> DRM_I915_GEM_SET_DOMAIN
>>> on "cpu access begin", which is buffer based, I am not clear how to
>>> implement that with xe.
>>>
>>> For the outstanding work you mention, since you say it is about
>>> exec_queue, I assume again it will not work purely with buffers? If
>>> so
>>> it probably won't be useful for minigbm.
>>>
>>> Also if I look at all the other minigbm backends, I see a mix of
>>> behaviours:
>>>
>>>    * msm and vc4 appear to not concern themselves with any of this.
>>>
>>>    * rockchip appears to be doing full bounce buffering via memcpy on
>>> CPU
>>> access begin/end.
>>>
>>>    * i915 and amdgpu respectively use
>>> DRM_I915_GEM_SET_DOMAIN/DRM_AMDGPU_GEM_WAIT_IDLE (also buffer based,
>>> not
>>> execution queue). Andgpu curiously does not do any flushing on CPU
>>> access end.
>>>
>>> Digging into git history, both DRM_I915_GEM_SET_DOMAIN on CPU access
>>> begin and clflushing on end were added to fix various CTS test
>>> failures.
>>> So I guess we could also wait and see what happens there. If those or
>>> some will be failing with xe too then propose adding some new uapi.
>>> Or
>>> if manual testing will start reporting visual corruption in the UI
>>> elements or such.
>>
>> Indeed it sounds like we'd need a DRM_XE_GEM_WAIT_IDLE, similar to
>> AMDGPU for this use-case. I'll bring that up for discussion.
>
> Thanks!
>
>>>>> 2)
>>>>> When a bo is unmapped, or CPU access finished, it wants to flush
>>>>> the
>>>>> CPU
>>>>> caches. That is /almost/ completely a CPU operation, where it
>>>>> just
>>>>> needs
>>>>> to either clflush or invalidate the WC buffer respectively, if
>>>>> not
>>>>> the
>>>>> fact that clflush can be skipped on platforms with LLC.
>>>>>
>>>>> I did not see an equivalent of an I915_PARAM_HAS_LLC in xe? Did I
>>>>> miss
>>>>> it or what it is the plan for querying this detail?
>>>>
>>>> XeKMD is generally coherent, except if UMD selects a GPU PAT index
>>>> with
>>>> limited coherency together with WB instead of WC memory. In that
>>>> case,
>>>> UMD is responsible for doing the needed CLFLUSH-ing, whereas KMD
>>>> only
>>>> ensures initial clearing of the pages is CLFLUSHED for security
>>>> reasons.
>>>>
>>>> I'm not 100% sure if UMD can actually select WB with limited
>>>> coherency
>>>> PAT index in the initial uAPI revision, but Matthew has received
>>>> requests for that so any additional input here on performance
>>>> implications is appreciated.
>>>>
>>>> The thinking here is otherwise that GPU PAT indices with limited
>>>> coherency should be used together with WC memory in the same
>>>> situations
>>>> as VRAM/LMEM is used on DGFX.
>>>
>>> Hm okay, this would be the VM BIND side of things which deals with
>>> GPU
>>> PATs. From the CPU side it is just CPU caching modes implicitly
>>> selected
>>> via DRM_XE_GEM_CPU_CACHING_WB/WC.
>>>
>>> The question about the analogue of I915_PARAM_HAS_LLC was AFAIU about
>>> a
>>> performance optimisation where UMD is deciding whether it is okay to
>>> skip issuing clflush for the mapped bo if DRM_XE_GEM_CPU_CACHING_WB
>>> was
>>> used. (If DRM_XE_GEM_CPU_CACHING_WC was used it obviously only needs
>>> to
>>> flush the write combine buffer.)
>>>
>>> Looking at what Mesa is doing it appears it is not using
>>> I915_PARAM_HAS_LLC but has its own device info tables. So I guess
>>> minigbm xe backend will have to do the same if xe will not provide
>>> the
>>> analogue query.
>>
>> IMO clflushing in this case should never be needed, (Again, similar to
>> AMD). Whatever renders from / to those buffers should make sure they
>> are clflushed before or while accessing them. How is a rendering API
>> made aware of these bos? Are they imported using drm prime?
>
> Yes they can be imported via drm prime. It's a quite a huge code base 
> with a few abstractions and quite hard to figure out all possible 
> paths. There are gbm_bo_map paths in the host and vm compositors (Exo 
> and Sommelier), Skia library, camera pipeline, maybe more.
>
> You mean specifically on the "cpu access end" part of the story 
> flushing would never be needed? Regardless of llc and !llc, wb or wc 
> mappings, imported or native buffer? What mechanism would next in the 
> pipeline (after CPU access) use to ensure the flush?
>
Currently we're not supporting any older !LLC hardware, and with newer 
!LLC hardware, we're only allowing at least 1-way coherency with 
dma-bufs, meaning the GPU will acquire the cache line when accessing the 
bo. If the bo is not a dma-buf, and the GPU wants to access it in 
non-coherent mode, we're currently enforcing WC cpu-mappings.

So:
igfx with shared LLC: Not a problem (always coherent)
dgfx: Not a problem (always coherent)
newer igfx WO shared LLC (KMD is enforcing coherent access).

Now this may of course, as mentions, have performance implications 
compared to previous igfx solutions, depending on how it all is written, 
but should have similar performance characteristics as dgfx. If we want 
to change this and relax the coherency enforcement and resort to 
CLFLUSHes again,  could UMD communicate the need for this in the same 
way it communicates the format of the gbm_bos?

/Thomas

> Regards,
>
> Tvrtko

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Intel-xe] LLC configurating, mmap and bo cache management questions
  2023-12-07 12:01         ` Thomas Hellström
@ 2023-12-13 11:55           ` Tvrtko Ursulin
  2023-12-13 17:27             ` Thomas Hellström
  0 siblings, 1 reply; 13+ messages in thread
From: Tvrtko Ursulin @ 2023-12-13 11:55 UTC (permalink / raw)
  To: Thomas Hellström, Hellstrom, Thomas,
	intel-xe@lists.freedesktop.org
  Cc: Auld, Matthew, Santa, Carlos, Vivi, Rodrigo


On 07/12/2023 12:01, Thomas Hellström wrote:
> 
> On 12/7/23 12:11, Tvrtko Ursulin wrote:
>>
>> On 06/12/2023 11:46, Hellstrom, Thomas wrote:
>>> Hi, Tvrtko.
>>>
>>> On Wed, 2023-12-06 at 10:58 +0000, Tvrtko Ursulin wrote:
>>>>
>>>> Hi Thomas,
>>>>
>>>> On 06/12/2023 08:26, Hellstrom, Thomas wrote:
>>>>> On Tue, 2023-12-05 at 14:19 +0000, Tvrtko Ursulin wrote:
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> We are working on adding xe support to ChromeOS minigbm and have
>>>>>> a
>>>>>> couple questions.
>>>>>>
>>>>>> If I follow things correctly with xe mmap caching mode is fixed
>>>>>> to
>>>>>> object caching modes set at bo create. For framebuffers it will
>>>>>> be WC
>>>>>> and for the rest userspace can choose WB or WC via
>>>>>> drm_xe_gem_create->cpu_caching. (Unless discrete, when WB cannot
>>>>>> be
>>>>>> used
>>>>>> at all.)
>>>>>>
>>>>>> AFAICT minigbm basically cares about two transition points. Lets
>>>>>> call
>>>>>> them CPU access begin and end.
>>>>>>
>>>>>> 1)
>>>>>> When a bo is mmapped it wants to invalidate the cache, which
>>>>>> looks to
>>>>>> be
>>>>>> about making sure all GPU writes have landed to the backing
>>>>>> store. In
>>>>>> the i915 world that translates to the set_domain ioctl.
>>>>>>
>>>>>> What is the uapi for this with xe, or it is somehow guaranteed to
>>>>>> not
>>>>>> be
>>>>>> needed?
>>>>>
>>>>> Signalling a user-fence or dma-fence obtained as an out-fence from
>>>>> an
>>>>> exec call will guarantee GPU caches are flushed. Currently I don't
>>>>> think there is anything like gem wait in the uAPI, although Matt is
>>>>> just about to add functionality to wait on all outstanding work on
>>>>> an
>>>>> exec_queue.
>>>>
>>>> Problem I see is there are no execs or therefore fences in the
>>>> minigbm
>>>> ABI. It's just buffers, created or imported, CPU access and some
>>>> other
>>>> stuff.
>>>>
>>>> And it is quite extensively used in the OS so I assume it has to work
>>>> (I
>>>> mean the invalidation/flushing was not put in there for no reason),
>>>> in
>>>> other words, where the i915 backend today does
>>>> DRM_I915_GEM_SET_DOMAIN
>>>> on "cpu access begin", which is buffer based, I am not clear how to
>>>> implement that with xe.
>>>>
>>>> For the outstanding work you mention, since you say it is about
>>>> exec_queue, I assume again it will not work purely with buffers? If
>>>> so
>>>> it probably won't be useful for minigbm.
>>>>
>>>> Also if I look at all the other minigbm backends, I see a mix of
>>>> behaviours:
>>>>
>>>>    * msm and vc4 appear to not concern themselves with any of this.
>>>>
>>>>    * rockchip appears to be doing full bounce buffering via memcpy on
>>>> CPU
>>>> access begin/end.
>>>>
>>>>    * i915 and amdgpu respectively use
>>>> DRM_I915_GEM_SET_DOMAIN/DRM_AMDGPU_GEM_WAIT_IDLE (also buffer based,
>>>> not
>>>> execution queue). Andgpu curiously does not do any flushing on CPU
>>>> access end.
>>>>
>>>> Digging into git history, both DRM_I915_GEM_SET_DOMAIN on CPU access
>>>> begin and clflushing on end were added to fix various CTS test
>>>> failures.
>>>> So I guess we could also wait and see what happens there. If those or
>>>> some will be failing with xe too then propose adding some new uapi.
>>>> Or
>>>> if manual testing will start reporting visual corruption in the UI
>>>> elements or such.
>>>
>>> Indeed it sounds like we'd need a DRM_XE_GEM_WAIT_IDLE, similar to
>>> AMDGPU for this use-case. I'll bring that up for discussion.
>>
>> Thanks!
>>
>>>>>> 2)
>>>>>> When a bo is unmapped, or CPU access finished, it wants to flush
>>>>>> the
>>>>>> CPU
>>>>>> caches. That is /almost/ completely a CPU operation, where it
>>>>>> just
>>>>>> needs
>>>>>> to either clflush or invalidate the WC buffer respectively, if
>>>>>> not
>>>>>> the
>>>>>> fact that clflush can be skipped on platforms with LLC.
>>>>>>
>>>>>> I did not see an equivalent of an I915_PARAM_HAS_LLC in xe? Did I
>>>>>> miss
>>>>>> it or what it is the plan for querying this detail?
>>>>>
>>>>> XeKMD is generally coherent, except if UMD selects a GPU PAT index
>>>>> with
>>>>> limited coherency together with WB instead of WC memory. In that
>>>>> case,
>>>>> UMD is responsible for doing the needed CLFLUSH-ing, whereas KMD
>>>>> only
>>>>> ensures initial clearing of the pages is CLFLUSHED for security
>>>>> reasons.
>>>>>
>>>>> I'm not 100% sure if UMD can actually select WB with limited
>>>>> coherency
>>>>> PAT index in the initial uAPI revision, but Matthew has received
>>>>> requests for that so any additional input here on performance
>>>>> implications is appreciated.
>>>>>
>>>>> The thinking here is otherwise that GPU PAT indices with limited
>>>>> coherency should be used together with WC memory in the same
>>>>> situations
>>>>> as VRAM/LMEM is used on DGFX.
>>>>
>>>> Hm okay, this would be the VM BIND side of things which deals with
>>>> GPU
>>>> PATs. From the CPU side it is just CPU caching modes implicitly
>>>> selected
>>>> via DRM_XE_GEM_CPU_CACHING_WB/WC.
>>>>
>>>> The question about the analogue of I915_PARAM_HAS_LLC was AFAIU about
>>>> a
>>>> performance optimisation where UMD is deciding whether it is okay to
>>>> skip issuing clflush for the mapped bo if DRM_XE_GEM_CPU_CACHING_WB
>>>> was
>>>> used. (If DRM_XE_GEM_CPU_CACHING_WC was used it obviously only needs
>>>> to
>>>> flush the write combine buffer.)
>>>>
>>>> Looking at what Mesa is doing it appears it is not using
>>>> I915_PARAM_HAS_LLC but has its own device info tables. So I guess
>>>> minigbm xe backend will have to do the same if xe will not provide
>>>> the
>>>> analogue query.
>>>
>>> IMO clflushing in this case should never be needed, (Again, similar to
>>> AMD). Whatever renders from / to those buffers should make sure they
>>> are clflushed before or while accessing them. How is a rendering API
>>> made aware of these bos? Are they imported using drm prime?
>>
>> Yes they can be imported via drm prime. It's a quite a huge code base 
>> with a few abstractions and quite hard to figure out all possible 
>> paths. There are gbm_bo_map paths in the host and vm compositors (Exo 
>> and Sommelier), Skia library, camera pipeline, maybe more.
>>
>> You mean specifically on the "cpu access end" part of the story 
>> flushing would never be needed? Regardless of llc and !llc, wb or wc 
>> mappings, imported or native buffer? What mechanism would next in the 
>> pipeline (after CPU access) use to ensure the flush?
>>
> Currently we're not supporting any older !LLC hardware, and with newer 
> !LLC hardware, we're only allowing at least 1-way coherency with 
> dma-bufs, meaning the GPU will acquire the cache line when accessing the 
> bo. If the bo is not a dma-buf, and the GPU wants to access it in 
> non-coherent mode, we're currently enforcing WC cpu-mappings.
> 
> So:
> igfx with shared LLC: Not a problem (always coherent)
> dgfx: Not a problem (always coherent)
> newer igfx WO shared LLC (KMD is enforcing coherent access).
> 
> Now this may of course, as mentions, have performance implications 
> compared to previous igfx solutions, depending on how it all is written, 
> but should have similar performance characteristics as dgfx. If we want 
> to change this and relax the coherency enforcement and resort to 
> CLFLUSHes again,  could UMD communicate the need for this in the same 
> way it communicates the format of the gbm_bos?

I think it is probably not needed. If minigbm will know mmap is always 
WC on MTL then it can know that all it needs to do to flush is flush the 
write-combine buffer. So the logic in minigbm xe backend "cpu access 
end" would be like:

xe_bo_flush(...)
{
   if (dgfx || (igfx && !llc))
	__builtin_ia32_sfence();
}

Does that sound right to you?

So far on the i915 backend this explicit sfence was not even there in 
the mmap_wc case, which I guess possibly works because something is 
likely to trigger the flush implicitly inside the complexities of the 
call stack anyway.

The only opens are 1) how to determine if LLC is present ie. would xe 
provide a query of minigbm would implement device id tables like Mesa, 
and 2) how to implement the "cpu access begin" side of things ie. making 
sure all GPU writes are CPU visible before starting to use the mmap-ed bo.

Regards,

Tvrtko

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Intel-xe] LLC configurating, mmap and bo cache management questions
  2023-12-13 11:55           ` Tvrtko Ursulin
@ 2023-12-13 17:27             ` Thomas Hellström
  2023-12-13 17:50               ` Tvrtko Ursulin
  0 siblings, 1 reply; 13+ messages in thread
From: Thomas Hellström @ 2023-12-13 17:27 UTC (permalink / raw)
  To: Tvrtko Ursulin, Hellstrom, Thomas, intel-xe@lists.freedesktop.org
  Cc: Auld, Matthew, Santa, Carlos, Vivi, Rodrigo


On 12/13/23 12:55, Tvrtko Ursulin wrote:
>
> On 07/12/2023 12:01, Thomas Hellström wrote:
>>
>> On 12/7/23 12:11, Tvrtko Ursulin wrote:
>>>
>>> On 06/12/2023 11:46, Hellstrom, Thomas wrote:
>>>> Hi, Tvrtko.
>>>>
>>>> On Wed, 2023-12-06 at 10:58 +0000, Tvrtko Ursulin wrote:
>>>>>
>>>>> Hi Thomas,
>>>>>
>>>>> On 06/12/2023 08:26, Hellstrom, Thomas wrote:
>>>>>> On Tue, 2023-12-05 at 14:19 +0000, Tvrtko Ursulin wrote:
>>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> We are working on adding xe support to ChromeOS minigbm and have
>>>>>>> a
>>>>>>> couple questions.
>>>>>>>
>>>>>>> If I follow things correctly with xe mmap caching mode is fixed
>>>>>>> to
>>>>>>> object caching modes set at bo create. For framebuffers it will
>>>>>>> be WC
>>>>>>> and for the rest userspace can choose WB or WC via
>>>>>>> drm_xe_gem_create->cpu_caching. (Unless discrete, when WB cannot
>>>>>>> be
>>>>>>> used
>>>>>>> at all.)
>>>>>>>
>>>>>>> AFAICT minigbm basically cares about two transition points. Lets
>>>>>>> call
>>>>>>> them CPU access begin and end.
>>>>>>>
>>>>>>> 1)
>>>>>>> When a bo is mmapped it wants to invalidate the cache, which
>>>>>>> looks to
>>>>>>> be
>>>>>>> about making sure all GPU writes have landed to the backing
>>>>>>> store. In
>>>>>>> the i915 world that translates to the set_domain ioctl.
>>>>>>>
>>>>>>> What is the uapi for this with xe, or it is somehow guaranteed to
>>>>>>> not
>>>>>>> be
>>>>>>> needed?
>>>>>>
>>>>>> Signalling a user-fence or dma-fence obtained as an out-fence from
>>>>>> an
>>>>>> exec call will guarantee GPU caches are flushed. Currently I don't
>>>>>> think there is anything like gem wait in the uAPI, although Matt is
>>>>>> just about to add functionality to wait on all outstanding work on
>>>>>> an
>>>>>> exec_queue.
>>>>>
>>>>> Problem I see is there are no execs or therefore fences in the
>>>>> minigbm
>>>>> ABI. It's just buffers, created or imported, CPU access and some
>>>>> other
>>>>> stuff.
>>>>>
>>>>> And it is quite extensively used in the OS so I assume it has to work
>>>>> (I
>>>>> mean the invalidation/flushing was not put in there for no reason),
>>>>> in
>>>>> other words, where the i915 backend today does
>>>>> DRM_I915_GEM_SET_DOMAIN
>>>>> on "cpu access begin", which is buffer based, I am not clear how to
>>>>> implement that with xe.
>>>>>
>>>>> For the outstanding work you mention, since you say it is about
>>>>> exec_queue, I assume again it will not work purely with buffers? If
>>>>> so
>>>>> it probably won't be useful for minigbm.
>>>>>
>>>>> Also if I look at all the other minigbm backends, I see a mix of
>>>>> behaviours:
>>>>>
>>>>>    * msm and vc4 appear to not concern themselves with any of this.
>>>>>
>>>>>    * rockchip appears to be doing full bounce buffering via memcpy on
>>>>> CPU
>>>>> access begin/end.
>>>>>
>>>>>    * i915 and amdgpu respectively use
>>>>> DRM_I915_GEM_SET_DOMAIN/DRM_AMDGPU_GEM_WAIT_IDLE (also buffer based,
>>>>> not
>>>>> execution queue). Andgpu curiously does not do any flushing on CPU
>>>>> access end.
>>>>>
>>>>> Digging into git history, both DRM_I915_GEM_SET_DOMAIN on CPU access
>>>>> begin and clflushing on end were added to fix various CTS test
>>>>> failures.
>>>>> So I guess we could also wait and see what happens there. If those or
>>>>> some will be failing with xe too then propose adding some new uapi.
>>>>> Or
>>>>> if manual testing will start reporting visual corruption in the UI
>>>>> elements or such.
>>>>
>>>> Indeed it sounds like we'd need a DRM_XE_GEM_WAIT_IDLE, similar to
>>>> AMDGPU for this use-case. I'll bring that up for discussion.
>>>
>>> Thanks!
>>>
>>>>>>> 2)
>>>>>>> When a bo is unmapped, or CPU access finished, it wants to flush
>>>>>>> the
>>>>>>> CPU
>>>>>>> caches. That is /almost/ completely a CPU operation, where it
>>>>>>> just
>>>>>>> needs
>>>>>>> to either clflush or invalidate the WC buffer respectively, if
>>>>>>> not
>>>>>>> the
>>>>>>> fact that clflush can be skipped on platforms with LLC.
>>>>>>>
>>>>>>> I did not see an equivalent of an I915_PARAM_HAS_LLC in xe? Did I
>>>>>>> miss
>>>>>>> it or what it is the plan for querying this detail?
>>>>>>
>>>>>> XeKMD is generally coherent, except if UMD selects a GPU PAT index
>>>>>> with
>>>>>> limited coherency together with WB instead of WC memory. In that
>>>>>> case,
>>>>>> UMD is responsible for doing the needed CLFLUSH-ing, whereas KMD
>>>>>> only
>>>>>> ensures initial clearing of the pages is CLFLUSHED for security
>>>>>> reasons.
>>>>>>
>>>>>> I'm not 100% sure if UMD can actually select WB with limited
>>>>>> coherency
>>>>>> PAT index in the initial uAPI revision, but Matthew has received
>>>>>> requests for that so any additional input here on performance
>>>>>> implications is appreciated.
>>>>>>
>>>>>> The thinking here is otherwise that GPU PAT indices with limited
>>>>>> coherency should be used together with WC memory in the same
>>>>>> situations
>>>>>> as VRAM/LMEM is used on DGFX.
>>>>>
>>>>> Hm okay, this would be the VM BIND side of things which deals with
>>>>> GPU
>>>>> PATs. From the CPU side it is just CPU caching modes implicitly
>>>>> selected
>>>>> via DRM_XE_GEM_CPU_CACHING_WB/WC.
>>>>>
>>>>> The question about the analogue of I915_PARAM_HAS_LLC was AFAIU about
>>>>> a
>>>>> performance optimisation where UMD is deciding whether it is okay to
>>>>> skip issuing clflush for the mapped bo if DRM_XE_GEM_CPU_CACHING_WB
>>>>> was
>>>>> used. (If DRM_XE_GEM_CPU_CACHING_WC was used it obviously only needs
>>>>> to
>>>>> flush the write combine buffer.)
>>>>>
>>>>> Looking at what Mesa is doing it appears it is not using
>>>>> I915_PARAM_HAS_LLC but has its own device info tables. So I guess
>>>>> minigbm xe backend will have to do the same if xe will not provide
>>>>> the
>>>>> analogue query.
>>>>
>>>> IMO clflushing in this case should never be needed, (Again, similar to
>>>> AMD). Whatever renders from / to those buffers should make sure they
>>>> are clflushed before or while accessing them. How is a rendering API
>>>> made aware of these bos? Are they imported using drm prime?
>>>
>>> Yes they can be imported via drm prime. It's a quite a huge code 
>>> base with a few abstractions and quite hard to figure out all 
>>> possible paths. There are gbm_bo_map paths in the host and vm 
>>> compositors (Exo and Sommelier), Skia library, camera pipeline, 
>>> maybe more.
>>>
>>> You mean specifically on the "cpu access end" part of the story 
>>> flushing would never be needed? Regardless of llc and !llc, wb or wc 
>>> mappings, imported or native buffer? What mechanism would next in 
>>> the pipeline (after CPU access) use to ensure the flush?
>>>
>> Currently we're not supporting any older !LLC hardware, and with 
>> newer !LLC hardware, we're only allowing at least 1-way coherency 
>> with dma-bufs, meaning the GPU will acquire the cache line when 
>> accessing the bo. If the bo is not a dma-buf, and the GPU wants to 
>> access it in non-coherent mode, we're currently enforcing WC 
>> cpu-mappings.
>>
>> So:
>> igfx with shared LLC: Not a problem (always coherent)
>> dgfx: Not a problem (always coherent)
>> newer igfx WO shared LLC (KMD is enforcing coherent access).
>>
>> Now this may of course, as mentions, have performance implications 
>> compared to previous igfx solutions, depending on how it all is 
>> written, but should have similar performance characteristics as dgfx. 
>> If we want to change this and relax the coherency enforcement and 
>> resort to CLFLUSHes again,  could UMD communicate the need for this 
>> in the same way it communicates the format of the gbm_bos?
>
> I think it is probably not needed. If minigbm will know mmap is always 
> WC on MTL then it can know that all it needs to do to flush is flush 
> the write-combine buffer. So the logic in minigbm xe backend "cpu 
> access end" would be like:
>
> xe_bo_flush(...)
> {
>   if (dgfx || (igfx && !llc))
>     __builtin_ia32_sfence();
> }
>
> Does that sound right to you?
>
> So far on the i915 backend this explicit sfence was not even there in 
> the mmap_wc case, which I guess possibly works because something is 
> likely to trigger the flush implicitly inside the complexities of the 
> call stack anyway.

IIRC WC buffers are flushed also on uncached register writes (does that 
include updating the ring tail perhapsl?).
It might also be that these gbm_bos are created WB and the igfx PAT is 
set to 1-way coherency. What code creates those gbm_bos?

I figure for dgfx at least we don't want any intel platform-specific 
code in the stack if possible....

>
> The only opens are 1) how to determine if LLC is present ie. would xe 
> provide a query of minigbm would implement device id tables like Mesa, 
> and 2) how to implement the "cpu access begin" side of things ie. 
> making sure all GPU writes are CPU visible before starting to use the 
> mmap-ed bo.

2) Yes, AR on me.

/Thomas



>
> Regards,
>
> Tvrtko

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Intel-xe] LLC configurating, mmap and bo cache management questions
  2023-12-13 17:27             ` Thomas Hellström
@ 2023-12-13 17:50               ` Tvrtko Ursulin
  2023-12-13 20:11                 ` Thomas Hellström
  0 siblings, 1 reply; 13+ messages in thread
From: Tvrtko Ursulin @ 2023-12-13 17:50 UTC (permalink / raw)
  To: Thomas Hellström, Hellstrom, Thomas,
	intel-xe@lists.freedesktop.org
  Cc: Auld, Matthew, Santa, Carlos, Vivi, Rodrigo


On 13/12/2023 17:27, Thomas Hellström wrote:
> 
> On 12/13/23 12:55, Tvrtko Ursulin wrote:
>>
>> On 07/12/2023 12:01, Thomas Hellström wrote:
>>>
>>> On 12/7/23 12:11, Tvrtko Ursulin wrote:
>>>>
>>>> On 06/12/2023 11:46, Hellstrom, Thomas wrote:
>>>>> Hi, Tvrtko.
>>>>>
>>>>> On Wed, 2023-12-06 at 10:58 +0000, Tvrtko Ursulin wrote:
>>>>>>
>>>>>> Hi Thomas,
>>>>>>
>>>>>> On 06/12/2023 08:26, Hellstrom, Thomas wrote:
>>>>>>> On Tue, 2023-12-05 at 14:19 +0000, Tvrtko Ursulin wrote:
>>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> We are working on adding xe support to ChromeOS minigbm and have
>>>>>>>> a
>>>>>>>> couple questions.
>>>>>>>>
>>>>>>>> If I follow things correctly with xe mmap caching mode is fixed
>>>>>>>> to
>>>>>>>> object caching modes set at bo create. For framebuffers it will
>>>>>>>> be WC
>>>>>>>> and for the rest userspace can choose WB or WC via
>>>>>>>> drm_xe_gem_create->cpu_caching. (Unless discrete, when WB cannot
>>>>>>>> be
>>>>>>>> used
>>>>>>>> at all.)
>>>>>>>>
>>>>>>>> AFAICT minigbm basically cares about two transition points. Lets
>>>>>>>> call
>>>>>>>> them CPU access begin and end.
>>>>>>>>
>>>>>>>> 1)
>>>>>>>> When a bo is mmapped it wants to invalidate the cache, which
>>>>>>>> looks to
>>>>>>>> be
>>>>>>>> about making sure all GPU writes have landed to the backing
>>>>>>>> store. In
>>>>>>>> the i915 world that translates to the set_domain ioctl.
>>>>>>>>
>>>>>>>> What is the uapi for this with xe, or it is somehow guaranteed to
>>>>>>>> not
>>>>>>>> be
>>>>>>>> needed?
>>>>>>>
>>>>>>> Signalling a user-fence or dma-fence obtained as an out-fence from
>>>>>>> an
>>>>>>> exec call will guarantee GPU caches are flushed. Currently I don't
>>>>>>> think there is anything like gem wait in the uAPI, although Matt is
>>>>>>> just about to add functionality to wait on all outstanding work on
>>>>>>> an
>>>>>>> exec_queue.
>>>>>>
>>>>>> Problem I see is there are no execs or therefore fences in the
>>>>>> minigbm
>>>>>> ABI. It's just buffers, created or imported, CPU access and some
>>>>>> other
>>>>>> stuff.
>>>>>>
>>>>>> And it is quite extensively used in the OS so I assume it has to work
>>>>>> (I
>>>>>> mean the invalidation/flushing was not put in there for no reason),
>>>>>> in
>>>>>> other words, where the i915 backend today does
>>>>>> DRM_I915_GEM_SET_DOMAIN
>>>>>> on "cpu access begin", which is buffer based, I am not clear how to
>>>>>> implement that with xe.
>>>>>>
>>>>>> For the outstanding work you mention, since you say it is about
>>>>>> exec_queue, I assume again it will not work purely with buffers? If
>>>>>> so
>>>>>> it probably won't be useful for minigbm.
>>>>>>
>>>>>> Also if I look at all the other minigbm backends, I see a mix of
>>>>>> behaviours:
>>>>>>
>>>>>>    * msm and vc4 appear to not concern themselves with any of this.
>>>>>>
>>>>>>    * rockchip appears to be doing full bounce buffering via memcpy on
>>>>>> CPU
>>>>>> access begin/end.
>>>>>>
>>>>>>    * i915 and amdgpu respectively use
>>>>>> DRM_I915_GEM_SET_DOMAIN/DRM_AMDGPU_GEM_WAIT_IDLE (also buffer based,
>>>>>> not
>>>>>> execution queue). Andgpu curiously does not do any flushing on CPU
>>>>>> access end.
>>>>>>
>>>>>> Digging into git history, both DRM_I915_GEM_SET_DOMAIN on CPU access
>>>>>> begin and clflushing on end were added to fix various CTS test
>>>>>> failures.
>>>>>> So I guess we could also wait and see what happens there. If those or
>>>>>> some will be failing with xe too then propose adding some new uapi.
>>>>>> Or
>>>>>> if manual testing will start reporting visual corruption in the UI
>>>>>> elements or such.
>>>>>
>>>>> Indeed it sounds like we'd need a DRM_XE_GEM_WAIT_IDLE, similar to
>>>>> AMDGPU for this use-case. I'll bring that up for discussion.
>>>>
>>>> Thanks!
>>>>
>>>>>>>> 2)
>>>>>>>> When a bo is unmapped, or CPU access finished, it wants to flush
>>>>>>>> the
>>>>>>>> CPU
>>>>>>>> caches. That is /almost/ completely a CPU operation, where it
>>>>>>>> just
>>>>>>>> needs
>>>>>>>> to either clflush or invalidate the WC buffer respectively, if
>>>>>>>> not
>>>>>>>> the
>>>>>>>> fact that clflush can be skipped on platforms with LLC.
>>>>>>>>
>>>>>>>> I did not see an equivalent of an I915_PARAM_HAS_LLC in xe? Did I
>>>>>>>> miss
>>>>>>>> it or what it is the plan for querying this detail?
>>>>>>>
>>>>>>> XeKMD is generally coherent, except if UMD selects a GPU PAT index
>>>>>>> with
>>>>>>> limited coherency together with WB instead of WC memory. In that
>>>>>>> case,
>>>>>>> UMD is responsible for doing the needed CLFLUSH-ing, whereas KMD
>>>>>>> only
>>>>>>> ensures initial clearing of the pages is CLFLUSHED for security
>>>>>>> reasons.
>>>>>>>
>>>>>>> I'm not 100% sure if UMD can actually select WB with limited
>>>>>>> coherency
>>>>>>> PAT index in the initial uAPI revision, but Matthew has received
>>>>>>> requests for that so any additional input here on performance
>>>>>>> implications is appreciated.
>>>>>>>
>>>>>>> The thinking here is otherwise that GPU PAT indices with limited
>>>>>>> coherency should be used together with WC memory in the same
>>>>>>> situations
>>>>>>> as VRAM/LMEM is used on DGFX.
>>>>>>
>>>>>> Hm okay, this would be the VM BIND side of things which deals with
>>>>>> GPU
>>>>>> PATs. From the CPU side it is just CPU caching modes implicitly
>>>>>> selected
>>>>>> via DRM_XE_GEM_CPU_CACHING_WB/WC.
>>>>>>
>>>>>> The question about the analogue of I915_PARAM_HAS_LLC was AFAIU about
>>>>>> a
>>>>>> performance optimisation where UMD is deciding whether it is okay to
>>>>>> skip issuing clflush for the mapped bo if DRM_XE_GEM_CPU_CACHING_WB
>>>>>> was
>>>>>> used. (If DRM_XE_GEM_CPU_CACHING_WC was used it obviously only needs
>>>>>> to
>>>>>> flush the write combine buffer.)
>>>>>>
>>>>>> Looking at what Mesa is doing it appears it is not using
>>>>>> I915_PARAM_HAS_LLC but has its own device info tables. So I guess
>>>>>> minigbm xe backend will have to do the same if xe will not provide
>>>>>> the
>>>>>> analogue query.
>>>>>
>>>>> IMO clflushing in this case should never be needed, (Again, similar to
>>>>> AMD). Whatever renders from / to those buffers should make sure they
>>>>> are clflushed before or while accessing them. How is a rendering API
>>>>> made aware of these bos? Are they imported using drm prime?
>>>>
>>>> Yes they can be imported via drm prime. It's a quite a huge code 
>>>> base with a few abstractions and quite hard to figure out all 
>>>> possible paths. There are gbm_bo_map paths in the host and vm 
>>>> compositors (Exo and Sommelier), Skia library, camera pipeline, 
>>>> maybe more.
>>>>
>>>> You mean specifically on the "cpu access end" part of the story 
>>>> flushing would never be needed? Regardless of llc and !llc, wb or wc 
>>>> mappings, imported or native buffer? What mechanism would next in 
>>>> the pipeline (after CPU access) use to ensure the flush?
>>>>
>>> Currently we're not supporting any older !LLC hardware, and with 
>>> newer !LLC hardware, we're only allowing at least 1-way coherency 
>>> with dma-bufs, meaning the GPU will acquire the cache line when 
>>> accessing the bo. If the bo is not a dma-buf, and the GPU wants to 
>>> access it in non-coherent mode, we're currently enforcing WC 
>>> cpu-mappings.
>>>
>>> So:
>>> igfx with shared LLC: Not a problem (always coherent)
>>> dgfx: Not a problem (always coherent)
>>> newer igfx WO shared LLC (KMD is enforcing coherent access).
>>>
>>> Now this may of course, as mentions, have performance implications 
>>> compared to previous igfx solutions, depending on how it all is 
>>> written, but should have similar performance characteristics as dgfx. 
>>> If we want to change this and relax the coherency enforcement and 
>>> resort to CLFLUSHes again,  could UMD communicate the need for this 
>>> in the same way it communicates the format of the gbm_bos?
>>
>> I think it is probably not needed. If minigbm will know mmap is always 
>> WC on MTL then it can know that all it needs to do to flush is flush 
>> the write-combine buffer. So the logic in minigbm xe backend "cpu 
>> access end" would be like:
>>
>> xe_bo_flush(...)
>> {
>>   if (dgfx || (igfx && !llc))
>>     __builtin_ia32_sfence();
>> }
>>
>> Does that sound right to you?
>>
>> So far on the i915 backend this explicit sfence was not even there in 
>> the mmap_wc case, which I guess possibly works because something is 
>> likely to trigger the flush implicitly inside the complexities of the 
>> call stack anyway.
> 
> IIRC WC buffers are flushed also on uncached register writes (does that 
> include updating the ring tail perhapsl?).

If next in chain is the GPU I guess so. It would indeed be a typical use 
case but for correctness it probably needs a sfence since AFAIU 
write-combine buffer is per CPU core.

> It might also be that these gbm_bos are created WB and the igfx PAT is 
> set to 1-way coherency. What code creates those gbm_bos?

Minigbm on behalf of various callers. It is based on the allocation 
flags such as will it be for scanout, protected content, modifiers and such.

When scanout is request we will set DRM_XE_GEM_CREATE_FLAG_SCANOUT. 
Which would mean scanount bos will be WC and the rest WB-1-way when mmaped.

> I figure for dgfx at least we don't want any intel platform-specific 
> code in the stack if possible....

True, would have to be platform specific.

>> The only opens are 1) how to determine if LLC is present ie. would xe 
>> provide a query of minigbm would implement device id tables like Mesa, 
>> and 2) how to implement the "cpu access begin" side of things ie. 
>> making sure all GPU writes are CPU visible before starting to use the 
>> mmap-ed bo.
> 
> 2) Yes, AR on me.

Thanks!

Regards,

Tvrtko

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Intel-xe] LLC configurating, mmap and bo cache management questions
  2023-12-13 17:50               ` Tvrtko Ursulin
@ 2023-12-13 20:11                 ` Thomas Hellström
  2023-12-14  8:10                   ` Tvrtko Ursulin
  0 siblings, 1 reply; 13+ messages in thread
From: Thomas Hellström @ 2023-12-13 20:11 UTC (permalink / raw)
  To: Tvrtko Ursulin, Hellstrom, Thomas, intel-xe@lists.freedesktop.org
  Cc: Auld, Matthew, Santa, Carlos, Vivi, Rodrigo


On 12/13/23 18:50, Tvrtko Ursulin wrote:
>
> On 13/12/2023 17:27, Thomas Hellström wrote:
>>
>> On 12/13/23 12:55, Tvrtko Ursulin wrote:
>>>
>>> On 07/12/2023 12:01, Thomas Hellström wrote:
>>>>
>>>> On 12/7/23 12:11, Tvrtko Ursulin wrote:
>>>>>
>>>>> On 06/12/2023 11:46, Hellstrom, Thomas wrote:
>>>>>> Hi, Tvrtko.
>>>>>>
>>>>>> On Wed, 2023-12-06 at 10:58 +0000, Tvrtko Ursulin wrote:
>>>>>>>
>>>>>>> Hi Thomas,
>>>>>>>
>>>>>>> On 06/12/2023 08:26, Hellstrom, Thomas wrote:
>>>>>>>> On Tue, 2023-12-05 at 14:19 +0000, Tvrtko Ursulin wrote:
>>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> We are working on adding xe support to ChromeOS minigbm and have
>>>>>>>>> a
>>>>>>>>> couple questions.
>>>>>>>>>
>>>>>>>>> If I follow things correctly with xe mmap caching mode is fixed
>>>>>>>>> to
>>>>>>>>> object caching modes set at bo create. For framebuffers it will
>>>>>>>>> be WC
>>>>>>>>> and for the rest userspace can choose WB or WC via
>>>>>>>>> drm_xe_gem_create->cpu_caching. (Unless discrete, when WB cannot
>>>>>>>>> be
>>>>>>>>> used
>>>>>>>>> at all.)
>>>>>>>>>
>>>>>>>>> AFAICT minigbm basically cares about two transition points. Lets
>>>>>>>>> call
>>>>>>>>> them CPU access begin and end.
>>>>>>>>>
>>>>>>>>> 1)
>>>>>>>>> When a bo is mmapped it wants to invalidate the cache, which
>>>>>>>>> looks to
>>>>>>>>> be
>>>>>>>>> about making sure all GPU writes have landed to the backing
>>>>>>>>> store. In
>>>>>>>>> the i915 world that translates to the set_domain ioctl.
>>>>>>>>>
>>>>>>>>> What is the uapi for this with xe, or it is somehow guaranteed to
>>>>>>>>> not
>>>>>>>>> be
>>>>>>>>> needed?
>>>>>>>>
>>>>>>>> Signalling a user-fence or dma-fence obtained as an out-fence from
>>>>>>>> an
>>>>>>>> exec call will guarantee GPU caches are flushed. Currently I don't
>>>>>>>> think there is anything like gem wait in the uAPI, although 
>>>>>>>> Matt is
>>>>>>>> just about to add functionality to wait on all outstanding work on
>>>>>>>> an
>>>>>>>> exec_queue.
>>>>>>>
>>>>>>> Problem I see is there are no execs or therefore fences in the
>>>>>>> minigbm
>>>>>>> ABI. It's just buffers, created or imported, CPU access and some
>>>>>>> other
>>>>>>> stuff.
>>>>>>>
>>>>>>> And it is quite extensively used in the OS so I assume it has to 
>>>>>>> work
>>>>>>> (I
>>>>>>> mean the invalidation/flushing was not put in there for no reason),
>>>>>>> in
>>>>>>> other words, where the i915 backend today does
>>>>>>> DRM_I915_GEM_SET_DOMAIN
>>>>>>> on "cpu access begin", which is buffer based, I am not clear how to
>>>>>>> implement that with xe.
>>>>>>>
>>>>>>> For the outstanding work you mention, since you say it is about
>>>>>>> exec_queue, I assume again it will not work purely with buffers? If
>>>>>>> so
>>>>>>> it probably won't be useful for minigbm.
>>>>>>>
>>>>>>> Also if I look at all the other minigbm backends, I see a mix of
>>>>>>> behaviours:
>>>>>>>
>>>>>>>    * msm and vc4 appear to not concern themselves with any of this.
>>>>>>>
>>>>>>>    * rockchip appears to be doing full bounce buffering via 
>>>>>>> memcpy on
>>>>>>> CPU
>>>>>>> access begin/end.
>>>>>>>
>>>>>>>    * i915 and amdgpu respectively use
>>>>>>> DRM_I915_GEM_SET_DOMAIN/DRM_AMDGPU_GEM_WAIT_IDLE (also buffer 
>>>>>>> based,
>>>>>>> not
>>>>>>> execution queue). Andgpu curiously does not do any flushing on CPU
>>>>>>> access end.
>>>>>>>
>>>>>>> Digging into git history, both DRM_I915_GEM_SET_DOMAIN on CPU 
>>>>>>> access
>>>>>>> begin and clflushing on end were added to fix various CTS test
>>>>>>> failures.
>>>>>>> So I guess we could also wait and see what happens there. If 
>>>>>>> those or
>>>>>>> some will be failing with xe too then propose adding some new uapi.
>>>>>>> Or
>>>>>>> if manual testing will start reporting visual corruption in the UI
>>>>>>> elements or such.
>>>>>>
>>>>>> Indeed it sounds like we'd need a DRM_XE_GEM_WAIT_IDLE, similar to
>>>>>> AMDGPU for this use-case. I'll bring that up for discussion.
>>>>>
>>>>> Thanks!
>>>>>
>>>>>>>>> 2)
>>>>>>>>> When a bo is unmapped, or CPU access finished, it wants to flush
>>>>>>>>> the
>>>>>>>>> CPU
>>>>>>>>> caches. That is /almost/ completely a CPU operation, where it
>>>>>>>>> just
>>>>>>>>> needs
>>>>>>>>> to either clflush or invalidate the WC buffer respectively, if
>>>>>>>>> not
>>>>>>>>> the
>>>>>>>>> fact that clflush can be skipped on platforms with LLC.
>>>>>>>>>
>>>>>>>>> I did not see an equivalent of an I915_PARAM_HAS_LLC in xe? Did I
>>>>>>>>> miss
>>>>>>>>> it or what it is the plan for querying this detail?
>>>>>>>>
>>>>>>>> XeKMD is generally coherent, except if UMD selects a GPU PAT index
>>>>>>>> with
>>>>>>>> limited coherency together with WB instead of WC memory. In that
>>>>>>>> case,
>>>>>>>> UMD is responsible for doing the needed CLFLUSH-ing, whereas KMD
>>>>>>>> only
>>>>>>>> ensures initial clearing of the pages is CLFLUSHED for security
>>>>>>>> reasons.
>>>>>>>>
>>>>>>>> I'm not 100% sure if UMD can actually select WB with limited
>>>>>>>> coherency
>>>>>>>> PAT index in the initial uAPI revision, but Matthew has received
>>>>>>>> requests for that so any additional input here on performance
>>>>>>>> implications is appreciated.
>>>>>>>>
>>>>>>>> The thinking here is otherwise that GPU PAT indices with limited
>>>>>>>> coherency should be used together with WC memory in the same
>>>>>>>> situations
>>>>>>>> as VRAM/LMEM is used on DGFX.
>>>>>>>
>>>>>>> Hm okay, this would be the VM BIND side of things which deals with
>>>>>>> GPU
>>>>>>> PATs. From the CPU side it is just CPU caching modes implicitly
>>>>>>> selected
>>>>>>> via DRM_XE_GEM_CPU_CACHING_WB/WC.
>>>>>>>
>>>>>>> The question about the analogue of I915_PARAM_HAS_LLC was AFAIU 
>>>>>>> about
>>>>>>> a
>>>>>>> performance optimisation where UMD is deciding whether it is 
>>>>>>> okay to
>>>>>>> skip issuing clflush for the mapped bo if DRM_XE_GEM_CPU_CACHING_WB
>>>>>>> was
>>>>>>> used. (If DRM_XE_GEM_CPU_CACHING_WC was used it obviously only 
>>>>>>> needs
>>>>>>> to
>>>>>>> flush the write combine buffer.)
>>>>>>>
>>>>>>> Looking at what Mesa is doing it appears it is not using
>>>>>>> I915_PARAM_HAS_LLC but has its own device info tables. So I guess
>>>>>>> minigbm xe backend will have to do the same if xe will not provide
>>>>>>> the
>>>>>>> analogue query.
>>>>>>
>>>>>> IMO clflushing in this case should never be needed, (Again, 
>>>>>> similar to
>>>>>> AMD). Whatever renders from / to those buffers should make sure they
>>>>>> are clflushed before or while accessing them. How is a rendering API
>>>>>> made aware of these bos? Are they imported using drm prime?
>>>>>
>>>>> Yes they can be imported via drm prime. It's a quite a huge code 
>>>>> base with a few abstractions and quite hard to figure out all 
>>>>> possible paths. There are gbm_bo_map paths in the host and vm 
>>>>> compositors (Exo and Sommelier), Skia library, camera pipeline, 
>>>>> maybe more.
>>>>>
>>>>> You mean specifically on the "cpu access end" part of the story 
>>>>> flushing would never be needed? Regardless of llc and !llc, wb or 
>>>>> wc mappings, imported or native buffer? What mechanism would next 
>>>>> in the pipeline (after CPU access) use to ensure the flush?
>>>>>
>>>> Currently we're not supporting any older !LLC hardware, and with 
>>>> newer !LLC hardware, we're only allowing at least 1-way coherency 
>>>> with dma-bufs, meaning the GPU will acquire the cache line when 
>>>> accessing the bo. If the bo is not a dma-buf, and the GPU wants to 
>>>> access it in non-coherent mode, we're currently enforcing WC 
>>>> cpu-mappings.
>>>>
>>>> So:
>>>> igfx with shared LLC: Not a problem (always coherent)
>>>> dgfx: Not a problem (always coherent)
>>>> newer igfx WO shared LLC (KMD is enforcing coherent access).
>>>>
>>>> Now this may of course, as mentions, have performance implications 
>>>> compared to previous igfx solutions, depending on how it all is 
>>>> written, but should have similar performance characteristics as 
>>>> dgfx. If we want to change this and relax the coherency enforcement 
>>>> and resort to CLFLUSHes again,  could UMD communicate the need for 
>>>> this in the same way it communicates the format of the gbm_bos?
>>>
>>> I think it is probably not needed. If minigbm will know mmap is 
>>> always WC on MTL then it can know that all it needs to do to flush 
>>> is flush the write-combine buffer. So the logic in minigbm xe 
>>> backend "cpu access end" would be like:
>>>
>>> xe_bo_flush(...)
>>> {
>>>   if (dgfx || (igfx && !llc))
>>>     __builtin_ia32_sfence();
>>> }
>>>
>>> Does that sound right to you?
>>>
>>> So far on the i915 backend this explicit sfence was not even there 
>>> in the mmap_wc case, which I guess possibly works because something 
>>> is likely to trigger the flush implicitly inside the complexities of 
>>> the call stack anyway.
>>
>> IIRC WC buffers are flushed also on uncached register writes (does 
>> that include updating the ring tail perhapsl?).
>
> If next in chain is the GPU I guess so. It would indeed be a typical 
> use case but for correctness it probably needs a sfence since AFAIU 
> write-combine buffer is per CPU core.
>
>> It might also be that these gbm_bos are created WB and the igfx PAT 
>> is set to 1-way coherency. What code creates those gbm_bos?
>
> Minigbm on behalf of various callers. It is based on the allocation 
> flags such as will it be for scanout, protected content, modifiers and 
> such.
>
> When scanout is request we will set DRM_XE_GEM_CREATE_FLAG_SCANOUT. 
> Which would mean scanount bos will be WC and the rest WB-1-way when 
> mmaped.
>
>> I figure for dgfx at least we don't want any intel platform-specific 
>> code in the stack if possible....
>
> True, would have to be platform specific.

So if we unconditionally add an sfence, I wouldn't expect the overhead 
to be significant even if it's not strictly needed? Or have the 
gbm_bo_create() forward to gbm_bo_map() whether the bo was created WC?

/Thomas



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Intel-xe] LLC configurating, mmap and bo cache management questions
  2023-12-13 20:11                 ` Thomas Hellström
@ 2023-12-14  8:10                   ` Tvrtko Ursulin
  2023-12-14 10:52                     ` Thomas Hellström
  0 siblings, 1 reply; 13+ messages in thread
From: Tvrtko Ursulin @ 2023-12-14  8:10 UTC (permalink / raw)
  To: Thomas Hellström, Hellstrom, Thomas,
	intel-xe@lists.freedesktop.org
  Cc: Auld, Matthew, Santa, Carlos, Vivi, Rodrigo


On 13/12/2023 20:11, Thomas Hellström wrote:
> 
> On 12/13/23 18:50, Tvrtko Ursulin wrote:
>>
>> On 13/12/2023 17:27, Thomas Hellström wrote:
>>>
>>> On 12/13/23 12:55, Tvrtko Ursulin wrote:
>>>>
>>>> On 07/12/2023 12:01, Thomas Hellström wrote:
>>>>>
>>>>> On 12/7/23 12:11, Tvrtko Ursulin wrote:
>>>>>>
>>>>>> On 06/12/2023 11:46, Hellstrom, Thomas wrote:
>>>>>>> Hi, Tvrtko.
>>>>>>>
>>>>>>> On Wed, 2023-12-06 at 10:58 +0000, Tvrtko Ursulin wrote:
>>>>>>>>
>>>>>>>> Hi Thomas,
>>>>>>>>
>>>>>>>> On 06/12/2023 08:26, Hellstrom, Thomas wrote:
>>>>>>>>> On Tue, 2023-12-05 at 14:19 +0000, Tvrtko Ursulin wrote:
>>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> We are working on adding xe support to ChromeOS minigbm and have
>>>>>>>>>> a
>>>>>>>>>> couple questions.
>>>>>>>>>>
>>>>>>>>>> If I follow things correctly with xe mmap caching mode is fixed
>>>>>>>>>> to
>>>>>>>>>> object caching modes set at bo create. For framebuffers it will
>>>>>>>>>> be WC
>>>>>>>>>> and for the rest userspace can choose WB or WC via
>>>>>>>>>> drm_xe_gem_create->cpu_caching. (Unless discrete, when WB cannot
>>>>>>>>>> be
>>>>>>>>>> used
>>>>>>>>>> at all.)
>>>>>>>>>>
>>>>>>>>>> AFAICT minigbm basically cares about two transition points. Lets
>>>>>>>>>> call
>>>>>>>>>> them CPU access begin and end.
>>>>>>>>>>
>>>>>>>>>> 1)
>>>>>>>>>> When a bo is mmapped it wants to invalidate the cache, which
>>>>>>>>>> looks to
>>>>>>>>>> be
>>>>>>>>>> about making sure all GPU writes have landed to the backing
>>>>>>>>>> store. In
>>>>>>>>>> the i915 world that translates to the set_domain ioctl.
>>>>>>>>>>
>>>>>>>>>> What is the uapi for this with xe, or it is somehow guaranteed to
>>>>>>>>>> not
>>>>>>>>>> be
>>>>>>>>>> needed?
>>>>>>>>>
>>>>>>>>> Signalling a user-fence or dma-fence obtained as an out-fence from
>>>>>>>>> an
>>>>>>>>> exec call will guarantee GPU caches are flushed. Currently I don't
>>>>>>>>> think there is anything like gem wait in the uAPI, although 
>>>>>>>>> Matt is
>>>>>>>>> just about to add functionality to wait on all outstanding work on
>>>>>>>>> an
>>>>>>>>> exec_queue.
>>>>>>>>
>>>>>>>> Problem I see is there are no execs or therefore fences in the
>>>>>>>> minigbm
>>>>>>>> ABI. It's just buffers, created or imported, CPU access and some
>>>>>>>> other
>>>>>>>> stuff.
>>>>>>>>
>>>>>>>> And it is quite extensively used in the OS so I assume it has to 
>>>>>>>> work
>>>>>>>> (I
>>>>>>>> mean the invalidation/flushing was not put in there for no reason),
>>>>>>>> in
>>>>>>>> other words, where the i915 backend today does
>>>>>>>> DRM_I915_GEM_SET_DOMAIN
>>>>>>>> on "cpu access begin", which is buffer based, I am not clear how to
>>>>>>>> implement that with xe.
>>>>>>>>
>>>>>>>> For the outstanding work you mention, since you say it is about
>>>>>>>> exec_queue, I assume again it will not work purely with buffers? If
>>>>>>>> so
>>>>>>>> it probably won't be useful for minigbm.
>>>>>>>>
>>>>>>>> Also if I look at all the other minigbm backends, I see a mix of
>>>>>>>> behaviours:
>>>>>>>>
>>>>>>>>    * msm and vc4 appear to not concern themselves with any of this.
>>>>>>>>
>>>>>>>>    * rockchip appears to be doing full bounce buffering via 
>>>>>>>> memcpy on
>>>>>>>> CPU
>>>>>>>> access begin/end.
>>>>>>>>
>>>>>>>>    * i915 and amdgpu respectively use
>>>>>>>> DRM_I915_GEM_SET_DOMAIN/DRM_AMDGPU_GEM_WAIT_IDLE (also buffer 
>>>>>>>> based,
>>>>>>>> not
>>>>>>>> execution queue). Andgpu curiously does not do any flushing on CPU
>>>>>>>> access end.
>>>>>>>>
>>>>>>>> Digging into git history, both DRM_I915_GEM_SET_DOMAIN on CPU 
>>>>>>>> access
>>>>>>>> begin and clflushing on end were added to fix various CTS test
>>>>>>>> failures.
>>>>>>>> So I guess we could also wait and see what happens there. If 
>>>>>>>> those or
>>>>>>>> some will be failing with xe too then propose adding some new uapi.
>>>>>>>> Or
>>>>>>>> if manual testing will start reporting visual corruption in the UI
>>>>>>>> elements or such.
>>>>>>>
>>>>>>> Indeed it sounds like we'd need a DRM_XE_GEM_WAIT_IDLE, similar to
>>>>>>> AMDGPU for this use-case. I'll bring that up for discussion.
>>>>>>
>>>>>> Thanks!
>>>>>>
>>>>>>>>>> 2)
>>>>>>>>>> When a bo is unmapped, or CPU access finished, it wants to flush
>>>>>>>>>> the
>>>>>>>>>> CPU
>>>>>>>>>> caches. That is /almost/ completely a CPU operation, where it
>>>>>>>>>> just
>>>>>>>>>> needs
>>>>>>>>>> to either clflush or invalidate the WC buffer respectively, if
>>>>>>>>>> not
>>>>>>>>>> the
>>>>>>>>>> fact that clflush can be skipped on platforms with LLC.
>>>>>>>>>>
>>>>>>>>>> I did not see an equivalent of an I915_PARAM_HAS_LLC in xe? Did I
>>>>>>>>>> miss
>>>>>>>>>> it or what it is the plan for querying this detail?
>>>>>>>>>
>>>>>>>>> XeKMD is generally coherent, except if UMD selects a GPU PAT index
>>>>>>>>> with
>>>>>>>>> limited coherency together with WB instead of WC memory. In that
>>>>>>>>> case,
>>>>>>>>> UMD is responsible for doing the needed CLFLUSH-ing, whereas KMD
>>>>>>>>> only
>>>>>>>>> ensures initial clearing of the pages is CLFLUSHED for security
>>>>>>>>> reasons.
>>>>>>>>>
>>>>>>>>> I'm not 100% sure if UMD can actually select WB with limited
>>>>>>>>> coherency
>>>>>>>>> PAT index in the initial uAPI revision, but Matthew has received
>>>>>>>>> requests for that so any additional input here on performance
>>>>>>>>> implications is appreciated.
>>>>>>>>>
>>>>>>>>> The thinking here is otherwise that GPU PAT indices with limited
>>>>>>>>> coherency should be used together with WC memory in the same
>>>>>>>>> situations
>>>>>>>>> as VRAM/LMEM is used on DGFX.
>>>>>>>>
>>>>>>>> Hm okay, this would be the VM BIND side of things which deals with
>>>>>>>> GPU
>>>>>>>> PATs. From the CPU side it is just CPU caching modes implicitly
>>>>>>>> selected
>>>>>>>> via DRM_XE_GEM_CPU_CACHING_WB/WC.
>>>>>>>>
>>>>>>>> The question about the analogue of I915_PARAM_HAS_LLC was AFAIU 
>>>>>>>> about
>>>>>>>> a
>>>>>>>> performance optimisation where UMD is deciding whether it is 
>>>>>>>> okay to
>>>>>>>> skip issuing clflush for the mapped bo if DRM_XE_GEM_CPU_CACHING_WB
>>>>>>>> was
>>>>>>>> used. (If DRM_XE_GEM_CPU_CACHING_WC was used it obviously only 
>>>>>>>> needs
>>>>>>>> to
>>>>>>>> flush the write combine buffer.)
>>>>>>>>
>>>>>>>> Looking at what Mesa is doing it appears it is not using
>>>>>>>> I915_PARAM_HAS_LLC but has its own device info tables. So I guess
>>>>>>>> minigbm xe backend will have to do the same if xe will not provide
>>>>>>>> the
>>>>>>>> analogue query.
>>>>>>>
>>>>>>> IMO clflushing in this case should never be needed, (Again, 
>>>>>>> similar to
>>>>>>> AMD). Whatever renders from / to those buffers should make sure they
>>>>>>> are clflushed before or while accessing them. How is a rendering API
>>>>>>> made aware of these bos? Are they imported using drm prime?
>>>>>>
>>>>>> Yes they can be imported via drm prime. It's a quite a huge code 
>>>>>> base with a few abstractions and quite hard to figure out all 
>>>>>> possible paths. There are gbm_bo_map paths in the host and vm 
>>>>>> compositors (Exo and Sommelier), Skia library, camera pipeline, 
>>>>>> maybe more.
>>>>>>
>>>>>> You mean specifically on the "cpu access end" part of the story 
>>>>>> flushing would never be needed? Regardless of llc and !llc, wb or 
>>>>>> wc mappings, imported or native buffer? What mechanism would next 
>>>>>> in the pipeline (after CPU access) use to ensure the flush?
>>>>>>
>>>>> Currently we're not supporting any older !LLC hardware, and with 
>>>>> newer !LLC hardware, we're only allowing at least 1-way coherency 
>>>>> with dma-bufs, meaning the GPU will acquire the cache line when 
>>>>> accessing the bo. If the bo is not a dma-buf, and the GPU wants to 
>>>>> access it in non-coherent mode, we're currently enforcing WC 
>>>>> cpu-mappings.
>>>>>
>>>>> So:
>>>>> igfx with shared LLC: Not a problem (always coherent)
>>>>> dgfx: Not a problem (always coherent)
>>>>> newer igfx WO shared LLC (KMD is enforcing coherent access).
>>>>>
>>>>> Now this may of course, as mentions, have performance implications 
>>>>> compared to previous igfx solutions, depending on how it all is 
>>>>> written, but should have similar performance characteristics as 
>>>>> dgfx. If we want to change this and relax the coherency enforcement 
>>>>> and resort to CLFLUSHes again,  could UMD communicate the need for 
>>>>> this in the same way it communicates the format of the gbm_bos?
>>>>
>>>> I think it is probably not needed. If minigbm will know mmap is 
>>>> always WC on MTL then it can know that all it needs to do to flush 
>>>> is flush the write-combine buffer. So the logic in minigbm xe 
>>>> backend "cpu access end" would be like:
>>>>
>>>> xe_bo_flush(...)
>>>> {
>>>>   if (dgfx || (igfx && !llc))
>>>>     __builtin_ia32_sfence();
>>>> }
>>>>
>>>> Does that sound right to you?
>>>>
>>>> So far on the i915 backend this explicit sfence was not even there 
>>>> in the mmap_wc case, which I guess possibly works because something 
>>>> is likely to trigger the flush implicitly inside the complexities of 
>>>> the call stack anyway.
>>>
>>> IIRC WC buffers are flushed also on uncached register writes (does 
>>> that include updating the ring tail perhapsl?).
>>
>> If next in chain is the GPU I guess so. It would indeed be a typical 
>> use case but for correctness it probably needs a sfence since AFAIU 
>> write-combine buffer is per CPU core.
>>
>>> It might also be that these gbm_bos are created WB and the igfx PAT 
>>> is set to 1-way coherency. What code creates those gbm_bos?
>>
>> Minigbm on behalf of various callers. It is based on the allocation 
>> flags such as will it be for scanout, protected content, modifiers and 
>> such.
>>
>> When scanout is request we will set DRM_XE_GEM_CREATE_FLAG_SCANOUT. 
>> Which would mean scanount bos will be WC and the rest WB-1-way when 
>> mmaped.
>>
>>> I figure for dgfx at least we don't want any intel platform-specific 
>>> code in the stack if possible....
>>
>> True, would have to be platform specific.
> 
> So if we unconditionally add an sfence, I wouldn't expect the overhead 
> to be significant even if it's not strictly needed? Or have the 
> gbm_bo_create() forward to gbm_bo_map() whether the bo was created WC?

Unconditional is most probably fine I agree.

Otherwise the map does have the information of how the object was 
created so that part wouldn't be a problem 
(https://github.com/intel/minigbm/blob/10d9a651375efa3592ab95431783984c28a30ad4/i915.c#L529). 
It is just if we wanted the LLC presence it would be extra work.

Regards,

Tvrtko


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Intel-xe] LLC configurating, mmap and bo cache management questions
  2023-12-14  8:10                   ` Tvrtko Ursulin
@ 2023-12-14 10:52                     ` Thomas Hellström
  2023-12-14 12:34                       ` Tvrtko Ursulin
  0 siblings, 1 reply; 13+ messages in thread
From: Thomas Hellström @ 2023-12-14 10:52 UTC (permalink / raw)
  To: Tvrtko Ursulin, Hellstrom, Thomas, intel-xe@lists.freedesktop.org
  Cc: Auld, Matthew, Santa, Carlos, Vivi, Rodrigo


On 12/14/23 09:10, Tvrtko Ursulin wrote:
>
> On 13/12/2023 20:11, Thomas Hellström wrote:
>>
>> On 12/13/23 18:50, Tvrtko Ursulin wrote:
>>>
>>> On 13/12/2023 17:27, Thomas Hellström wrote:
>>>>
>>>> On 12/13/23 12:55, Tvrtko Ursulin wrote:
>>>>>
>>>>> On 07/12/2023 12:01, Thomas Hellström wrote:
>>>>>>
>>>>>> On 12/7/23 12:11, Tvrtko Ursulin wrote:
>>>>>>>
>>>>>>> On 06/12/2023 11:46, Hellstrom, Thomas wrote:
>>>>>>>> Hi, Tvrtko.
>>>>>>>>
>>>>>>>> On Wed, 2023-12-06 at 10:58 +0000, Tvrtko Ursulin wrote:
>>>>>>>>>
>>>>>>>>> Hi Thomas,
>>>>>>>>>
>>>>>>>>> On 06/12/2023 08:26, Hellstrom, Thomas wrote:
>>>>>>>>>> On Tue, 2023-12-05 at 14:19 +0000, Tvrtko Ursulin wrote:
>>>>>>>>>>>
>>>>>>>>>>> Hi,
>>>>>>>>>>>
>>>>>>>>>>> We are working on adding xe support to ChromeOS minigbm and 
>>>>>>>>>>> have
>>>>>>>>>>> a
>>>>>>>>>>> couple questions.
>>>>>>>>>>>
>>>>>>>>>>> If I follow things correctly with xe mmap caching mode is fixed
>>>>>>>>>>> to
>>>>>>>>>>> object caching modes set at bo create. For framebuffers it will
>>>>>>>>>>> be WC
>>>>>>>>>>> and for the rest userspace can choose WB or WC via
>>>>>>>>>>> drm_xe_gem_create->cpu_caching. (Unless discrete, when WB 
>>>>>>>>>>> cannot
>>>>>>>>>>> be
>>>>>>>>>>> used
>>>>>>>>>>> at all.)
>>>>>>>>>>>
>>>>>>>>>>> AFAICT minigbm basically cares about two transition points. 
>>>>>>>>>>> Lets
>>>>>>>>>>> call
>>>>>>>>>>> them CPU access begin and end.
>>>>>>>>>>>
>>>>>>>>>>> 1)
>>>>>>>>>>> When a bo is mmapped it wants to invalidate the cache, which
>>>>>>>>>>> looks to
>>>>>>>>>>> be
>>>>>>>>>>> about making sure all GPU writes have landed to the backing
>>>>>>>>>>> store. In
>>>>>>>>>>> the i915 world that translates to the set_domain ioctl.
>>>>>>>>>>>
>>>>>>>>>>> What is the uapi for this with xe, or it is somehow 
>>>>>>>>>>> guaranteed to
>>>>>>>>>>> not
>>>>>>>>>>> be
>>>>>>>>>>> needed?
>>>>>>>>>>
>>>>>>>>>> Signalling a user-fence or dma-fence obtained as an out-fence 
>>>>>>>>>> from
>>>>>>>>>> an
>>>>>>>>>> exec call will guarantee GPU caches are flushed. Currently I 
>>>>>>>>>> don't
>>>>>>>>>> think there is anything like gem wait in the uAPI, although 
>>>>>>>>>> Matt is
>>>>>>>>>> just about to add functionality to wait on all outstanding 
>>>>>>>>>> work on
>>>>>>>>>> an
>>>>>>>>>> exec_queue.
>>>>>>>>>
>>>>>>>>> Problem I see is there are no execs or therefore fences in the
>>>>>>>>> minigbm
>>>>>>>>> ABI. It's just buffers, created or imported, CPU access and some
>>>>>>>>> other
>>>>>>>>> stuff.
>>>>>>>>>
>>>>>>>>> And it is quite extensively used in the OS so I assume it has 
>>>>>>>>> to work
>>>>>>>>> (I
>>>>>>>>> mean the invalidation/flushing was not put in there for no 
>>>>>>>>> reason),
>>>>>>>>> in
>>>>>>>>> other words, where the i915 backend today does
>>>>>>>>> DRM_I915_GEM_SET_DOMAIN
>>>>>>>>> on "cpu access begin", which is buffer based, I am not clear 
>>>>>>>>> how to
>>>>>>>>> implement that with xe.
>>>>>>>>>
>>>>>>>>> For the outstanding work you mention, since you say it is about
>>>>>>>>> exec_queue, I assume again it will not work purely with 
>>>>>>>>> buffers? If
>>>>>>>>> so
>>>>>>>>> it probably won't be useful for minigbm.
>>>>>>>>>
>>>>>>>>> Also if I look at all the other minigbm backends, I see a mix of
>>>>>>>>> behaviours:
>>>>>>>>>
>>>>>>>>>    * msm and vc4 appear to not concern themselves with any of 
>>>>>>>>> this.
>>>>>>>>>
>>>>>>>>>    * rockchip appears to be doing full bounce buffering via 
>>>>>>>>> memcpy on
>>>>>>>>> CPU
>>>>>>>>> access begin/end.
>>>>>>>>>
>>>>>>>>>    * i915 and amdgpu respectively use
>>>>>>>>> DRM_I915_GEM_SET_DOMAIN/DRM_AMDGPU_GEM_WAIT_IDLE (also buffer 
>>>>>>>>> based,
>>>>>>>>> not
>>>>>>>>> execution queue). Andgpu curiously does not do any flushing on 
>>>>>>>>> CPU
>>>>>>>>> access end.
>>>>>>>>>
>>>>>>>>> Digging into git history, both DRM_I915_GEM_SET_DOMAIN on CPU 
>>>>>>>>> access
>>>>>>>>> begin and clflushing on end were added to fix various CTS test
>>>>>>>>> failures.
>>>>>>>>> So I guess we could also wait and see what happens there. If 
>>>>>>>>> those or
>>>>>>>>> some will be failing with xe too then propose adding some new 
>>>>>>>>> uapi.
>>>>>>>>> Or
>>>>>>>>> if manual testing will start reporting visual corruption in 
>>>>>>>>> the UI
>>>>>>>>> elements or such.
>>>>>>>>
>>>>>>>> Indeed it sounds like we'd need a DRM_XE_GEM_WAIT_IDLE, similar to
>>>>>>>> AMDGPU for this use-case. I'll bring that up for discussion.
>>>>>>>
>>>>>>> Thanks!
>>>>>>>
>>>>>>>>>>> 2)
>>>>>>>>>>> When a bo is unmapped, or CPU access finished, it wants to 
>>>>>>>>>>> flush
>>>>>>>>>>> the
>>>>>>>>>>> CPU
>>>>>>>>>>> caches. That is /almost/ completely a CPU operation, where it
>>>>>>>>>>> just
>>>>>>>>>>> needs
>>>>>>>>>>> to either clflush or invalidate the WC buffer respectively, if
>>>>>>>>>>> not
>>>>>>>>>>> the
>>>>>>>>>>> fact that clflush can be skipped on platforms with LLC.
>>>>>>>>>>>
>>>>>>>>>>> I did not see an equivalent of an I915_PARAM_HAS_LLC in xe? 
>>>>>>>>>>> Did I
>>>>>>>>>>> miss
>>>>>>>>>>> it or what it is the plan for querying this detail?
>>>>>>>>>>
>>>>>>>>>> XeKMD is generally coherent, except if UMD selects a GPU PAT 
>>>>>>>>>> index
>>>>>>>>>> with
>>>>>>>>>> limited coherency together with WB instead of WC memory. In that
>>>>>>>>>> case,
>>>>>>>>>> UMD is responsible for doing the needed CLFLUSH-ing, whereas KMD
>>>>>>>>>> only
>>>>>>>>>> ensures initial clearing of the pages is CLFLUSHED for security
>>>>>>>>>> reasons.
>>>>>>>>>>
>>>>>>>>>> I'm not 100% sure if UMD can actually select WB with limited
>>>>>>>>>> coherency
>>>>>>>>>> PAT index in the initial uAPI revision, but Matthew has received
>>>>>>>>>> requests for that so any additional input here on performance
>>>>>>>>>> implications is appreciated.
>>>>>>>>>>
>>>>>>>>>> The thinking here is otherwise that GPU PAT indices with limited
>>>>>>>>>> coherency should be used together with WC memory in the same
>>>>>>>>>> situations
>>>>>>>>>> as VRAM/LMEM is used on DGFX.
>>>>>>>>>
>>>>>>>>> Hm okay, this would be the VM BIND side of things which deals 
>>>>>>>>> with
>>>>>>>>> GPU
>>>>>>>>> PATs. From the CPU side it is just CPU caching modes implicitly
>>>>>>>>> selected
>>>>>>>>> via DRM_XE_GEM_CPU_CACHING_WB/WC.
>>>>>>>>>
>>>>>>>>> The question about the analogue of I915_PARAM_HAS_LLC was 
>>>>>>>>> AFAIU about
>>>>>>>>> a
>>>>>>>>> performance optimisation where UMD is deciding whether it is 
>>>>>>>>> okay to
>>>>>>>>> skip issuing clflush for the mapped bo if 
>>>>>>>>> DRM_XE_GEM_CPU_CACHING_WB
>>>>>>>>> was
>>>>>>>>> used. (If DRM_XE_GEM_CPU_CACHING_WC was used it obviously only 
>>>>>>>>> needs
>>>>>>>>> to
>>>>>>>>> flush the write combine buffer.)
>>>>>>>>>
>>>>>>>>> Looking at what Mesa is doing it appears it is not using
>>>>>>>>> I915_PARAM_HAS_LLC but has its own device info tables. So I guess
>>>>>>>>> minigbm xe backend will have to do the same if xe will not 
>>>>>>>>> provide
>>>>>>>>> the
>>>>>>>>> analogue query.
>>>>>>>>
>>>>>>>> IMO clflushing in this case should never be needed, (Again, 
>>>>>>>> similar to
>>>>>>>> AMD). Whatever renders from / to those buffers should make sure 
>>>>>>>> they
>>>>>>>> are clflushed before or while accessing them. How is a 
>>>>>>>> rendering API
>>>>>>>> made aware of these bos? Are they imported using drm prime?
>>>>>>>
>>>>>>> Yes they can be imported via drm prime. It's a quite a huge code 
>>>>>>> base with a few abstractions and quite hard to figure out all 
>>>>>>> possible paths. There are gbm_bo_map paths in the host and vm 
>>>>>>> compositors (Exo and Sommelier), Skia library, camera pipeline, 
>>>>>>> maybe more.
>>>>>>>
>>>>>>> You mean specifically on the "cpu access end" part of the story 
>>>>>>> flushing would never be needed? Regardless of llc and !llc, wb 
>>>>>>> or wc mappings, imported or native buffer? What mechanism would 
>>>>>>> next in the pipeline (after CPU access) use to ensure the flush?
>>>>>>>
>>>>>> Currently we're not supporting any older !LLC hardware, and with 
>>>>>> newer !LLC hardware, we're only allowing at least 1-way coherency 
>>>>>> with dma-bufs, meaning the GPU will acquire the cache line when 
>>>>>> accessing the bo. If the bo is not a dma-buf, and the GPU wants 
>>>>>> to access it in non-coherent mode, we're currently enforcing WC 
>>>>>> cpu-mappings.
>>>>>>
>>>>>> So:
>>>>>> igfx with shared LLC: Not a problem (always coherent)
>>>>>> dgfx: Not a problem (always coherent)
>>>>>> newer igfx WO shared LLC (KMD is enforcing coherent access).
>>>>>>
>>>>>> Now this may of course, as mentions, have performance 
>>>>>> implications compared to previous igfx solutions, depending on 
>>>>>> how it all is written, but should have similar performance 
>>>>>> characteristics as dgfx. If we want to change this and relax the 
>>>>>> coherency enforcement and resort to CLFLUSHes again,  could UMD 
>>>>>> communicate the need for this in the same way it communicates the 
>>>>>> format of the gbm_bos?
>>>>>
>>>>> I think it is probably not needed. If minigbm will know mmap is 
>>>>> always WC on MTL then it can know that all it needs to do to flush 
>>>>> is flush the write-combine buffer. So the logic in minigbm xe 
>>>>> backend "cpu access end" would be like:
>>>>>
>>>>> xe_bo_flush(...)
>>>>> {
>>>>>   if (dgfx || (igfx && !llc))
>>>>>     __builtin_ia32_sfence();
>>>>> }
>>>>>
>>>>> Does that sound right to you?
>>>>>
>>>>> So far on the i915 backend this explicit sfence was not even there 
>>>>> in the mmap_wc case, which I guess possibly works because 
>>>>> something is likely to trigger the flush implicitly inside the 
>>>>> complexities of the call stack anyway.
>>>>
>>>> IIRC WC buffers are flushed also on uncached register writes (does 
>>>> that include updating the ring tail perhapsl?).
>>>
>>> If next in chain is the GPU I guess so. It would indeed be a typical 
>>> use case but for correctness it probably needs a sfence since AFAIU 
>>> write-combine buffer is per CPU core.
>>>
>>>> It might also be that these gbm_bos are created WB and the igfx PAT 
>>>> is set to 1-way coherency. What code creates those gbm_bos?
>>>
>>> Minigbm on behalf of various callers. It is based on the allocation 
>>> flags such as will it be for scanout, protected content, modifiers 
>>> and such.
>>>
>>> When scanout is request we will set DRM_XE_GEM_CREATE_FLAG_SCANOUT. 
>>> Which would mean scanount bos will be WC and the rest WB-1-way when 
>>> mmaped.
>>>
>>>> I figure for dgfx at least we don't want any intel 
>>>> platform-specific code in the stack if possible....
>>>
>>> True, would have to be platform specific.
>>
>> So if we unconditionally add an sfence, I wouldn't expect the 
>> overhead to be significant even if it's not strictly needed? Or have 
>> the gbm_bo_create() forward to gbm_bo_map() whether the bo was 
>> created WC?
>
> Unconditional is most probably fine I agree.
>
> Otherwise the map does have the information of how the object was 
> created so that part wouldn't be a problem 
> (https://github.com/intel/minigbm/blob/10d9a651375efa3592ab95431783984c28a30ad4/i915.c#L529). 
> It is just if we wanted the LLC presence it would be extra work.

Regarding 2) There is some internal resistance to exposing a legacy gem 
wait. Does GBM always alias the gbm_bo as a dma-buf? Would it be 
possible to use the  DMA_BUF_IOCTL_SYNC ioctl for this?

/Thomas


>
> Regards,
>
> Tvrtko
>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Intel-xe] LLC configurating, mmap and bo cache management questions
  2023-12-14 10:52                     ` Thomas Hellström
@ 2023-12-14 12:34                       ` Tvrtko Ursulin
  0 siblings, 0 replies; 13+ messages in thread
From: Tvrtko Ursulin @ 2023-12-14 12:34 UTC (permalink / raw)
  To: Thomas Hellström, Hellstrom, Thomas,
	intel-xe@lists.freedesktop.org
  Cc: Auld, Matthew, Santa, Carlos, Vivi, Rodrigo


On 14/12/2023 10:52, Thomas Hellström wrote:
> 
> On 12/14/23 09:10, Tvrtko Ursulin wrote:
>>
>> On 13/12/2023 20:11, Thomas Hellström wrote:
>>>
>>> On 12/13/23 18:50, Tvrtko Ursulin wrote:
>>>>
>>>> On 13/12/2023 17:27, Thomas Hellström wrote:
>>>>>
>>>>> On 12/13/23 12:55, Tvrtko Ursulin wrote:
>>>>>>
>>>>>> On 07/12/2023 12:01, Thomas Hellström wrote:
>>>>>>>
>>>>>>> On 12/7/23 12:11, Tvrtko Ursulin wrote:
>>>>>>>>
>>>>>>>> On 06/12/2023 11:46, Hellstrom, Thomas wrote:
>>>>>>>>> Hi, Tvrtko.
>>>>>>>>>
>>>>>>>>> On Wed, 2023-12-06 at 10:58 +0000, Tvrtko Ursulin wrote:
>>>>>>>>>>
>>>>>>>>>> Hi Thomas,
>>>>>>>>>>
>>>>>>>>>> On 06/12/2023 08:26, Hellstrom, Thomas wrote:
>>>>>>>>>>> On Tue, 2023-12-05 at 14:19 +0000, Tvrtko Ursulin wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> Hi,
>>>>>>>>>>>>
>>>>>>>>>>>> We are working on adding xe support to ChromeOS minigbm and 
>>>>>>>>>>>> have
>>>>>>>>>>>> a
>>>>>>>>>>>> couple questions.
>>>>>>>>>>>>
>>>>>>>>>>>> If I follow things correctly with xe mmap caching mode is fixed
>>>>>>>>>>>> to
>>>>>>>>>>>> object caching modes set at bo create. For framebuffers it will
>>>>>>>>>>>> be WC
>>>>>>>>>>>> and for the rest userspace can choose WB or WC via
>>>>>>>>>>>> drm_xe_gem_create->cpu_caching. (Unless discrete, when WB 
>>>>>>>>>>>> cannot
>>>>>>>>>>>> be
>>>>>>>>>>>> used
>>>>>>>>>>>> at all.)
>>>>>>>>>>>>
>>>>>>>>>>>> AFAICT minigbm basically cares about two transition points. 
>>>>>>>>>>>> Lets
>>>>>>>>>>>> call
>>>>>>>>>>>> them CPU access begin and end.
>>>>>>>>>>>>
>>>>>>>>>>>> 1)
>>>>>>>>>>>> When a bo is mmapped it wants to invalidate the cache, which
>>>>>>>>>>>> looks to
>>>>>>>>>>>> be
>>>>>>>>>>>> about making sure all GPU writes have landed to the backing
>>>>>>>>>>>> store. In
>>>>>>>>>>>> the i915 world that translates to the set_domain ioctl.
>>>>>>>>>>>>
>>>>>>>>>>>> What is the uapi for this with xe, or it is somehow 
>>>>>>>>>>>> guaranteed to
>>>>>>>>>>>> not
>>>>>>>>>>>> be
>>>>>>>>>>>> needed?
>>>>>>>>>>>
>>>>>>>>>>> Signalling a user-fence or dma-fence obtained as an out-fence 
>>>>>>>>>>> from
>>>>>>>>>>> an
>>>>>>>>>>> exec call will guarantee GPU caches are flushed. Currently I 
>>>>>>>>>>> don't
>>>>>>>>>>> think there is anything like gem wait in the uAPI, although 
>>>>>>>>>>> Matt is
>>>>>>>>>>> just about to add functionality to wait on all outstanding 
>>>>>>>>>>> work on
>>>>>>>>>>> an
>>>>>>>>>>> exec_queue.
>>>>>>>>>>
>>>>>>>>>> Problem I see is there are no execs or therefore fences in the
>>>>>>>>>> minigbm
>>>>>>>>>> ABI. It's just buffers, created or imported, CPU access and some
>>>>>>>>>> other
>>>>>>>>>> stuff.
>>>>>>>>>>
>>>>>>>>>> And it is quite extensively used in the OS so I assume it has 
>>>>>>>>>> to work
>>>>>>>>>> (I
>>>>>>>>>> mean the invalidation/flushing was not put in there for no 
>>>>>>>>>> reason),
>>>>>>>>>> in
>>>>>>>>>> other words, where the i915 backend today does
>>>>>>>>>> DRM_I915_GEM_SET_DOMAIN
>>>>>>>>>> on "cpu access begin", which is buffer based, I am not clear 
>>>>>>>>>> how to
>>>>>>>>>> implement that with xe.
>>>>>>>>>>
>>>>>>>>>> For the outstanding work you mention, since you say it is about
>>>>>>>>>> exec_queue, I assume again it will not work purely with 
>>>>>>>>>> buffers? If
>>>>>>>>>> so
>>>>>>>>>> it probably won't be useful for minigbm.
>>>>>>>>>>
>>>>>>>>>> Also if I look at all the other minigbm backends, I see a mix of
>>>>>>>>>> behaviours:
>>>>>>>>>>
>>>>>>>>>>    * msm and vc4 appear to not concern themselves with any of 
>>>>>>>>>> this.
>>>>>>>>>>
>>>>>>>>>>    * rockchip appears to be doing full bounce buffering via 
>>>>>>>>>> memcpy on
>>>>>>>>>> CPU
>>>>>>>>>> access begin/end.
>>>>>>>>>>
>>>>>>>>>>    * i915 and amdgpu respectively use
>>>>>>>>>> DRM_I915_GEM_SET_DOMAIN/DRM_AMDGPU_GEM_WAIT_IDLE (also buffer 
>>>>>>>>>> based,
>>>>>>>>>> not
>>>>>>>>>> execution queue). Andgpu curiously does not do any flushing on 
>>>>>>>>>> CPU
>>>>>>>>>> access end.
>>>>>>>>>>
>>>>>>>>>> Digging into git history, both DRM_I915_GEM_SET_DOMAIN on CPU 
>>>>>>>>>> access
>>>>>>>>>> begin and clflushing on end were added to fix various CTS test
>>>>>>>>>> failures.
>>>>>>>>>> So I guess we could also wait and see what happens there. If 
>>>>>>>>>> those or
>>>>>>>>>> some will be failing with xe too then propose adding some new 
>>>>>>>>>> uapi.
>>>>>>>>>> Or
>>>>>>>>>> if manual testing will start reporting visual corruption in 
>>>>>>>>>> the UI
>>>>>>>>>> elements or such.
>>>>>>>>>
>>>>>>>>> Indeed it sounds like we'd need a DRM_XE_GEM_WAIT_IDLE, similar to
>>>>>>>>> AMDGPU for this use-case. I'll bring that up for discussion.
>>>>>>>>
>>>>>>>> Thanks!
>>>>>>>>
>>>>>>>>>>>> 2)
>>>>>>>>>>>> When a bo is unmapped, or CPU access finished, it wants to 
>>>>>>>>>>>> flush
>>>>>>>>>>>> the
>>>>>>>>>>>> CPU
>>>>>>>>>>>> caches. That is /almost/ completely a CPU operation, where it
>>>>>>>>>>>> just
>>>>>>>>>>>> needs
>>>>>>>>>>>> to either clflush or invalidate the WC buffer respectively, if
>>>>>>>>>>>> not
>>>>>>>>>>>> the
>>>>>>>>>>>> fact that clflush can be skipped on platforms with LLC.
>>>>>>>>>>>>
>>>>>>>>>>>> I did not see an equivalent of an I915_PARAM_HAS_LLC in xe? 
>>>>>>>>>>>> Did I
>>>>>>>>>>>> miss
>>>>>>>>>>>> it or what it is the plan for querying this detail?
>>>>>>>>>>>
>>>>>>>>>>> XeKMD is generally coherent, except if UMD selects a GPU PAT 
>>>>>>>>>>> index
>>>>>>>>>>> with
>>>>>>>>>>> limited coherency together with WB instead of WC memory. In that
>>>>>>>>>>> case,
>>>>>>>>>>> UMD is responsible for doing the needed CLFLUSH-ing, whereas KMD
>>>>>>>>>>> only
>>>>>>>>>>> ensures initial clearing of the pages is CLFLUSHED for security
>>>>>>>>>>> reasons.
>>>>>>>>>>>
>>>>>>>>>>> I'm not 100% sure if UMD can actually select WB with limited
>>>>>>>>>>> coherency
>>>>>>>>>>> PAT index in the initial uAPI revision, but Matthew has received
>>>>>>>>>>> requests for that so any additional input here on performance
>>>>>>>>>>> implications is appreciated.
>>>>>>>>>>>
>>>>>>>>>>> The thinking here is otherwise that GPU PAT indices with limited
>>>>>>>>>>> coherency should be used together with WC memory in the same
>>>>>>>>>>> situations
>>>>>>>>>>> as VRAM/LMEM is used on DGFX.
>>>>>>>>>>
>>>>>>>>>> Hm okay, this would be the VM BIND side of things which deals 
>>>>>>>>>> with
>>>>>>>>>> GPU
>>>>>>>>>> PATs. From the CPU side it is just CPU caching modes implicitly
>>>>>>>>>> selected
>>>>>>>>>> via DRM_XE_GEM_CPU_CACHING_WB/WC.
>>>>>>>>>>
>>>>>>>>>> The question about the analogue of I915_PARAM_HAS_LLC was 
>>>>>>>>>> AFAIU about
>>>>>>>>>> a
>>>>>>>>>> performance optimisation where UMD is deciding whether it is 
>>>>>>>>>> okay to
>>>>>>>>>> skip issuing clflush for the mapped bo if 
>>>>>>>>>> DRM_XE_GEM_CPU_CACHING_WB
>>>>>>>>>> was
>>>>>>>>>> used. (If DRM_XE_GEM_CPU_CACHING_WC was used it obviously only 
>>>>>>>>>> needs
>>>>>>>>>> to
>>>>>>>>>> flush the write combine buffer.)
>>>>>>>>>>
>>>>>>>>>> Looking at what Mesa is doing it appears it is not using
>>>>>>>>>> I915_PARAM_HAS_LLC but has its own device info tables. So I guess
>>>>>>>>>> minigbm xe backend will have to do the same if xe will not 
>>>>>>>>>> provide
>>>>>>>>>> the
>>>>>>>>>> analogue query.
>>>>>>>>>
>>>>>>>>> IMO clflushing in this case should never be needed, (Again, 
>>>>>>>>> similar to
>>>>>>>>> AMD). Whatever renders from / to those buffers should make sure 
>>>>>>>>> they
>>>>>>>>> are clflushed before or while accessing them. How is a 
>>>>>>>>> rendering API
>>>>>>>>> made aware of these bos? Are they imported using drm prime?
>>>>>>>>
>>>>>>>> Yes they can be imported via drm prime. It's a quite a huge code 
>>>>>>>> base with a few abstractions and quite hard to figure out all 
>>>>>>>> possible paths. There are gbm_bo_map paths in the host and vm 
>>>>>>>> compositors (Exo and Sommelier), Skia library, camera pipeline, 
>>>>>>>> maybe more.
>>>>>>>>
>>>>>>>> You mean specifically on the "cpu access end" part of the story 
>>>>>>>> flushing would never be needed? Regardless of llc and !llc, wb 
>>>>>>>> or wc mappings, imported or native buffer? What mechanism would 
>>>>>>>> next in the pipeline (after CPU access) use to ensure the flush?
>>>>>>>>
>>>>>>> Currently we're not supporting any older !LLC hardware, and with 
>>>>>>> newer !LLC hardware, we're only allowing at least 1-way coherency 
>>>>>>> with dma-bufs, meaning the GPU will acquire the cache line when 
>>>>>>> accessing the bo. If the bo is not a dma-buf, and the GPU wants 
>>>>>>> to access it in non-coherent mode, we're currently enforcing WC 
>>>>>>> cpu-mappings.
>>>>>>>
>>>>>>> So:
>>>>>>> igfx with shared LLC: Not a problem (always coherent)
>>>>>>> dgfx: Not a problem (always coherent)
>>>>>>> newer igfx WO shared LLC (KMD is enforcing coherent access).
>>>>>>>
>>>>>>> Now this may of course, as mentions, have performance 
>>>>>>> implications compared to previous igfx solutions, depending on 
>>>>>>> how it all is written, but should have similar performance 
>>>>>>> characteristics as dgfx. If we want to change this and relax the 
>>>>>>> coherency enforcement and resort to CLFLUSHes again,  could UMD 
>>>>>>> communicate the need for this in the same way it communicates the 
>>>>>>> format of the gbm_bos?
>>>>>>
>>>>>> I think it is probably not needed. If minigbm will know mmap is 
>>>>>> always WC on MTL then it can know that all it needs to do to flush 
>>>>>> is flush the write-combine buffer. So the logic in minigbm xe 
>>>>>> backend "cpu access end" would be like:
>>>>>>
>>>>>> xe_bo_flush(...)
>>>>>> {
>>>>>>   if (dgfx || (igfx && !llc))
>>>>>>     __builtin_ia32_sfence();
>>>>>> }
>>>>>>
>>>>>> Does that sound right to you?
>>>>>>
>>>>>> So far on the i915 backend this explicit sfence was not even there 
>>>>>> in the mmap_wc case, which I guess possibly works because 
>>>>>> something is likely to trigger the flush implicitly inside the 
>>>>>> complexities of the call stack anyway.
>>>>>
>>>>> IIRC WC buffers are flushed also on uncached register writes (does 
>>>>> that include updating the ring tail perhapsl?).
>>>>
>>>> If next in chain is the GPU I guess so. It would indeed be a typical 
>>>> use case but for correctness it probably needs a sfence since AFAIU 
>>>> write-combine buffer is per CPU core.
>>>>
>>>>> It might also be that these gbm_bos are created WB and the igfx PAT 
>>>>> is set to 1-way coherency. What code creates those gbm_bos?
>>>>
>>>> Minigbm on behalf of various callers. It is based on the allocation 
>>>> flags such as will it be for scanout, protected content, modifiers 
>>>> and such.
>>>>
>>>> When scanout is request we will set DRM_XE_GEM_CREATE_FLAG_SCANOUT. 
>>>> Which would mean scanount bos will be WC and the rest WB-1-way when 
>>>> mmaped.
>>>>
>>>>> I figure for dgfx at least we don't want any intel 
>>>>> platform-specific code in the stack if possible....
>>>>
>>>> True, would have to be platform specific.
>>>
>>> So if we unconditionally add an sfence, I wouldn't expect the 
>>> overhead to be significant even if it's not strictly needed? Or have 
>>> the gbm_bo_create() forward to gbm_bo_map() whether the bo was 
>>> created WC?
>>
>> Unconditional is most probably fine I agree.
>>
>> Otherwise the map does have the information of how the object was 
>> created so that part wouldn't be a problem 
>> (https://github.com/intel/minigbm/blob/10d9a651375efa3592ab95431783984c28a30ad4/i915.c#L529). It is just if we wanted the LLC presence it would be extra work.
> 
> Regarding 2) There is some internal resistance to exposing a legacy gem 
> wait. Does GBM always alias the gbm_bo as a dma-buf? Would it be 
> possible to use the  DMA_BUF_IOCTL_SYNC ioctl for this?

I don't think so. It can create native buffers or import from dma-buf 
fd. In theory I suppose one could create a fd for every created buffer 
and sync on that but huh.. :)

I don't know if you can access the below, but for reference, commits 
which added the "cpu access begin" paths for i915 and amdgpu:

https://chromium.googlesource.com/chromiumos/platform/minigbm/+/ff66c801c90ddeac97e6121584daff595b85d795

https://chromium.googlesource.com/chromiumos/platform/minigbm/+/2d1877f0417e5891ad6e2795db7cf395916b15b3

You will notice they only reference the CTS failures. As it is unknown 
at this point if any real use cases are affected, as discussed before, 
it could be fine to simply punt this problem for later.

Regards,

Tvrtko

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2023-12-14 12:34 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-12-05 14:19 [Intel-xe] LLC configurating, mmap and bo cache management questions Tvrtko Ursulin
2023-12-06  8:26 ` Hellstrom, Thomas
2023-12-06 10:58   ` Tvrtko Ursulin
2023-12-06 11:46     ` Hellstrom, Thomas
2023-12-07 11:11       ` Tvrtko Ursulin
2023-12-07 12:01         ` Thomas Hellström
2023-12-13 11:55           ` Tvrtko Ursulin
2023-12-13 17:27             ` Thomas Hellström
2023-12-13 17:50               ` Tvrtko Ursulin
2023-12-13 20:11                 ` Thomas Hellström
2023-12-14  8:10                   ` Tvrtko Ursulin
2023-12-14 10:52                     ` Thomas Hellström
2023-12-14 12:34                       ` Tvrtko Ursulin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox