Re: [Intel-gfx] [PATCH 5/5] drm/amdgpu: implement amdgpu_gem_prime_move_notify v2

From: "Christian König" <christian.koenig@amd.com>
To: Daniel Vetter <daniel@ffwll.ch>
Cc: "moderated list:DMA BUFFER SHARING FRAMEWORK"
	<linaro-mm-sig@lists.linaro.org>,
	"Thomas Hellström (VMware)" <thomas_os@shipmail.org>,
	intel-gfx <intel-gfx@lists.freedesktop.org>,
	dri-devel <dri-devel@lists.freedesktop.org>,
	"open list:DMA BUFFER SHARING FRAMEWORK"
	<linux-media@vger.kernel.org>
Subject: Re: [Intel-gfx] [PATCH 5/5] drm/amdgpu: implement amdgpu_gem_prime_move_notify v2
Date: Thu, 27 Feb 2020 10:20:19 +0100	[thread overview]
Message-ID: <d524ecc5-5a18-e20f-8d9e-7060d49cb12e@amd.com> (raw)
In-Reply-To: <CAKMK7uFrcRjjaDAwK73e3UYoONCz36k5SaUStGbjMz7q5FqTMQ@mail.gmail.com>

Am 26.02.20 um 17:32 schrieb Daniel Vetter:
> On Tue, Feb 25, 2020 at 6:16 PM Daniel Vetter <daniel@ffwll.ch> wrote:
>> On Mon, Feb 24, 2020 at 07:46:59PM +0100, Christian König wrote:
>>> Am 23.02.20 um 17:54 schrieb Thomas Hellström (VMware):
>>>> On 2/23/20 4:45 PM, Christian König wrote:
>>>>> Am 21.02.20 um 18:12 schrieb Daniel Vetter:
>>>>>> [SNIP]
>>>>>> Yeah the Great Plan (tm) is to fully rely on ww_mutex slowly
>>>>>> degenerating
>>>>>> into essentially a global lock. But only when there's actual contention
>>>>>> and thrashing.
>>>>> Yes exactly. A really big problem in TTM is currently that we drop
>>>>> the lock after evicting BOs because they tend to move in again
>>>>> directly after that.
>>>>>
>>>>>  From practice I can also confirm that there is exactly zero benefit
>>>>> from dropping locks early and reacquire them for example for the VM
>>>>> page tables. That's just makes it more likely that somebody needs to
>>>>> roll back and this is what we need to avoid in the first place.
>>>> If you have a benchmarking setup available it would be very interesting
>>>> for future reference to see how changing from WD to WW mutexes affects
>>>> the roll back frequency. WW is known to cause rollbacks much less
>>>> frequently but there is more work associated with each rollback.
>>> Not of hand. To be honest I still have a hard time to get a grip on the
>>> difference between WD and WW from the algorithm point of view. So I can't
>>> judge that difference at all.
>>>
>>>>> Contention on BO locks during command submission is perfectly fine
>>>>> as long as this is as lightweight as possible while we don't have
>>>>> trashing. When we have trashing multi submission performance is best
>>>>> archived to just favor a single process to finish its business and
>>>>> block everybody else.
>>>> Hmm. Sounds like we need a per-manager ww_rwsem protecting manager
>>>> allocation, taken in write-mode then there's thrashing. In read-mode
>>>> otherwise. That would limit the amount of "unnecessary" locks we'd have
>>>> to keep and reduce unwanted side-effects, (see below):
>>> Well per-manager (you mean per domain here don't you?) doesn't sound like
>>> that useful because we rarely use only one domain, but I'm actually
>>> questioning for quite a while if the per BO lock scheme was the right
>>> approach.
>>>
>>> See from the performance aspect the closest to ideal solution I can think of
>>> would be a ww_rwsem per user of a resource.
>>>
>>> In other words we don't lock BOs, but instead a list of all their users and
>>> when you want to evict a BO you need to walk that list and inform all users
>>> that the BO will be moving.
>>>
>>> During command submission you then have the fast path which rather just
>>> grabs the read side of the user lock and check if all BOs are still in the
>>> expected place.
>>>
>>> If some BOs were evicted you back off and start the slow path, e.g. maybe
>>> even copy additional data from userspace then grab the write side of the
>>> lock etc.. etc...
>>>
>>> That approach is similar to what we use in amdgpu with the per-VM BOs, but
>>> goes a step further. Problem is that we are so used to per BO locks in the
>>> kernel that this is probably not doable any more.
>> Yeah I think it'd be nice to have the same approach for shared bo too. I
>> guess what we could do is something like this (spinning your ww_rwmutex
>> idea a bit further):
>>
>> dma_buf_read_lock(buf, vm)
>> {
>>          if (enabled(CONFIG_DEBUG_WW_MUTEX_SLOWPATH))
>>          {
>>                  check that vm is indeed listed in buf and splat if not
>>          }
>>
>>          /* for a buf that's not shared in multiple vm we'd have buf->resv
>>           * == vm->resv here */
>>          return ww_mutex_lock(vm->resv);
>> }
>>
>> dma_buf_write_lock(buf)
>> {
>>          for_each_vm_in_buf(buf, vm) {
>>                  ww_mutex_lock(vm->resv);
>>          }
>> }
>>
>> Ideally we'd track all these vms with something slightly less shoddy than
>> a linked list :-) Resizeable array is probably pretty good, I think we
>> only ever need to go from buf -> vm list, not the other way round. At
>> least in dma_resv/dma_buf code, driver code ofc needs to keep a list of
>> all bo bound to a vm somewhere. But that's probably a much bigger
>> datastructure for tracking vma offsets and mappings and other things on
>> top.
>>
>> Ofc to even just get there we'd need something like the sublock list to
>> keep track of all the additional locks we'd need for the writer lock. And
>> we'd need the release callback for backoff, so that we could also go
>> through the slowpath on a vm object that we're not holding a full
>> reference on. That also means vm need to be refcounted.
>>
>> And the list of vms on a buffer need to be protected with some lock and
>> the usual kref_get_unless_zero trickery.
>>
>> But with all this I think we can make the dma_buf_write_lock lock 100%
>> like the old per-buffer lock for everyone. And execbuf could switch over
>> to dma_buf_read_lock for shared buffers. Bonus points when the gpu context
>> just keeps track of a list of shared vm used by buffers in that context
>> ...
>>
>> That way we could make vm fastpath locking a la amdgpu opt-in, while
>> keeping everyone else on the per-object locking juices.
>>
>> Thoughts?

At least to me that sounds like a plan.

> One thing I just realized, which is nasty: The full (write) lock needs
> ww_acquire_ctx with this, because it needs to take a bunch of locks.
> Rolling that out everywhere is going to be nasty.

Why? Take a single write lock shouldn't be different to taking a single 
ww_mutex, or am I missing something?

> I guess though we could do a fallback and have a locally created
> ww_acquire_ctx if there's none passed in, with backoff entirely
> implemented within dma_resv_lock.

How should that work? As far as I understand it the ww_acquire_ctx must 
be kept existing until after the last of the locks it was used with is 
unlocked. Or do I see this incorrectly?

> -Daniel
>
>> Cheers, Daniel
>>
>> PS: Of course the write lock for these buffers is going to be terrible, so
>> every time you need to update fences for implicit sync on shared buffer
>> (well write fence at least) it's going to suck. We probably also want a
>> read_to_write_upgrade function, which also can be done easily with
>> ww_mutex magic.

I'm thinking that we probably sole want a read_to_write upgrade function.

Regards,
Christian.

>> --
>> Daniel Vetter
>> Software Engineer, Intel Corporation
>>

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx