Re: [RFC] drm/i915: Add sync framework support to execbuff IOCTL

From: Daniel Vetter <daniel@ffwll.ch>
To: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
Cc: Intel-GFX@Lists.FreeDesktop.Org
Subject: Re: [RFC] drm/i915: Add sync framework support to execbuff IOCTL
Date: Mon, 6 Jul 2015 17:37:57 +0200	[thread overview]
Message-ID: <20150706153757.GA7568@phenom.ffwll.local> (raw)
In-Reply-To: <559A9CF8.9010601@linux.intel.com>

On Mon, Jul 06, 2015 at 04:21:28PM +0100, Tvrtko Ursulin wrote:
> 
> On 07/06/2015 04:12 PM, Daniel Vetter wrote:
> >On Mon, Jul 06, 2015 at 03:46:49PM +0100, Tvrtko Ursulin wrote:
> >>On 07/06/2015 03:26 PM, John Harrison wrote:
> >>>On 06/07/2015 14:59, Daniel Vetter wrote:
> >>>>On Mon, Jul 06, 2015 at 01:58:25PM +0100, John Harrison wrote:
> >>>>>On 06/07/2015 10:29, Daniel Vetter wrote:
> >>>>>>On Fri, Jul 03, 2015 at 12:17:33PM +0100, Tvrtko Ursulin wrote:
> >>>>>>>On 07/02/2015 04:55 PM, Chris Wilson wrote:
> >>>>>>>>It would be nice if we could reuse one seqno both for
> >>>>>>>>internal/external
> >>>>>>>>fences. If you need to expose a fence ordering within a timeline
> >>>>>>>>that is
> >>>>>>>>based on the creation stamp rather than execution stamp, it seems
> >>>>>>>>like
> >>>>>>>>we could just add such a stamp when creating the sync_pt and not
> >>>>>>>>worry
> >>>>>>>>about its relationship to the execution seqno.
> >>>>>>>>
> >>>>>>>>Doing so does expose that requests are reordered to userspace
> >>>>>>>>since the
> >>>>>>>>signalling timeline is not the same as userspace's ordered
> >>>>>>>>timeline. Not
> >>>>>>>>sure if that is a problem or not.
> >>>>>>>>
> >>>>>>>>Afaict the sync uapi is based on waiting for all of a set of
> >>>>>>>>fences to
> >>>>>>>>retire. It doesn't seem to rely on fence ordering (that is knowing
> >>>>>>>>that
> >>>>>>>>fence A will signal before fence B so it need only wait on fence B).
> >>>>>>>>
> >>>>>>>>Here's hoping that we can have both simplicity and efficiency...
> >>>>>>>Jumping in with not even perfect understanding of everything here -
> >>>>>>>but
> >>>>>>>timeline business has always been confusing me. There is nothing in
> >>>>>>>the
> >>>>>>>uapi which needs it afaics and iirc there was some discussion at
> >>>>>>>the time
> >>>>>>>Jesse floated his patches that it can be removed. Based on that when I
> >>>>>>>squashed his patches and ported them on top of John's request to fence
> >>>>>>>conversion it ended up something like the below (manually edited a
> >>>>>>>bit to
> >>>>>>>be less noisy and some prep patches omitted):
> >>>>>>>
> >>>>>>>This implements the ioctl based uapi and indeed seqnos are not
> >>>>>>>actually
> >>>>>>>used in waits. So is this insufficient for some reason? (Other that it
> >>>>>>>does not implement the input fence side of things.)
> >>>>>>Yeah android syncpt on top of struct fence embedded int i915 request is
> >>>>>>what I'd have expected.
> >>>>>The thing I'm not happy with in that plan is that it leaves the kernel
> >>>>>driver at the mercy of user land applications. If we return a fence
> >>>>>object
> >>>>>to user land via a file descriptor (or indeed any other mechanism)
> >>>>>then that
> >>>>>fence object must be locked until user land closes the file. If the
> >>>>>fence
> >>>>>object is the one embedded within our request structure then that
> >>>>>means user
> >>>>>land is effectively locking our request structure. Given that more
> >>>>>and more
> >>>>>stuff is being attached to the request, that could be a fair bit of
> >>>>>memory
> >>>>>tied up that we can do nothing about. E.g. if a rogue/buggy application
> >>>>>requests a fence be returned for every batch buffer submitted but never
> >>>>>closes them. Whereas, if we go the route of a separate fence object
> >>>>>specifically for user land then they can leak them like a sieve and
> >>>>>we won't
> >>>>>really care so much.
> >>>>Userspace can exhaust kernel allocations, that's nothing new. And if we
> >>>>keep it userspace simply needs to leak a few more fence fds than if
> >>>>there's a bit more data attached to it.
> >>>>
> >>>>The solution to this problem is to have a mem cgroup limit set. No
> >>>>need to
> >>>>complicate our kernel code.
> >>>
> >>>There is still the extra complication that request unreferencing cannot
> >>>require any kind of mutex lock if we are allowing it to happen from
> >>>outside of the driver. That means the unreference callback must move the
> >>>request to a 'please clean me later' list, schedule a worker thread to
> >>>run, and thus do the clean up asynchronously.
> >>
> >>For this particular issue my solution was to extend the sync_fence
> >>constructor to take a mutex and store it inside the object. Then at
> >>destruction time, which happens at sync_fd->f_ops->release() time, it is
> >>just a matter of calling kref_put_mutex instead of kref_put.
> >>
> >>Seemed to work under some quick testing but that is as much as I did back
> >>then.
> >
> >The problem is that it doesn't scale since everyone wants some other kind
> >of mutex to serialize the final kref_put. If something is supposed to be
> >cross-subsystem/driver (which is the case for fences) then we really can't
> >do that kind of leaky locking design. Imo we should have a kref_put_mutex
> >considered harmful sign somewhere ...
> 
> I get the argument about everything wanting to add their own not scaling,
> but don't tie with the leaky comment? Also mutex is a pretty standard thing,
> especially since kref_put_mutex. :D If you look at it from that angle it
> kind of just exposes to the super class what the base class can do.

leaky not as in leaking memory, but leaky abstraction - we impose locking
internals for i915 onto users of the fence interface (somewhat at least).
> 
> >If you have weak references somewhere and need to prevent the object from
> >disappearing untimely while chasing that weak reference then imo the
> >better design pattern is to use kref_get_unless_zero. If you need the
> >serialization the mutex provides for some other reason (someone is only
> >hodling the mutex instead of grabbing a proper refernce when they really
> >should grab one) then your refcounting scheme probably needs another kind
> >of fixup patch.
> 
> I don't see how weak references can work since if the request goes
> information is lost, unless stored somewhere else.

So weak refernce is a pointer which doesn't hold a reference to the object
controlled with kref, but protected by some lock. Latest when the object
is destroyed we need to clear that pointer in the free callback of
kref_put (and so also grab the lock that protects that pointer). The
problem is that anyone else chasing that weak reference might race with
the final kref_put and increase the refcount from 0 to 1, which isn't
good.

There's two ways to do that:
- kref_put_mutex on the release side.
- in the acquire side do a kref_get_unless_zero _while_ holding that lock.

Two upsides of the later approach:
- You can have an unrestricted amount of weak references, each protected
  with their own lock. kref_put_mutex only works up to one.
- It doesn't serialize the final kref_put with the lock - doing that
  allows folks to rely on the mutex instead of a proper refcount to make
  the obj stick around, which is imo a design antipattern that I suffered
  through trying to cleaning it up agian a lot.

But that's really a tangent to the discussion here, I have no idea whether
this applies here since I didn't read your patch in detail.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx