From mboxrd@z Thu Jan  1 00:00:00 1970
From: Daniel Vetter <daniel@ffwll.ch>
Subject: Re: [PATCH] [RFC] drm/i915: read-read semaphore
	optimization
Date: Tue, 13 Dec 2011 17:01:33 +0100
Message-ID: <20111213160133.GA4125@phenom.ffwll.local>
References: <1323748328-10153-1-git-send-email-ben@bwidawsk.net>
	<aefc95$2hvr0i@orsmga001.jf.intel.com>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Return-path: <intel-gfx-bounces+gcfxdi-intel-gfx=m.gmane.org@lists.freedesktop.org>
Received: from mail-fx0-f49.google.com (mail-fx0-f49.google.com
	[209.85.161.49])
	by gabe.freedesktop.org (Postfix) with ESMTP id F03279E810
	for <intel-gfx@lists.freedesktop.org>;
	Tue, 13 Dec 2011 08:00:01 -0800 (PST)
Received: by fabs1 with SMTP id s1so370315fab.36
	for <intel-gfx@lists.freedesktop.org>;
	Tue, 13 Dec 2011 08:00:00 -0800 (PST)
Content-Disposition: inline
In-Reply-To: <aefc95$2hvr0i@orsmga001.jf.intel.com>
List-Unsubscribe: <http://lists.freedesktop.org/mailman/options/intel-gfx>,
	<mailto:intel-gfx-request@lists.freedesktop.org?subject=unsubscribe>
List-Archive: <http://lists.freedesktop.org/archives/intel-gfx>
List-Post: <mailto:intel-gfx@lists.freedesktop.org>
List-Help: <mailto:intel-gfx-request@lists.freedesktop.org?subject=help>
List-Subscribe: <http://lists.freedesktop.org/mailman/listinfo/intel-gfx>,
	<mailto:intel-gfx-request@lists.freedesktop.org?subject=subscribe>
Sender: intel-gfx-bounces+gcfxdi-intel-gfx=m.gmane.org@lists.freedesktop.org
Errors-To: intel-gfx-bounces+gcfxdi-intel-gfx=m.gmane.org@lists.freedesktop.org
To: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>, Ben Widawsky <ben@bwidawsk.net>, intel-gfx@lists.freedesktop.org
List-Id: intel-gfx@lists.freedesktop.org

On Tue, Dec 13, 2011 at 09:49:34AM +0000, Chris Wilson wrote:
> On Mon, 12 Dec 2011 19:52:08 -0800, Ben Widawsky <ben@bwidawsk.net> wrote:
> > Since we don't differentiate on the different GPU read domains, it
> > should be safe to allow back to back reads to occur without issuing a
> > wait (or flush in the non-semaphore case).
> > 
> > This has the unfortunate side effect that we need to keep track of all
> > the outstanding buffer reads so that we can synchronize on a write, to
> > another ring (since we don't know which read finishes first). In other
> > words, the code is quite simple for two rings, but gets more tricky for
> > > 2 rings.
> > 
> > Here is a picture of the solution to the above problem
> > 
> > Ring 0            Ring 1             Ring 2
> > batch 0           batch 1            batch 2
> >   read buffer A     read buffer A      wait batch 0
> >                                        wait batch 1
> >                                        write buffer A
> > 
> > This code is really untested. I'm hoping for some feedback if this is
> > worth cleaning up, and testing more thoroughly.
> 
> Yes, that race is quite valid and the reason why I thought I hadn't made
> that optimisation. Darn. :(
> 
> To go a step further, we can split the obj->ring_list into
> (obj->ring_read_list[NUM_RINGS], obj->num_readers, obj->last_read_seqno) and
> (obj->ring_write_list, obj->last_write_seqno). At which point Daniel
> complains about bloating every i915_gem_object, and we probably should
> kmem_cache_alloc a i915_gem_object_seqno on demand. This allows us to track
> objects in multiple rings and implement read-write locking, albeit at
> significantly more complexity in managing the active lists.

I think the i915_gem_object bloat can be fought by stealing a few bits
from the various seqnos and storing the ring id in there. The thing that
makes me more uneasy is that I don't trust our gpu domain tracking
(especially since it's not per-ring). So either
- extend it to be per-ring
- or remove it all and invalidate/flush unconditionally.
In the light of all the complexity and the fact that due to our various
w/as I prefer the latter.

Afaik the only use-case for parallel reads is video decode with
post-processing on the render ring. The decode ring needs read-only access
to reference frames to decode the next frame and the render ring read-only
access to past frames for post-processing (e.g. deinterlacing). But given
the general state of perf optimizations in libva I think we have lower
hanging fruit to chase if we actually miss a performance target for this
use-case.

Cheers, Daniel
-- 
Daniel Vetter
Mail: daniel@ffwll.ch
Mobile: +41 (0)79 365 57 48