From mboxrd@z Thu Jan 1 00:00:00 1970 From: Daniel Vetter Subject: Re: [PATCH] [RFC] drm/i915: read-read semaphore optimization Date: Tue, 13 Dec 2011 17:01:33 +0100 Message-ID: <20111213160133.GA4125@phenom.ffwll.local> References: <1323748328-10153-1-git-send-email-ben@bwidawsk.net> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Return-path: Received: from mail-fx0-f49.google.com (mail-fx0-f49.google.com [209.85.161.49]) by gabe.freedesktop.org (Postfix) with ESMTP id F03279E810 for ; Tue, 13 Dec 2011 08:00:01 -0800 (PST) Received: by fabs1 with SMTP id s1so370315fab.36 for ; Tue, 13 Dec 2011 08:00:00 -0800 (PST) Content-Disposition: inline In-Reply-To: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: intel-gfx-bounces+gcfxdi-intel-gfx=m.gmane.org@lists.freedesktop.org Errors-To: intel-gfx-bounces+gcfxdi-intel-gfx=m.gmane.org@lists.freedesktop.org To: Chris Wilson Cc: Daniel Vetter , Ben Widawsky , intel-gfx@lists.freedesktop.org List-Id: intel-gfx@lists.freedesktop.org On Tue, Dec 13, 2011 at 09:49:34AM +0000, Chris Wilson wrote: > On Mon, 12 Dec 2011 19:52:08 -0800, Ben Widawsky wrote: > > Since we don't differentiate on the different GPU read domains, it > > should be safe to allow back to back reads to occur without issuing a > > wait (or flush in the non-semaphore case). > > > > This has the unfortunate side effect that we need to keep track of all > > the outstanding buffer reads so that we can synchronize on a write, to > > another ring (since we don't know which read finishes first). In other > > words, the code is quite simple for two rings, but gets more tricky for > > > 2 rings. > > > > Here is a picture of the solution to the above problem > > > > Ring 0 Ring 1 Ring 2 > > batch 0 batch 1 batch 2 > > read buffer A read buffer A wait batch 0 > > wait batch 1 > > write buffer A > > > > This code is really untested. I'm hoping for some feedback if this is > > worth cleaning up, and testing more thoroughly. > > Yes, that race is quite valid and the reason why I thought I hadn't made > that optimisation. Darn. :( > > To go a step further, we can split the obj->ring_list into > (obj->ring_read_list[NUM_RINGS], obj->num_readers, obj->last_read_seqno) and > (obj->ring_write_list, obj->last_write_seqno). At which point Daniel > complains about bloating every i915_gem_object, and we probably should > kmem_cache_alloc a i915_gem_object_seqno on demand. This allows us to track > objects in multiple rings and implement read-write locking, albeit at > significantly more complexity in managing the active lists. I think the i915_gem_object bloat can be fought by stealing a few bits from the various seqnos and storing the ring id in there. The thing that makes me more uneasy is that I don't trust our gpu domain tracking (especially since it's not per-ring). So either - extend it to be per-ring - or remove it all and invalidate/flush unconditionally. In the light of all the complexity and the fact that due to our various w/as I prefer the latter. Afaik the only use-case for parallel reads is video decode with post-processing on the render ring. The decode ring needs read-only access to reference frames to decode the next frame and the render ring read-only access to past frames for post-processing (e.g. deinterlacing). But given the general state of perf optimizations in libva I think we have lower hanging fruit to chase if we actually miss a performance target for this use-case. Cheers, Daniel -- Daniel Vetter Mail: daniel@ffwll.ch Mobile: +41 (0)79 365 57 48