From mboxrd@z Thu Jan  1 00:00:00 1970
From: Daniel Vetter <daniel@ffwll.ch>
Subject: Re: [PATCH] drm/i915: Wait for completion of pending
 flips when starved of fences
Date: Mon, 20 Jan 2014 11:37:42 +0100
Message-ID: <20140120103742.GD15089@phenom.ffwll.local>
References: <1390166413-9410-1-git-send-email-chris@chris-wilson.co.uk>
	<CAKMK7uGnC8HRkjC--zaGp0=QSGtn5ex41RuRBWMqUhtnscUHRA@mail.gmail.com>
	<20140120094924.GB27650@nuc-i3427.alporthouse.com>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Return-path: <intel-gfx-bounces@lists.freedesktop.org>
Received: from mail-ee0-f50.google.com (mail-ee0-f50.google.com [74.125.83.50])
	by gabe.freedesktop.org (Postfix) with ESMTP id 7E8ADFAE0D
	for <intel-gfx@lists.freedesktop.org>;
	Mon, 20 Jan 2014 02:37:48 -0800 (PST)
Received: by mail-ee0-f50.google.com with SMTP id d17so3310489eek.37
	for <intel-gfx@lists.freedesktop.org>;
	Mon, 20 Jan 2014 02:37:46 -0800 (PST)
Content-Disposition: inline
In-Reply-To: <20140120094924.GB27650@nuc-i3427.alporthouse.com>
List-Unsubscribe: <http://lists.freedesktop.org/mailman/options/intel-gfx>,
	<mailto:intel-gfx-request@lists.freedesktop.org?subject=unsubscribe>
List-Archive: <http://lists.freedesktop.org/archives/intel-gfx>
List-Post: <mailto:intel-gfx@lists.freedesktop.org>
List-Help: <mailto:intel-gfx-request@lists.freedesktop.org?subject=help>
List-Subscribe: <http://lists.freedesktop.org/mailman/listinfo/intel-gfx>,
	<mailto:intel-gfx-request@lists.freedesktop.org?subject=subscribe>
Sender: intel-gfx-bounces@lists.freedesktop.org
Errors-To: intel-gfx-bounces@lists.freedesktop.org
To: Chris Wilson <chris@chris-wilson.co.uk>, Daniel Vetter <daniel@ffwll.ch>, intel-gfx <intel-gfx@lists.freedesktop.org>
List-Id: intel-gfx@lists.freedesktop.org

On Mon, Jan 20, 2014 at 09:49:24AM +0000, Chris Wilson wrote:
> On Sun, Jan 19, 2014 at 10:55:26PM +0100, Daniel Vetter wrote:
> > On Sun, Jan 19, 2014 at 10:20 PM, Chris Wilson <chris@chris-wilson.co.uk> wrote:
> > > On older generations (gen2, gen3) the GPU requires fences for many
> > > operations, such as blits. The display hardware also requires fences for
> > > scanouts and this leads to a situation where an arbitrary number of
> > > fences may be pinned by old scanouts following a pageflip but before we
> > > have executed the unpin workqueue. This is unpredictable by userspace
> > > and leads to random EDEADLK when submitting an otherwise benign
> > > execbuffer. However, we can detect when we have an outstanding flip and
> > > so cause userspace to wait upon their completion before finally
> > > declaring that the system is starved of fences. This is really no worse
> > > than forcing the GPU to stall waiting for older execbuffer to retire and
> > > release their fences before we can reallocate them for the next
> > > execbuffer.
> > >
> > > Reported-and-tested-by: dimon@gmx.net
> > > Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=73696
> > > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> > 
> > New subtest for kms_flip which submits such a blt buffer while a
> > pageflip is still pending?
> 
> Correct.
> 
> > Also there's a certain chance we'll starve
> > the unpin work, similar to the issues around flushing the unpin work
> > in our pageflip implementation.
> 
> If you mean that we will never run the unpin workqueue, that's what the
> implementation will fix, eventually, after a busy-spin in userspace since
> set_need_resched() was removed. I can teach userspace to yield() after
> an EAGAIN which seems a reasonable compromise (userspace gets a bonus
> for being cooperative rather than penalized for using up its timeslice.)

yield won't help, we need to block on the work-queue draining like we do
in the pageflip code with flush_workqueue. At least we've had bug reports
in the past where someone found it intriguing to run his entire userspace
with rt prio, which ended up starving the sched_normal workqueue and so
livelocked the entire system.

Instead of busy-looping through userspace with -EAGAIN I think we should
keep all the unpin works on a spinlock-protected list and synchronously
unpin the buffers in the get_fence and evict_something paths (after the
flip completed, we've removed the unpin entry from the list and dropped
the spinlock ofc).

The only downside is that we have a notch more complexity since we need to
manually check for gpu hangs and bail out correctly if there is one. Which
means another kms_flip subtest, but that shouldn't be too much fuzz with
the combinatorial testflags we already have.

Since we don't have a test where rt threads starve our workers for the
normal pageflip code I think we can eshew that part here, too. I'll add it
to the i-g-t wishlist though for a rainy afternoon ;-)

Cheers, Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch