stable.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
To: linux-kernel@vger.kernel.org
Cc: "Greg Kroah-Hartman" <gregkh@linuxfoundation.org>,
	stable@vger.kernel.org,
	"Ville Syrjälä" <ville.syrjala@linux.intel.com>,
	"Chris Wilson" <chris@chris-wilson.co.uk>,
	"Daniel Vetter" <daniel.vetter@ffwll.ch>
Subject: [ 25/71] drm/i915: fix wait_for_pending_flips vs gpu hang deadlock
Date: Sun, 29 Sep 2013 12:27:37 -0700	[thread overview]
Message-ID: <20130929192645.238568857@linuxfoundation.org> (raw)
In-Reply-To: <20130929192643.539596256@linuxfoundation.org>

3.11-stable review patch.  If anyone has any objections, please let me know.

------------------

From: Daniel Vetter <daniel.vetter@ffwll.ch>

commit 17e1df07df0fbc77696a1e1b6ccf9f2e5af70e40 upstream.

My g33 here seems to be shockingly good at hitting them all. This time
around kms_flip/flip-vs-panning-vs-hang blows up:

intel_crtc_wait_for_pending_flips correctly checks for gpu hangs and
if a gpu hang is pending aborts the wait for outstanding flips so that
the setcrtc call will succeed and release the crtc mutex. And the gpu
hang handler needs that lock in intel_display_handle_reset to be able
to complete outstanding flips.

The problem is that we can race in two ways:
- Waiters on the dev_priv->pending_flip_queue aren't woken up after
  we've the reset as pending, but before we actually start the reset
  work. This means that the waiter doesn't notice the pending reset
  and hence will keep on hogging the locks.

  Like with dev->struct_mutex and the ring->irq_queue wait queues we
  there need to wake up everyone that potentially holds a lock which
  the reset handler needs.

- intel_display_handle_reset was called _after_ we've already
  signalled the completion of the reset work. Which means a waiter
  could sneak in, grab the lock and never release it (since the
  pageflips won't ever get released).

  Similar to resetting the gem state all the reset work must complete
  before we update the reset counter. Contrary to the gem reset we
  don't need to have a second explicit wake up call since that will
  have happened already when completing the pageflips. We also don't
  have any issues that the completion happens while the reset state is
  still pending - wait_for_pending_flips is only there to ensure we
  display the right frame. After a gpu hang&reset events such
  guarantees are out the window anyway. This is in contrast to the gem
  code where too-early wake-up would result in unnecessary restarting
  of ioctls.

Also, since we've gotten these various deadlocks and ordering
constraints wrong so often throw copious amounts of comments at the
code.

This deadlock regression has been introduced in the commit which added
the pageflip reset logic to the gpu hang work:

commit 96a02917a0131e52efefde49c2784c0421d6c439
Author: Ville Syrjälä <ville.syrjala@linux.intel.com>
Date:   Mon Feb 18 19:08:49 2013 +0200

    drm/i915: Finish page flips and update primary planes after a GPU reset

v2:
- Add comments to explain how the wake_up serves as memory barriers
  for the atomic_t reset counter.
- Improve the comments a bit as suggested by Chris Wilson.
- Extract the wake_up calls before/after the reset into a little
  i915_error_wake_up and unconditionally wake up the
  pending_flip_queue waiters, again as suggested by Chris Wilson.

v3: Throw copious amounts of comments at i915_error_wake_up as
suggested by Chris Wilson.

Cc: Ville Syrjälä <ville.syrjala@linux.intel.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

---
 drivers/gpu/drm/i915/i915_irq.c |   68 +++++++++++++++++++++++++++++++---------
 1 file changed, 54 insertions(+), 14 deletions(-)

--- a/drivers/gpu/drm/i915/i915_irq.c
+++ b/drivers/gpu/drm/i915/i915_irq.c
@@ -1407,6 +1407,34 @@ done:
 	return ret;
 }
 
+static void i915_error_wake_up(struct drm_i915_private *dev_priv,
+			       bool reset_completed)
+{
+	struct intel_ring_buffer *ring;
+	int i;
+
+	/*
+	 * Notify all waiters for GPU completion events that reset state has
+	 * been changed, and that they need to restart their wait after
+	 * checking for potential errors (and bail out to drop locks if there is
+	 * a gpu reset pending so that i915_error_work_func can acquire them).
+	 */
+
+	/* Wake up __wait_seqno, potentially holding dev->struct_mutex. */
+	for_each_ring(ring, dev_priv, i)
+		wake_up_all(&ring->irq_queue);
+
+	/* Wake up intel_crtc_wait_for_pending_flips, holding crtc->mutex. */
+	wake_up_all(&dev_priv->pending_flip_queue);
+
+	/*
+	 * Signal tasks blocked in i915_gem_wait_for_error that the pending
+	 * reset state is cleared.
+	 */
+	if (reset_completed)
+		wake_up_all(&dev_priv->gpu_error.reset_queue);
+}
+
 /**
  * i915_error_work_func - do process context error handling work
  * @work: work struct
@@ -1421,11 +1449,10 @@ static void i915_error_work_func(struct
 	drm_i915_private_t *dev_priv = container_of(error, drm_i915_private_t,
 						    gpu_error);
 	struct drm_device *dev = dev_priv->dev;
-	struct intel_ring_buffer *ring;
 	char *error_event[] = { "ERROR=1", NULL };
 	char *reset_event[] = { "RESET=1", NULL };
 	char *reset_done_event[] = { "ERROR=0", NULL };
-	int i, ret;
+	int ret;
 
 	kobject_uevent_env(&dev->primary->kdev.kobj, KOBJ_CHANGE, error_event);
 
@@ -1444,8 +1471,16 @@ static void i915_error_work_func(struct
 		kobject_uevent_env(&dev->primary->kdev.kobj, KOBJ_CHANGE,
 				   reset_event);
 
+		/*
+		 * All state reset _must_ be completed before we update the
+		 * reset counter, for otherwise waiters might miss the reset
+		 * pending state and not properly drop locks, resulting in
+		 * deadlocks with the reset work.
+		 */
 		ret = i915_reset(dev);
 
+		intel_display_handle_reset(dev);
+
 		if (ret == 0) {
 			/*
 			 * After all the gem state is reset, increment the reset
@@ -1466,12 +1501,11 @@ static void i915_error_work_func(struct
 			atomic_set(&error->reset_counter, I915_WEDGED);
 		}
 
-		for_each_ring(ring, dev_priv, i)
-			wake_up_all(&ring->irq_queue);
-
-		intel_display_handle_reset(dev);
-
-		wake_up_all(&dev_priv->gpu_error.reset_queue);
+		/*
+		 * Note: The wake_up also serves as a memory barrier so that
+		 * waiters see the update value of the reset counter atomic_t.
+		 */
+		i915_error_wake_up(dev_priv, true);
 	}
 }
 
@@ -2109,8 +2143,6 @@ static void i915_report_and_clear_eir(st
 void i915_handle_error(struct drm_device *dev, bool wedged)
 {
 	struct drm_i915_private *dev_priv = dev->dev_private;
-	struct intel_ring_buffer *ring;
-	int i;
 
 	i915_capture_error_state(dev);
 	i915_report_and_clear_eir(dev);
@@ -2120,11 +2152,19 @@ void i915_handle_error(struct drm_device
 				&dev_priv->gpu_error.reset_counter);
 
 		/*
-		 * Wakeup waiting processes so that the reset work item
-		 * doesn't deadlock trying to grab various locks.
+		 * Wakeup waiting processes so that the reset work function
+		 * i915_error_work_func doesn't deadlock trying to grab various
+		 * locks. By bumping the reset counter first, the woken
+		 * processes will see a reset in progress and back off,
+		 * releasing their locks and then wait for the reset completion.
+		 * We must do this for _all_ gpu waiters that might hold locks
+		 * that the reset work needs to acquire.
+		 *
+		 * Note: The wake_up serves as the required memory barrier to
+		 * ensure that the waiters see the updated value of the reset
+		 * counter atomic_t.
 		 */
-		for_each_ring(ring, dev_priv, i)
-			wake_up_all(&ring->irq_queue);
+		i915_error_wake_up(dev_priv, false);
 	}
 
 	/*



  parent reply	other threads:[~2013-09-29 19:27 UTC|newest]

Thread overview: 79+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-09-29 19:27 [ 00/71] 3.11.3-stable review Greg Kroah-Hartman
2013-09-29 19:27 ` [ 01/71] PCI / ACPI / PM: Clear pme_poll for devices in D3cold on wakeup Greg Kroah-Hartman
2013-09-29 19:27 ` [ 02/71] ARM: OMAP4: Fix clock_get error for GPMC during boot Greg Kroah-Hartman
2013-09-29 19:27 ` [ 03/71] net: usb: cdc_ether: Use wwan interface for Telit modules Greg Kroah-Hartman
2013-09-29 19:27 ` [ 04/71] cifs: fix filp leak in cifs_atomic_open() Greg Kroah-Hartman
2013-09-29 19:27 ` [ 05/71] bgmac: fix internal switch initialization Greg Kroah-Hartman
2013-09-29 19:27 ` [ 06/71] rt2800: change initialization sequence to fix system freeze Greg Kroah-Hartman
2013-09-29 19:27 ` [ 07/71] rt2800: fix wrong TX power compensation Greg Kroah-Hartman
2013-09-29 19:27 ` [ 08/71] timekeeping: Fix HRTICK related deadlock from ntp lock changes Greg Kroah-Hartman
2013-09-29 19:27 ` [ 09/71] sched/cputime: Do not scale when utime == 0 Greg Kroah-Hartman
2013-09-29 19:27 ` [ 10/71] sched/fair: Fix small race where child->se.parent,cfs_rq might point to invalid ones Greg Kroah-Hartman
2013-09-29 19:27 ` [ 11/71] HID: provide a helper for validating hid reports Greg Kroah-Hartman
2013-09-29 19:27 ` [ 12/71] HID: validate feature and input report details Greg Kroah-Hartman
2013-09-29 19:27 ` [ 13/71] HID: multitouch: validate indexes details Greg Kroah-Hartman
2013-09-29 19:27 ` [ 14/71] HID: LG: validate HID output report details Greg Kroah-Hartman
2013-09-29 19:27 ` [ 15/71] HID: zeroplus: validate " Greg Kroah-Hartman
2013-09-29 19:27 ` [ 16/71] HID: lenovo-tpkbd: fix leak if tpkbd_probe_tp fails Greg Kroah-Hartman
2013-09-29 19:27 ` [ 17/71] HID: steelseries: validate output report details Greg Kroah-Hartman
2013-09-29 19:27 ` [ 18/71] HID: sony: validate HID " Greg Kroah-Hartman
2013-09-29 19:27 ` [ 19/71] HID: lenovo-tpkbd: validate " Greg Kroah-Hartman
2013-09-29 19:27 ` [ 20/71] HID: logitech-dj: " Greg Kroah-Hartman
2013-09-29 19:27 ` [ 21/71] usb: gadget: fix a bug and a WARN_ON in dummy-hcd Greg Kroah-Hartman
2013-09-29 19:27 ` [ 22/71] drm/i915: try not to lose backlight CBLV precision Greg Kroah-Hartman
2013-09-29 19:27 ` [ 23/71] drm/i915: fix hpd work vs. flush_work in the pageflip code deadlock Greg Kroah-Hartman
2013-09-29 19:27 ` [ 24/71] drm/i915: fix gpu hang vs. flip stall deadlocks Greg Kroah-Hartman
2013-09-29 19:27 ` Greg Kroah-Hartman [this message]
2013-09-29 19:27 ` [ 26/71] drm/i915: do not update cursor in crtc mode set Greg Kroah-Hartman
2013-09-29 19:27 ` [ 27/71] drm/i915: Dont enable the cursor on a disable pipe Greg Kroah-Hartman
2013-09-29 19:27 ` [ 28/71] drm: fix DRM_IOCTL_MODE_GETFB handle-leak Greg Kroah-Hartman
2013-09-29 19:27 ` [ 29/71] drm/ast: fix the ast open key function Greg Kroah-Hartman
2013-09-29 19:27 ` [ 30/71] drm/ttm: fix the tt_populated check in ttm_tt_destroy() Greg Kroah-Hartman
2013-09-29 19:27 ` [ 31/71] radeon kms: fix uninitialised hotplug work usage in r100_irq_process() Greg Kroah-Hartman
2013-09-29 19:27 ` [ 32/71] drm/nv50/disp: prevent false output detection on the original nv50 Greg Kroah-Hartman
2013-09-29 19:27 ` [ 33/71] drm/radeon: fix LCD record parsing Greg Kroah-Hartman
2013-09-29 19:27 ` [ 34/71] drm/radeon/dpm: add reclocking quirk for ASUS K70AF Greg Kroah-Hartman
2013-09-29 19:27 ` [ 35/71] drm/radeon: fix endian bugs in hw i2c atom routines Greg Kroah-Hartman
2013-09-29 19:27 ` [ 36/71] drm/radeon: enable UVD interrupts on CIK Greg Kroah-Hartman
2013-09-29 19:27 ` [ 37/71] drm/radeon: fill in gpu_init for berlin GPU cores Greg Kroah-Hartman
2013-09-29 19:27 ` [ 38/71] drm/radeon: update line buffer allocation for dce8 Greg Kroah-Hartman
2013-09-29 19:27 ` [ 39/71] drm/radeon: fix init ordering for r600+ Greg Kroah-Hartman
2013-09-29 19:27 ` [ 40/71] drm/radeon/cik: update gpu_init for an additional berlin gpu Greg Kroah-Hartman
2013-09-29 19:27 ` [ 41/71] drm/radeon: add berlin pci ids Greg Kroah-Hartman
2013-09-29 19:27 ` [ 42/71] drm/radeon/si: Add support for CP DMA to CS checker for compute v2 Greg Kroah-Hartman
2013-09-29 19:27 ` [ 43/71] drm/radeon: update line buffer allocation for dce4.1/5 Greg Kroah-Hartman
2013-09-29 19:27 ` [ 44/71] drm/radeon: update line buffer allocation for dce6 Greg Kroah-Hartman
2013-09-29 19:27 ` [ 45/71] drm/radeon: fix resume on some rs4xx boards (v2) Greg Kroah-Hartman
2013-09-29 19:27 ` [ 46/71] drm/radeon: fix handling of variable sized arrays for router objects Greg Kroah-Hartman
2013-09-29 19:27 ` [ 47/71] drm/radeon/dpm: make sure dc performance level limits are valid (BTC-SI) (v2) Greg Kroah-Hartman
2013-09-29 19:28 ` [ 48/71] tg3: Dont turn off led on 5719 serdes port 0 Greg Kroah-Hartman
2013-09-29 19:28 ` [ 49/71] tg3: Expand led off fix to include 5720 Greg Kroah-Hartman
2013-09-29 19:28 ` [ 50/71] drm/radeon: add some additional berlin pci ids Greg Kroah-Hartman
2013-09-29 19:28 ` [ 51/71] drm/radeon/r6xx: add a stubbed out set_uvd_clocks callback Greg Kroah-Hartman
2013-09-29 19:28 ` [ 52/71] drm/radeon/atom: workaround vbios bug in transmitter table on rs880 (v2) Greg Kroah-Hartman
2013-09-29 19:28 ` [ 53/71] drm/radeon/dpm: handle bapm on trinity Greg Kroah-Hartman
2013-09-29 19:28 ` [ 54/71] drm/radeon/dpm: fix fallback for empty UVD clocks Greg Kroah-Hartman
2013-09-29 19:28 ` [ 55/71] drm/radeon/dpm/rs780: dont enable sclk scaling if not required Greg Kroah-Hartman
2013-09-29 19:28 ` [ 56/71] drm/radeon: fix panel scaling with eDP and LVDS bridges Greg Kroah-Hartman
2013-09-29 19:28 ` [ 57/71] drm/radeon: avoid UVD corruptions on AGP cards Greg Kroah-Hartman
2013-09-29 19:28 ` [ 58/71] skge: fix broken driver Greg Kroah-Hartman
2013-09-29 19:28 ` [ 59/71] udf: Standardize return values in mount sequence Greg Kroah-Hartman
2013-09-29 19:28 ` [ 60/71] udf: Refuse RW mount of the filesystem instead of making it RO Greg Kroah-Hartman
2013-09-29 19:28 ` [ 61/71] audit: fix endless wait in audit_log_start() Greg Kroah-Hartman
2013-09-29 19:28 ` [ 62/71] mm: fix aio performance regression for database caused by THP Greg Kroah-Hartman
2013-09-29 19:28 ` [ 63/71] bio-integrity: Fix use of bs->bio_integrity_pool after free Greg Kroah-Hartman
2013-09-29 19:28 ` [ 64/71] cfq: explicitly use 64bit divide operation for 64bit arguments Greg Kroah-Hartman
2013-09-29 19:28 ` [ 65/71] rpc: clean up decoding of gssproxy linux creds Greg Kroah-Hartman
2013-09-29 19:28 ` [ 66/71] rpc: comment on linux_cred encoding, treat all as unsigned Greg Kroah-Hartman
2013-09-29 19:28 ` [ 67/71] rpc: fix huge kmallocs in gss-proxy Greg Kroah-Hartman
2013-09-29 19:28 ` [ 68/71] rpc: let xdr layer allocate gssproxy receieve pages Greg Kroah-Hartman
2013-09-29 19:28 ` [ 69/71] cw1200: Prevent a lock-related hang in the cw1200_spi driver Greg Kroah-Hartman
2013-09-29 19:28 ` [ 70/71] cw1200: Dont perform SPI transfers in interrupt context Greg Kroah-Hartman
2013-10-02  2:23   ` Solomon Peachy
2013-10-02 21:26     ` Greg Kroah-Hartman
2013-10-03 13:22       ` Solomon Peachy
2013-09-29 19:28 ` [ 71/71] netfilter: ipset: Fix serious failure in CIDR tracking Greg Kroah-Hartman
2013-09-30  1:28 ` [ 00/71] 3.11.3-stable review Guenter Roeck
2013-09-30  1:51   ` Greg Kroah-Hartman
2013-09-30  2:22     ` Guenter Roeck
2013-10-01 19:23 ` Shuah Khan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20130929192645.238568857@linuxfoundation.org \
    --to=gregkh@linuxfoundation.org \
    --cc=chris@chris-wilson.co.uk \
    --cc=daniel.vetter@ffwll.ch \
    --cc=linux-kernel@vger.kernel.org \
    --cc=stable@vger.kernel.org \
    --cc=ville.syrjala@linux.intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).