* [RFCv2 00/12] TDR/watchdog timeout support for gen8
@ 2015-07-21 13:58 Tomas Elf
2015-07-21 13:58 ` [RFCv2 01/12] drm/i915: Early exit from semaphore_waits_for for execlist mode Tomas Elf
` (12 more replies)
0 siblings, 13 replies; 15+ messages in thread
From: Tomas Elf @ 2015-07-21 13:58 UTC (permalink / raw)
To: Intel-GFX
This patch series introduces the following features:
* Feature 1: TDR (Timeout Detection and Recovery) for gen8 execlist mode.
TDR is an umbrella term for anything that goes into detecting and recovering from GPU hangs and is a term more widely used outside of the upstream driver.
This feature introduces an extensible framework that currently supports gen8 but that can be easily extended to support gen7 as well (which is already the case in GMIN but unfortunately in a not quite upstreamable form). The code contained in this submission represents the essentials of what is currently in GMIN merged with what is currently in upstream (as of the time when this work commenced a few months back).
This feature adds a new hang recovery path alongside the legacy GPU reset path, which takes care of engine recovery only. Aside from adding support for per-engine recovery this feature also introduces rules for how to promote a potential per-engine reset to a legacy, full GPU reset.
The hang checker now integrates with the error handler in a slightly different way in that it allows hang recovery on multiple engines at the same time by passing an engine flag mask to the error handler where flags representing all of the hung engines are set. This allows us to schedule hang recovery once for all currently hung engines instead of one hang recovery per detected engine hang. Previously, when only full GPU reset was supported this was all the same since it wouldn't matter if one or four engines were hung at any given point since it would all amount to the same thing - the GPU getting reset. As it stands now the behaviour is different depending on which engine is hung since each engine is reset separately from all the other engines, therefore we have to think about this in terms of scheduling cost and recovery latency. (see open question below)
OPEN QUESTIONS:
1. Do we want to investigate the possibility of per-engine hang
detection? In the current upstream driver there is only one work queue
that handles the hang checker and everything from initial hang
detection to final hang recovery runs in this thread. This makes sense
if you're only supporting one form of hang recovery - using full GPU
reset and nothing tied to any particular engine. However, as part
of this patch series we're changing that by introducing per-engine
hang recovery. It could make sense to introduce multiple work
queues - one per engine - to run multiple hang checking threads in
parallel.
This would potentially allow savings in terms of recovery latency since
we don't have to scan all engines every time the hang checker is
scheduled and the error handler does not have to scan all engines every
time it is scheduled. Instead, we could implement one work queue per
engine that would invoke the hang checker that only checks _that_
particular engine and then the error handler is invoked for _that_
particular engine. If one engine has hung the latency for getting to
the hang recovery path for that particular engine would be (Time For
Hang Checking One Engine) + (Time For Error Handling One Engine) rather
than the time it takes to do hang checking for all engines + the time
it takes to do error handling for all engines that have been detected
as hung (which in the worst case would be all engines). There would
potentially be as many hang checker and error handling threads going on
concurrently as there are engines in the hardware but they would all be
running in parallel without any significant locking. The first time
where any thread needs exclusive access to the driver is at the point
of the actual hang recovery but the time it takes to get there would
theoretically be lower and the time it actually takes to do per-engine
hang recovery is quite a lot lower than the time it takes to actually
detect a hang reliably.
How much we would save by such a change still needs to be analysed and
compared against the current single-thread model but it makes sense
from a theoretical design point of view.
* Feature 2: Watchdog Timeout (a.k.a "media engine reset") for gen8.
This feature allows userland applications to control whether or not individual batch buffers should have a first-level, fine-grained, hardware-based hang detection mechanism on top of the ordinary, software-based periodic hang checker that is already in the driver. The advantage over relying solely on the current software-based hang checker is that the watchdog timeout mechanism is about 1000x quicker and more precise. Since it's not a full driver-level hang detection mechanism but only targetting one individual batch buffer at a time it can afford to be that quick without risking an increase in false positive hang detection.
This feature includes the following changes:
a) Watchdog timeout interrupt service routine for handling watchdog interrupts and connecting these to per-engine hang recovery.
b) Injection of watchdog timer enablement/cancellation instructions before/after the batch buffer start instruction in the ring buffer so that watchdog timeout is connected to the submission of an individual batch buffer.
c) Extension of the DRM batch buffer interface, exposing the watchdog timeout feature to userland. We've got two open source groups in VPG currently in the process of integrating support for this feature, which should make it principally possible to upstream this extension.
There is currently full watchdog timeout support for gen7 in GMIN and it is quite similar to the gen8 implementation so there is nothing obvious that prevents us from upstreaming that code along with the gen8 code. However, watchdog timeout is fully dependent on the per-engine hang recovery path and that is not part of this code submission for gen7. Therefore watchdog timeout support for gen7 has been excluded until per-engine hang recovery support for
gen7 has landed upstream.
As part of this submission we've had to reinstate the work queue that was previously in place between the error handler and the hang recovery path. The reason for this is that the per-engine recovery path is called directly from the interrupt handler in the case of watchdog timeout. In that situation there's no way of grabbing the struct_mutex, which is a requirement for the hang recovery path. Therefore, by reinstating the work queue we provide a unified execution context for the hang recovery code that allows the hang recovery code to grab whatever locks it needs without sacrificing interrupt latency too much or sleeping indefinitely in hard interrupt context.
* Feature 3. Context Submission Status Consistency checking
Something that becomes apparent when you run long-duration operations tests with concurrent rendering processes with intermittently injected hangs is that it seems like the GPU forgets to send context completion interrupts to the driver under some circumstances. What this means is that the driver sometimes gets stuck on a context that never seems to finish, all the while the hardware has completed and is waiting for more work.
The problem with this is that the per-engine hang recovery path relies on context resubmission to kick off the hardware again following an engine reset.
This can only be done safely if the hardware and driver share the same opinion about the current state. Therefore we've extended the periodic hang checker to check for context submission state inconsistencies aside from the hang checking it already does.
If such a state is detected it is assumed (based on experience) that a context completion interrupt has been lost somehow. If this state persists for some time an attempt to correct it is made by faking the presumably lost context completion interrupt by manually calling the execlist interrupt handler, which is normally called from the main interrupt handler cued by a received context event interrupt. Just because an interrupt goes missing does not mean that the context status buffer (CSB) does not get appropriately updated by the hardware, which means that we can expect to find all the recent changes to the context states for each engine captured there.
If there are outstanding context status changes in store there then the faked context event interrupt will allow the interrupt handler to act on them. In the case of lost context completion interrupts this will prompt the driver to remove the already completed context from the execlist queue and move on to the next pending piece of work and thereby eliminating the inconsistency.
* Feature 4. Debugfs extensions for per-engine hang recovery and TDR/watchdog trace points.
GITHUB REPOSITORY:
https://github.com/telf/TDR_watchdog_RFC_1.git
RFC VERSION 1 BRANCH:
20150608_TDR_upstream_adaptation_single-thread_hangchecking_RFC_delivery_sendmail_1
RFC VERSION 2 BRANCH:
20150604_TDR_upstream_adaptation_single-thread_hangchecking_RFCv2_delivery_sendmail_1
CHANGES IN VERSION 2
--------------------
Version 2 of this RFC series addresses design concerns that Chris Wilson and Daniel Vetter et. al. had with the first version of this RFC series. Below is a summary of all the changes made between versions:
* [1/12] drm/i915: Early exit from semaphore_waits_for for execlist mode:
Turned the execlist mode check into a ringbuffer NULL check to make it more submission mode agnostic and less of a layering violation.
* [2/12] drm/i915: Make i915_gem_reset_ring_status() public
Replaces the old patch in RFCv1: "drm/i915: Add reset stats entry point for per-engine reset"
* [3/12] drm/i915: Adding TDR / per-engine reset support for gen8:
1. Simply use the previously private function
i915_gem_reset_ring_status() from the engine hang recovery path to set
active/pending context status. This replicates the same behaviour as in
full GPU reset but for a single, targetted engine.
2. Remove all additional uevents for both full GPU reset and per-engine
reset. Adapted uevent behaviour to the new per-engine hang recovery
mode in that it will only send one uevent regardless of which form of
recovery is employed. If a per-engine reset is attempted first then
one uevent will be dispatched. If that recovery mode fails and the
hang is promoted to a full GPU reset no further uevents will be
dispatched at that point.
3. Removed the 2*HUNG hang threshold from i915_hangcheck_elapsed in
order to not make the hang detection algorithm too complicated. This
threshold was introduced to compensate for the possibility that hang
recovery might be delayed due to inconsistent context submission status
that would prevent per-engine hang recovery from happening. In a later
patch we introduce faked context event interrupts and inconsistency
rectification at the onset of per-engine hang recovery instead of
relying on the hang checker to do this for us. Therefore, since we do
not delay and defer to future hang detections, we never allow hangs to
go addressed beyond the HUNG threshold and therefore there is no need
for any further thresholds.
4. Tidied up the TDR context resubmission path in intel_lrc.c . Reduced
the amount of duplication by relying entirely on the normal unqueue
function. Added a new parameter to the unqueue function that takes
into consideration if the unqueue call is for a first-time context
submission or a resubmission and adapts the handling of elsp_submitted
accordingly. The reason for this is that for context resubmission we
don't expect any further interrupts for the submission or the following
context completion. A more elegant way of handling this would be to
phase out elsp_submitted altogether, however that's part of a
LRC/execlist cleanup effort that is happening independently of this
RFC. For now we make this change as simple as possible with as few
non-TDR-related side-effects as possible.
* [4/12] drm/i915: Extending i915_gem_check_wedge to check engine reset in progress:
Removed unwarranted changes made to i915_gem_ring_throttle()
* [7/12] drm/i915: Fake lost context interrupts through forced CSB check
Remove context submission status consistency pre-check from
i915_hangcheck_elapsed() and turn it into a pre-check to per-engine reset.
The following describes the change in philosphy in how context submission state
inconsistencies are detected:
Previously we would let the periodic hang checker ensure that there were no
context submission status inconsistencies on any engine, at any point. If an
inconsistency was detected in the per-engine hang recovery path we would back
off and defer to the next hang check since per-engine hang recovery is not
effective during inconsistent context submission states.
What we do in this new version is to move the consistency pre-check from the
hang checker to the earliest point in the per-engine hang recovery path. If we
detect an inconsistency at that point we fake a potentially lost context event
interrupt by forcing a CSB check. If there are outstanding events in the CSB
these will be acted upon and hopefully that will bring the driver up to speed
with the hardware. If the CSB check did not amount to anything it is concluded
that the inconsistency is unresolvable and the per-engine hang recovery fails
and promotes to full GPU reset instead.
In the hang checker-based consistency checking we would check the inconsistency
for a number of times to make sure the detected state was stable before
attempting to rectify the situation. This is possible since hang checking
is a recurring event. Having moved the consistency checking to the recovery
path instead (i.e. a one-time, fire & forget-style event) it is assumed
that the hang detection that brought on the hang recovery has detected a
stable hang and therefore, if an inconsistency is detected at that point,
the inconsistency must be stable and not the result of a momentary context
state transition. Therefore, unlike in the hang checker case, at the very
first indication of an inconsistent context submission status the interrupt
is faked speculatively. If outstanding CSB events are found it is
determined that the hang was in fact just a context submission status
inconsistency and no hang recovery is done. If the inconsistency cannot be
resolved the per-engine
* [8/12] drm/i915: Debugfs interface for per-engine hang recovery
1. After review comments by Chris Wilson we're dropping the dual-mode parameter
value interpretation in i915_wedged_set(). In this version we only accept
engine id flag masks that contain the engine id flags of all currently hung
engines. Full GPU reset is most easily requested by passing an all zero
engine id flag mask.
2. Moved TDR-specific engine metrics like number of detected engine hangs and
number of per-engine resets into i915_hangcheck_info() from
i915_hangcheck_read().
* [9/12] drm/i915: TDR/watchdog trace points
As a consequence of the faking context event interrupt commit being moved from
the hang checker to the per-engine recovery path we no longer check context
submission status from the hang checker. Therefore there is no need to provide
submission status of the currently running context to the
trace_i915_tdr_hang_check() event.
* [10/12] drm/i915: Fix __i915_wait_request() behaviour during hang detection
* [11/12] drm/i915: Port of Added scheduler support to __wait_request() calls
NEW: Added to address the way that __i915_wait_request()
behaves in the face of hang detections and hang recovery.
* [12/12] drm/i915: Extended error state with TDR count, watchdog count and engine reset count
NEW: Adds per-engine TDR statistics to captured error state.
Tomas Elf (12):
drm/i915: Early exit from semaphore_waits_for for execlist mode.
drm/i915: Make i915_gem_reset_ring_status() public
drm/i915: Adding TDR / per-engine reset support for gen8.
drm/i915: Extending i915_gem_check_wedge to check engine reset in
progress
drm/i915: Reinstate hang recovery work queue.
drm/i915: Watchdog timeout support for gen8.
drm/i915: Fake lost context interrupts through forced CSB check.
drm/i915: Debugfs interface for per-engine hang recovery.
drm/i915: TDR/watchdog trace points.
drm/i915: Port of Added scheduler support to __wait_request() calls
drm/i915: Fix __i915_wait_request() behaviour during hang detection.
drm/i915: Extended error state with TDR count, watchdog count and
engine reset count
drivers/gpu/drm/i915/i915_debugfs.c | 76 +++-
drivers/gpu/drm/i915/i915_dma.c | 78 ++++
drivers/gpu/drm/i915/i915_drv.c | 257 +++++++++++
drivers/gpu/drm/i915/i915_drv.h | 79 +++-
drivers/gpu/drm/i915/i915_gem.c | 128 ++++--
drivers/gpu/drm/i915/i915_gpu_error.c | 8 +-
drivers/gpu/drm/i915/i915_irq.c | 292 +++++++++++--
drivers/gpu/drm/i915/i915_params.c | 10 +
drivers/gpu/drm/i915/i915_reg.h | 13 +
drivers/gpu/drm/i915/i915_trace.h | 308 +++++++++++++-
drivers/gpu/drm/i915/intel_display.c | 2 +-
drivers/gpu/drm/i915/intel_lrc.c | 729 ++++++++++++++++++++++++++++++--
drivers/gpu/drm/i915/intel_lrc.h | 16 +-
drivers/gpu/drm/i915/intel_lrc_tdr.h | 39 ++
drivers/gpu/drm/i915/intel_ringbuffer.c | 87 +++-
drivers/gpu/drm/i915/intel_ringbuffer.h | 95 +++++
drivers/gpu/drm/i915/intel_uncore.c | 203 +++++++++
include/uapi/drm/i915_drm.h | 5 +-
18 files changed, 2313 insertions(+), 112 deletions(-)
create mode 100644 drivers/gpu/drm/i915/intel_lrc_tdr.h
--
1.9.1
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply [flat|nested] 15+ messages in thread
* [RFCv2 01/12] drm/i915: Early exit from semaphore_waits_for for execlist mode.
2015-07-21 13:58 [RFCv2 00/12] TDR/watchdog timeout support for gen8 Tomas Elf
@ 2015-07-21 13:58 ` Tomas Elf
2015-07-21 13:58 ` [RFCv2 02/12] drm/i915: Make i915_gem_reset_ring_status() public Tomas Elf
` (11 subsequent siblings)
12 siblings, 0 replies; 15+ messages in thread
From: Tomas Elf @ 2015-07-21 13:58 UTC (permalink / raw)
To: Intel-GFX
When submitting semaphores in execlist mode the hang checker crashes in this
function because it is only runnable in ring submission mode. The reason this
is of particular interest to the TDR patch series is because we use semaphores
as a mean to induce hangs during testing (which is the recommended way to
induce hangs for gen8+). It's not clear how this is supposed to work in
execlist mode since:
1. This function requires a ring buffer.
2. Retrieving a ring buffer in execlist mode requires us to retrieve the
corresponding context, which we get from a request.
3. Retieving a request from the hang checker is not straight-forward since that
requires us to grab the struct_mutex in order to synchronize against the
request retirement thread.
4. Grabbing the struct_mutex from the hang checker is nothing that we will do
since that puts us at risk of deadlock since a hung thread might be holding the
struct_mutex already.
Therefore it's not obvious how we're supposed to deal with this. For now, we're
doing an early exit from this function, which avoids any kernel panic situation
when running our own internal TDR ULT.
* v2: (Chris Wilson)
Turned the execlist mode check into a ringbuffer NULL check to make it more
submission mode agnostic and less of a layering violation.
Signed-off-by: Tomas Elf <tomas.elf@intel.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
---
drivers/gpu/drm/i915/i915_irq.c | 20 ++++++++++++++++++++
1 file changed, 20 insertions(+)
diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
index 46bcbff..4e8e722 100644
--- a/drivers/gpu/drm/i915/i915_irq.c
+++ b/drivers/gpu/drm/i915/i915_irq.c
@@ -2698,6 +2698,26 @@ semaphore_waits_for(struct intel_engine_cs *ring, u32 *seqno)
u64 offset = 0;
int i, backwards;
+ /*
+ * This function does not support execlist mode - any attempt to
+ * proceed further into this function will result in a kernel panic
+ * when dereferencing ring->buffer, which is not set up in execlist
+ * mode.
+ *
+ * The correct way of doing it would be to derive the currently
+ * executing ring buffer from the current context, which is derived
+ * from the currently running request. Unfortunately, to get the
+ * current request we would have to grab the struct_mutex before doing
+ * anything else, which would be ill-advised since some other thread
+ * might have grabbed it already and managed to hang itself, causing
+ * the hang checker to deadlock.
+ *
+ * Therefore, this function does not support execlist mode in its
+ * current form. Just return NULL and move on.
+ */
+ if (ring->buffer == NULL)
+ return NULL;
+
ipehr = I915_READ(RING_IPEHR(ring->mmio_base));
if (!ipehr_is_semaphore_wait(ring->dev, ipehr))
return NULL;
--
1.9.1
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [RFCv2 02/12] drm/i915: Make i915_gem_reset_ring_status() public
2015-07-21 13:58 [RFCv2 00/12] TDR/watchdog timeout support for gen8 Tomas Elf
2015-07-21 13:58 ` [RFCv2 01/12] drm/i915: Early exit from semaphore_waits_for for execlist mode Tomas Elf
@ 2015-07-21 13:58 ` Tomas Elf
2015-07-21 13:58 ` [RFCv2 03/12] drm/i915: Adding TDR / per-engine reset support for gen8 Tomas Elf
` (10 subsequent siblings)
12 siblings, 0 replies; 15+ messages in thread
From: Tomas Elf @ 2015-07-21 13:58 UTC (permalink / raw)
To: Intel-GFX
Makes i915_gem_reset_ring_status() public for use from engine reset path in
order to replicate the same behavior as in full GPU reset but for a single
engine.
Signed-off-by: Tomas Elf <tomas.elf@intel.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
---
drivers/gpu/drm/i915/i915_drv.h | 2 ++
drivers/gpu/drm/i915/i915_gem.c | 4 ++--
2 files changed, 4 insertions(+), 2 deletions(-)
diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index 47be4a5..c32c502 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -2781,6 +2781,8 @@ static inline bool i915_stop_ring_allow_warn(struct drm_i915_private *dev_priv)
}
void i915_gem_reset(struct drm_device *dev);
+void i915_gem_reset_ring_status(struct drm_i915_private *dev_priv,
+ struct intel_engine_cs *ring);
bool i915_gem_clflush_object(struct drm_i915_gem_object *obj, bool force);
int __must_check i915_gem_object_finish_gpu(struct drm_i915_gem_object *obj);
int __must_check i915_gem_init(struct drm_device *dev);
diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index 8ce363a..4f4fc3e0 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -2561,8 +2561,8 @@ i915_gem_find_active_request(struct intel_engine_cs *ring)
return NULL;
}
-static void i915_gem_reset_ring_status(struct drm_i915_private *dev_priv,
- struct intel_engine_cs *ring)
+void i915_gem_reset_ring_status(struct drm_i915_private *dev_priv,
+ struct intel_engine_cs *ring)
{
struct drm_i915_gem_request *request;
bool ring_hung;
--
1.9.1
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [RFCv2 03/12] drm/i915: Adding TDR / per-engine reset support for gen8.
2015-07-21 13:58 [RFCv2 00/12] TDR/watchdog timeout support for gen8 Tomas Elf
2015-07-21 13:58 ` [RFCv2 01/12] drm/i915: Early exit from semaphore_waits_for for execlist mode Tomas Elf
2015-07-21 13:58 ` [RFCv2 02/12] drm/i915: Make i915_gem_reset_ring_status() public Tomas Elf
@ 2015-07-21 13:58 ` Tomas Elf
2015-07-21 13:58 ` [RFCv2 04/12] drm/i915: Extending i915_gem_check_wedge to check engine reset in progress Tomas Elf
` (9 subsequent siblings)
12 siblings, 0 replies; 15+ messages in thread
From: Tomas Elf @ 2015-07-21 13:58 UTC (permalink / raw)
To: Intel-GFX; +Cc: Ian Lister
This change introduces support for TDR-style per-engine reset as a initial,
less intrusive hang recovery option to be attempted before falling back to the
legacy full GPU reset recovery mode if necessary. Initially we're only
supporting gen8 but adding support for gen7 is straight-forward since we've
already established an extensible framework where gen7 support can be plugged
in (add corresponding versions of intel_ring_enable, intel_ring_disable,
intel_ring_save, intel_ring_restore, etc.).
1. Per-engine recovery vs. Full GPU recovery
To capture the state of a single engine being detected as hung there is now a
new flag for every engine that can be set once the decision has been made to
schedule hang recovery for that particular engine.
The following algorithm is used to determine when to use which recovery mode:
a. Once the hang check score reaches level HUNG hang recovery is
scheduled as usual. The hang checker aggregates all engines currently
detected as hung into a single engine flag mask and passes that to the
error handler, which allows us to schedule hang recovery for all
currently hung engines in a single call.
b. The error handler checks all engines that have been marked as hung
by the hang checker and - more specifically - checks how long ago it
was since it last attempted to do per-engine hang recovery for each
respective, currently hung engine. If the measured time period is
within a certain time window, i.e. the last per-engine hang recovery
was done too recently, it is determined that per-engine hang recovery
is ineffective and the step is taken to promote a full GPU reset.
c. If the error handler determines that no currently hung engine has
recently had hang recovery a per-engine hang recovery is scheduled.
d. Additionally, if the hang checker detects that the hang check score
has grown too high (currently defined as twice the HUNG level) it
determines that previous hang recovery attempts have failed for
whatever reason and it will bypass the error checker full GPU reset
promotion logic. One case where this is important is if the hang
checker and error handler thinks that per-engine hang recovery is a
suitable option and several such attempts are made - infrequently
enough - but no effective reset is done, perhaps due to inconsistent
context submission status, which is described further down below.
NOTE: Gen 7 and earlier will always promote to full GPU reset since there is
currently no per-engine reset support for these gens.
2. Context Submission Status Consistency.
Per-engine hang recovery on gen8 relies on the basic concept of context
submission status consistency. What this means is that we make sure that the
status of the hardware and the driver when it comes to the submission status of
the current context on any engine is consistent. For example, when submitting a
context to the corresponding ELSP port of an engine we expect the owning
request of that context to be at the head of the corresponding execution list
queue. Likewise, as long as the context is executing on the GPU we expect the
EXECLIST_STATUS register and the context status buffer to reflect this. Thus,
if the context submission status is consistent the ID of the currently
executing context should be in EXECLIST_STATUS and it should be consistent
with the context of the head request element in the execution list queue
corresponding to that engine.
The reason why this is important for per-engine hang recovery on gen8 is
because this recovery mode relies on context resubmission to resume execution
following the recovery. If a context has been determined to be hung and the
per-engine hang recovery mode is engaged leading to the resubmission of that
context it's important that the hardware is in fact not busy doing something
else or being idle since a resubmission during this state would cause unforseen
side-effects such as unexpected preemptions.
There are rare, although consistently reproducable, situations that have shown
up in practice where the driver and hardware are no longer consistent with each
other, e.g. due to lost context completion interrupts after which the hardware
would be idle but the driver would still think that a context would still be
active.
3. There is a new reset path for engine reset alongside the legacy full GPU
reset path. This path does the following:
1) Check for context submission consistency to make sure that the
context that the hardware is currently stuck on is actually what the
driver is working on. If not then clearly we're not in a consistently
hung state and we bail out early.
2) Disable/idle the engine. This is done through reset handshaking on
gen8+ unlike earlier gens where this was done by clearing the ring
valid bits in MI_MODE and ring control registers, which are no longer
supported on gen8+. Reset handshaking translates to setting the reset
request bit in the reset control register.
3) Save the current engine state.
What this translates to on gen8 is simply to read the current value of
the head register and nudge it so that it points to the next valid
instruction in the ring buffer. Since we assume that the execution is
currently stuck in a batch buffer the effect of this is that the
batchbuffer start instruction of the hung batch buffer is skipped so
that when execution resumes, following the hang recovery completion, it
resumes immediately following the batch buffer.
This effectively means that we're forcefully terminating the currently
active, hung batch buffer. Obviously, the outcome of this intervention
is potentially undefined but there are not many good options in this
scenario. It's better than resetting the entire GPU in the vast
majority of cases.
Save the nudged head value to be applied later.
4) Reset the engine.
5) Apply the nudged head value to the head register.
6) Reenable the engine. For gen8 this means resubmitting the fixed-up
context, allowing execution to resume. In order to resubmit a context
without relying on the currently hung execution list queues we use a
privileged API that is dedicated for TDR use only. This submission API
bypasses any currently queued work and gets exclusive access to the
ELSP ports.
7) If the engine hang recovery procedure fails at any point in between
disablement and reenablement of the engine there is a back-off
procedure: For gen8 it's possible to back out of the reset handshake by
clearing the reset request bit in the reset control register.
NOTE:
It's possible that some of Ben Widawsky's original per-engine reset patches
from 3 years ago are in this commit but since this work has gone through the
hands of at least 3 people already any kind of ownership tracking has been lost
a long time ago. If you think that you should be on the sob list just let me
know.
* v2: (Chris Wilson / Daniel Vetter)
- Simply use the previously private function i915_gem_reset_ring_status() from
the engine hang recovery path to set active/pending context status. This
replicates the same behaviour as in full GPU reset but for a single,
targetted engine.
- Remove all additional uevents for both full GPU reset and per-engine reset.
Adapted uevent behaviour to the new per-engine hang recovery mode in that it
will only send one uevent regardless of which form of recovery is employed.
If a per-engine reset is attempted first then one uevent will be dispatched.
If that recovery mode fails and the hang is promoted to a full GPU reset no
further uevents will be dispatched at that point.
- Removed the 2*HUNG hang threshold from i915_hangcheck_elapsed in order to not
make the hang detection algorithm too complicated. This threshold was
introduced to compensate for the possibility that hang recovery might be
delayed due to inconsistent context submission status that would prevent
per-engine hang recovery from happening. In a later patch we introduce faked
context event interrupts and inconsistency rectification at the onset of
per-engine hang recovery instead of relying on the hang checker to do this
for us. Therefore, since we do not delay and defer to future hang detections,
we never allow hangs to go addressed beyond the HUNG threshold and
therefore there is no need for any further thresholds.
- Tidied up the TDR context resubmission path in intel_lrc.c . Reduced the
amount of duplication by relying entirely on the normal unqueue function.
Added a new parameter to the unqueue function that takes into consideration
if the unqueue call is for a first-time context submission or a resubmission
and adapts the handling of elsp_submitted accordingly. The reason for this is
that for context resubmission we don't expect any further interrupts for the
submission or the following context completion. A more elegant way of
handling this would be to phase out elsp_submitted altogether, however that's
part of a LRC/execlist cleanup effort that is happening independently of this
RFC. For now we make this change as simple as possible with as few
non-TDR-related side-effects as possible.
Signed-off-by: Tomas Elf <tomas.elf@intel.com>
Signed-off-by: Arun Siluvery <arun.siluvery@intel.com>
Signed-off-by: Ian Lister <ian.lister@intel.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
---
drivers/gpu/drm/i915/i915_debugfs.c | 2 +-
drivers/gpu/drm/i915/i915_dma.c | 18 ++
drivers/gpu/drm/i915/i915_drv.c | 198 ++++++++++++
drivers/gpu/drm/i915/i915_drv.h | 63 +++-
drivers/gpu/drm/i915/i915_irq.c | 199 ++++++++++--
drivers/gpu/drm/i915/i915_params.c | 10 +
drivers/gpu/drm/i915/i915_reg.h | 6 +
drivers/gpu/drm/i915/intel_lrc.c | 556 ++++++++++++++++++++++++++++++--
drivers/gpu/drm/i915/intel_lrc.h | 14 +
drivers/gpu/drm/i915/intel_lrc_tdr.h | 36 +++
drivers/gpu/drm/i915/intel_ringbuffer.c | 84 ++++-
drivers/gpu/drm/i915/intel_ringbuffer.h | 64 ++++
drivers/gpu/drm/i915/intel_uncore.c | 199 ++++++++++++
13 files changed, 1400 insertions(+), 49 deletions(-)
create mode 100644 drivers/gpu/drm/i915/intel_lrc_tdr.h
diff --git a/drivers/gpu/drm/i915/i915_debugfs.c b/drivers/gpu/drm/i915/i915_debugfs.c
index 8446ef4..e33e105 100644
--- a/drivers/gpu/drm/i915/i915_debugfs.c
+++ b/drivers/gpu/drm/i915/i915_debugfs.c
@@ -4183,7 +4183,7 @@ i915_wedged_set(void *data, u64 val)
intel_runtime_pm_get(dev_priv);
- i915_handle_error(dev, val,
+ i915_handle_error(dev, 0x0, val,
"Manually setting wedged to %llu", val);
intel_runtime_pm_put(dev_priv);
diff --git a/drivers/gpu/drm/i915/i915_dma.c b/drivers/gpu/drm/i915/i915_dma.c
index e44116f..cf01e84 100644
--- a/drivers/gpu/drm/i915/i915_dma.c
+++ b/drivers/gpu/drm/i915/i915_dma.c
@@ -776,6 +776,22 @@ static void intel_device_info_runtime_init(struct drm_device *dev)
info->has_eu_pg ? "y" : "n");
}
+static void
+i915_hangcheck_init(struct drm_device *dev)
+{
+ int i;
+ struct drm_i915_private *dev_priv = dev->dev_private;
+
+ for (i = 0; i < I915_NUM_RINGS; i++) {
+ struct intel_engine_cs *engine = &dev_priv->ring[i];
+ struct intel_ring_hangcheck *hc = &engine->hangcheck;
+
+ i915_hangcheck_reinit(engine);
+ hc->reset_count = 0;
+ hc->tdr_count = 0;
+ }
+}
+
/**
* i915_driver_load - setup chip and create an initial config
* @dev: DRM device
@@ -956,6 +972,8 @@ int i915_driver_load(struct drm_device *dev, unsigned long flags)
i915_gem_load(dev);
+ i915_hangcheck_init(dev);
+
/* On the 945G/GM, the chipset reports the MSI capability on the
* integrated graphics even though the support isn't actually there
* according to the published specs. It doesn't appear to function
diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
index c3fdbb0..c7ba64e 100644
--- a/drivers/gpu/drm/i915/i915_drv.c
+++ b/drivers/gpu/drm/i915/i915_drv.c
@@ -34,6 +34,7 @@
#include "i915_drv.h"
#include "i915_trace.h"
#include "intel_drv.h"
+#include "intel_lrc_tdr.h"
#include <linux/console.h>
#include <linux/module.h>
@@ -581,6 +582,7 @@ static int i915_drm_suspend(struct drm_device *dev)
struct drm_crtc *crtc;
pci_power_t opregion_target_state;
int error;
+ int i;
/* ignore lid events during suspend */
mutex_lock(&dev_priv->modeset_restore_lock);
@@ -602,6 +604,16 @@ static int i915_drm_suspend(struct drm_device *dev)
return error;
}
+ /*
+ * Clear any pending reset requests. They should be picked up
+ * after resume when new work is submitted
+ */
+ for (i = 0; i < I915_NUM_RINGS; i++)
+ atomic_set(&dev_priv->ring[i].hangcheck.flags, 0);
+
+ atomic_clear_mask(I915_RESET_IN_PROGRESS_FLAG,
+ &dev_priv->gpu_error.reset_counter);
+
intel_suspend_gt_powersave(dev);
/*
@@ -905,6 +917,192 @@ int i915_reset(struct drm_device *dev)
return 0;
}
+/**
+ * i915_reset_engine - reset GPU engine after a hang
+ * @engine: engine to reset
+ *
+ * Reset a specific GPU engine. Useful if a hang is detected. Returns zero on successful
+ * reset or otherwise an error code.
+ *
+ * Procedure is fairly simple:
+ *
+ * - Force engine to idle.
+ *
+ * - Save current head register value and nudge it past the point of the hang in the
+ * ring buffer, which is typically the BB_START instruction of the hung batch buffer,
+ * on to the following instruction.
+ *
+ * - Reset engine.
+ *
+ * - Restore the previously saved, nudged head register value.
+ *
+ * - Re-enable engine to resume running. On gen8 this requires the previously hung
+ * context to be resubmitted to ELSP via the dedicated TDR-execlists interface.
+ *
+ */
+int i915_reset_engine(struct intel_engine_cs *engine)
+{
+ struct drm_device *dev = engine->dev;
+ struct drm_i915_private *dev_priv = dev->dev_private;
+ struct drm_i915_gem_request *current_request = NULL;
+ uint32_t head;
+ bool force_advance = false;
+ int ret = 0;
+ int err_ret = 0;
+
+ WARN_ON(!mutex_is_locked(&dev->struct_mutex));
+
+ /* Take wake lock to prevent power saving mode */
+ intel_uncore_forcewake_get(dev_priv, FORCEWAKE_ALL);
+
+ i915_gem_reset_ring_status(dev_priv, engine);
+
+ if (i915.enable_execlists) {
+ enum context_submission_status status =
+ intel_execlists_TDR_get_current_request(engine, NULL);
+
+ /*
+ * If the hardware and driver states do not coincide
+ * or if there for some reason is no current context
+ * in the process of being submitted then bail out and
+ * try again. Do not proceed unless we have reliable
+ * current context state information.
+ */
+ if (status != CONTEXT_SUBMISSION_STATUS_OK) {
+ ret = -EAGAIN;
+ goto reset_engine_error;
+ }
+ }
+
+ ret = intel_ring_disable(engine);
+ if (ret != 0) {
+ DRM_ERROR("Failed to disable %s\n", engine->name);
+ goto reset_engine_error;
+ }
+
+ if (i915.enable_execlists) {
+ enum context_submission_status status;
+ bool inconsistent;
+
+ status = intel_execlists_TDR_get_current_request(engine,
+ ¤t_request);
+
+ inconsistent = (status != CONTEXT_SUBMISSION_STATUS_OK);
+ if (inconsistent) {
+ /*
+ * If we somehow have reached this point with
+ * an inconsistent context submission status then
+ * back out of the previously requested reset and
+ * retry later.
+ */
+ WARN(inconsistent,
+ "Inconsistent context status on %s: %u\n",
+ engine->name, status);
+
+ ret = -EAGAIN;
+ goto reenable_reset_engine_error;
+ }
+ }
+
+ /* Sample the current ring head position */
+ head = I915_READ_HEAD(engine) & HEAD_ADDR;
+
+ if (head == engine->hangcheck.last_head) {
+ /*
+ * The engine has not advanced since the last
+ * time it hung so force it to advance to the
+ * next QWORD. In most cases the engine head
+ * pointer will automatically advance to the
+ * next instruction as soon as it has read the
+ * current instruction, without waiting for it
+ * to complete. This seems to be the default
+ * behaviour, however an MBOX wait inserted
+ * directly to the VCS/BCS engines does not behave
+ * in the same way, instead the head pointer
+ * will still be pointing at the MBOX instruction
+ * until it completes.
+ */
+ force_advance = true;
+ }
+
+ engine->hangcheck.last_head = head;
+
+ ret = intel_ring_save(engine, current_request, force_advance);
+ if (ret) {
+ DRM_ERROR("Failed to save %s engine state\n", engine->name);
+ goto reenable_reset_engine_error;
+ }
+
+ ret = intel_gpu_engine_reset(engine);
+ if (ret) {
+ DRM_ERROR("Failed to reset %s\n", engine->name);
+ goto reenable_reset_engine_error;
+ }
+
+ ret = intel_ring_restore(engine, current_request);
+ if (ret) {
+ DRM_ERROR("Failed to restore %s engine state\n", engine->name);
+ goto reenable_reset_engine_error;
+ }
+
+ /* Correct driver state */
+ intel_gpu_engine_reset_resample(engine, current_request);
+
+ /*
+ * Reenable engine
+ *
+ * In execlist mode on gen8+ this is implicit by simply resubmitting
+ * the previously hung context. In ring buffer submission mode on gen7
+ * and earlier we need to actively turn on the engine first.
+ */
+ if (i915.enable_execlists)
+ intel_execlists_TDR_context_resubmission(engine);
+ else
+ ret = intel_ring_enable(engine);
+
+ if (ret) {
+ DRM_ERROR("Failed to enable %s again after reset\n",
+ engine->name);
+
+ goto reset_engine_error;
+ }
+
+ /* Clear reset flags to allow future hangchecks */
+ atomic_set(&engine->hangcheck.flags, 0);
+
+ /* Wake up anything waiting on this engine's queue */
+ wake_up_all(&engine->irq_queue);
+
+ if (i915.enable_execlists && current_request)
+ i915_gem_request_unreference(current_request);
+
+ intel_uncore_forcewake_put(dev_priv, FORCEWAKE_ALL);
+
+ return ret;
+
+reenable_reset_engine_error:
+
+ err_ret = intel_ring_enable(engine);
+ if (err_ret)
+ DRM_ERROR("Failed to reenable %s following error during reset (%d)\n",
+ engine->name, err_ret);
+
+reset_engine_error:
+
+ /* Clear reset flags to allow future hangchecks */
+ atomic_set(&engine->hangcheck.flags, 0);
+
+ /* Wake up anything waiting on this engine's queue */
+ wake_up_all(&engine->irq_queue);
+
+ if (i915.enable_execlists && current_request)
+ i915_gem_request_unreference(current_request);
+
+ intel_uncore_forcewake_put(dev_priv, FORCEWAKE_ALL);
+
+ return ret;
+}
+
static int i915_pci_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
{
struct intel_device_info *intel_info =
diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index c32c502..be4c95c 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -2280,6 +2280,48 @@ struct drm_i915_cmd_table {
int count;
};
+/*
+ * Context submission status
+ *
+ * CONTEXT_SUBMISSION_STATUS_OK:
+ * Context submitted to ELSP and state of execlist queue is the same as
+ * the state of EXECLIST_STATUS register. Software and hardware states
+ * are consistent and can be trusted.
+ *
+ * CONTEXT_SUBMISSION_STATUS_INCONSISTENT:
+ * Context has been submitted to the execlist queue but the state of the
+ * EXECLIST_STATUS register is different from the execlist queue state.
+ * This could mean any of the following:
+ *
+ * 1. The context is in the head position of the execlist queue
+ * but has not yet been submitted to ELSP.
+ *
+ * 2. The hardware just recently completed the context but the
+ * context is pending removal from the execlist queue.
+ *
+ * 3. The driver has lost a context state transition interrupt.
+ * Typically what this means is that hardware has completed and
+ * is now idle but the driver thinks the hardware is still
+ * busy.
+ *
+ * Overall what this means is that the context submission status is
+ * currently in transition and cannot be trusted until it settles down.
+ *
+ * CONTEXT_SUBMISSION_STATUS_NONE_SUBMITTED:
+ * No context submitted to the execlist queue and the EXECLIST_STATUS
+ * register shows no context being processed.
+ *
+ * CONTEXT_SUBMISSION_STATUS_NONE_UNDEFINED:
+ * Initial state before submission status has been determined.
+ *
+ */
+enum context_submission_status {
+ CONTEXT_SUBMISSION_STATUS_OK = 0,
+ CONTEXT_SUBMISSION_STATUS_INCONSISTENT,
+ CONTEXT_SUBMISSION_STATUS_NONE_SUBMITTED,
+ CONTEXT_SUBMISSION_STATUS_UNDEFINED
+};
+
/* Note that the (struct drm_i915_private *) cast is just to shut up gcc. */
#define __I915__(p) ({ \
struct drm_i915_private *__p; \
@@ -2478,6 +2520,7 @@ struct i915_params {
int enable_ips;
int invert_brightness;
int enable_cmd_parser;
+ unsigned int gpu_reset_promotion_time;
/* leave bools at the end to not create holes */
bool enable_hangcheck;
bool fastboot;
@@ -2508,18 +2551,34 @@ extern long i915_compat_ioctl(struct file *filp, unsigned int cmd,
unsigned long arg);
#endif
extern int intel_gpu_reset(struct drm_device *dev);
+extern int intel_gpu_engine_reset(struct intel_engine_cs *engine);
+extern int intel_request_gpu_engine_reset(struct intel_engine_cs *engine);
+extern int intel_unrequest_gpu_engine_reset(struct intel_engine_cs *engine);
extern int i915_reset(struct drm_device *dev);
+extern int i915_reset_engine(struct intel_engine_cs *engine);
extern unsigned long i915_chipset_val(struct drm_i915_private *dev_priv);
extern unsigned long i915_mch_val(struct drm_i915_private *dev_priv);
extern unsigned long i915_gfx_val(struct drm_i915_private *dev_priv);
extern void i915_update_gfx_val(struct drm_i915_private *dev_priv);
int vlv_force_gfx_clock(struct drm_i915_private *dev_priv, bool on);
void intel_hpd_cancel_work(struct drm_i915_private *dev_priv);
+static inline void i915_hangcheck_reinit(struct intel_engine_cs *engine)
+{
+ struct intel_ring_hangcheck *hc = &engine->hangcheck;
+
+ hc->acthd = 0;
+ hc->max_acthd = 0;
+ hc->seqno = 0;
+ hc->score = 0;
+ hc->action = HANGCHECK_IDLE;
+ hc->deadlock = 0;
+}
+
/* i915_irq.c */
void i915_queue_hangcheck(struct drm_device *dev);
-__printf(3, 4)
-void i915_handle_error(struct drm_device *dev, bool wedged,
+__printf(4, 5)
+void i915_handle_error(struct drm_device *dev, u32 engine_mask, bool wedged,
const char *fmt, ...);
extern void intel_irq_init(struct drm_i915_private *dev_priv);
diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
index 4e8e722..e869823 100644
--- a/drivers/gpu/drm/i915/i915_irq.c
+++ b/drivers/gpu/drm/i915/i915_irq.c
@@ -2312,10 +2312,70 @@ static void i915_reset_and_wakeup(struct drm_device *dev)
char *error_event[] = { I915_ERROR_UEVENT "=1", NULL };
char *reset_event[] = { I915_RESET_UEVENT "=1", NULL };
char *reset_done_event[] = { I915_ERROR_UEVENT "=0", NULL };
- int ret;
+ bool reset_complete = false;
+ struct intel_engine_cs *ring;
+ int ret = 0;
+ int i;
+
+ mutex_lock(&dev->struct_mutex);
kobject_uevent_env(&dev->primary->kdev->kobj, KOBJ_CHANGE, error_event);
+ for_each_ring(ring, dev_priv, i) {
+
+ /*
+ * Skip further individual engine reset requests if full GPU
+ * reset requested.
+ */
+ if (i915_reset_in_progress(error))
+ break;
+
+ if (atomic_read(&ring->hangcheck.flags) &
+ I915_ENGINE_RESET_IN_PROGRESS) {
+
+ if (!reset_complete)
+ kobject_uevent_env(&dev->primary->kdev->kobj,
+ KOBJ_CHANGE,
+ reset_event);
+
+ reset_complete = true;
+
+ ret = i915_reset_engine(ring);
+
+ /*
+ * Execlist mode only:
+ *
+ * -EAGAIN means that between detecting a hang (and
+ * also determining that the currently submitted
+ * context is stable and valid) and trying to recover
+ * from the hang the current context changed state.
+ * This means that we are probably not completely hung
+ * after all. Just fail and retry by exiting all the
+ * way back and wait for the next hang detection. If we
+ * have a true hang on our hands then we will detect it
+ * again, otherwise we will continue like nothing
+ * happened.
+ */
+ if (ret == -EAGAIN) {
+ DRM_ERROR("Reset of %s aborted due to " \
+ "change in context submission " \
+ "state - retrying!", ring->name);
+ ret = 0;
+ }
+
+ if (ret) {
+ DRM_ERROR("Reset of %s failed! (%d)", ring->name, ret);
+
+ atomic_set_mask(I915_RESET_IN_PROGRESS_FLAG,
+ &dev_priv->gpu_error.reset_counter);
+ break;
+ }
+ }
+ }
+
+ /* The full GPU reset will grab the struct_mutex when it needs it */
+ mutex_unlock(&dev->struct_mutex);
+
/*
* Note that there's only one work item which does gpu resets, so we
* need not worry about concurrent gpu resets potentially incrementing
@@ -2328,8 +2388,13 @@ static void i915_reset_and_wakeup(struct drm_device *dev)
*/
if (i915_reset_in_progress(error) && !i915_terminally_wedged(error)) {
DRM_DEBUG_DRIVER("resetting chip\n");
- kobject_uevent_env(&dev->primary->kdev->kobj, KOBJ_CHANGE,
- reset_event);
+
+ if (!reset_complete)
+ kobject_uevent_env(&dev->primary->kdev->kobj,
+ KOBJ_CHANGE,
+ reset_event);
+
+ reset_complete = true;
/*
* In most cases it's guaranteed that we get here with an RPM
@@ -2362,23 +2427,36 @@ static void i915_reset_and_wakeup(struct drm_device *dev)
*
* Since unlock operations are a one-sided barrier only,
* we need to insert a barrier here to order any seqno
- * updates before
- * the counter increment.
+ * updates before the counter increment.
+ *
+ * The increment clears I915_RESET_IN_PROGRESS_FLAG.
*/
smp_mb__before_atomic();
atomic_inc(&dev_priv->gpu_error.reset_counter);
- kobject_uevent_env(&dev->primary->kdev->kobj,
- KOBJ_CHANGE, reset_done_event);
+ /*
+ * If any per-engine resets were promoted to full GPU
+ * reset don't forget to clear those reset flags.
+ */
+ for_each_ring(ring, dev_priv, i)
+ atomic_set(&ring->hangcheck.flags, 0);
} else {
+ /* Terminal wedge condition */
+ WARN(1, "i915_reset failed, declaring GPU as wedged!\n");
atomic_set_mask(I915_WEDGED, &error->reset_counter);
}
+ }
- /*
- * Note: The wake_up also serves as a memory barrier so that
- * waiters see the update value of the reset counter atomic_t.
- */
+ /*
+ * Note: The wake_up also serves as a memory barrier so that
+ * waiters see the update value of the reset counter atomic_t.
+ */
+ if (reset_complete) {
i915_error_wake_up(dev_priv, true);
+
+ if (ret == 0)
+ kobject_uevent_env(&dev->primary->kdev->kobj,
+ KOBJ_CHANGE, reset_done_event);
}
}
@@ -2476,21 +2554,42 @@ static void i915_report_and_clear_eir(struct drm_device *dev)
/**
* i915_handle_error - handle a gpu error
- * @dev: drm device
*
- * Do some basic checking of regsiter state at error time and
+ * @dev: drm device
+ *
+ * @engine_mask: Bit mask containing the engine flags of all engines
+ * associated with one or more detected errors.
+ * May be 0x0.
+ *
+ * If wedged is set to true this implies that one or more
+ * engine hangs were detected. In this case we will
+ * attempt to reset all engines that have been detected
+ * as hung.
+ *
+ * If a previous engine reset was attempted too recently
+ * or if one of the current engine resets fails we fall
+ * back to legacy full GPU reset.
+ *
+ * @wedged: true = Hang detected, invoke hang recovery.
+ * @fmt, ...: Error message describing reason for error.
+ *
+ * Do some basic checking of register state at error time and
* dump it to the syslog. Also call i915_capture_error_state() to make
* sure we get a record and make it available in debugfs. Fire a uevent
* so userspace knows something bad happened (should trigger collection
- * of a ring dump etc.).
+ * of a ring dump etc.). If a hang was detected (wedged = true) try to
+ * reset the associated engine. Failing that, try to fall back to legacy
+ * full GPU reset recovery mode.
*/
-void i915_handle_error(struct drm_device *dev, bool wedged,
+void i915_handle_error(struct drm_device *dev, u32 engine_mask, bool wedged,
const char *fmt, ...)
{
struct drm_i915_private *dev_priv = dev->dev_private;
va_list args;
char error_msg[80];
+ struct intel_engine_cs *engine;
+
va_start(args, fmt);
vscnprintf(error_msg, sizeof(error_msg), fmt, args);
va_end(args);
@@ -2499,8 +2598,59 @@ void i915_handle_error(struct drm_device *dev, bool wedged,
i915_report_and_clear_eir(dev);
if (wedged) {
- atomic_set_mask(I915_RESET_IN_PROGRESS_FLAG,
- &dev_priv->gpu_error.reset_counter);
+ /*
+ * Defer to full GPU reset if any of the following is true:
+ * 1. The caller did not ask for per-engine reset.
+ * 2. The hardware does not support it (pre-gen7).
+ * 3. We already tried per-engine reset recently.
+ */
+ bool full_reset = true;
+
+ /*
+ * TBD: We currently only support per-engine reset for gen8+.
+ * Implement support for gen7.
+ */
+ if (engine_mask && (INTEL_INFO(dev)->gen >= 8)) {
+ u32 i;
+
+ for_each_ring(engine, dev_priv, i) {
+ u32 now, last_engine_reset_timediff;
+
+ if (!(intel_ring_flag(engine) & engine_mask))
+ continue;
+
+ /* Measure the time since this engine was last reset */
+ now = get_seconds();
+ last_engine_reset_timediff =
+ now - engine->hangcheck.last_engine_reset_time;
+
+ full_reset = last_engine_reset_timediff <
+ i915.gpu_reset_promotion_time;
+
+ engine->hangcheck.last_engine_reset_time = now;
+
+ /*
+ * This engine was not reset too recently - go ahead
+ * with engine reset instead of falling back to full
+ * GPU reset.
+ *
+ * Flag that we want to try and reset this engine.
+ * This can still be overridden by a global
+ * reset e.g. if per-engine reset fails.
+ */
+ if (!full_reset)
+ atomic_set_mask(I915_ENGINE_RESET_IN_PROGRESS,
+ &engine->hangcheck.flags);
+ else
+ break;
+
+ } /* for_each_ring */
+ }
+
+ if (full_reset) {
+ atomic_set_mask(I915_RESET_IN_PROGRESS_FLAG,
+ &dev_priv->gpu_error.reset_counter);
+ }
/*
* Wakeup waiting processes so that the reset function
@@ -2823,7 +2973,7 @@ ring_stuck(struct intel_engine_cs *ring, u64 acthd)
*/
tmp = I915_READ_CTL(ring);
if (tmp & RING_WAIT) {
- i915_handle_error(dev, false,
+ i915_handle_error(dev, intel_ring_flag(ring), false,
"Kicking stuck wait on %s",
ring->name);
I915_WRITE_CTL(ring, tmp);
@@ -2835,7 +2985,7 @@ ring_stuck(struct intel_engine_cs *ring, u64 acthd)
default:
return HANGCHECK_HUNG;
case 1:
- i915_handle_error(dev, false,
+ i915_handle_error(dev, intel_ring_flag(ring), false,
"Kicking stuck semaphore on %s",
ring->name);
I915_WRITE_CTL(ring, tmp);
@@ -2864,7 +3014,8 @@ static void i915_hangcheck_elapsed(struct work_struct *work)
struct drm_device *dev = dev_priv->dev;
struct intel_engine_cs *ring;
int i;
- int busy_count = 0, rings_hung = 0;
+ u32 engine_mask = 0;
+ int busy_count = 0;
bool stuck[I915_NUM_RINGS] = { 0 };
#define BUSY 1
#define KICK 5
@@ -2960,12 +3111,14 @@ static void i915_hangcheck_elapsed(struct work_struct *work)
DRM_INFO("%s on %s\n",
stuck[i] ? "stuck" : "no progress",
ring->name);
- rings_hung++;
+
+ engine_mask |= intel_ring_flag(ring);
+ ring->hangcheck.tdr_count++;
}
}
- if (rings_hung)
- return i915_handle_error(dev, true, "Ring hung");
+ if (engine_mask)
+ i915_handle_error(dev, engine_mask, true, "Ring hung (0x%02x)", engine_mask);
if (busy_count)
/* Reset timer case chip hangs without another request
diff --git a/drivers/gpu/drm/i915/i915_params.c b/drivers/gpu/drm/i915/i915_params.c
index bb64415..9cea004 100644
--- a/drivers/gpu/drm/i915/i915_params.c
+++ b/drivers/gpu/drm/i915/i915_params.c
@@ -50,6 +50,7 @@ struct i915_params i915 __read_mostly = {
.enable_cmd_parser = 1,
.disable_vtd_wa = 0,
.use_mmio_flip = 0,
+ .gpu_reset_promotion_time = 0,
.mmio_debug = 0,
.verbose_state_checks = 1,
.nuclear_pageflip = 0,
@@ -172,6 +173,15 @@ module_param_named(use_mmio_flip, i915.use_mmio_flip, int, 0600);
MODULE_PARM_DESC(use_mmio_flip,
"use MMIO flips (-1=never, 0=driver discretion [default], 1=always)");
+module_param_named(gpu_reset_promotion_time,
+ i915.gpu_reset_promotion_time, int, 0644);
+MODULE_PARM_DESC(gpu_reset_promotion_time,
+ "Catch excessive engine resets. Each engine maintains a "
+ "timestamp of the last time it was reset. If it hangs again "
+ "within this period then fall back to full GPU reset to try and"
+ " recover from the hang. "
+ "default=0 seconds (disabled)");
+
module_param_named(mmio_debug, i915.mmio_debug, int, 0600);
MODULE_PARM_DESC(mmio_debug,
"Enable the MMIO debug code for the first N failures (default: off). "
diff --git a/drivers/gpu/drm/i915/i915_reg.h b/drivers/gpu/drm/i915/i915_reg.h
index 9c97842..af9f0ad 100644
--- a/drivers/gpu/drm/i915/i915_reg.h
+++ b/drivers/gpu/drm/i915/i915_reg.h
@@ -100,6 +100,10 @@
#define GRDOM_RESET_STATUS (1<<1)
#define GRDOM_RESET_ENABLE (1<<0)
+#define RING_RESET_CTL(ring) ((ring)->mmio_base+0xd0)
+#define READY_FOR_RESET 0x2
+#define REQUEST_RESET 0x1
+
#define ILK_GDSR 0x2ca4 /* MCHBAR offset */
#define ILK_GRDOM_FULL (0<<1)
#define ILK_GRDOM_RENDER (1<<1)
@@ -130,6 +134,8 @@
#define GEN6_GRDOM_RENDER (1 << 1)
#define GEN6_GRDOM_MEDIA (1 << 2)
#define GEN6_GRDOM_BLT (1 << 3)
+#define GEN6_GRDOM_VECS (1 << 4)
+#define GEN8_GRDOM_MEDIA2 (1 << 7)
#define RING_PP_DIR_BASE(ring) ((ring)->mmio_base+0x228)
#define RING_PP_DIR_BASE_READ(ring) ((ring)->mmio_base+0x518)
diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
index 0fc35dd..e02abec 100644
--- a/drivers/gpu/drm/i915/intel_lrc.c
+++ b/drivers/gpu/drm/i915/intel_lrc.c
@@ -135,6 +135,7 @@
#include <drm/drmP.h>
#include <drm/i915_drm.h>
#include "i915_drv.h"
+#include "intel_lrc_tdr.h"
#define GEN9_LR_CONTEXT_RENDER_SIZE (22 * PAGE_SIZE)
#define GEN8_LR_CONTEXT_RENDER_SIZE (20 * PAGE_SIZE)
@@ -330,6 +331,164 @@ static void execlists_elsp_write(struct intel_engine_cs *ring,
spin_unlock(&dev_priv->uncore.lock);
}
+/**
+ * execlist_get_context_reg_page() - Get memory page for context object
+ * @engine: engine
+ * @ctx: context running on engine
+ * @page: returned page
+ *
+ * Return: 0 if successful, otherwise propagates error codes.
+ */
+static inline int execlist_get_context_reg_page(struct intel_engine_cs *engine,
+ struct intel_context *ctx,
+ struct page **page)
+{
+ struct drm_i915_gem_object *ctx_obj;
+
+ if (!page)
+ return -EINVAL;
+
+ if (!ctx)
+ ctx = engine->default_context;
+
+ ctx_obj = ctx->engine[engine->id].state;
+
+ if (WARN(!ctx_obj, "Context object not set up!\n"))
+ return -EINVAL;
+
+ WARN(!i915_gem_obj_is_pinned(ctx_obj),
+ "Context object is not pinned!\n");
+
+ *page = i915_gem_object_get_page(ctx_obj, 1);
+
+ if (WARN(!*page, "Context object page could not be resolved!\n"))
+ return -EINVAL;
+
+ return 0;
+}
+
+/**
+ * execlist_write_context_reg() - Write value to context register
+ * @engine: engine
+ * @ctx: context running on engine
+ * @ctx_reg: Index into context image pointing to register location
+ * @mmio_reg_addr: MMIO register address
+ * @val: Value to be written
+ *
+ * Return: 0 if successful, otherwise propagates error codes.
+ */
+static inline int execlists_write_context_reg(struct intel_engine_cs *engine,
+ struct intel_context *ctx, u32 ctx_reg, u32 mmio_reg_addr,
+ u32 val)
+{
+ struct page *page = NULL;
+ uint32_t *reg_state;
+
+ int ret = execlist_get_context_reg_page(engine, ctx, &page);
+ if (WARN(ret, "Failed to write %u to register %u for %s!\n",
+ (unsigned int) val, (unsigned int) ctx_reg, engine->name))
+ return ret;
+
+ reg_state = kmap_atomic(page);
+
+ WARN(reg_state[ctx_reg] != mmio_reg_addr,
+ "Context register address (%x) != MMIO register address (%x)!\n",
+ (unsigned int) reg_state[ctx_reg], (unsigned int) mmio_reg_addr);
+
+ reg_state[ctx_reg+1] = val;
+ kunmap_atomic(reg_state);
+
+ return ret;
+}
+
+/**
+ * execlist_read_context_reg() - Read value from context register
+ * @engine: engine
+ * @ctx: context running on engine
+ * @ctx_reg: Index into context image pointing to register location
+ * @mmio_reg_addr: MMIO register address
+ * @val: Output parameter returning register value
+ *
+ * Return: 0 if successful, otherwise propagates error codes.
+ */
+static inline int execlists_read_context_reg(struct intel_engine_cs *engine,
+ struct intel_context *ctx, u32 ctx_reg, u32 mmio_reg_addr,
+ u32 *val)
+{
+ struct page *page = NULL;
+ uint32_t *reg_state;
+ int ret = 0;
+
+ if (!val)
+ return -EINVAL;
+
+ ret = execlist_get_context_reg_page(engine, ctx, &page);
+ if (WARN(ret, "Failed to read from register %u for %s!\n",
+ (unsigned int) ctx_reg, engine->name))
+ return ret;
+
+ reg_state = kmap_atomic(page);
+
+ WARN(reg_state[ctx_reg] != mmio_reg_addr,
+ "Context register address (%x) != MMIO register address (%x)!\n",
+ (unsigned int) reg_state[ctx_reg], (unsigned int) mmio_reg_addr);
+
+ *val = reg_state[ctx_reg+1];
+ kunmap_atomic(reg_state);
+
+ return ret;
+ }
+
+/*
+ * Generic macros for generating function implementation for context register
+ * read/write functions.
+ *
+ * Macro parameters
+ * ----------------
+ * reg_name: Designated name of context register (e.g. tail, head, buffer_ctl)
+ *
+ * reg_def: Context register macro definition (e.g. CTX_RING_TAIL)
+ *
+ * mmio_reg_def: Name of macro function used to determine the address
+ * of the corresponding MMIO register (e.g. RING_TAIL, RING_HEAD).
+ * This macro function is assumed to be defined on the form of:
+ *
+ * #define mmio_reg_def(base) (base+register_offset)
+ *
+ * Where "base" is the MMIO base address of the respective ring
+ * and "register_offset" is the offset relative to "base".
+ *
+ * Function parameters
+ * -------------------
+ * engine: The engine that the context is running on
+ * ctx: The context of the register that is to be accessed
+ * reg_name: Value to be written/read to/from the register.
+ */
+#define INTEL_EXECLISTS_WRITE_REG(reg_name, reg_def, mmio_reg_def) \
+ int intel_execlists_write_##reg_name(struct intel_engine_cs *engine, \
+ struct intel_context *ctx, \
+ u32 reg_name) \
+{ \
+ return execlists_write_context_reg(engine, ctx, (reg_def), \
+ mmio_reg_def(engine->mmio_base), (reg_name)); \
+}
+
+#define INTEL_EXECLISTS_READ_REG(reg_name, reg_def, mmio_reg_def) \
+ int intel_execlists_read_##reg_name(struct intel_engine_cs *engine, \
+ struct intel_context *ctx, \
+ u32 *reg_name) \
+{ \
+ return execlists_read_context_reg(engine, ctx, (reg_def), \
+ mmio_reg_def(engine->mmio_base), (reg_name)); \
+}
+
+INTEL_EXECLISTS_READ_REG(tail, CTX_RING_TAIL, RING_TAIL)
+INTEL_EXECLISTS_WRITE_REG(head, CTX_RING_HEAD, RING_HEAD)
+INTEL_EXECLISTS_READ_REG(head, CTX_RING_HEAD, RING_HEAD)
+
+#undef INTEL_EXECLISTS_READ_REG
+#undef INTEL_EXECLISTS_WRITE_REG
+
static int execlists_update_context(struct drm_i915_gem_object *ctx_obj,
struct drm_i915_gem_object *ring_obj,
struct i915_hw_ppgtt *ppgtt,
@@ -387,44 +546,93 @@ static void execlists_submit_contexts(struct intel_engine_cs *ring,
execlists_elsp_write(ring, ctx_obj0, ctx_obj1);
}
-static void execlists_context_unqueue(struct intel_engine_cs *ring)
+static void execlists_fetch_requests(struct intel_engine_cs *ring,
+ struct drm_i915_gem_request **req0,
+ struct drm_i915_gem_request **req1)
{
- struct drm_i915_gem_request *req0 = NULL, *req1 = NULL;
struct drm_i915_gem_request *cursor = NULL, *tmp = NULL;
- assert_spin_locked(&ring->execlist_lock);
-
- if (list_empty(&ring->execlist_queue))
+ if (!req0)
return;
+ *req0 = NULL;
+
+ if (req1)
+ *req1 = NULL;
+
/* Try to read in pairs */
list_for_each_entry_safe(cursor, tmp, &ring->execlist_queue,
execlist_link) {
- if (!req0) {
- req0 = cursor;
- } else if (req0->ctx == cursor->ctx) {
- /* Same ctx: ignore first request, as second request
- * will update tail past first request's workload */
- cursor->elsp_submitted = req0->elsp_submitted;
- list_del(&req0->execlist_link);
- list_add_tail(&req0->execlist_link,
+ if (!(*req0))
+ *req0 = cursor;
+ else if ((*req0)->ctx == cursor->ctx) {
+ /*
+ * Same ctx: ignore first request, as second request
+ * will update tail past first request's workload
+ */
+ cursor->elsp_submitted = (*req0)->elsp_submitted;
+ list_del(&(*req0)->execlist_link);
+ list_add_tail(&(*req0)->execlist_link,
&ring->execlist_retired_req_list);
- req0 = cursor;
+ *req0 = cursor;
} else {
- req1 = cursor;
+ if (req1)
+ *req1 = cursor;
break;
}
}
+}
- WARN_ON(req1 && req1->elsp_submitted);
+static void execlists_context_unqueue(struct intel_engine_cs *ring, bool tdr_resubmission)
+{
+ struct drm_i915_gem_request *req0 = NULL, *req1 = NULL;
+
+ assert_spin_locked(&ring->execlist_lock);
+ if (list_empty(&ring->execlist_queue))
+ return;
+
+ execlists_fetch_requests(ring, &req0, &req1);
+
+ if (tdr_resubmission && req1 && !req1->elsp_submitted)
+ req1 = NULL;
+
+ WARN_ON(req1 && req1->elsp_submitted && !tdr_resubmission);
execlists_submit_contexts(ring, req0->ctx, req0->tail,
req1 ? req1->ctx : NULL,
req1 ? req1->tail : 0);
- req0->elsp_submitted++;
- if (req1)
- req1->elsp_submitted++;
+ if (!tdr_resubmission) {
+ req0->elsp_submitted++;
+ if (req1)
+ req1->elsp_submitted++;
+ }
+}
+
+/**
+ * intel_execlists_TDR_context_resubmission() - ELSP context resubmission
+ * bypassing queue.
+ *
+ * Context submission mechanism exclusively used by TDR that bypasses the
+ * execlist queue. This is necessary since at the point of TDR hang recovery
+ * the hardware will be hung and resubmitting a fixed context (the context that
+ * the TDR has identified as hung and fixed up in order to move past the
+ * blocking batch buffer) to a hung execlist queue will lock up the TDR.
+ * Instead, opt for direct ELSP submission without depending on the rest of the
+ * driver.
+ *
+ * @ring: engine to do resubmission for.
+ *
+ */
+void intel_execlists_TDR_context_resubmission(struct intel_engine_cs *ring)
+{
+ unsigned long flags;
+
+ spin_lock_irqsave(&ring->execlist_lock, flags);
+ WARN_ON(list_empty(&ring->execlist_queue));
+
+ execlists_context_unqueue(ring, true);
+ spin_unlock_irqrestore(&ring->execlist_lock, flags);
}
static bool execlists_check_remove_request(struct intel_engine_cs *ring,
@@ -506,7 +714,7 @@ void intel_lrc_irq_handler(struct intel_engine_cs *ring)
}
if (submit_contexts != 0)
- execlists_context_unqueue(ring);
+ execlists_context_unqueue(ring, false);
spin_unlock(&ring->execlist_lock);
@@ -570,7 +778,7 @@ static int execlists_context_queue(struct intel_engine_cs *ring,
list_add_tail(&request->execlist_link, &ring->execlist_queue);
if (num_elements == 0)
- execlists_context_unqueue(ring);
+ execlists_context_unqueue(ring, false);
spin_unlock_irq(&ring->execlist_lock);
@@ -1066,7 +1274,7 @@ static int gen8_init_common_ring(struct intel_engine_cs *ring)
ring->next_context_status_buffer = 0;
DRM_DEBUG_DRIVER("Execlists enabled for %s\n", ring->name);
- memset(&ring->hangcheck, 0, sizeof(ring->hangcheck));
+ i915_hangcheck_reinit(ring);
return 0;
}
@@ -1314,6 +1522,173 @@ out:
return ret;
}
+static int
+gen8_ring_disable(struct intel_engine_cs *ring)
+{
+ intel_request_gpu_engine_reset(ring);
+ return 0;
+}
+
+static int
+gen8_ring_enable(struct intel_engine_cs *ring)
+{
+ intel_unrequest_gpu_engine_reset(ring);
+ return 0;
+}
+
+/*
+ * gen8_ring_save()
+ *
+ * Saves the head MMIO register to scratch memory while engine is reset and
+ * reinitialized. Before saving the head register we nudge the head position to
+ * be correctly aligned with a QWORD boundary, which brings it up to the next
+ * presumably valid instruction. Typically, at the point of hang recovery the
+ * head register will be pointing to the last DWORD of the BB_START
+ * instruction, which is followed by a padding MI_NOOP inserted by the
+ * driver.
+ *
+ * ring: engine to be reset
+ * req: request containing the context currently running on engine
+ * force_advance: indicates whether or not we should nudge the head
+ * forward or not
+ */
+static int
+gen8_ring_save(struct intel_engine_cs *ring, struct drm_i915_gem_request *req,
+ bool force_advance)
+{
+ struct drm_i915_private *dev_priv = ring->dev->dev_private;
+ struct intel_ringbuffer *ringbuf = NULL;
+ struct intel_context *ctx;
+ int ret = 0;
+ int clamp_to_tail = 0;
+ uint32_t head;
+ uint32_t tail;
+ uint32_t head_addr;
+ uint32_t tail_addr;
+
+ if (WARN_ON(!req))
+ return -EINVAL;
+
+ ctx = req->ctx;
+ ringbuf = ctx->engine[ring->id].ringbuf;
+
+ /*
+ * Read head from MMIO register since it contains the
+ * most up to date value of head at this point.
+ */
+ head = I915_READ_HEAD(ring);
+
+ /*
+ * Read tail from the context because the execlist queue
+ * updates the tail value there first during submission.
+ * The MMIO tail register is not updated until the actual
+ * ring submission completes.
+ */
+ ret = I915_READ_TAIL_CTX(ring, ctx, tail);
+ if (ret)
+ return ret;
+
+ /*
+ * head_addr and tail_addr are the head and tail values
+ * excluding ring wrapping information and aligned to DWORD
+ * boundary
+ */
+ head_addr = head & HEAD_ADDR;
+ tail_addr = tail & TAIL_ADDR;
+
+ /*
+ * The head must always chase the tail.
+ * If the tail is beyond the head then do not allow
+ * the head to overtake it. If the tail is less than
+ * the head then the tail has already wrapped and
+ * there is no problem in advancing the head or even
+ * wrapping the head back to 0 as worst case it will
+ * become equal to tail
+ */
+ if (head_addr <= tail_addr)
+ clamp_to_tail = 1;
+
+ if (force_advance) {
+
+ /* Force head pointer to next QWORD boundary */
+ head_addr &= ~0x7;
+ head_addr += 8;
+
+ } else if (head & 0x7) {
+
+ /* Ensure head pointer is pointing to a QWORD boundary */
+ head += 0x7;
+ head &= ~0x7;
+ head_addr = head;
+ }
+
+ if (clamp_to_tail && (head_addr > tail_addr)) {
+ head_addr = tail_addr;
+ } else if (head_addr >= ringbuf->size) {
+ /* Wrap head back to start if it exceeds ring size */
+ head_addr = 0;
+ }
+
+ head &= ~HEAD_ADDR;
+ head |= (head_addr & HEAD_ADDR);
+ ring->saved_head = head;
+
+ return 0;
+}
+
+static int
+gen8_ring_restore(struct intel_engine_cs *ring, struct drm_i915_gem_request *req)
+{
+ struct drm_i915_private *dev_priv = ring->dev->dev_private;
+ struct intel_context *ctx;
+
+ if (WARN_ON(!req))
+ return -EINVAL;
+
+ ctx = req->ctx;
+
+ /* Re-initialize ring */
+ if (ring->init_hw) {
+ int ret = ring->init_hw(ring);
+ if (ret != 0) {
+ DRM_ERROR("Failed to re-initialize %s\n",
+ ring->name);
+ return ret;
+ }
+ } else {
+ DRM_ERROR("ring init function pointer not set up\n");
+ return -EINVAL;
+ }
+
+ if (ring->id == RCS) {
+ /*
+ * These register reinitializations are only located here
+ * temporarily until they are moved out of the
+ * init_clock_gating function to some function we can
+ * call from here.
+ */
+
+ /* WaVSRefCountFullforceMissDisable:chv */
+ /* WaDSRefCountFullforceMissDisable:chv */
+ I915_WRITE(GEN7_FF_THREAD_MODE,
+ I915_READ(GEN7_FF_THREAD_MODE) &
+ ~(GEN8_FF_DS_REF_CNT_FFME | GEN7_FF_VS_REF_CNT_FFME));
+
+ I915_WRITE(_3D_CHICKEN3,
+ _3D_CHICKEN_SDE_LIMIT_FIFO_POLY_DEPTH(2));
+
+ /* WaSwitchSolVfFArbitrationPriority:bdw */
+ I915_WRITE(GAM_ECOCHK, I915_READ(GAM_ECOCHK) | HSW_ECOCHK_ARB_PRIO_SOL);
+ }
+
+ /* Restore head */
+
+ I915_WRITE_HEAD(ring, ring->saved_head);
+ I915_WRITE_HEAD_CTX(ring, ctx, ring->saved_head);
+
+ return 0;
+}
+
static int gen8_init_rcs_context(struct intel_engine_cs *ring,
struct intel_context *ctx)
{
@@ -1412,6 +1787,10 @@ static int logical_render_ring_init(struct drm_device *dev)
ring->irq_get = gen8_logical_ring_get_irq;
ring->irq_put = gen8_logical_ring_put_irq;
ring->emit_bb_start = gen8_emit_bb_start;
+ ring->enable = gen8_ring_enable;
+ ring->disable = gen8_ring_disable;
+ ring->save = gen8_ring_save;
+ ring->restore = gen8_ring_restore;
ring->dev = dev;
ret = logical_ring_init(dev, ring);
@@ -1442,6 +1821,10 @@ static int logical_bsd_ring_init(struct drm_device *dev)
ring->irq_get = gen8_logical_ring_get_irq;
ring->irq_put = gen8_logical_ring_put_irq;
ring->emit_bb_start = gen8_emit_bb_start;
+ ring->enable = gen8_ring_enable;
+ ring->disable = gen8_ring_disable;
+ ring->save = gen8_ring_save;
+ ring->restore = gen8_ring_restore;
return logical_ring_init(dev, ring);
}
@@ -1467,6 +1850,10 @@ static int logical_bsd2_ring_init(struct drm_device *dev)
ring->irq_get = gen8_logical_ring_get_irq;
ring->irq_put = gen8_logical_ring_put_irq;
ring->emit_bb_start = gen8_emit_bb_start;
+ ring->enable = gen8_ring_enable;
+ ring->disable = gen8_ring_disable;
+ ring->save = gen8_ring_save;
+ ring->restore = gen8_ring_restore;
return logical_ring_init(dev, ring);
}
@@ -1492,6 +1879,10 @@ static int logical_blt_ring_init(struct drm_device *dev)
ring->irq_get = gen8_logical_ring_get_irq;
ring->irq_put = gen8_logical_ring_put_irq;
ring->emit_bb_start = gen8_emit_bb_start;
+ ring->enable = gen8_ring_enable;
+ ring->disable = gen8_ring_disable;
+ ring->save = gen8_ring_save;
+ ring->restore = gen8_ring_restore;
return logical_ring_init(dev, ring);
}
@@ -1517,6 +1908,10 @@ static int logical_vebox_ring_init(struct drm_device *dev)
ring->irq_get = gen8_logical_ring_get_irq;
ring->irq_put = gen8_logical_ring_put_irq;
ring->emit_bb_start = gen8_emit_bb_start;
+ ring->enable = gen8_ring_enable;
+ ring->disable = gen8_ring_disable;
+ ring->save = gen8_ring_save;
+ ring->restore = gen8_ring_restore;
return logical_ring_init(dev, ring);
}
@@ -1974,3 +2369,120 @@ void intel_lr_context_reset(struct drm_device *dev,
ringbuf->tail = 0;
}
}
+
+/**
+ * intel_execlists_TDR_get_current_request() - return request currently
+ * processed by engine
+ *
+ * @ring: Engine currently running context to be returned.
+ *
+ * @req: Output parameter containing the current request (the request at the
+ * head of execlist queue corresponding to the given ring). May be NULL
+ * if no request has been submitted to the execlist queue of this
+ * engine. If the req parameter passed in to the function is not NULL
+ * and a request is found and returned the request is referenced before
+ * it is returned. It is the responsibility of the caller to dereference
+ * it at the end of its life cycle.
+ *
+ * Return:
+ * CONTEXT_SUBMISSION_STATUS_OK if request is found to be submitted and its
+ * context is currently running on engine.
+ *
+ * CONTEXT_SUBMISSION_STATUS_INCONSISTENT if request is found to be submitted
+ * but its context is not in a state that is consistent with current
+ * hardware state for the given engine. This has been observed in three cases:
+ *
+ * 1. Before the engine has switched to this context after it has
+ * been submitted to the execlist queue.
+ *
+ * 2. After the engine has switched away from this context but
+ * before the context has been removed from the execlist queue.
+ *
+ * 3. The driver has lost an interrupt. Typically the hardware has
+ * gone to idle but the driver still thinks the context belonging to
+ * the request at the head of the queue is still executing.
+ *
+ * CONTEXT_SUBMISSION_STATUS_NONE_SUBMITTED if no context has been found
+ * to be submitted to the execlist queue and if the hardware is idle.
+ */
+enum context_submission_status
+intel_execlists_TDR_get_current_request(struct intel_engine_cs *ring,
+ struct drm_i915_gem_request **req)
+{
+ struct drm_i915_private *dev_priv;
+ unsigned long flags;
+ struct drm_i915_gem_request *tmpreq = NULL;
+ struct intel_context *tmpctx = NULL;
+ unsigned hw_context = 0;
+ bool hw_active = false;
+ enum context_submission_status status =
+ CONTEXT_SUBMISSION_STATUS_UNDEFINED;
+
+ if (WARN_ON(!ring))
+ return status;
+
+ dev_priv = ring->dev->dev_private;
+
+ intel_uncore_forcewake_get(dev_priv, FORCEWAKE_ALL);
+ spin_lock_irqsave(&ring->execlist_lock, flags);
+ hw_context = I915_READ(RING_EXECLIST_STATUS_CTX_ID(ring));
+
+ hw_active = (I915_READ(RING_EXECLIST_STATUS(ring)) &
+ EXECLIST_STATUS_CURRENT_ACTIVE_ELEMENT_STATUS) ? true : false;
+
+ tmpreq = list_first_entry_or_null(&ring->execlist_queue,
+ struct drm_i915_gem_request, execlist_link);
+
+ if (tmpreq) {
+ /*
+ * If the caller has not passed a non-NULL req parameter then
+ * it is not interested in getting a request reference back.
+ * Don't temporarily grab a reference since holding the execlist
+ * lock is enough to ensure that the execlist code will hold its
+ * reference all throughout this function. As long as that reference
+ * is kept there is no need for us to take yet another reference.
+ * The reason why this is of interest is because certain callers, such
+ * as the TDR hang checker, cannot grab struct_mutex before calling
+ * and because of that we cannot dereference any requests (DRM might
+ * assert if we do). Just rely on the execlist code to provide
+ * indirect protection.
+ */
+ if (req)
+ i915_gem_request_reference(tmpreq);
+
+
+ if (tmpreq->ctx)
+ tmpctx = tmpreq->ctx;
+
+ WARN(!tmpctx, "No context in request %p\n", tmpreq);
+ }
+
+ if (tmpctx) {
+ unsigned sw_context =
+ intel_execlists_ctx_id((tmpctx)->engine[ring->id].state);
+
+ status = ((hw_context == sw_context) && hw_active) ?
+ CONTEXT_SUBMISSION_STATUS_OK :
+ CONTEXT_SUBMISSION_STATUS_INCONSISTENT;
+ } else {
+ /*
+ * If we don't have any queue entries and the
+ * EXECLIST_STATUS register points to zero we are
+ * clearly not processing any context right now
+ */
+ WARN((hw_context || hw_active), "hw_context=%x, hardware %s!\n",
+ hw_context, hw_active ? "not idle":"idle");
+
+ status = (hw_context || hw_active) ?
+ CONTEXT_SUBMISSION_STATUS_INCONSISTENT :
+ CONTEXT_SUBMISSION_STATUS_NONE_SUBMITTED;
+ }
+
+ if (req)
+ *req = tmpreq;
+
+ spin_unlock_irqrestore(&ring->execlist_lock, flags);
+ intel_uncore_forcewake_put(dev_priv, FORCEWAKE_ALL);
+
+ return status;
+}
diff --git a/drivers/gpu/drm/i915/intel_lrc.h b/drivers/gpu/drm/i915/intel_lrc.h
index 04d3a6d..d2f497c 100644
--- a/drivers/gpu/drm/i915/intel_lrc.h
+++ b/drivers/gpu/drm/i915/intel_lrc.h
@@ -29,6 +29,8 @@
/* Execlists regs */
#define RING_ELSP(ring) ((ring)->mmio_base+0x230)
#define RING_EXECLIST_STATUS(ring) ((ring)->mmio_base+0x234)
+#define EXECLIST_STATUS_CURRENT_ACTIVE_ELEMENT_STATUS (0x3 << 14)
+#define RING_EXECLIST_STATUS_CTX_ID(ring) (RING_EXECLIST_STATUS(ring)+4)
#define RING_CONTEXT_CONTROL(ring) ((ring)->mmio_base+0x244)
#define CTX_CTRL_INHIBIT_SYN_CTX_SWITCH (1 << 3)
#define CTX_CTRL_ENGINE_CTX_RESTORE_INHIBIT (1 << 0)
@@ -89,4 +91,16 @@ u32 intel_execlists_ctx_id(struct drm_i915_gem_object *ctx_obj);
void intel_lrc_irq_handler(struct intel_engine_cs *ring);
void intel_execlists_retire_requests(struct intel_engine_cs *ring);
+int intel_execlists_read_tail(struct intel_engine_cs *ring,
+ struct intel_context *ctx,
+ u32 *tail);
+
+int intel_execlists_write_head(struct intel_engine_cs *ring,
+ struct intel_context *ctx,
+ u32 head);
+
+int intel_execlists_read_head(struct intel_engine_cs *ring,
+ struct intel_context *ctx,
+ u32 *head);
+
#endif /* _INTEL_LRC_H_ */
diff --git a/drivers/gpu/drm/i915/intel_lrc_tdr.h b/drivers/gpu/drm/i915/intel_lrc_tdr.h
new file mode 100644
index 0000000..4520753
--- /dev/null
+++ b/drivers/gpu/drm/i915/intel_lrc_tdr.h
@@ -0,0 +1,36 @@
+/*
+ * Copyright © 2015 Intel Corporation
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a
+ * copy of this software and associated documentation files (the "Software"),
+ * to deal in the Software without restriction, including without limitation
+ * the rights to use, copy, modify, merge, publish, distribute, sublicense,
+ * and/or sell copies of the Software, and to permit persons to whom the
+ * Software is furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice (including the next
+ * paragraph) shall be included in all copies or substantial portions of the
+ * Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+ * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
+ * DEALINGS IN THE SOFTWARE.
+ */
+
+#ifndef _INTEL_LRC_TDR_H_
+#define _INTEL_LRC_TDR_H_
+
+/* Privileged execlist API used exclusively by TDR */
+
+void intel_execlists_TDR_context_resubmission(struct intel_engine_cs *ring);
+
+enum context_submission_status
+intel_execlists_TDR_get_current_request(struct intel_engine_cs *ring,
+ struct drm_i915_gem_request **req);
+
+#endif /* _INTEL_LRC_TDR_H_ */
+
diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.c b/drivers/gpu/drm/i915/intel_ringbuffer.c
index f949583..0fdf983 100644
--- a/drivers/gpu/drm/i915/intel_ringbuffer.c
+++ b/drivers/gpu/drm/i915/intel_ringbuffer.c
@@ -442,6 +442,88 @@ static void ring_write_tail(struct intel_engine_cs *ring,
I915_WRITE_TAIL(ring, value);
}
+int intel_ring_disable(struct intel_engine_cs *ring)
+{
+ WARN_ON(!ring);
+
+ if (ring && ring->disable)
+ return ring->disable(ring);
+ else {
+ DRM_ERROR("Ring disable not supported on %s\n", ring->name);
+ return -EINVAL;
+ }
+}
+
+int intel_ring_enable(struct intel_engine_cs *ring)
+{
+ WARN_ON(!ring);
+
+ if (ring && ring->enable)
+ return ring->enable(ring);
+ else {
+ DRM_ERROR("Ring enable not supported on %s\n", ring->name);
+ return -EINVAL;
+ }
+}
+
+int intel_ring_save(struct intel_engine_cs *ring,
+ struct drm_i915_gem_request *req,
+ bool force_advance)
+{
+ WARN_ON(!ring);
+
+ if (ring && ring->save)
+ return ring->save(ring, req, force_advance);
+ else {
+ DRM_ERROR("Ring save not supported on %s\n", ring->name);
+ return -EINVAL;
+ }
+}
+
+int intel_ring_restore(struct intel_engine_cs *ring,
+ struct drm_i915_gem_request *req)
+{
+ WARN_ON(!ring);
+
+ if (ring && ring->restore)
+ return ring->restore(ring, req);
+ else {
+ DRM_ERROR("Ring restore not supported on %s\n", ring->name);
+ return -EINVAL;
+ }
+}
+
+void intel_gpu_engine_reset_resample(struct intel_engine_cs *ring,
+ struct drm_i915_gem_request *req)
+{
+ struct intel_ringbuffer *ringbuf;
+ struct drm_i915_private *dev_priv;
+
+ if (WARN_ON(!ring))
+ return;
+
+ dev_priv = ring->dev->dev_private;
+
+ if (i915.enable_execlists) {
+ struct intel_context *ctx;
+
+ if (WARN_ON(!req))
+ return;
+
+ ctx = req->ctx;
+ ringbuf = ctx->engine[ring->id].ringbuf;
+
+ /*
+ * In gen8+ context head is restored during reset and
+ * we can use it as a reference to set up the new
+ * driver state.
+ */
+ I915_READ_HEAD_CTX(ring, ctx, ringbuf->head);
+ ringbuf->last_retired_head = -1;
+ intel_ring_update_space(ringbuf);
+ }
+}
+
u64 intel_ring_get_active_head(struct intel_engine_cs *ring)
{
struct drm_i915_private *dev_priv = ring->dev->dev_private;
@@ -637,7 +719,7 @@ static int init_ring_common(struct intel_engine_cs *ring)
ringbuf->tail = I915_READ_TAIL(ring) & TAIL_ADDR;
intel_ring_update_space(ringbuf);
- memset(&ring->hangcheck, 0, sizeof(ring->hangcheck));
+ i915_hangcheck_reinit(ring);
out:
intel_uncore_forcewake_put(dev_priv, FORCEWAKE_ALL);
diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.h b/drivers/gpu/drm/i915/intel_ringbuffer.h
index 39f6dfc..35360a4 100644
--- a/drivers/gpu/drm/i915/intel_ringbuffer.h
+++ b/drivers/gpu/drm/i915/intel_ringbuffer.h
@@ -48,6 +48,22 @@ struct intel_hw_status_page {
#define I915_READ_MODE(ring) I915_READ(RING_MI_MODE((ring)->mmio_base))
#define I915_WRITE_MODE(ring, val) I915_WRITE(RING_MI_MODE((ring)->mmio_base), val)
+
+#define I915_READ_TAIL_CTX(engine, ctx, outval) \
+ intel_execlists_read_tail((engine), \
+ (ctx), \
+ &(outval));
+
+#define I915_READ_HEAD_CTX(engine, ctx, outval) \
+ intel_execlists_read_head((engine), \
+ (ctx), \
+ &(outval));
+
+#define I915_WRITE_HEAD_CTX(engine, ctx, val) \
+ intel_execlists_write_head((engine), \
+ (ctx), \
+ (val));
+
/* seqno size is actually only a uint32, but since we plan to use MI_FLUSH_DW to
* do the writes, and that must have qw aligned offsets, simply pretend it's 8b.
*/
@@ -92,6 +108,34 @@ struct intel_ring_hangcheck {
int score;
enum intel_ring_hangcheck_action action;
int deadlock;
+
+ /*
+ * Last recorded ring head index.
+ * This is only ever a ring index where as active
+ * head may be a graphics address in a ring buffer
+ */
+ u32 last_head;
+
+ /* Flag to indicate if engine reset required */
+ atomic_t flags;
+
+ /* Indicates request to reset this engine */
+#define I915_ENGINE_RESET_IN_PROGRESS (1<<0)
+
+ /*
+ * Timestamp (seconds) from when the last time
+ * this engine was reset.
+ */
+ u32 last_engine_reset_time;
+
+ /*
+ * Number of times this engine has been
+ * reset since boot
+ */
+ u32 reset_count;
+
+ /* Number of TDR hang detections */
+ u32 tdr_count;
};
struct intel_ringbuffer {
@@ -177,6 +221,14 @@ struct intel_engine_cs {
#define I915_DISPATCH_PINNED 0x2
void (*cleanup)(struct intel_engine_cs *ring);
+ int (*enable)(struct intel_engine_cs *ring);
+ int (*disable)(struct intel_engine_cs *ring);
+ int (*save)(struct intel_engine_cs *ring,
+ struct drm_i915_gem_request *req,
+ bool force_advance);
+ int (*restore)(struct intel_engine_cs *ring,
+ struct drm_i915_gem_request *req);
+
/* GEN8 signal/wait table - never trust comments!
* signal to signal to signal to signal to signal to
* RCS VCS BCS VECS VCS2
@@ -283,6 +335,9 @@ struct intel_engine_cs {
struct intel_ring_hangcheck hangcheck;
+ /* Saved head value to be restored after reset */
+ u32 saved_head;
+
struct {
struct drm_i915_gem_object *obj;
u32 gtt_offset;
@@ -420,6 +475,15 @@ int intel_ring_space(struct intel_ringbuffer *ringbuf);
bool intel_ring_stopped(struct intel_engine_cs *ring);
void __intel_ring_advance(struct intel_engine_cs *ring);
+void intel_gpu_engine_reset_resample(struct intel_engine_cs *ring,
+ struct drm_i915_gem_request *req);
+int intel_ring_disable(struct intel_engine_cs *ring);
+int intel_ring_enable(struct intel_engine_cs *ring);
+int intel_ring_save(struct intel_engine_cs *ring,
+ struct drm_i915_gem_request *req, bool force_advance);
+int intel_ring_restore(struct intel_engine_cs *ring,
+ struct drm_i915_gem_request *req);
+
int __must_check intel_ring_idle(struct intel_engine_cs *ring);
void intel_ring_init_seqno(struct intel_engine_cs *ring, u32 seqno);
int intel_ring_flush_all_caches(struct intel_engine_cs *ring);
diff --git a/drivers/gpu/drm/i915/intel_uncore.c b/drivers/gpu/drm/i915/intel_uncore.c
index d96d15f..91427ac 100644
--- a/drivers/gpu/drm/i915/intel_uncore.c
+++ b/drivers/gpu/drm/i915/intel_uncore.c
@@ -1463,6 +1463,205 @@ int intel_gpu_reset(struct drm_device *dev)
return -ENODEV;
}
+static inline int wait_for_engine_reset(struct drm_i915_private *dev_priv,
+ unsigned int grdom)
+{
+#define _CND ((__raw_i915_read32(dev_priv, GEN6_GDRST) & grdom) == 0)
+
+ /*
+ * Spin waiting for the device to ack the reset request.
+ * Times out after 500 us
+ * */
+ return wait_for_atomic_us(_CND, 500);
+
+#undef _CND
+}
+
+static int do_engine_reset_nolock(struct intel_engine_cs *engine)
+{
+ int ret = -ENODEV;
+ struct drm_i915_private *dev_priv = engine->dev->dev_private;
+
+ assert_spin_locked(&dev_priv->uncore.lock);
+
+ switch (engine->id) {
+ case RCS:
+ __raw_i915_write32(dev_priv, GEN6_GDRST, GEN6_GRDOM_RENDER);
+ engine->hangcheck.reset_count++;
+ ret = wait_for_engine_reset(dev_priv, GEN6_GRDOM_RENDER);
+ break;
+
+ case BCS:
+ __raw_i915_write32(dev_priv, GEN6_GDRST, GEN6_GRDOM_BLT);
+ engine->hangcheck.reset_count++;
+ ret = wait_for_engine_reset(dev_priv, GEN6_GRDOM_BLT);
+ break;
+
+ case VCS:
+ __raw_i915_write32(dev_priv, GEN6_GDRST, GEN6_GRDOM_MEDIA);
+ engine->hangcheck.reset_count++;
+ ret = wait_for_engine_reset(dev_priv, GEN6_GRDOM_MEDIA);
+ break;
+
+ case VECS:
+ __raw_i915_write32(dev_priv, GEN6_GDRST, GEN6_GRDOM_VECS);
+ engine->hangcheck.reset_count++;
+ ret = wait_for_engine_reset(dev_priv, GEN6_GRDOM_VECS);
+ break;
+
+ case VCS2:
+ __raw_i915_write32(dev_priv, GEN6_GDRST, GEN8_GRDOM_MEDIA2);
+ engine->hangcheck.reset_count++;
+ ret = wait_for_engine_reset(dev_priv, GEN8_GRDOM_MEDIA2);
+ break;
+
+ default:
+ DRM_ERROR("Unexpected engine: %d\n", engine->id);
+ break;
+ }
+
+ return ret;
+}
+
+static int gen8_do_engine_reset(struct intel_engine_cs *engine)
+{
+ struct drm_device *dev = engine->dev;
+ struct drm_i915_private *dev_priv = dev->dev_private;
+ int ret = -ENODEV;
+ unsigned long irqflags;
+
+ spin_lock_irqsave(&dev_priv->uncore.lock, irqflags);
+ ret = do_engine_reset_nolock(engine);
+ spin_unlock_irqrestore(&dev_priv->uncore.lock, irqflags);
+
+ if (!ret) {
+ u32 reset_ctl = 0;
+
+ /*
+ * Confirm that reset control register back to normal
+ * following the reset.
+ */
+ reset_ctl = I915_READ(RING_RESET_CTL(engine));
+ WARN(reset_ctl & 0x3, "Reset control still active after reset! (0x%08x)\n",
+ reset_ctl);
+ } else {
+ DRM_ERROR("Engine reset failed! (%d)\n", ret);
+ }
+
+ return ret;
+}
+
+int intel_gpu_engine_reset(struct intel_engine_cs *engine)
+{
+ /* Reset an individual engine */
+ int ret = -ENODEV;
+ struct drm_device *dev = engine->dev;
+
+ switch (INTEL_INFO(dev)->gen) {
+ case 8:
+ ret = gen8_do_engine_reset(engine);
+ break;
+ default:
+ DRM_ERROR("Per Engine Reset not supported on Gen%d\n",
+ INTEL_INFO(dev)->gen);
+ ret = -ENODEV;
+ break;
+ }
+
+ return ret;
+}
+
+static int gen8_request_engine_reset(struct intel_engine_cs *engine)
+{
+ int ret = 0;
+ unsigned long irqflags;
+ u32 reset_ctl = 0;
+ struct drm_i915_private *dev_priv = engine->dev->dev_private;
+
+ spin_lock_irqsave(&dev_priv->uncore.lock, irqflags);
+
+ /*
+ * Initiate reset handshake by requesting reset from the
+ * reset control register.
+ */
+ __raw_i915_write32(dev_priv, RING_RESET_CTL(engine),
+ _MASKED_BIT_ENABLE(REQUEST_RESET));
+
+ /*
+ * Wait for ready to reset ack.
+ */
+ ret = wait_for_atomic_us((__raw_i915_read32(dev_priv,
+ RING_RESET_CTL(engine)) & READY_FOR_RESET) ==
+ READY_FOR_RESET, 500);
+
+ reset_ctl = __raw_i915_read32(dev_priv, RING_RESET_CTL(engine));
+
+ spin_unlock_irqrestore(&dev_priv->uncore.lock, irqflags);
+
+ WARN(ret, "Reset request failed! (err=%d, reset control=0x%08x)\n",
+ ret, reset_ctl);
+
+ return ret;
+}
+
+static int gen8_unrequest_engine_reset(struct intel_engine_cs *engine)
+{
+ struct drm_i915_private *dev_priv = engine->dev->dev_private;
+
+ I915_WRITE(RING_RESET_CTL(engine), _MASKED_BIT_DISABLE(REQUEST_RESET));
+ return 0;
+}
+
+/*
+ * On gen8+ a reset request has to be issued via the reset control register
+ * before a GPU engine can be reset in order to stop the command streamer
+ * and idle the engine. This replaces the legacy way of stopping an engine
+ * by writing to the stop ring bit in the MI_MODE register.
+ */
+int intel_request_gpu_engine_reset(struct intel_engine_cs *engine)
+{
+ /* Request reset for an individual engine */
+ int ret = -ENODEV;
+ struct drm_device *dev;
+
+ if (WARN_ON(!engine))
+ return -EINVAL;
+
+ dev = engine->dev;
+
+ if (INTEL_INFO(dev)->gen >= 8)
+ ret = gen8_request_engine_reset(engine);
+ else
+ DRM_ERROR("Reset request not supported on Gen%d\n",
+ INTEL_INFO(dev)->gen);
+
+ return ret;
+}
+
+/*
+ * It is possible to back off from a previously issued reset request by simply
+ * clearing the reset request bit in the reset control register.
+ */
+int intel_unrequest_gpu_engine_reset(struct intel_engine_cs *engine)
+{
+ /* Request reset for an individual engine */
+ int ret = -ENODEV;
+ struct drm_device *dev;
+
+ if (WARN_ON(!engine))
+ return -EINVAL;
+
+ dev = engine->dev;
+
+ if (INTEL_INFO(dev)->gen >= 8)
+ ret = gen8_unrequest_engine_reset(engine);
+ else
+ DRM_ERROR("Reset unrequest not supported on Gen%d\n",
+ INTEL_INFO(dev)->gen);
+
+ return ret;
+}
+
void intel_uncore_check_errors(struct drm_device *dev)
{
struct drm_i915_private *dev_priv = dev->dev_private;
--
1.9.1
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [RFCv2 04/12] drm/i915: Extending i915_gem_check_wedge to check engine reset in progress
2015-07-21 13:58 [RFCv2 00/12] TDR/watchdog timeout support for gen8 Tomas Elf
` (2 preceding siblings ...)
2015-07-21 13:58 ` [RFCv2 03/12] drm/i915: Adding TDR / per-engine reset support for gen8 Tomas Elf
@ 2015-07-21 13:58 ` Tomas Elf
2015-07-21 13:58 ` [RFCv2 05/12] drm/i915: Reinstate hang recovery work queue Tomas Elf
` (8 subsequent siblings)
12 siblings, 0 replies; 15+ messages in thread
From: Tomas Elf @ 2015-07-21 13:58 UTC (permalink / raw)
To: Intel-GFX; +Cc: Ian Lister
i915_gem_wedge now returns a positive result in three different cases:
1. Legacy: A hang has been detected and full GPU reset has been promoted.
2. Per-engine recovery:
a. A single engine reference can be passed to the function, in which
case only that engine will be checked. If that particular engine is
detected to be hung and is to be reset this will yield a positive
result but not if reset is in progress for any other engine.
b. No engine reference is passed to the function, in which case all
engines are checked for ongoing per-engine hang recovery.
Also, i915_wait_request was updated to take advantage of this new
functionality. This is important since the TDR hang recovery mechanism needs a
way to force waiting threads that hold the struct_mutex to give up the
struct_mutex and try again after the hang recovery has completed. If
i915_wait_request does not take per-engine hang recovery into account there is
no way for a waiting thread to know that a per-engine recovery is about to
happen and that it needs to back off.
* v2 (Chris Wilson):
Removed unwarranted changes made to i915_gem_ring_throttle().
Signed-off-by: Tomas Elf <tomas.elf@intel.com>
Signed-off-by: Arun Siluvery <arun.siluvery@intel.com>
Signed-off-by: Ian Lister <ian.lister@intel.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
---
drivers/gpu/drm/i915/i915_drv.h | 3 +-
drivers/gpu/drm/i915/i915_gem.c | 63 +++++++++++++++++++++++++++------
drivers/gpu/drm/i915/intel_lrc.c | 3 +-
drivers/gpu/drm/i915/intel_ringbuffer.c | 3 +-
4 files changed, 58 insertions(+), 14 deletions(-)
diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index be4c95c..f4d0c1f 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -2807,7 +2807,8 @@ i915_gem_find_active_request(struct intel_engine_cs *ring);
bool i915_gem_retire_requests(struct drm_device *dev);
void i915_gem_retire_requests_ring(struct intel_engine_cs *ring);
-int __must_check i915_gem_check_wedge(struct i915_gpu_error *error,
+int __must_check i915_gem_check_wedge(struct drm_i915_private *dev_priv,
+ struct intel_engine_cs *engine,
bool interruptible);
int __must_check i915_gem_check_olr(struct drm_i915_gem_request *req);
diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index 4f4fc3e0..036735b 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -97,12 +97,38 @@ static void i915_gem_info_remove_obj(struct drm_i915_private *dev_priv,
spin_unlock(&dev_priv->mm.object_stat_lock);
}
+static inline int
+i915_engine_reset_in_progress(struct drm_i915_private *dev_priv,
+ struct intel_engine_cs *engine)
+{
+ int ret = 0;
+
+ if (engine) {
+ ret = !!(atomic_read(&dev_priv->ring[engine->id].hangcheck.flags)
+ & I915_ENGINE_RESET_IN_PROGRESS);
+ } else {
+ int i;
+
+ for (i = 0; i < I915_NUM_RINGS; i++)
+ if (atomic_read(&dev_priv->ring[i].hangcheck.flags)
+ & I915_ENGINE_RESET_IN_PROGRESS) {
+
+ ret = 1;
+ break;
+ }
+ }
+
+ return ret;
+}
+
static int
-i915_gem_wait_for_error(struct i915_gpu_error *error)
+i915_gem_wait_for_error(struct drm_i915_private *dev_priv)
{
int ret;
+ struct i915_gpu_error *error = &dev_priv->gpu_error;
#define EXIT_COND (!i915_reset_in_progress(error) || \
+ !i915_engine_reset_in_progress(dev_priv, NULL) || \
i915_terminally_wedged(error))
if (EXIT_COND)
return 0;
@@ -131,7 +157,7 @@ int i915_mutex_lock_interruptible(struct drm_device *dev)
struct drm_i915_private *dev_priv = dev->dev_private;
int ret;
- ret = i915_gem_wait_for_error(&dev_priv->gpu_error);
+ ret = i915_gem_wait_for_error(dev_priv);
if (ret)
return ret;
@@ -1128,10 +1154,15 @@ put_rpm:
}
int
-i915_gem_check_wedge(struct i915_gpu_error *error,
+i915_gem_check_wedge(struct drm_i915_private *dev_priv,
+ struct intel_engine_cs *engine,
bool interruptible)
{
- if (i915_reset_in_progress(error)) {
+ struct i915_gpu_error *error = &dev_priv->gpu_error;
+
+ if (i915_reset_in_progress(error) ||
+ i915_engine_reset_in_progress(dev_priv, engine)) {
+
/* Non-interruptible callers can't handle -EAGAIN, hence return
* -EIO unconditionally for these. */
if (!interruptible)
@@ -1213,6 +1244,7 @@ int __i915_wait_request(struct drm_i915_gem_request *req,
unsigned long timeout_expire;
s64 before, now;
int ret;
+ int reset_in_progress = 0;
WARN(!intel_irqs_enabled(dev_priv), "IRQs disabled");
@@ -1239,11 +1271,17 @@ int __i915_wait_request(struct drm_i915_gem_request *req,
/* We need to check whether any gpu reset happened in between
* the caller grabbing the seqno and now ... */
- if (reset_counter != atomic_read(&dev_priv->gpu_error.reset_counter)) {
+ reset_in_progress =
+ i915_gem_check_wedge(ring->dev->dev_private, NULL, interruptible);
+
+ if ((reset_counter != atomic_read(&dev_priv->gpu_error.reset_counter)) ||
+ reset_in_progress) {
+
/* ... but upgrade the -EAGAIN to an -EIO if the gpu
* is truely gone. */
- ret = i915_gem_check_wedge(&dev_priv->gpu_error, interruptible);
- if (ret == 0)
+ if (reset_in_progress)
+ ret = reset_in_progress;
+ else
ret = -EAGAIN;
break;
}
@@ -1327,7 +1365,7 @@ i915_wait_request(struct drm_i915_gem_request *req)
BUG_ON(!mutex_is_locked(&dev->struct_mutex));
- ret = i915_gem_check_wedge(&dev_priv->gpu_error, interruptible);
+ ret = i915_gem_check_wedge(dev_priv, NULL, interruptible);
if (ret)
return ret;
@@ -1396,6 +1434,7 @@ i915_gem_object_wait_rendering__nonblocking(struct drm_i915_gem_object *obj,
struct drm_i915_private *dev_priv = dev->dev_private;
unsigned reset_counter;
int ret;
+ struct intel_engine_cs *ring = NULL;
BUG_ON(!mutex_is_locked(&dev->struct_mutex));
BUG_ON(!dev_priv->mm.interruptible);
@@ -1404,7 +1443,9 @@ i915_gem_object_wait_rendering__nonblocking(struct drm_i915_gem_object *obj,
if (!req)
return 0;
- ret = i915_gem_check_wedge(&dev_priv->gpu_error, true);
+ ring = i915_gem_request_get_ring(req);
+
+ ret = i915_gem_check_wedge(dev_priv, ring, true);
if (ret)
return ret;
@@ -4075,11 +4116,11 @@ i915_gem_ring_throttle(struct drm_device *dev, struct drm_file *file)
unsigned reset_counter;
int ret;
- ret = i915_gem_wait_for_error(&dev_priv->gpu_error);
+ ret = i915_gem_wait_for_error(dev_priv);
if (ret)
return ret;
- ret = i915_gem_check_wedge(&dev_priv->gpu_error, false);
+ ret = i915_gem_check_wedge(dev_priv, NULL, false);
if (ret)
return ret;
diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
index e02abec..4a19385 100644
--- a/drivers/gpu/drm/i915/intel_lrc.c
+++ b/drivers/gpu/drm/i915/intel_lrc.c
@@ -986,7 +986,8 @@ static int intel_logical_ring_begin(struct intel_ringbuffer *ringbuf,
struct drm_i915_private *dev_priv = dev->dev_private;
int ret;
- ret = i915_gem_check_wedge(&dev_priv->gpu_error,
+ ret = i915_gem_check_wedge(dev_priv,
+ ring,
dev_priv->mm.interruptible);
if (ret)
return ret;
diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.c b/drivers/gpu/drm/i915/intel_ringbuffer.c
index 0fdf983..fc82942 100644
--- a/drivers/gpu/drm/i915/intel_ringbuffer.c
+++ b/drivers/gpu/drm/i915/intel_ringbuffer.c
@@ -2259,7 +2259,8 @@ int intel_ring_begin(struct intel_engine_cs *ring,
struct drm_i915_private *dev_priv = ring->dev->dev_private;
int ret;
- ret = i915_gem_check_wedge(&dev_priv->gpu_error,
+ ret = i915_gem_check_wedge(dev_priv,
+ ring,
dev_priv->mm.interruptible);
if (ret)
return ret;
--
1.9.1
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [RFCv2 05/12] drm/i915: Reinstate hang recovery work queue.
2015-07-21 13:58 [RFCv2 00/12] TDR/watchdog timeout support for gen8 Tomas Elf
` (3 preceding siblings ...)
2015-07-21 13:58 ` [RFCv2 04/12] drm/i915: Extending i915_gem_check_wedge to check engine reset in progress Tomas Elf
@ 2015-07-21 13:58 ` Tomas Elf
2015-07-21 13:58 ` [RFCv2 06/12] drm/i915: Watchdog timeout support for gen8 Tomas Elf
` (7 subsequent siblings)
12 siblings, 0 replies; 15+ messages in thread
From: Tomas Elf @ 2015-07-21 13:58 UTC (permalink / raw)
To: Intel-GFX; +Cc: Mika Kuoppala
There used to be a work queue separating the error handler from the hang
recovery path, which was removed a while back in this commit:
commit b8d24a06568368076ebd5a858a011699a97bfa42
Author: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Date: Wed Jan 28 17:03:14 2015 +0200
drm/i915: Remove nested work in gpu error handling
Now we need to revert most of that commit since the work queue separating hang
detection from hang recovery is needed in preparation for the upcoming watchdog
timeout feature. The watchdog interrupt service routine will be a second
callsite of the error handler alongside the periodic hang checker, which runs
in a work queue context. Seeing as the error handler will be serving a caller
in a hard interrupt execution context that means that the error handler must
never end up in a situation where it needs to grab the struct_mutex.
Unfortunately, that is exactly what we need to do first at the start of the
hang recovery path, which might potentially sleep if the struct_mutex is
already held by another thread. Not good when you're in a hard interrupt
context.
Signed-off-by: Tomas Elf <tomas.elf@intel.com>
Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com>
---
drivers/gpu/drm/i915/i915_dma.c | 1 +
drivers/gpu/drm/i915/i915_drv.h | 1 +
drivers/gpu/drm/i915/i915_irq.c | 28 +++++++++++++++++++++++-----
3 files changed, 25 insertions(+), 5 deletions(-)
diff --git a/drivers/gpu/drm/i915/i915_dma.c b/drivers/gpu/drm/i915/i915_dma.c
index cf01e84..39b8f5f 100644
--- a/drivers/gpu/drm/i915/i915_dma.c
+++ b/drivers/gpu/drm/i915/i915_dma.c
@@ -1116,6 +1116,7 @@ int i915_driver_unload(struct drm_device *dev)
/* Free error state after interrupts are fully disabled. */
cancel_delayed_work_sync(&dev_priv->gpu_error.hangcheck_work);
i915_destroy_error_state(dev);
+ cancel_work_sync(&dev_priv->gpu_error.work);
if (dev->pdev->msi_enabled)
pci_disable_msi(dev->pdev);
diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index f4d0c1f..ef7c129 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -1245,6 +1245,7 @@ struct i915_gpu_error {
spinlock_t lock;
/* Protected by the above dev->gpu_error.lock. */
struct drm_i915_error_state *first_error;
+ struct work_struct work;
unsigned long missed_irq_rings;
diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
index e869823..4973826 100644
--- a/drivers/gpu/drm/i915/i915_irq.c
+++ b/drivers/gpu/drm/i915/i915_irq.c
@@ -2300,15 +2300,18 @@ static void i915_error_wake_up(struct drm_i915_private *dev_priv,
}
/**
- * i915_reset_and_wakeup - do process context error handling work
+ * i915_error_work_func - do process context error handling work
*
* Fire an error uevent so userspace can see that a hang or error
* was detected.
*/
-static void i915_reset_and_wakeup(struct drm_device *dev)
+static void i915_error_work_func(struct work_struct *work)
{
- struct drm_i915_private *dev_priv = to_i915(dev);
- struct i915_gpu_error *error = &dev_priv->gpu_error;
+ struct i915_gpu_error *error = container_of(work, struct i915_gpu_error,
+ work);
+ struct drm_i915_private *dev_priv =
+ container_of(error, struct drm_i915_private, gpu_error);
+ struct drm_device *dev = dev_priv->dev;
char *error_event[] = { I915_ERROR_UEVENT "=1", NULL };
char *reset_event[] = { I915_RESET_UEVENT "=1", NULL };
char *reset_done_event[] = { I915_ERROR_UEVENT "=0", NULL };
@@ -2668,7 +2671,21 @@ void i915_handle_error(struct drm_device *dev, u32 engine_mask, bool wedged,
i915_error_wake_up(dev_priv, false);
}
- i915_reset_and_wakeup(dev);
+ /*
+ * Gen 7:
+ *
+ * Our reset work can grab modeset locks (since it needs to reset the
+ * state of outstanding pageflips). Hence it must not be run on our own
+ * dev-priv->wq work queue for otherwise the flush_work in the pageflip
+ * code will deadlock.
+ * If error_work is already in the work queue then it will not be added
+ * again. It hasn't yet executed so it will see the reset flags when
+ * it is scheduled. If it isn't in the queue or it is currently
+ * executing then this call will add it to the queue again so that
+ * even if it misses the reset flags during the current call it is
+ * guaranteed to see them on the next call.
+ */
+ schedule_work(&dev_priv->gpu_error.work);
}
/* Called from drm generic code, passed 'crtc' which
@@ -4389,6 +4406,7 @@ void intel_irq_init(struct drm_i915_private *dev_priv)
struct drm_device *dev = dev_priv->dev;
INIT_WORK(&dev_priv->hotplug_work, i915_hotplug_work_func);
+ INIT_WORK(&dev_priv->gpu_error.work, i915_error_work_func);
INIT_WORK(&dev_priv->dig_port_work, i915_digport_work_func);
INIT_WORK(&dev_priv->rps.work, gen6_pm_rps_work);
INIT_WORK(&dev_priv->l3_parity.error_work, ivybridge_parity_work);
--
1.9.1
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [RFCv2 06/12] drm/i915: Watchdog timeout support for gen8.
2015-07-21 13:58 [RFCv2 00/12] TDR/watchdog timeout support for gen8 Tomas Elf
` (4 preceding siblings ...)
2015-07-21 13:58 ` [RFCv2 05/12] drm/i915: Reinstate hang recovery work queue Tomas Elf
@ 2015-07-21 13:58 ` Tomas Elf
2015-07-21 13:58 ` [RFCv2 07/12] drm/i915: Fake lost context interrupts through forced CSB check Tomas Elf
` (6 subsequent siblings)
12 siblings, 0 replies; 15+ messages in thread
From: Tomas Elf @ 2015-07-21 13:58 UTC (permalink / raw)
To: Intel-GFX; +Cc: Ian Lister
Watchdog timeout (or "media engine reset" as it is sometimes called, even
though the render engine is also supported) is a feature that allows userland
applications to enable hang detection on individual batch buffers. The
detection mechanism itself is mostly bound to the hardware and the only thing
that the driver needs to do to support this form of hang detection is to
implement the interrupt handling support as well as watchdog instruction
injection before and after the emitted batch buffer start instruction in the
ring buffer.
The principle of this hang detection mechanism is as follows:
1. Once the decision has been made to enable watchdog timeout for a particular
batch buffer and the driver is in the process of emitting the batch buffer
start instruction into the ring buffer it also emits a watchdog timer start
instruction before and a watchdog timer cancellation instruction after the
batch buffer instruction in the ring buffer.
2. Once the GPU execution reaches the watchdog timer start instruction the
hardware watchdog counter is started by the hardware. The counter keeps
counting until it reaches a previously configured threshold value.
2a. If the counter reaches the threshold value the hardware fires a watchdog
interrupt that is picked up by the watchdog interrupt service routine in this
commit. This means that a hang has been detected and the driver needs to deal
with it the same way it would deal with a engine hang detected by the periodic
hang checker. The only difference between the two is that we never promote full
GPU reset following a watchdog timeout in case a per-engine reset was attempted
too recently. Thusly, the watchdog interrupt handler calls the error handler
directly passing the engine mask of the hung engine in question, which
immediately results in a per-engine hang recovery being scheduled.
2b. If the batch buffer finishes executing and the execution reaches the
watchdog cancellation instruction before the watchdog counter reaches its
threshold value the watchdog is cancelled and nothing more comes of it. No hang
was detected.
Currently watchdog timeout for the render engine and all available media
engines are supported. The specifications elude to the VECS engine being
supported but that is currently not supported by this commit.
The current default watchdog threshold value is 60 ms, since this has been
emprically determined to be a good compromise for low-latency requirements and
low rate of false positives.
NOTE: I don't know if Ben Widawsky had any part in this code from 3 years
ago. There have been so many people involved in this already that I am in no
position to know. If I've missed anyone's sob line please let me know.
Signed-off-by: Tomas Elf <tomas.elf@intel.com>
Signed-off-by: Arun Siluvery <arun.siluvery@intel.com>
Signed-off-by: Ian Lister <ian.lister@intel.com>
---
drivers/gpu/drm/i915/i915_debugfs.c | 2 +-
drivers/gpu/drm/i915/i915_dma.c | 59 ++++++++++++++++++++
drivers/gpu/drm/i915/i915_drv.h | 7 ++-
drivers/gpu/drm/i915/i915_irq.c | 84 ++++++++++++++++++++++------
drivers/gpu/drm/i915/i915_reg.h | 7 +++
drivers/gpu/drm/i915/intel_lrc.c | 99 +++++++++++++++++++++++++++++++++
drivers/gpu/drm/i915/intel_ringbuffer.h | 31 +++++++++++
include/uapi/drm/i915_drm.h | 5 +-
8 files changed, 271 insertions(+), 23 deletions(-)
diff --git a/drivers/gpu/drm/i915/i915_debugfs.c b/drivers/gpu/drm/i915/i915_debugfs.c
index e33e105..a89da48 100644
--- a/drivers/gpu/drm/i915/i915_debugfs.c
+++ b/drivers/gpu/drm/i915/i915_debugfs.c
@@ -4183,7 +4183,7 @@ i915_wedged_set(void *data, u64 val)
intel_runtime_pm_get(dev_priv);
- i915_handle_error(dev, 0x0, val,
+ i915_handle_error(dev, 0x0, false, val,
"Manually setting wedged to %llu", val);
intel_runtime_pm_put(dev_priv);
diff --git a/drivers/gpu/drm/i915/i915_dma.c b/drivers/gpu/drm/i915/i915_dma.c
index 39b8f5f..bf1d45a 100644
--- a/drivers/gpu/drm/i915/i915_dma.c
+++ b/drivers/gpu/drm/i915/i915_dma.c
@@ -792,6 +792,64 @@ i915_hangcheck_init(struct drm_device *dev)
}
}
+void i915_watchdog_init(struct drm_device *dev)
+{
+ struct drm_i915_private *dev_priv = dev->dev_private;
+ int freq;
+ int i;
+
+ /*
+ * Based on pre-defined time out value (60ms or 30ms) calculate
+ * timer count thresholds needed based on core frequency.
+ *
+ * For RCS.
+ * The timestamp resolution changed in Gen7 and beyond to 80ns
+ * for all pipes. Before that it was 640ns.
+ */
+
+#define KM_RCS_ENGINE_TIMEOUT_VALUE_IN_MS 60
+#define KM_BSD_ENGINE_TIMEOUT_VALUE_IN_MS 60
+#define KM_TIMER_MILLISECOND 1000
+
+ /*
+ * Timestamp timer resolution = 0.080 uSec,
+ * or 12500000 counts per second
+ */
+#define KM_TIMESTAMP_CNTS_PER_SEC_80NS 12500000
+
+ /*
+ * Timestamp timer resolution = 0.640 uSec,
+ * or 1562500 counts per second
+ */
+#define KM_TIMESTAMP_CNTS_PER_SEC_640NS 1562500
+
+ if (INTEL_INFO(dev)->gen >= 7)
+ freq = KM_TIMESTAMP_CNTS_PER_SEC_80NS;
+ else
+ freq = KM_TIMESTAMP_CNTS_PER_SEC_640NS;
+
+ dev_priv->ring[RCS].watchdog_threshold =
+ ((KM_RCS_ENGINE_TIMEOUT_VALUE_IN_MS) *
+ (freq / KM_TIMER_MILLISECOND));
+
+ dev_priv->ring[VCS].watchdog_threshold =
+ ((KM_BSD_ENGINE_TIMEOUT_VALUE_IN_MS) *
+ (freq / KM_TIMER_MILLISECOND));
+
+ dev_priv->ring[VCS2].watchdog_threshold =
+ ((KM_BSD_ENGINE_TIMEOUT_VALUE_IN_MS) *
+ (freq / KM_TIMER_MILLISECOND));
+
+ for (i = 0; i < I915_NUM_RINGS; i++)
+ dev_priv->ring[i].hangcheck.watchdog_count = 0;
+
+ DRM_INFO("Watchdog Timeout [ms], " \
+ "RCS: 0x%08X, VCS: 0x%08X, VCS2: 0x%08X\n", \
+ KM_RCS_ENGINE_TIMEOUT_VALUE_IN_MS,
+ KM_BSD_ENGINE_TIMEOUT_VALUE_IN_MS,
+ KM_BSD_ENGINE_TIMEOUT_VALUE_IN_MS);
+}
+
/**
* i915_driver_load - setup chip and create an initial config
* @dev: DRM device
@@ -973,6 +1031,7 @@ int i915_driver_load(struct drm_device *dev, unsigned long flags)
i915_gem_load(dev);
i915_hangcheck_init(dev);
+ i915_watchdog_init(dev);
/* On the 945G/GM, the chipset reports the MSI capability on the
* integrated graphics even though the support isn't actually there
diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index ef7c129..3d31872 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -2563,6 +2563,7 @@ extern unsigned long i915_gfx_val(struct drm_i915_private *dev_priv);
extern void i915_update_gfx_val(struct drm_i915_private *dev_priv);
int vlv_force_gfx_clock(struct drm_i915_private *dev_priv, bool on);
void intel_hpd_cancel_work(struct drm_i915_private *dev_priv);
+void i915_watchdog_init(struct drm_device *dev);
static inline void i915_hangcheck_reinit(struct intel_engine_cs *engine)
{
struct intel_ring_hangcheck *hc = &engine->hangcheck;
@@ -2578,9 +2579,9 @@ static inline void i915_hangcheck_reinit(struct intel_engine_cs *engine)
/* i915_irq.c */
void i915_queue_hangcheck(struct drm_device *dev);
-__printf(4, 5)
-void i915_handle_error(struct drm_device *dev, u32 engine_mask, bool wedged,
- const char *fmt, ...);
+__printf(5, 6)
+void i915_handle_error(struct drm_device *dev, u32 engine_mask,
+ bool watchdog, bool wedged, const char *fmt, ...);
extern void intel_irq_init(struct drm_i915_private *dev_priv);
extern void intel_hpd_init(struct drm_i915_private *dev_priv);
diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
index 4973826..5672a2c 100644
--- a/drivers/gpu/drm/i915/i915_irq.c
+++ b/drivers/gpu/drm/i915/i915_irq.c
@@ -1289,6 +1289,18 @@ static irqreturn_t gen8_gt_irq_handler(struct drm_i915_private *dev_priv,
intel_lrc_irq_handler(&dev_priv->ring[RCS]);
if (tmp & (GT_RENDER_USER_INTERRUPT << GEN8_RCS_IRQ_SHIFT))
notify_ring(&dev_priv->ring[RCS]);
+ if (tmp & (GT_GEN8_RCS_WATCHDOG_INTERRUPT << GEN8_RCS_IRQ_SHIFT)) {
+ struct intel_engine_cs *ring;
+
+ /* Stop the counter to prevent further interrupts */
+ ring = &dev_priv->ring[RCS];
+ I915_WRITE(RING_CNTR(ring->mmio_base),
+ GEN6_RCS_WATCHDOG_DISABLE);
+
+ ring->hangcheck.watchdog_count++;
+ i915_handle_error(ring->dev, intel_ring_flag(ring), true, true,
+ "Render engine watchdog timed out");
+ }
if (tmp & (GT_CONTEXT_SWITCH_INTERRUPT << GEN8_BCS_IRQ_SHIFT))
intel_lrc_irq_handler(&dev_priv->ring[BCS]);
@@ -1308,11 +1320,35 @@ static irqreturn_t gen8_gt_irq_handler(struct drm_i915_private *dev_priv,
intel_lrc_irq_handler(&dev_priv->ring[VCS]);
if (tmp & (GT_RENDER_USER_INTERRUPT << GEN8_VCS1_IRQ_SHIFT))
notify_ring(&dev_priv->ring[VCS]);
+ if (tmp & (GT_GEN8_VCS_WATCHDOG_INTERRUPT << GEN8_VCS1_IRQ_SHIFT)) {
+ struct intel_engine_cs *ring;
+
+ /* Stop the counter to prevent further interrupts */
+ ring = &dev_priv->ring[VCS];
+ I915_WRITE(RING_CNTR(ring->mmio_base),
+ GEN8_VCS_WATCHDOG_DISABLE);
+
+ ring->hangcheck.watchdog_count++;
+ i915_handle_error(ring->dev, intel_ring_flag(ring), true, true,
+ "Media engine watchdog timed out");
+ }
if (tmp & (GT_CONTEXT_SWITCH_INTERRUPT << GEN8_VCS2_IRQ_SHIFT))
intel_lrc_irq_handler(&dev_priv->ring[VCS2]);
if (tmp & (GT_RENDER_USER_INTERRUPT << GEN8_VCS2_IRQ_SHIFT))
notify_ring(&dev_priv->ring[VCS2]);
+ if (tmp & (GT_GEN8_VCS_WATCHDOG_INTERRUPT << GEN8_VCS2_IRQ_SHIFT)) {
+ struct intel_engine_cs *ring;
+
+ /* Stop the counter to prevent further interrupts */
+ ring = &dev_priv->ring[VCS2];
+ I915_WRITE(RING_CNTR(ring->mmio_base),
+ GEN8_VCS_WATCHDOG_DISABLE);
+
+ ring->hangcheck.watchdog_count++;
+ i915_handle_error(ring->dev, intel_ring_flag(ring), true, true,
+ "Media engine 2 watchdog timed out");
+ }
} else
DRM_ERROR("The master control interrupt lied (GT1)!\n");
}
@@ -2573,6 +2609,7 @@ static void i915_report_and_clear_eir(struct drm_device *dev)
* or if one of the current engine resets fails we fall
* back to legacy full GPU reset.
*
+ * @watchdog: true = Engine hang detected by hardware watchdog.
* @wedged: true = Hang detected, invoke hang recovery.
* @fmt, ...: Error message describing reason for error.
*
@@ -2584,8 +2621,8 @@ static void i915_report_and_clear_eir(struct drm_device *dev)
* reset the associated engine. Failing that, try to fall back to legacy
* full GPU reset recovery mode.
*/
-void i915_handle_error(struct drm_device *dev, u32 engine_mask, bool wedged,
- const char *fmt, ...)
+void i915_handle_error(struct drm_device *dev, u32 engine_mask,
+ bool watchdog, bool wedged, const char *fmt, ...)
{
struct drm_i915_private *dev_priv = dev->dev_private;
va_list args;
@@ -2617,20 +2654,27 @@ void i915_handle_error(struct drm_device *dev, u32 engine_mask, bool wedged,
u32 i;
for_each_ring(engine, dev_priv, i) {
- u32 now, last_engine_reset_timediff;
if (!(intel_ring_flag(engine) & engine_mask))
continue;
- /* Measure the time since this engine was last reset */
- now = get_seconds();
- last_engine_reset_timediff =
- now - engine->hangcheck.last_engine_reset_time;
-
- full_reset = last_engine_reset_timediff <
- i915.gpu_reset_promotion_time;
-
- engine->hangcheck.last_engine_reset_time = now;
+ if (!watchdog) {
+ /* Measure the time since this engine was last reset */
+ u32 now = get_seconds();
+ u32 last_engine_reset_timediff =
+ now - engine->hangcheck.last_engine_reset_time;
+
+ full_reset = last_engine_reset_timediff <
+ i915.gpu_reset_promotion_time;
+
+ engine->hangcheck.last_engine_reset_time = now;
+ } else {
+ /*
+ * Watchdog timeout always results
+ * in engine reset.
+ */
+ full_reset = false;
+ }
/*
* This engine was not reset too recently - go ahead
@@ -2641,10 +2685,11 @@ void i915_handle_error(struct drm_device *dev, u32 engine_mask, bool wedged,
* This can still be overridden by a global
* reset e.g. if per-engine reset fails.
*/
- if (!full_reset)
+ if (watchdog || !full_reset)
atomic_set_mask(I915_ENGINE_RESET_IN_PROGRESS,
&engine->hangcheck.flags);
- else
+
+ if (full_reset)
break;
} /* for_each_ring */
@@ -2652,7 +2697,7 @@ void i915_handle_error(struct drm_device *dev, u32 engine_mask, bool wedged,
if (full_reset) {
atomic_set_mask(I915_RESET_IN_PROGRESS_FLAG,
- &dev_priv->gpu_error.reset_counter);
+ &dev_priv->gpu_error.reset_counter);
}
/*
@@ -2990,7 +3035,7 @@ ring_stuck(struct intel_engine_cs *ring, u64 acthd)
*/
tmp = I915_READ_CTL(ring);
if (tmp & RING_WAIT) {
- i915_handle_error(dev, intel_ring_flag(ring), false,
+ i915_handle_error(dev, intel_ring_flag(ring), false, false,
"Kicking stuck wait on %s",
ring->name);
I915_WRITE_CTL(ring, tmp);
@@ -3002,7 +3047,7 @@ ring_stuck(struct intel_engine_cs *ring, u64 acthd)
default:
return HANGCHECK_HUNG;
case 1:
- i915_handle_error(dev, intel_ring_flag(ring), false,
+ i915_handle_error(dev, intel_ring_flag(ring), false, false,
"Kicking stuck semaphore on %s",
ring->name);
I915_WRITE_CTL(ring, tmp);
@@ -3135,7 +3180,7 @@ static void i915_hangcheck_elapsed(struct work_struct *work)
}
if (engine_mask)
- i915_handle_error(dev, engine_mask, true, "Ring hung (0x%02x)", engine_mask);
+ i915_handle_error(dev, engine_mask, false, true, "Ring hung (0x%02x)", engine_mask);
if (busy_count)
/* Reset timer case chip hangs without another request
@@ -3589,11 +3634,14 @@ static void gen8_gt_irq_postinstall(struct drm_i915_private *dev_priv)
{
/* These are interrupts we'll toggle with the ring mask register */
uint32_t gt_interrupts[] = {
+ GT_GEN8_RCS_WATCHDOG_INTERRUPT << GEN8_RCS_IRQ_SHIFT |
GT_RENDER_USER_INTERRUPT << GEN8_RCS_IRQ_SHIFT |
GT_CONTEXT_SWITCH_INTERRUPT << GEN8_RCS_IRQ_SHIFT |
GT_RENDER_L3_PARITY_ERROR_INTERRUPT |
GT_RENDER_USER_INTERRUPT << GEN8_BCS_IRQ_SHIFT |
GT_CONTEXT_SWITCH_INTERRUPT << GEN8_BCS_IRQ_SHIFT,
+ GT_GEN8_VCS_WATCHDOG_INTERRUPT << GEN8_VCS1_IRQ_SHIFT |
+ GT_GEN8_VCS_WATCHDOG_INTERRUPT << GEN8_VCS2_IRQ_SHIFT |
GT_RENDER_USER_INTERRUPT << GEN8_VCS1_IRQ_SHIFT |
GT_CONTEXT_SWITCH_INTERRUPT << GEN8_VCS1_IRQ_SHIFT |
GT_RENDER_USER_INTERRUPT << GEN8_VCS2_IRQ_SHIFT |
diff --git a/drivers/gpu/drm/i915/i915_reg.h b/drivers/gpu/drm/i915/i915_reg.h
index af9f0ad..d2adb9b 100644
--- a/drivers/gpu/drm/i915/i915_reg.h
+++ b/drivers/gpu/drm/i915/i915_reg.h
@@ -1181,6 +1181,8 @@ enum skl_disp_power_wells {
#define RING_HEAD(base) ((base)+0x34)
#define RING_START(base) ((base)+0x38)
#define RING_CTL(base) ((base)+0x3c)
+#define RING_CNTR(base) ((base)+0x178)
+#define RING_THRESH(base) ((base)+0x17C)
#define RING_SYNC_0(base) ((base)+0x40)
#define RING_SYNC_1(base) ((base)+0x44)
#define RING_SYNC_2(base) ((base)+0x48)
@@ -1584,6 +1586,11 @@ enum skl_disp_power_wells {
#define GT_BSD_USER_INTERRUPT (1 << 12)
#define GT_RENDER_L3_PARITY_ERROR_INTERRUPT_S1 (1 << 11) /* hsw+; rsvd on snb, ivb, vlv */
#define GT_CONTEXT_SWITCH_INTERRUPT (1 << 8)
+#define GT_GEN6_RENDER_WATCHDOG_INTERRUPT (1 << 6)
+#define GT_GEN8_RCS_WATCHDOG_INTERRUPT (1 << 6)
+#define GEN6_RCS_WATCHDOG_DISABLE 1
+#define GT_GEN8_VCS_WATCHDOG_INTERRUPT (1 << 6)
+#define GEN8_VCS_WATCHDOG_DISABLE 0xFFFFFFFF
#define GT_RENDER_L3_PARITY_ERROR_INTERRUPT (1 << 5) /* !snb */
#define GT_RENDER_PIPECTL_NOTIFY_INTERRUPT (1 << 4)
#define GT_RENDER_CS_MASTER_ERROR_INTERRUPT (1 << 3)
diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
index 4a19385..ff9d27cb 100644
--- a/drivers/gpu/drm/i915/intel_lrc.c
+++ b/drivers/gpu/drm/i915/intel_lrc.c
@@ -1005,6 +1005,78 @@ static int intel_logical_ring_begin(struct intel_ringbuffer *ringbuf,
return 0;
}
+static int
+gen8_ring_start_watchdog(struct intel_ringbuffer *ringbuf, struct intel_context *ctx)
+{
+ int ret;
+ struct intel_engine_cs *ring = ringbuf->ring;
+
+ ret = intel_logical_ring_begin(ringbuf, ctx, 10);
+ if (ret)
+ return ret;
+
+ /*
+ * i915_reg.h includes a warning to place a MI_NOOP
+ * before a MI_LOAD_REGISTER_IMM
+ */
+ intel_logical_ring_emit(ringbuf, MI_NOOP);
+ intel_logical_ring_emit(ringbuf, MI_NOOP);
+
+ /* Set counter period */
+ intel_logical_ring_emit(ringbuf, MI_LOAD_REGISTER_IMM(1));
+ intel_logical_ring_emit(ringbuf, RING_THRESH(ring->mmio_base));
+ intel_logical_ring_emit(ringbuf, ring->watchdog_threshold);
+ intel_logical_ring_emit(ringbuf, MI_NOOP);
+
+ /* Start counter */
+ intel_logical_ring_emit(ringbuf, MI_LOAD_REGISTER_IMM(1));
+ intel_logical_ring_emit(ringbuf, RING_CNTR(ring->mmio_base));
+ intel_logical_ring_emit(ringbuf, I915_WATCHDOG_ENABLE);
+ intel_logical_ring_emit(ringbuf, MI_NOOP);
+ intel_logical_ring_advance(ringbuf);
+
+ return 0;
+}
+
+static int
+gen8_ring_stop_watchdog(struct intel_ringbuffer *ringbuf, struct intel_context *ctx)
+{
+ int ret;
+ struct intel_engine_cs *ring = ringbuf->ring;
+
+ ret = intel_logical_ring_begin(ringbuf, ctx, 6);
+ if (ret)
+ return ret;
+
+ /*
+ * i915_reg.h includes a warning to place a MI_NOOP
+ * before a MI_LOAD_REGISTER_IMM
+ */
+ intel_logical_ring_emit(ringbuf, MI_NOOP);
+ intel_logical_ring_emit(ringbuf, MI_NOOP);
+
+ intel_logical_ring_emit(ringbuf, MI_LOAD_REGISTER_IMM(1));
+ intel_logical_ring_emit(ringbuf, RING_CNTR(ring->mmio_base));
+
+ switch (ring->id) {
+ default:
+ WARN(1, "%s does not support watchdog timeout! " \
+ "Defaulting to render engine.\n", ring->name);
+ case RCS:
+ intel_logical_ring_emit(ringbuf, GEN6_RCS_WATCHDOG_DISABLE);
+ break;
+ case VCS:
+ case VCS2:
+ intel_logical_ring_emit(ringbuf, GEN8_VCS_WATCHDOG_DISABLE);
+ break;
+ }
+
+ intel_logical_ring_emit(ringbuf, MI_NOOP);
+ intel_logical_ring_advance(ringbuf);
+
+ return 0;
+}
+
/**
* execlists_submission() - submit a batchbuffer for execution, Execlists style
* @dev: DRM device.
@@ -1035,6 +1107,7 @@ int intel_execlists_submission(struct drm_device *dev, struct drm_file *file,
int instp_mode;
u32 instp_mask;
int ret;
+ bool watchdog_running = false;
instp_mode = args->flags & I915_EXEC_CONSTANTS_MASK;
instp_mask = I915_EXEC_CONSTANTS_MASK;
@@ -1086,6 +1159,18 @@ int intel_execlists_submission(struct drm_device *dev, struct drm_file *file,
if (ret)
return ret;
+ /* Start watchdog timer */
+ if (args->flags & I915_EXEC_ENABLE_WATCHDOG) {
+ if (!intel_ring_supports_watchdog(ring))
+ return -EINVAL;
+
+ ret = gen8_ring_start_watchdog(ringbuf, ctx);
+ if (ret)
+ return ret;
+
+ watchdog_running = true;
+ }
+
if (ring == &dev_priv->ring[RCS] &&
instp_mode != dev_priv->relative_constants_mode) {
ret = intel_logical_ring_begin(ringbuf, ctx, 4);
@@ -1107,6 +1192,13 @@ int intel_execlists_submission(struct drm_device *dev, struct drm_file *file,
trace_i915_gem_ring_dispatch(intel_ring_get_request(ring), dispatch_flags);
+ /* Cancel watchdog timer */
+ if (watchdog_running) {
+ ret = gen8_ring_stop_watchdog(ringbuf, ctx);
+ if (ret)
+ return ret;
+ }
+
i915_gem_execbuffer_move_to_active(vmas, ring);
i915_gem_execbuffer_retire_commands(dev, file, ring, batch_obj);
@@ -1775,6 +1867,9 @@ static int logical_render_ring_init(struct drm_device *dev)
if (HAS_L3_DPF(dev))
ring->irq_keep_mask |= GT_RENDER_L3_PARITY_ERROR_INTERRUPT;
+ ring->irq_keep_mask |=
+ (GT_GEN8_RCS_WATCHDOG_INTERRUPT << GEN8_RCS_IRQ_SHIFT);
+
if (INTEL_INFO(dev)->gen >= 9)
ring->init_hw = gen9_init_render_ring;
else
@@ -1813,6 +1908,8 @@ static int logical_bsd_ring_init(struct drm_device *dev)
GT_RENDER_USER_INTERRUPT << GEN8_VCS1_IRQ_SHIFT;
ring->irq_keep_mask =
GT_CONTEXT_SWITCH_INTERRUPT << GEN8_VCS1_IRQ_SHIFT;
+ ring->irq_keep_mask |=
+ (GT_GEN8_VCS_WATCHDOG_INTERRUPT << GEN8_VCS1_IRQ_SHIFT);
ring->init_hw = gen8_init_common_ring;
ring->get_seqno = gen8_get_seqno;
@@ -1842,6 +1939,8 @@ static int logical_bsd2_ring_init(struct drm_device *dev)
GT_RENDER_USER_INTERRUPT << GEN8_VCS2_IRQ_SHIFT;
ring->irq_keep_mask =
GT_CONTEXT_SWITCH_INTERRUPT << GEN8_VCS2_IRQ_SHIFT;
+ ring->irq_keep_mask |=
+ (GT_GEN8_VCS_WATCHDOG_INTERRUPT << GEN8_VCS2_IRQ_SHIFT);
ring->init_hw = gen8_init_common_ring;
ring->get_seqno = gen8_get_seqno;
diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.h b/drivers/gpu/drm/i915/intel_ringbuffer.h
index 35360a4..9058789 100644
--- a/drivers/gpu/drm/i915/intel_ringbuffer.h
+++ b/drivers/gpu/drm/i915/intel_ringbuffer.h
@@ -30,6 +30,8 @@ struct intel_hw_status_page {
struct drm_i915_gem_object *obj;
};
+#define I915_WATCHDOG_ENABLE 0
+
#define I915_READ_TAIL(ring) I915_READ(RING_TAIL((ring)->mmio_base))
#define I915_WRITE_TAIL(ring, val) I915_WRITE(RING_TAIL((ring)->mmio_base), val)
@@ -136,6 +138,9 @@ struct intel_ring_hangcheck {
/* Number of TDR hang detections */
u32 tdr_count;
+
+ /* Number of watchdog hang detections for this ring */
+ u32 watchdog_count;
};
struct intel_ringbuffer {
@@ -338,6 +343,12 @@ struct intel_engine_cs {
/* Saved head value to be restored after reset */
u32 saved_head;
+ /*
+ * Watchdog timer threshold values
+ * only RCS, VCS, VCS2 rings have watchdog timeout support
+ */
+ uint32_t watchdog_threshold;
+
struct {
struct drm_i915_gem_object *obj;
u32 gtt_offset;
@@ -484,6 +495,26 @@ int intel_ring_save(struct intel_engine_cs *ring,
int intel_ring_restore(struct intel_engine_cs *ring,
struct drm_i915_gem_request *req);
+static inline bool intel_ring_supports_watchdog(struct intel_engine_cs *ring)
+{
+ bool ret = false;
+
+ if (WARN_ON(!ring))
+ goto exit;
+
+ ret = ( ring->id == RCS ||
+ ring->id == VCS ||
+ ring->id == VCS2);
+
+ if (!ret)
+ DRM_ERROR("%s does not support watchdog timeout!\n", ring->name);
+
+exit:
+ return ret;
+}
+int intel_ring_start_watchdog(struct intel_engine_cs *ring);
+int intel_ring_stop_watchdog(struct intel_engine_cs *ring);
+
int __must_check intel_ring_idle(struct intel_engine_cs *ring);
void intel_ring_init_seqno(struct intel_engine_cs *ring, u32 seqno);
int intel_ring_flush_all_caches(struct intel_engine_cs *ring);
diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
index 4851d66..f8af7d2 100644
--- a/include/uapi/drm/i915_drm.h
+++ b/include/uapi/drm/i915_drm.h
@@ -760,7 +760,10 @@ struct drm_i915_gem_execbuffer2 {
#define I915_EXEC_BSD_RING1 (1<<13)
#define I915_EXEC_BSD_RING2 (2<<13)
-#define __I915_EXEC_UNKNOWN_FLAGS -(1<<15)
+/* Enable watchdog timer for this batch buffer */
+#define I915_EXEC_ENABLE_WATCHDOG (1<<15)
+
+#define __I915_EXEC_UNKNOWN_FLAGS -(1<<16)
#define I915_EXEC_CONTEXT_ID_MASK (0xffffffff)
#define i915_execbuffer2_set_context_id(eb2, context) \
--
1.9.1
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [RFCv2 07/12] drm/i915: Fake lost context interrupts through forced CSB check.
2015-07-21 13:58 [RFCv2 00/12] TDR/watchdog timeout support for gen8 Tomas Elf
` (5 preceding siblings ...)
2015-07-21 13:58 ` [RFCv2 06/12] drm/i915: Watchdog timeout support for gen8 Tomas Elf
@ 2015-07-21 13:58 ` Tomas Elf
2015-07-21 13:58 ` [RFCv2 08/12] drm/i915: Debugfs interface for per-engine hang recovery Tomas Elf
` (5 subsequent siblings)
12 siblings, 0 replies; 15+ messages in thread
From: Tomas Elf @ 2015-07-21 13:58 UTC (permalink / raw)
To: Intel-GFX
A recurring issue during long-duration operations testing of concurrent
rendering tasks with intermittent hangs is that context completion interrupts
following engine resets are sometimes lost. This becomes a real problem since
the hardware might have completed a previously hung context following a
per-engine hang recovery and then gone idle somehow without sending an
interrupt telling the driver about this. At this point the driver would be
stuck waiting for context completion, thinking that the context is still active,
even though the hardware would be idle and waiting for more work.
The way this is solved is by periodically checking for context submission
status inconsistencies. What this means is that the ID of the currently running
context on a given engine is compared against the context ID in the
EXECLIST_STATUS register of the respective engine. If the two do not match and
if the state does not change over time it is assumed that an interrupt was
missed and that the driver is now stuck in an inconsistent state.
Following the decision that the driver and the hardware are irreversibly stuck
in an inconsistent state on a certain engine, the presumably lost interrupt is
faked by simply calling the execlist interrupt handler from a non-interrupt
context. Even though interrupts might be lost that does not mean that the
hardware does not always update the context status buffer (CSB) when
appropriate, which means that any context state transitions would be captured
there regardless of the interrupt being sent or not. By faking the lost
interrupt the interrupt handler could act on the outstanding context status
transition events in the CSB, e.g. a context completion event. In the case
where the hardware would be idle but the driver would be waiting for
completion, faking an interrupt and finding a context completion status event
would cause the driver to remove the currently active request from the execlist
queue and go idle - thereby reestablishing a consistent context submission
status between the hardware and the driver.
The way this is implemented is that the hang checker will always keep alive as
long as there is outstanding work. Even if the enable_hangcheck flag is
disabled one part of the hang checker will always keep alive and reschedule
itself, only to scan for inconsistent context submission states on all engines.
As long as the context submission status of the currently running context on a
given engine is consistent the hang checker works as normal and schedules hang
recoveries as expected. If the status is not consistent no hang recoveries will
be scheduled since no context resubmission will be possible anyway, so there is
no point in trying until the status becomes consistent again. Of course, if
enough hangs on the same engine are detected without any change in consistency
the hang checker will go straight for the full GPU reset so there is no chance
of getting stuck in this state.
It's worth keeping in mind that the watchdog timeout hang detection mechanism
relies entirely on the per-engine hang recovery path. So if we have an
inconsistent context submission status on the engine that the watchdog timeout
has detected a hang there is no way to recover from that hang if the period
hangchecker is turned off since the per-engine hang recovery cannot do its
final context resubmission if the context submission status is inconsistent.
That's why we need to make sure that there is always a thread alive that keeps
an eye out for inconsistent context submission states, not only for the
periodic hang checker but also for watchdog timeout.
Finally, since a non-interrupt context thread could end up in the interrupt
handler as part of the forced CSB checking there's the chance of a race
condition between the interrupt handler and the ring init code since both
update ring->next_context_status_buffer. Therefore we've had to update the
interrupt handler so that it grabs the execlist spinlock before updating
the variable. We've also had to make sure that the ring init code
grabs the execlist spinlock before initing this variable.
* v2: (Chris Wilson)
Remove context submission status consistency pre-check from
i915_hangcheck_elapsed() and turn it into a pre-check to per-engine reset.
The following describes the change in philosphy in how context submission state
inconsistencies are detected:
Previously we would let the periodic hang checker ensure that there were no
context submission status inconsistencies on any engine, at any point. If an
inconsistency was detected in the per-engine hang recovery path we would back
off and defer to the next hang check since per-engine hang recovery is not
effective during inconsistent context submission states.
What we do in this new version is to move the consistency pre-check from the
hang checker to the earliest point in the per-engine hang recovery path. If we
detect an inconsistency at that point we fake a potentially lost context event
interrupt by forcing a CSB check. If there are outstanding events in the CSB
these will be acted upon and hopefully that will bring the driver up to speed
with the hardware. If the CSB check did not amount to anything it is concluded
that the inconsistency is unresolvable and the per-engine hang recovery fails
and promotes to full GPU reset instead.
In the hang checker-based consistency checking we would check the inconsistency
for a number of times to make sure the detected state was stable before
attempting to rectify the situation. This is possible since hang checking
is a recurring event. Having moved the consistency checking to the recovery
path instead (i.e. a one-time, fire & forget-style event) it is assumed
that the hang detection that brought on the hang recovery has detected a
stable hang and therefore, if an inconsistency is detected at that point,
the inconsistency must be stable and not the result of a momentary context
state transition. Therefore, unlike in the hang checker case, at the very
first indication of an inconsistent context submission status the interrupt
is faked speculatively. If outstanding CSB events are found it is
determined that the hang was in fact just a context submission status
inconsistency and no hang recovery is done. If the inconsistency cannot be
resolved the per-engine hang recovery is failed and the hang is promoted to
full GPU reset instead.
Signed-off-by: Tomas Elf <tomas.elf@intel.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
---
drivers/gpu/drm/i915/i915_drv.c | 118 ++++++++++++++++++++++++++---------
drivers/gpu/drm/i915/i915_irq.c | 32 ++--------
drivers/gpu/drm/i915/intel_lrc.c | 67 ++++++++++++++++++--
drivers/gpu/drm/i915/intel_lrc.h | 2 +-
drivers/gpu/drm/i915/intel_lrc_tdr.h | 3 +
5 files changed, 159 insertions(+), 63 deletions(-)
diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
index c7ba64e..2ccb2e8 100644
--- a/drivers/gpu/drm/i915/i915_drv.c
+++ b/drivers/gpu/drm/i915/i915_drv.c
@@ -917,6 +917,71 @@ int i915_reset(struct drm_device *dev)
return 0;
}
+static bool i915_gem_reset_engine_CSSC_precheck(
+ struct drm_i915_private *dev_priv,
+ struct intel_engine_cs *engine,
+ struct drm_i915_gem_request **req,
+ int *ret)
+{
+ bool precheck_ok = true;
+ enum context_submission_status status;
+
+ WARN_ON(!ret);
+
+ *ret = 0;
+
+ status = intel_execlists_TDR_get_current_request(engine, req);
+
+ /*
+ * If the hardware and driver states do not coincide
+ * or if there for some reason is no current context
+ * in the process of being submitted then bail out and
+ * try again. Do not proceed unless we have reliable
+ * current context state information.
+ */
+ if (status == CONTEXT_SUBMISSION_STATUS_NONE_SUBMITTED) {
+ /*
+ * No work in flight. This state is possible to get
+ * into if calling the error handler directly from
+ * debugfs. Just do early exit and forget it happened.
+ */
+ WARN(1, "No work in flight! Aborting recovery on %s\n",
+ engine->name);
+
+ precheck_ok = false;
+ *ret = 0;
+
+ } else if (status == CONTEXT_SUBMISSION_STATUS_INCONSISTENT) {
+ if (!intel_execlists_TDR_force_CSB_check(dev_priv, engine)) {
+ /*
+ * Context submission state is inconsistent and
+ * faking a context event IRQ did not help.
+ * Fail and promote to higher level of
+ * recovery!
+ */
+ precheck_ok = false;
+ *ret = -EINVAL;
+ } else {
+ /*
+ * Rectifying the inconsistent context
+ * submission status helped! No reset required,
+ * just exit and move on!
+ */
+ precheck_ok = false;
+ *ret = 0;
+ }
+
+ } else if (status != CONTEXT_SUBMISSION_STATUS_OK) {
+ WARN(1, "Unexpected context submission status (%u) on %s\n",
+ status, engine->name);
+
+ precheck_ok = false;
+ *ret = -EINVAL;
+ }
+
+ return precheck_ok;
+}
+
/**
* i915_reset_engine - reset GPU engine after a hang
* @engine: engine to reset
@@ -958,20 +1023,17 @@ int i915_reset_engine(struct intel_engine_cs *engine)
i915_gem_reset_ring_status(dev_priv, engine);
if (i915.enable_execlists) {
- enum context_submission_status status =
- intel_execlists_TDR_get_current_request(engine, NULL);
-
/*
- * If the hardware and driver states do not coincide
- * or if there for some reason is no current context
- * in the process of being submitted then bail out and
- * try again. Do not proceed unless we have reliable
- * current context state information.
+ * Check context submission status consistency (CSSC) before
+ * moving on. If the driver and hardware have different
+ * opinions about what is going on just fail and escalate to a
+ * higher form of hang recovery.
*/
- if (status != CONTEXT_SUBMISSION_STATUS_OK) {
- ret = -EAGAIN;
+ if (!i915_gem_reset_engine_CSSC_precheck(dev_priv,
+ engine,
+ NULL,
+ &ret))
goto reset_engine_error;
- }
}
ret = intel_ring_disable(engine);
@@ -981,27 +1043,21 @@ int i915_reset_engine(struct intel_engine_cs *engine)
}
if (i915.enable_execlists) {
- enum context_submission_status status;
- bool inconsistent;
-
- status = intel_execlists_TDR_get_current_request(engine,
- ¤t_request);
-
- inconsistent = (status != CONTEXT_SUBMISSION_STATUS_OK);
- if (inconsistent) {
- /*
- * If we somehow have reached this point with
- * an inconsistent context submission status then
- * back out of the previously requested reset and
- * retry later.
- */
- WARN(inconsistent,
- "Inconsistent context status on %s: %u\n",
- engine->name, status);
-
- ret = -EAGAIN;
+ /*
+ * Get a hold of the currently executing context.
+ *
+ * Context submission status consistency is done implicitly so
+ * we might as well check it post-engine disablement since we
+ * get that option for free. Also, it's conceivable that the
+ * context submission state might have changed as part of the
+ * reset request on gen8+ so it's not completely devoid of
+ * value to do this.
+ */
+ if (!i915_gem_reset_engine_CSSC_precheck(dev_priv,
+ engine,
+ ¤t_request,
+ &ret))
goto reenable_reset_engine_error;
- }
}
/* Sample the current ring head position */
diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
index 5672a2c..22dd96c 100644
--- a/drivers/gpu/drm/i915/i915_irq.c
+++ b/drivers/gpu/drm/i915/i915_irq.c
@@ -36,6 +36,7 @@
#include "i915_drv.h"
#include "i915_trace.h"
#include "intel_drv.h"
+#include "intel_lrc_tdr.h"
/**
* DOC: interrupt handling
@@ -1286,7 +1287,7 @@ static irqreturn_t gen8_gt_irq_handler(struct drm_i915_private *dev_priv,
ret = IRQ_HANDLED;
if (tmp & (GT_CONTEXT_SWITCH_INTERRUPT << GEN8_RCS_IRQ_SHIFT))
- intel_lrc_irq_handler(&dev_priv->ring[RCS]);
+ intel_lrc_irq_handler(&dev_priv->ring[RCS], true);
if (tmp & (GT_RENDER_USER_INTERRUPT << GEN8_RCS_IRQ_SHIFT))
notify_ring(&dev_priv->ring[RCS]);
if (tmp & (GT_GEN8_RCS_WATCHDOG_INTERRUPT << GEN8_RCS_IRQ_SHIFT)) {
@@ -1303,7 +1304,7 @@ static irqreturn_t gen8_gt_irq_handler(struct drm_i915_private *dev_priv,
}
if (tmp & (GT_CONTEXT_SWITCH_INTERRUPT << GEN8_BCS_IRQ_SHIFT))
- intel_lrc_irq_handler(&dev_priv->ring[BCS]);
+ intel_lrc_irq_handler(&dev_priv->ring[BCS], true);
if (tmp & (GT_RENDER_USER_INTERRUPT << GEN8_BCS_IRQ_SHIFT))
notify_ring(&dev_priv->ring[BCS]);
} else
@@ -1317,7 +1318,7 @@ static irqreturn_t gen8_gt_irq_handler(struct drm_i915_private *dev_priv,
ret = IRQ_HANDLED;
if (tmp & (GT_CONTEXT_SWITCH_INTERRUPT << GEN8_VCS1_IRQ_SHIFT))
- intel_lrc_irq_handler(&dev_priv->ring[VCS]);
+ intel_lrc_irq_handler(&dev_priv->ring[VCS], true);
if (tmp & (GT_RENDER_USER_INTERRUPT << GEN8_VCS1_IRQ_SHIFT))
notify_ring(&dev_priv->ring[VCS]);
if (tmp & (GT_GEN8_VCS_WATCHDOG_INTERRUPT << GEN8_VCS1_IRQ_SHIFT)) {
@@ -1334,7 +1335,7 @@ static irqreturn_t gen8_gt_irq_handler(struct drm_i915_private *dev_priv,
}
if (tmp & (GT_CONTEXT_SWITCH_INTERRUPT << GEN8_VCS2_IRQ_SHIFT))
- intel_lrc_irq_handler(&dev_priv->ring[VCS2]);
+ intel_lrc_irq_handler(&dev_priv->ring[VCS2], true);
if (tmp & (GT_RENDER_USER_INTERRUPT << GEN8_VCS2_IRQ_SHIFT))
notify_ring(&dev_priv->ring[VCS2]);
if (tmp & (GT_GEN8_VCS_WATCHDOG_INTERRUPT << GEN8_VCS2_IRQ_SHIFT)) {
@@ -1360,7 +1361,7 @@ static irqreturn_t gen8_gt_irq_handler(struct drm_i915_private *dev_priv,
ret = IRQ_HANDLED;
if (tmp & (GT_CONTEXT_SWITCH_INTERRUPT << GEN8_VECS_IRQ_SHIFT))
- intel_lrc_irq_handler(&dev_priv->ring[VECS]);
+ intel_lrc_irq_handler(&dev_priv->ring[VECS], true);
if (tmp & (GT_RENDER_USER_INTERRUPT << GEN8_VECS_IRQ_SHIFT))
notify_ring(&dev_priv->ring[VECS]);
} else
@@ -2381,27 +2382,6 @@ static void i915_error_work_func(struct work_struct *work)
ret = i915_reset_engine(ring);
- /*
- * Execlist mode only:
- *
- * -EAGAIN means that between detecting a hang (and
- * also determining that the currently submitted
- * context is stable and valid) and trying to recover
- * from the hang the current context changed state.
- * This means that we are probably not completely hung
- * after all. Just fail and retry by exiting all the
- * way back and wait for the next hang detection. If we
- * have a true hang on our hands then we will detect it
- * again, otherwise we will continue like nothing
- * happened.
- */
- if (ret == -EAGAIN) {
- DRM_ERROR("Reset of %s aborted due to " \
- "change in context submission " \
- "state - retrying!", ring->name);
- ret = 0;
- }
-
if (ret) {
DRM_ERROR("Reset of %s failed! (%d)", ring->name, ret);
diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
index ff9d27cb..476eff0 100644
--- a/drivers/gpu/drm/i915/intel_lrc.c
+++ b/drivers/gpu/drm/i915/intel_lrc.c
@@ -672,7 +672,7 @@ static bool execlists_check_remove_request(struct intel_engine_cs *ring,
* Check the unread Context Status Buffers and manage the submission of new
* contexts to the ELSP accordingly.
*/
-void intel_lrc_irq_handler(struct intel_engine_cs *ring)
+int intel_lrc_irq_handler(struct intel_engine_cs *ring, bool do_lock)
{
struct drm_i915_private *dev_priv = ring->dev->dev_private;
u32 status_pointer;
@@ -684,13 +684,14 @@ void intel_lrc_irq_handler(struct intel_engine_cs *ring)
status_pointer = I915_READ(RING_CONTEXT_STATUS_PTR(ring));
+ if (do_lock)
+ spin_lock(&ring->execlist_lock);
+
read_pointer = ring->next_context_status_buffer;
write_pointer = status_pointer & 0x07;
if (read_pointer > write_pointer)
write_pointer += 6;
- spin_lock(&ring->execlist_lock);
-
while (read_pointer < write_pointer) {
read_pointer++;
status = I915_READ(RING_CONTEXT_STATUS_BUF(ring) +
@@ -716,13 +717,16 @@ void intel_lrc_irq_handler(struct intel_engine_cs *ring)
if (submit_contexts != 0)
execlists_context_unqueue(ring, false);
- spin_unlock(&ring->execlist_lock);
-
WARN(submit_contexts > 2, "More than two context complete events?\n");
ring->next_context_status_buffer = write_pointer % 6;
+ if (do_lock)
+ spin_unlock(&ring->execlist_lock);
+
I915_WRITE(RING_CONTEXT_STATUS_PTR(ring),
((u32)ring->next_context_status_buffer & 0x07) << 8);
+
+ return submit_contexts;
}
static int execlists_context_queue(struct intel_engine_cs *ring,
@@ -1356,6 +1360,7 @@ static int gen8_init_common_ring(struct intel_engine_cs *ring)
{
struct drm_device *dev = ring->dev;
struct drm_i915_private *dev_priv = dev->dev_private;
+ unsigned long flags;
I915_WRITE_IMR(ring, ~(ring->irq_enable_mask | ring->irq_keep_mask));
I915_WRITE(RING_HWSTAM(ring->mmio_base), 0xffffffff);
@@ -1364,7 +1369,11 @@ static int gen8_init_common_ring(struct intel_engine_cs *ring)
_MASKED_BIT_DISABLE(GFX_REPLAY_MODE) |
_MASKED_BIT_ENABLE(GFX_RUN_LIST_ENABLE));
POSTING_READ(RING_MODE_GEN7(ring));
+
+ spin_lock_irqsave(&ring->execlist_lock, flags);
ring->next_context_status_buffer = 0;
+ spin_unlock_irqrestore(&ring->execlist_lock, flags);
+
DRM_DEBUG_DRIVER("Execlists enabled for %s\n", ring->name);
i915_hangcheck_reinit(ring);
@@ -2586,3 +2595,51 @@ intel_execlists_TDR_get_current_request(struct intel_engine_cs *ring,
return status;
}
+
+/**
+ * execlists_TDR_force_CSB_check() - check CSB manually to act on pending
+ * context status events.
+ *
+ * @dev_priv: ...
+ * @engine: engine whose CSB is to be checked.
+ *
+ * Return:
+ * True: Consistency restored.
+ * False: Failed trying to restore consistency.
+ *
+ * In case we missed a context event interrupt we can fake this interrupt by
+ * acting on pending CSB events manually by calling this function. This is
+ * normally what would happen in interrupt context but that does not prevent us
+ * from calling it from a user thread.
+ */
+bool intel_execlists_TDR_force_CSB_check(struct drm_i915_private *dev_priv,
+ struct intel_engine_cs *engine)
+{
+ unsigned long flags;
+ bool hw_active;
+ int was_effective;
+
+ hw_active =
+ (I915_READ(RING_EXECLIST_STATUS(engine)) &
+ EXECLIST_STATUS_CURRENT_ACTIVE_ELEMENT_STATUS) ?
+ true : false;
+ if (hw_active) {
+ u32 hw_context;
+
+ hw_context = I915_READ(RING_EXECLIST_STATUS_CTX_ID(engine));
+ WARN(hw_active, "Context (%x) executing on %s - " \
+ "No need for faked IRQ!\n",
+ hw_context, engine->name);
+ return false;
+ }
+
+ spin_lock_irqsave(&engine->execlist_lock, flags);
+ if (!(was_effective = intel_lrc_irq_handler(engine, false)))
+ DRM_ERROR("Forced CSB check of %s ineffective!\n", engine->name);
+ spin_unlock_irqrestore(&engine->execlist_lock, flags);
+
+ wake_up_all(&engine->irq_queue);
+
+ return !!was_effective;
+}
+
diff --git a/drivers/gpu/drm/i915/intel_lrc.h b/drivers/gpu/drm/i915/intel_lrc.h
index d2f497c..6fae3c8 100644
--- a/drivers/gpu/drm/i915/intel_lrc.h
+++ b/drivers/gpu/drm/i915/intel_lrc.h
@@ -88,7 +88,7 @@ int intel_execlists_submission(struct drm_device *dev, struct drm_file *file,
u64 exec_start, u32 dispatch_flags);
u32 intel_execlists_ctx_id(struct drm_i915_gem_object *ctx_obj);
-void intel_lrc_irq_handler(struct intel_engine_cs *ring);
+int intel_lrc_irq_handler(struct intel_engine_cs *ring, bool do_lock);
void intel_execlists_retire_requests(struct intel_engine_cs *ring);
int intel_execlists_read_tail(struct intel_engine_cs *ring,
diff --git a/drivers/gpu/drm/i915/intel_lrc_tdr.h b/drivers/gpu/drm/i915/intel_lrc_tdr.h
index 4520753..041c808 100644
--- a/drivers/gpu/drm/i915/intel_lrc_tdr.h
+++ b/drivers/gpu/drm/i915/intel_lrc_tdr.h
@@ -32,5 +32,8 @@ enum context_submission_status
intel_execlists_TDR_get_current_request(struct intel_engine_cs *ring,
struct drm_i915_gem_request **req);
+bool intel_execlists_TDR_force_CSB_check(struct drm_i915_private *dev_priv,
+ struct intel_engine_cs *engine);
+
#endif /* _INTEL_LRC_TDR_H_ */
--
1.9.1
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [RFCv2 08/12] drm/i915: Debugfs interface for per-engine hang recovery.
2015-07-21 13:58 [RFCv2 00/12] TDR/watchdog timeout support for gen8 Tomas Elf
` (6 preceding siblings ...)
2015-07-21 13:58 ` [RFCv2 07/12] drm/i915: Fake lost context interrupts through forced CSB check Tomas Elf
@ 2015-07-21 13:58 ` Tomas Elf
2015-07-21 13:58 ` [RFCv2 09/12] drm/i915: TDR/watchdog trace points Tomas Elf
` (4 subsequent siblings)
12 siblings, 0 replies; 15+ messages in thread
From: Tomas Elf @ 2015-07-21 13:58 UTC (permalink / raw)
To: Intel-GFX; +Cc: Ian Lister
1. The i915_wedged_set() function allows us to schedule three forms of hang recovery:
a) Legacy hang recovery: By passing e.g. -1 we trigger the legacy full
GPU reset recovery path.
b) Single engine hang recovery: By passing an engine ID in the interval
of [0, I915_NUM_RINGS) we can schedule hang recovery of any single
engine assuming that the context submission consistency requirements
are met (otherwise the hang recovery path will simply exit early and
wait for another hang detection). The values are assumed to use up bits
3:0 only since we certainly do not support as many as 16 engines.
This mode is supported since there are several legacy test applications
that rely on this interface.
c) Multiple engine hang recovery: By passing in an engine flag mask in
bits 31:8 (bit 8 corresponds to engine 0 = RCS, bit 9 corresponds to
engine 1 = VCS etc) we can schedule any combination of engine hang
recoveries as we please. For example, by passing in the value 0x3 << 8
we would schedule hang recovery for engines 0 and 1 (RCS and VCS) at
the same time.
If bits in fields 3:0 and 31:8 are both used then single engine hang
recovery mode takes presidence and bits 31:8 are ignored.
2. The i915_hangcheck_read() function produces a set of statistics related to:
a) Number of engine hangs detected by periodic hang checker.
b) Number of watchdog timeout hangs detected.
c) Number of full GPU resets carried out.
d) Number of engine resets carried out.
These statistics are presented in a very parser-friendly way and are
used by the TDR ULT to poll system behaviour to validate test outcomes.
* v2: (Chris Wilson)
- After review comments by Chris Wilson we're dropping the dual-mode parameter
value interpretation in i915_wedged_set(). In this version we only accept
engine id flag masks that contain the engine id flags of all currently hung
engines. Full GPU reset is most easily requested by passing an all zero
engine id flag mask.
- Moved TDR-specific engine metrics like number of detected engine hangs and
number of per-engine resets into i915_hangcheck_info() from
i915_hangcheck_read().
Signed-off-by: Tomas Elf <tomas.elf@intel.com>
Signed-off-by: Arun Siluvery <arun.siluvery@intel.com>
Signed-off-by: Ian Lister <ian.lister@intel.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
---
drivers/gpu/drm/i915/i915_debugfs.c | 76 ++++++++++++++++++++++++++++++++++---
1 file changed, 71 insertions(+), 5 deletions(-)
diff --git a/drivers/gpu/drm/i915/i915_debugfs.c b/drivers/gpu/drm/i915/i915_debugfs.c
index a89da48..d99c152 100644
--- a/drivers/gpu/drm/i915/i915_debugfs.c
+++ b/drivers/gpu/drm/i915/i915_debugfs.c
@@ -1302,6 +1302,8 @@ static int i915_hangcheck_info(struct seq_file *m, void *unused)
} else
seq_printf(m, "Hangcheck inactive\n");
+ seq_printf(m, "Full GPU resets = %u\n", i915_reset_count(&dev_priv->gpu_error));
+
for_each_ring(ring, dev_priv, i) {
seq_printf(m, "%s:\n", ring->name);
seq_printf(m, "\tseqno = %x [current %x]\n",
@@ -1313,6 +1315,12 @@ static int i915_hangcheck_info(struct seq_file *m, void *unused)
(long long)ring->hangcheck.max_acthd);
seq_printf(m, "\tscore = %d\n", ring->hangcheck.score);
seq_printf(m, "\taction = %d\n", ring->hangcheck.action);
+ seq_printf(m, "\tengine resets = %u\n",
+ ring->hangcheck.reset_count);
+ seq_printf(m, "\tengine hang detections = %u\n",
+ ring->hangcheck.tdr_count);
+ seq_printf(m, "\tengine watchdog timeout detections = %u\n",
+ ring->hangcheck.watchdog_count);
}
return 0;
@@ -2030,7 +2038,7 @@ static int i915_execlists(struct seq_file *m, void *data)
seq_printf(m, "%s\n", ring->name);
status = I915_READ(RING_EXECLIST_STATUS(ring));
- ctx_id = I915_READ(RING_EXECLIST_STATUS(ring) + 4);
+ ctx_id = I915_READ(RING_EXECLIST_STATUS_CTX_ID(ring));
seq_printf(m, "\tExeclist status: 0x%08X, context: %u\n",
status, ctx_id);
@@ -4164,11 +4172,47 @@ i915_wedged_get(void *data, u64 *val)
return 0;
}
+static const char *ringid_to_str(enum intel_ring_id ring_id)
+{
+ switch (ring_id) {
+ case RCS:
+ return "RCS";
+ case VCS:
+ return "VCS";
+ case BCS:
+ return "BCS";
+ case VECS:
+ return "VECS";
+ case VCS2:
+ return "VCS2";
+ }
+
+ return "unknown";
+}
+
static int
i915_wedged_set(void *data, u64 val)
{
struct drm_device *dev = data;
struct drm_i915_private *dev_priv = dev->dev_private;
+ struct intel_engine_cs *engine;
+ u32 i;
+#define ENGINE_MSGLEN 64
+ char msg[ENGINE_MSGLEN];
+
+ /*
+ * Val contains the engine flag mask of engines to be reset.
+ *
+ * Full GPU reset is implied in the following two cases:
+ * 1. val == 0x0
+ * 2. val >= (1 << I915_NUM_RINGS)
+ *
+ * Bit 0: RCS engine
+ * Bit 1: VCS engine
+ * Bit 2: BCS engine
+ * Bit 3: VECS engine
+ * Bit 4: VCS2 engine (if available)
+ */
/*
* There is no safeguard against this debugfs entry colliding
@@ -4177,14 +4221,36 @@ i915_wedged_set(void *data, u64 val)
* test harness is responsible enough not to inject gpu hangs
* while it is writing to 'i915_wedged'
*/
-
- if (i915_reset_in_progress(&dev_priv->gpu_error))
+ if (i915_gem_check_wedge(dev_priv, NULL, true))
return -EAGAIN;
intel_runtime_pm_get(dev_priv);
- i915_handle_error(dev, 0x0, false, val,
- "Manually setting wedged to %llu", val);
+ memset(msg, 0, sizeof(msg));
+
+ if (val) {
+ scnprintf(msg, sizeof(msg), "Manual reset:");
+
+ /* Assemble message string */
+ for_each_ring(engine, dev_priv, i)
+ if (intel_ring_flag(engine) & val) {
+ DRM_INFO("Manual reset: %s\n", engine->name);
+
+ scnprintf(msg, sizeof(msg),
+ "%s [%s]",
+ msg,
+ ringid_to_str(i));
+ }
+
+ } else {
+ scnprintf(msg, sizeof(msg), "Manual global reset");
+ }
+
+ i915_handle_error(dev,
+ val,
+ false,
+ true,
+ msg);
intel_runtime_pm_put(dev_priv);
--
1.9.1
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [RFCv2 09/12] drm/i915: TDR/watchdog trace points.
2015-07-21 13:58 [RFCv2 00/12] TDR/watchdog timeout support for gen8 Tomas Elf
` (7 preceding siblings ...)
2015-07-21 13:58 ` [RFCv2 08/12] drm/i915: Debugfs interface for per-engine hang recovery Tomas Elf
@ 2015-07-21 13:58 ` Tomas Elf
2015-07-21 13:58 ` [RFCv2 10/12] drm/i915: Port of Added scheduler support to __wait_request() calls Tomas Elf
` (3 subsequent siblings)
12 siblings, 0 replies; 15+ messages in thread
From: Tomas Elf @ 2015-07-21 13:58 UTC (permalink / raw)
To: Intel-GFX
Defined trace points and sprinkled the usage of these throughout the
TDR/watchdog implementation.
The following trace points are supported:
1. trace_i915_tdr_gpu_recovery:
Called at the onset of the full GPU reset recovery path.
2. trace_i915_tdr_engine_recovery:
Called at the onset of the per-engine recovery path.
3. i915_tdr_recovery_start:
Called at the onset of hang recovery before recovery mode has been
decided.
4. i915_tdr_recovery_complete:
Called at the point of hang recovery completion.
5. i915_tdr_recovery_queued:
Called once the error handler decides to schedule the actual hang
recovery, which marks the end of the hang detection path.
6. i915_tdr_engine_save:
Called at the point of saving the engine state during per-engine hang
recovery.
7. i915_tdr_gpu_reset_complete:
Called at the point of full GPU reset recovery completion.
8. i915_tdr_engine_reset_complete:
Called at the point of per-engine recovery completion.
9. i915_tdr_forced_csb_check:
Called at the completion of a forced CSB check.
10. i915_tdr_hang_check:
Called for every engine in the periodic hang checker loop before moving
on to the next engine. Provides an overview of all hang check stats in
real-time. The collected stats are:
a. Engine name.
b. Current engine seqno.
c. Seqno of previous hang check iteration for that engine.
d. ACTHD register value of given engine.
e. Current hang check score of given engine (and whether or not
the engine has been detected as hung).
f. Current action for given engine.
g. Busyness of given engine.
h. Submission status of currently running context on given engine.
* v2:
As a consequence of the faking context event interrupt commit being moved from
the hang checker to the per-engine recovery path we no longer check context
submission status from the hang checker. Therefore there is no need to provide
submission status of the currently running context to the
trace_i915_tdr_hang_check() event.
Signed-off-by: Tomas Elf <tomas.elf@intel.com>
---
drivers/gpu/drm/i915/i915_drv.c | 3 +
drivers/gpu/drm/i915/i915_drv.h | 1 +
drivers/gpu/drm/i915/i915_gpu_error.c | 2 +-
drivers/gpu/drm/i915/i915_irq.c | 9 +-
drivers/gpu/drm/i915/i915_trace.h | 293 ++++++++++++++++++++++++++++++++++
drivers/gpu/drm/i915/intel_lrc.c | 8 +-
drivers/gpu/drm/i915/intel_uncore.c | 4 +
7 files changed, 316 insertions(+), 4 deletions(-)
diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
index 2ccb2e8..f923036 100644
--- a/drivers/gpu/drm/i915/i915_drv.c
+++ b/drivers/gpu/drm/i915/i915_drv.c
@@ -845,6 +845,7 @@ int i915_reset(struct drm_device *dev)
if (!i915.reset)
return 0;
+ trace_i915_tdr_gpu_recovery(dev);
intel_reset_gt_powersave(dev);
mutex_lock(&dev->struct_mutex);
@@ -1017,6 +1018,8 @@ int i915_reset_engine(struct intel_engine_cs *engine)
WARN_ON(!mutex_is_locked(&dev->struct_mutex));
+ trace_i915_tdr_engine_recovery(engine);
+
/* Take wake lock to prevent power saving mode */
intel_uncore_forcewake_get(dev_priv, FORCEWAKE_ALL);
diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index 3d31872..9f37cba 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -3129,6 +3129,7 @@ void i915_destroy_error_state(struct drm_device *dev);
void i915_get_extra_instdone(struct drm_device *dev, uint32_t *instdone);
const char *i915_cache_level_str(struct drm_i915_private *i915, int type);
+const char *hangcheck_action_to_str(enum intel_ring_hangcheck_action a);
/* i915_cmd_parser.c */
int i915_cmd_parser_get_version(void);
diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c
index ac22614..cee1825 100644
--- a/drivers/gpu/drm/i915/i915_gpu_error.c
+++ b/drivers/gpu/drm/i915/i915_gpu_error.c
@@ -220,7 +220,7 @@ static void print_error_buffers(struct drm_i915_error_state_buf *m,
}
}
-static const char *hangcheck_action_to_str(enum intel_ring_hangcheck_action a)
+const char *hangcheck_action_to_str(enum intel_ring_hangcheck_action a)
{
switch (a) {
case HANGCHECK_IDLE:
diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
index 22dd96c..beee624 100644
--- a/drivers/gpu/drm/i915/i915_irq.c
+++ b/drivers/gpu/drm/i915/i915_irq.c
@@ -2359,6 +2359,7 @@ static void i915_error_work_func(struct work_struct *work)
mutex_lock(&dev->struct_mutex);
+ trace_i915_tdr_recovery_start(dev);
kobject_uevent_env(&dev->primary->kdev->kobj, KOBJ_CHANGE, error_event);
for_each_ring(ring, dev_priv, i) {
@@ -2477,6 +2478,7 @@ static void i915_error_work_func(struct work_struct *work)
kobject_uevent_env(&dev->primary->kdev->kobj,
KOBJ_CHANGE, reset_done_event);
}
+ trace_i915_tdr_recovery_complete(dev);
}
static void i915_report_and_clear_eir(struct drm_device *dev)
@@ -2607,6 +2609,7 @@ void i915_handle_error(struct drm_device *dev, u32 engine_mask,
struct drm_i915_private *dev_priv = dev->dev_private;
va_list args;
char error_msg[80];
+ bool full_reset = true;
struct intel_engine_cs *engine;
@@ -2624,7 +2627,6 @@ void i915_handle_error(struct drm_device *dev, u32 engine_mask,
* 2. The hardware does not support it (pre-gen7).
* 3. We already tried per-engine reset recently.
*/
- bool full_reset = true;
/*
* TBD: We currently only support per-engine reset for gen8+.
@@ -2648,6 +2650,7 @@ void i915_handle_error(struct drm_device *dev, u32 engine_mask,
i915.gpu_reset_promotion_time;
engine->hangcheck.last_engine_reset_time = now;
+
} else {
/*
* Watchdog timeout always results
@@ -2696,6 +2699,8 @@ void i915_handle_error(struct drm_device *dev, u32 engine_mask,
i915_error_wake_up(dev_priv, false);
}
+ trace_i915_tdr_recovery_queued(dev, engine_mask, watchdog, full_reset);
+
/*
* Gen 7:
*
@@ -3146,6 +3151,8 @@ static void i915_hangcheck_elapsed(struct work_struct *work)
ring->hangcheck.seqno = seqno;
ring->hangcheck.acthd = acthd;
busy_count += busy;
+
+ trace_i915_tdr_hang_check(ring, seqno, acthd, busy);
}
for_each_ring(ring, dev_priv, i) {
diff --git a/drivers/gpu/drm/i915/i915_trace.h b/drivers/gpu/drm/i915/i915_trace.h
index 2aa140e..b970259 100644
--- a/drivers/gpu/drm/i915/i915_trace.h
+++ b/drivers/gpu/drm/i915/i915_trace.h
@@ -775,6 +775,299 @@ TRACE_EVENT(switch_mm,
__entry->dev, __entry->ring, __entry->to, __entry->vm)
);
+/**
+ * DOC: i915_tdr_gpu_recovery
+ *
+ * This tracepoint tracks the onset of the full GPU recovery path
+ */
+TRACE_EVENT(i915_tdr_gpu_recovery,
+ TP_PROTO(struct drm_device *dev),
+
+ TP_ARGS(dev),
+
+ TP_STRUCT__entry(
+ __field(u32, dev)
+ ),
+
+ TP_fast_assign(
+ __entry->dev = dev->primary->index;
+ ),
+
+ TP_printk("dev=%u, full GPU recovery started",
+ __entry->dev)
+);
+
+/**
+ * DOC: i915_tdr_engine_recovery
+ *
+ * This tracepoint tracks the onset of the engine recovery path
+ */
+TRACE_EVENT(i915_tdr_engine_recovery,
+ TP_PROTO(struct intel_engine_cs *engine),
+
+ TP_ARGS(engine),
+
+ TP_STRUCT__entry(
+ __field(struct intel_engine_cs *, engine)
+ ),
+
+ TP_fast_assign(
+ __entry->engine = engine;
+ ),
+
+ TP_printk("dev=%u, engine=%u, recovery of %s started",
+ __entry->engine->dev->primary->index,
+ __entry->engine->id,
+ __entry->engine->name)
+);
+
+/**
+ * DOC: i915_tdr_recovery_start
+ *
+ * This tracepoint tracks hang recovery start
+ */
+TRACE_EVENT(i915_tdr_recovery_start,
+ TP_PROTO(struct drm_device *dev),
+
+ TP_ARGS(dev),
+
+ TP_STRUCT__entry(
+ __field(u32, dev)
+ ),
+
+ TP_fast_assign(
+ __entry->dev = dev->primary->index;
+ ),
+
+ TP_printk("dev=%u, hang recovery started",
+ __entry->dev)
+);
+
+/**
+ * DOC: i915_tdr_recovery_complete
+ *
+ * This tracepoint tracks hang recovery completion
+ */
+TRACE_EVENT(i915_tdr_recovery_complete,
+ TP_PROTO(struct drm_device *dev),
+
+ TP_ARGS(dev),
+
+ TP_STRUCT__entry(
+ __field(u32, dev)
+ ),
+
+ TP_fast_assign(
+ __entry->dev = dev->primary->index;
+ ),
+
+ TP_printk("dev=%u, hang recovery completed",
+ __entry->dev)
+);
+
+/**
+ * DOC: i915_tdr_recovery_queued
+ *
+ * This tracepoint tracks the point of queuing recovery from hang check.
+ * If engine recovery is requested engine name will be displayed, otherwise
+ * it will be set to "none". If too many engine reset was attempted in the
+ * previous history we promote to full GPU reset, which is remarked by appending
+ * the "[PROMOTED]" flag.
+ */
+TRACE_EVENT(i915_tdr_recovery_queued,
+ TP_PROTO(struct drm_device *dev,
+ u32 hung_engines,
+ bool watchdog,
+ bool full_reset),
+
+ TP_ARGS(dev, hung_engines, watchdog, full_reset),
+
+ TP_STRUCT__entry(
+ __field(u32, dev)
+ __field(u32, hung_engines)
+ __field(bool, watchdog)
+ __field(bool, full_reset)
+ ),
+
+ TP_fast_assign(
+ __entry->dev = dev->primary->index;
+ __entry->hung_engines = hung_engines;
+ __entry->watchdog = watchdog;
+ __entry->full_reset = full_reset;
+ ),
+
+ TP_printk("dev=%u, hung_engines=0x%02x%s%s%s%s%s%s%s, watchdog=%s, full_reset=%s",
+ __entry->dev,
+ __entry->hung_engines,
+ __entry->hung_engines ? " (":"",
+ __entry->hung_engines & RENDER_RING ? " [RCS] " : "",
+ __entry->hung_engines & BSD_RING ? " [VCS] " : "",
+ __entry->hung_engines & BLT_RING ? " [BCS] " : "",
+ __entry->hung_engines & VEBOX_RING ? " [VECS] " : "",
+ __entry->hung_engines & BSD2_RING ? " [VCS2] " : "",
+ __entry->hung_engines ? ")":"",
+ __entry->watchdog ? "true" : "false",
+ __entry->full_reset ?
+ (__entry->hung_engines ? "true [PROMOTED]" : "true") :
+ "false")
+);
+
+/**
+ * DOC: i915_tdr_engine_save
+ *
+ * This tracepoint tracks the point of engine state save during the engine
+ * recovery path. Logs the head pointer position at point of hang, the position
+ * after recovering and whether or not we forced a head pointer advancement or
+ * rounded up to an aligned QWORD position.
+ */
+TRACE_EVENT(i915_tdr_engine_save,
+ TP_PROTO(struct intel_engine_cs *engine,
+ u32 old_head,
+ u32 new_head,
+ bool forced_advance),
+
+ TP_ARGS(engine, old_head, new_head, forced_advance),
+
+ TP_STRUCT__entry(
+ __field(struct intel_engine_cs *, engine)
+ __field(u32, old_head)
+ __field(u32, new_head)
+ __field(bool, forced_advance)
+ ),
+
+ TP_fast_assign(
+ __entry->engine = engine;
+ __entry->old_head = old_head;
+ __entry->new_head = new_head;
+ __entry->forced_advance = forced_advance;
+ ),
+
+ TP_printk("dev=%u, engine=%s, old_head=%u, new_head=%u, forced_advance=%s",
+ __entry->engine->dev->primary->index,
+ __entry->engine->name,
+ __entry->old_head,
+ __entry->new_head,
+ __entry->forced_advance ? "true" : "false")
+);
+
+/**
+ * DOC: i915_tdr_gpu_reset_complete
+ *
+ * This tracepoint tracks the point of full GPU reset completion
+ */
+TRACE_EVENT(i915_tdr_gpu_reset_complete,
+ TP_PROTO(struct drm_device *dev),
+
+ TP_ARGS(dev),
+
+ TP_STRUCT__entry(
+ __field(struct drm_device *, dev)
+ ),
+
+ TP_fast_assign(
+ __entry->dev = dev;
+ ),
+
+ TP_printk("dev=%u, resets=%u",
+ __entry->dev->primary->index,
+ i915_reset_count(&((struct drm_i915_private *)
+ (__entry->dev)->dev_private)->gpu_error) )
+);
+
+/**
+ * DOC: i915_tdr_engine_reset_complete
+ *
+ * This tracepoint tracks the point of engine reset completion
+ */
+TRACE_EVENT(i915_tdr_engine_reset_complete,
+ TP_PROTO(struct intel_engine_cs *engine),
+
+ TP_ARGS(engine),
+
+ TP_STRUCT__entry(
+ __field(struct intel_engine_cs *, engine)
+ ),
+
+ TP_fast_assign(
+ __entry->engine = engine;
+ ),
+
+ TP_printk("dev=%u, engine=%s, resets=%u",
+ __entry->engine->dev->primary->index,
+ __entry->engine->name,
+ __entry->engine->hangcheck.reset_count)
+);
+
+/**
+ * DOC: i915_tdr_forced_csb_check
+ *
+ * This tracepoint tracks the occurences of forced CSB checks
+ * that the driver does when detecting inconsistent context
+ * submission states between the driver state and the current
+ * CPU engine state.
+ */
+TRACE_EVENT(i915_tdr_forced_csb_check,
+ TP_PROTO(struct intel_engine_cs *engine,
+ bool was_effective),
+
+ TP_ARGS(engine, was_effective),
+
+ TP_STRUCT__entry(
+ __field(struct intel_engine_cs *, engine)
+ __field(bool, was_effective)
+ ),
+
+ TP_fast_assign(
+ __entry->engine = engine;
+ __entry->was_effective = was_effective;
+ ),
+
+ TP_printk("dev=%u, engine=%s, was_effective=%s",
+ __entry->engine->dev->primary->index,
+ __entry->engine->name,
+ __entry->was_effective ? "yes" : "no")
+);
+
+/**
+ * DOC: i915_tdr_hang_check
+ *
+ * This tracepoint tracks hang checks on each engine.
+ */
+TRACE_EVENT(i915_tdr_hang_check,
+ TP_PROTO(struct intel_engine_cs *engine,
+ u32 seqno,
+ u64 acthd,
+ bool busy),
+
+ TP_ARGS(engine, seqno, acthd, busy),
+
+ TP_STRUCT__entry(
+ __field(struct intel_engine_cs *, engine)
+ __field(u32, seqno)
+ __field(u64, acthd)
+ __field(bool, busy)
+ ),
+
+ TP_fast_assign(
+ __entry->engine = engine;
+ __entry->seqno = seqno;
+ __entry->acthd = acthd;
+ __entry->busy = busy;
+ ),
+
+ TP_printk("dev=%u, engine=%s, seqno=%u, last seqno=%u, acthd=%lu, score=%d%s, action=%u [%s], busy=%s",
+ __entry->engine->dev->primary->index,
+ __entry->engine->name,
+ __entry->seqno,
+ __entry->engine->hangcheck.seqno,
+ (long unsigned int) __entry->acthd,
+ __entry->engine->hangcheck.score,
+ (__entry->engine->hangcheck.score >= HANGCHECK_SCORE_RING_HUNG) ? " [HUNG]" : "",
+ (unsigned int) __entry->engine->hangcheck.action,
+ hangcheck_action_to_str(__entry->engine->hangcheck.action),
+ __entry->busy ? "yes" : "no")
+);
+
#endif /* _I915_TRACE_H_ */
/* This part must be outside protection */
diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
index 476eff0..a8f2954 100644
--- a/drivers/gpu/drm/i915/intel_lrc.c
+++ b/drivers/gpu/drm/i915/intel_lrc.c
@@ -1663,7 +1663,7 @@ gen8_ring_save(struct intel_engine_cs *ring, struct drm_i915_gem_request *req,
struct intel_context *ctx;
int ret = 0;
int clamp_to_tail = 0;
- uint32_t head;
+ uint32_t head, old_head;
uint32_t tail;
uint32_t head_addr;
uint32_t tail_addr;
@@ -1678,7 +1678,7 @@ gen8_ring_save(struct intel_engine_cs *ring, struct drm_i915_gem_request *req,
* Read head from MMIO register since it contains the
* most up to date value of head at this point.
*/
- head = I915_READ_HEAD(ring);
+ old_head = head = I915_READ_HEAD(ring);
/*
* Read tail from the context because the execlist queue
@@ -1735,6 +1735,9 @@ gen8_ring_save(struct intel_engine_cs *ring, struct drm_i915_gem_request *req,
head |= (head_addr & HEAD_ADDR);
ring->saved_head = head;
+ trace_i915_tdr_engine_save(ring, old_head,
+ head, force_advance);
+
return 0;
}
@@ -2638,6 +2641,7 @@ bool intel_execlists_TDR_force_CSB_check(struct drm_i915_private *dev_priv,
DRM_ERROR("Forced CSB check of %s ineffective!\n", engine->name);
spin_unlock_irqrestore(&engine->execlist_lock, flags);
+ trace_i915_tdr_forced_csb_check(engine, !!was_effective);
wake_up_all(&engine->irq_queue);
return !!was_effective;
diff --git a/drivers/gpu/drm/i915/intel_uncore.c b/drivers/gpu/drm/i915/intel_uncore.c
index 91427ac..67a34aa 100644
--- a/drivers/gpu/drm/i915/intel_uncore.c
+++ b/drivers/gpu/drm/i915/intel_uncore.c
@@ -1442,6 +1442,8 @@ static int gen6_do_reset(struct drm_device *dev)
/* Spin waiting for the device to ack the reset request */
ret = wait_for((__raw_i915_read32(dev_priv, GEN6_GDRST) & GEN6_GRDOM_FULL) == 0, 500);
+ trace_i915_tdr_gpu_reset_complete(dev);
+
intel_uncore_forcewake_reset(dev, true);
return ret;
@@ -1520,6 +1522,8 @@ static int do_engine_reset_nolock(struct intel_engine_cs *engine)
break;
}
+ trace_i915_tdr_engine_reset_complete(engine);
+
return ret;
}
--
1.9.1
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [RFCv2 10/12] drm/i915: Port of Added scheduler support to __wait_request() calls
2015-07-21 13:58 [RFCv2 00/12] TDR/watchdog timeout support for gen8 Tomas Elf
` (8 preceding siblings ...)
2015-07-21 13:58 ` [RFCv2 09/12] drm/i915: TDR/watchdog trace points Tomas Elf
@ 2015-07-21 13:58 ` Tomas Elf
2015-07-21 13:58 ` [RFCv2 11/12] drm/i915: Fix __i915_wait_request() behaviour during hang detection Tomas Elf
` (2 subsequent siblings)
12 siblings, 0 replies; 15+ messages in thread
From: Tomas Elf @ 2015-07-21 13:58 UTC (permalink / raw)
To: Intel-GFX
This is a partial port of the following patch from John Harrison's GPU
scheduler patch series: (patch sent to Intel-GFX with the subject line
"[Intel-gfx] [RFC 19/39] drm/i915: Added scheduler support to __wait_request()
calls" on Fri 17 July 2015)
Author: John Harrison <John.C.Harrison@Intel.com>
Date: Thu Apr 10 10:48:55 2014 +0100
Subject: drm/i915: Added scheduler support to __wait_request() calls
Removed all scheduler references and backported it to this baseline. The reason
we need this is because Chris Wilson has pointed out that threads that don't
hold the struct_mutex should not be thrown out of __i915_wait_request during
TDR hang recovery. Therefore we need a way to determine which threads are
holding the mutex and which are not.
Signed-off-by: Tomas Elf <tomas.elf@intel.com>
Signed-off-by: John Harrison <john.c.harrison@intel.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
---
drivers/gpu/drm/i915/i915_drv.h | 3 ++-
drivers/gpu/drm/i915/i915_gem.c | 11 ++++++-----
drivers/gpu/drm/i915/intel_display.c | 2 +-
3 files changed, 9 insertions(+), 7 deletions(-)
diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index 9f37cba..2cf8b38 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -2864,7 +2864,8 @@ int __i915_wait_request(struct drm_i915_gem_request *req,
unsigned reset_counter,
bool interruptible,
s64 *timeout,
- struct drm_i915_file_private *file_priv);
+ struct drm_i915_file_private *file_priv,
+ bool is_locked);
int __must_check i915_wait_request(struct drm_i915_gem_request *req);
int i915_gem_fault(struct vm_area_struct *vma, struct vm_fault *vmf);
int __must_check
diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index 036735b..340418b 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -1233,7 +1233,8 @@ int __i915_wait_request(struct drm_i915_gem_request *req,
unsigned reset_counter,
bool interruptible,
s64 *timeout,
- struct drm_i915_file_private *file_priv)
+ struct drm_i915_file_private *file_priv,
+ bool is_locked)
{
struct intel_engine_cs *ring = i915_gem_request_get_ring(req);
struct drm_device *dev = ring->dev;
@@ -1376,7 +1377,7 @@ i915_wait_request(struct drm_i915_gem_request *req)
reset_counter = atomic_read(&dev_priv->gpu_error.reset_counter);
i915_gem_request_reference(req);
ret = __i915_wait_request(req, reset_counter,
- interruptible, NULL, NULL);
+ interruptible, NULL, NULL, true);
i915_gem_request_unreference(req);
return ret;
}
@@ -1456,7 +1457,7 @@ i915_gem_object_wait_rendering__nonblocking(struct drm_i915_gem_object *obj,
reset_counter = atomic_read(&dev_priv->gpu_error.reset_counter);
i915_gem_request_reference(req);
mutex_unlock(&dev->struct_mutex);
- ret = __i915_wait_request(req, reset_counter, true, NULL, file_priv);
+ ret = __i915_wait_request(req, reset_counter, true, NULL, file_priv, false);
mutex_lock(&dev->struct_mutex);
i915_gem_request_unreference(req);
if (ret)
@@ -2950,7 +2951,7 @@ i915_gem_wait_ioctl(struct drm_device *dev, void *data, struct drm_file *file)
ret = __i915_wait_request(req, reset_counter, true,
args->timeout_ns > 0 ? &args->timeout_ns : NULL,
- file->driver_priv);
+ file->driver_priv, false);
i915_gem_request_unreference__unlocked(req);
return ret;
@@ -4139,7 +4140,7 @@ i915_gem_ring_throttle(struct drm_device *dev, struct drm_file *file)
if (target == NULL)
return 0;
- ret = __i915_wait_request(target, reset_counter, true, NULL, NULL);
+ ret = __i915_wait_request(target, reset_counter, true, NULL, NULL, false);
if (ret == 0)
queue_delayed_work(dev_priv->wq, &dev_priv->mm.retire_work, 0);
diff --git a/drivers/gpu/drm/i915/intel_display.c b/drivers/gpu/drm/i915/intel_display.c
index 97922fb..3e1db90 100644
--- a/drivers/gpu/drm/i915/intel_display.c
+++ b/drivers/gpu/drm/i915/intel_display.c
@@ -10358,7 +10358,7 @@ static void intel_mmio_flip_work_func(struct work_struct *work)
if (mmio_flip->req)
WARN_ON(__i915_wait_request(mmio_flip->req,
crtc->reset_counter,
- false, NULL, NULL) != 0);
+ false, NULL, NULL, false) != 0);
intel_do_mmio_flip(crtc);
if (mmio_flip->req) {
--
1.9.1
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [RFCv2 11/12] drm/i915: Fix __i915_wait_request() behaviour during hang detection.
2015-07-21 13:58 [RFCv2 00/12] TDR/watchdog timeout support for gen8 Tomas Elf
` (9 preceding siblings ...)
2015-07-21 13:58 ` [RFCv2 10/12] drm/i915: Port of Added scheduler support to __wait_request() calls Tomas Elf
@ 2015-07-21 13:58 ` Tomas Elf
2015-07-21 13:58 ` [RFCv2 12/12] drm/i915: Extended error state with TDR count, watchdog count and engine reset count Tomas Elf
2015-07-21 14:51 ` [RFCv2 00/12] TDR/watchdog timeout support for gen8 (edit: fixed coverletter) Tomas Elf
12 siblings, 0 replies; 15+ messages in thread
From: Tomas Elf @ 2015-07-21 13:58 UTC (permalink / raw)
To: Intel-GFX
Use is_locked parameter in __i915_wait_request() to determine if a thread
should be forced to back off and retry or if it can continue sleeping. Don't
return -EIO from __i915_wait_request since that is bad for the upper layers,
only -EAGAIN to signify reset in progress.
Also, use is_locked in trace_i915_gem_request_wait_begin() trace point for more
accurate reflection of current thread's lock state.
Signed-off-by: Tomas Elf <tomas.elf@intel.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
---
drivers/gpu/drm/i915/i915_gem.c | 66 ++++++++++++++++++++++++++++++---------
drivers/gpu/drm/i915/i915_trace.h | 15 +++------
2 files changed, 56 insertions(+), 25 deletions(-)
diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index 340418b..3339365 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -1245,7 +1245,6 @@ int __i915_wait_request(struct drm_i915_gem_request *req,
unsigned long timeout_expire;
s64 before, now;
int ret;
- int reset_in_progress = 0;
WARN(!intel_irqs_enabled(dev_priv), "IRQs disabled");
@@ -1262,28 +1261,67 @@ int __i915_wait_request(struct drm_i915_gem_request *req,
return -ENODEV;
/* Record current time in case interrupted by signal, or wedged */
- trace_i915_gem_request_wait_begin(req);
+ trace_i915_gem_request_wait_begin(req, is_locked);
before = ktime_get_raw_ns();
for (;;) {
struct timer_list timer;
+ bool full_gpu_reset_completed_unlocked = false;
+ bool reset_in_progress_locked = false;
prepare_to_wait(&ring->irq_queue, &wait,
interruptible ? TASK_INTERRUPTIBLE : TASK_UNINTERRUPTIBLE);
- /* We need to check whether any gpu reset happened in between
- * the caller grabbing the seqno and now ... */
- reset_in_progress =
- i915_gem_check_wedge(ring->dev->dev_private, NULL, interruptible);
+ /*
+ * Rules for waiting with/without struct_mutex held in the face
+ * of asynchronous TDR hang detection/recovery:
+ *
+ * ("reset in progress" = TDR has detected a hang, hang
+ * recovery may or may not have commenced)
+ *
+ * 1. Is this thread holding the struct_mutex and is any of the
+ * following true?
+ *
+ * a) Is any kind of reset in progress?
+ * b) Has a full GPU reset happened while this thread were
+ * sleeping?
+ *
+ * If so:
+ * Return -EAGAIN. The caller should interpret this as:
+ * Release struct_mutex, try to acquire struct_mutex
+ * (through i915_mutex_lock_interruptible(), which will
+ * fail as long as any reset is in progress) and retry the
+ * call to __i915_wait_request(), which hopefully will go
+ * better as soon as the hang has been resolved.
+ *
+ * 2. Is this thread not holding the struct_mutex and has a
+ * full GPU reset completed?
+ *
+ * If so:
+ * Return 0. Since the request has been purged there is no
+ * requests left to wait for. Just go home.
+ *
+ * 3. Is this thread not holding the struct_mutex and is any
+ * kind of reset in progress?
+ *
+ * If so:
+ * This thread may keep on waiting.
+ */
- if ((reset_counter != atomic_read(&dev_priv->gpu_error.reset_counter)) ||
- reset_in_progress) {
+ reset_in_progress_locked =
+ (((i915_gem_check_wedge(ring->dev->dev_private, NULL, interruptible)) ||
+ (reset_counter != atomic_read(&dev_priv->gpu_error.reset_counter))) &&
+ is_locked);
- /* ... but upgrade the -EAGAIN to an -EIO if the gpu
- * is truely gone. */
- if (reset_in_progress)
- ret = reset_in_progress;
- else
- ret = -EAGAIN;
+ if (reset_in_progress_locked) {
+ ret = -EAGAIN;
+ break;
+ }
+
+ full_gpu_reset_completed_unlocked =
+ (reset_counter != atomic_read(&dev_priv->gpu_error.reset_counter));
+
+ if (full_gpu_reset_completed_unlocked) {
+ ret = 0;
break;
}
diff --git a/drivers/gpu/drm/i915/i915_trace.h b/drivers/gpu/drm/i915/i915_trace.h
index b970259..fe0524e 100644
--- a/drivers/gpu/drm/i915/i915_trace.h
+++ b/drivers/gpu/drm/i915/i915_trace.h
@@ -556,8 +556,8 @@ DEFINE_EVENT(i915_gem_request, i915_gem_request_complete,
);
TRACE_EVENT(i915_gem_request_wait_begin,
- TP_PROTO(struct drm_i915_gem_request *req),
- TP_ARGS(req),
+ TP_PROTO(struct drm_i915_gem_request *req, bool blocking),
+ TP_ARGS(req, blocking),
TP_STRUCT__entry(
__field(u32, dev)
@@ -566,25 +566,18 @@ TRACE_EVENT(i915_gem_request_wait_begin,
__field(bool, blocking)
),
- /* NB: the blocking information is racy since mutex_is_locked
- * doesn't check that the current thread holds the lock. The only
- * other option would be to pass the boolean information of whether
- * or not the class was blocking down through the stack which is
- * less desirable.
- */
TP_fast_assign(
struct intel_engine_cs *ring =
i915_gem_request_get_ring(req);
__entry->dev = ring->dev->primary->index;
__entry->ring = ring->id;
__entry->seqno = i915_gem_request_get_seqno(req);
- __entry->blocking =
- mutex_is_locked(&ring->dev->struct_mutex);
+ __entry->blocking = blocking;
),
TP_printk("dev=%u, ring=%u, seqno=%u, blocking=%s",
__entry->dev, __entry->ring,
- __entry->seqno, __entry->blocking ? "yes (NB)" : "no")
+ __entry->seqno, __entry->blocking ? "yes" : "no")
);
DEFINE_EVENT(i915_gem_request, i915_gem_request_wait_end,
--
1.9.1
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [RFCv2 12/12] drm/i915: Extended error state with TDR count, watchdog count and engine reset count
2015-07-21 13:58 [RFCv2 00/12] TDR/watchdog timeout support for gen8 Tomas Elf
` (10 preceding siblings ...)
2015-07-21 13:58 ` [RFCv2 11/12] drm/i915: Fix __i915_wait_request() behaviour during hang detection Tomas Elf
@ 2015-07-21 13:58 ` Tomas Elf
2015-07-21 14:51 ` [RFCv2 00/12] TDR/watchdog timeout support for gen8 (edit: fixed coverletter) Tomas Elf
12 siblings, 0 replies; 15+ messages in thread
From: Tomas Elf @ 2015-07-21 13:58 UTC (permalink / raw)
To: Intel-GFX
These new TDR-specific metrics have previously been added to
i915_hangcheck_info() in debugfs. During design review Chris Wilson asked for
these metrics to be added to the error state as well.
Signed-off-by: Tomas Elf <tomas.elf@intel.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
---
drivers/gpu/drm/i915/i915_drv.h | 3 +++
drivers/gpu/drm/i915/i915_gpu_error.c | 6 ++++++
2 files changed, 9 insertions(+)
diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index 2cf8b38..862e995 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -447,6 +447,9 @@ struct drm_i915_error_state {
int hangcheck_score;
enum intel_ring_hangcheck_action hangcheck_action;
int num_requests;
+ int hangcheck_tdr_count;
+ int hangcheck_watchdog_count;
+ int hangcheck_reset_count;
/* our own tracking of ring head and tail */
u32 cpu_ring_head;
diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c
index cee1825..f4d89df 100644
--- a/drivers/gpu/drm/i915/i915_gpu_error.c
+++ b/drivers/gpu/drm/i915/i915_gpu_error.c
@@ -303,6 +303,9 @@ static void i915_ring_error_state(struct drm_i915_error_state_buf *m,
err_printf(m, " hangcheck: %s [%d]\n",
hangcheck_action_to_str(ring->hangcheck_action),
ring->hangcheck_score);
+ err_printf(m, " TDR count: %d\n", ring->hangcheck_tdr_count);
+ err_printf(m, " Watchdog count: %d\n", ring->hangcheck_watchdog_count);
+ err_printf(m, " Engine reset count: %d\n", ring->hangcheck_reset_count);
}
void i915_error_printf(struct drm_i915_error_state_buf *e, const char *f, ...)
@@ -920,6 +923,9 @@ static void i915_record_ring_state(struct drm_device *dev,
ering->hangcheck_score = ring->hangcheck.score;
ering->hangcheck_action = ring->hangcheck.action;
+ ering->hangcheck_tdr_count = ring->hangcheck.tdr_count;
+ ering->hangcheck_watchdog_count = ring->hangcheck.watchdog_count;
+ ering->hangcheck_reset_count = ring->hangcheck.reset_count;
if (USES_PPGTT(dev)) {
int i;
--
1.9.1
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [RFCv2 00/12] TDR/watchdog timeout support for gen8 (edit: fixed coverletter)
2015-07-21 13:58 [RFCv2 00/12] TDR/watchdog timeout support for gen8 Tomas Elf
` (11 preceding siblings ...)
2015-07-21 13:58 ` [RFCv2 12/12] drm/i915: Extended error state with TDR count, watchdog count and engine reset count Tomas Elf
@ 2015-07-21 14:51 ` Tomas Elf
2015-07-24 11:13 ` Tomas Elf
12 siblings, 1 reply; 15+ messages in thread
From: Tomas Elf @ 2015-07-21 14:51 UTC (permalink / raw)
To: Intel-GFX
This patch series introduces the following features:
* Feature 1: TDR (Timeout Detection and Recovery) for gen8 execlist mode.
TDR is an umbrella term for anything that goes into detecting and recovering from GPU hangs and is a term more widely used outside of the upstream driver.
This feature introduces an extensible framework that currently supports gen8 but that can be easily extended to support gen7 as well (which is already the case in GMIN but unfortunately in a not quite upstreamable form). The code contained in this submission represents the essentials of what is currently in GMIN merged with what is currently in upstream (as of the time when this work commenced a few months back).
This feature adds a new hang recovery path alongside the legacy GPU reset path, which takes care of engine recovery only. Aside from adding support for per-engine recovery this feature also introduces rules for how to promote a potential per-engine reset to a legacy, full GPU reset.
The hang checker now integrates with the error handler in a slightly different way in that it allows hang recovery on multiple engines at the same time by passing an engine flag mask to the error handler where flags representing all of the hung engines are set. This allows us to schedule hang recovery once for all currently hung engines instead of one hang recovery per detected engine hang. Previously, when only full GPU reset was supported this was all the same since it wouldn't matter if one or four engines were hung at any given point since it would all amount to the same thing - the GPU getting reset. As it stands now the behaviour is different depending on which engine is hung since each engine is reset separately from all the other engines, therefore we have to think about this in terms of scheduling cost and recovery latency. (see open question below)
OPEN QUESTIONS:
1. Do we want to investigate the possibility of per-engine hang
detection? In the current upstream driver there is only one work queue
that handles the hang checker and everything from initial hang
detection to final hang recovery runs in this thread. This makes sense
if you're only supporting one form of hang recovery - using full GPU
reset and nothing tied to any particular engine. However, as part
of this patch series we're changing that by introducing per-engine
hang recovery. It could make sense to introduce multiple work
queues - one per engine - to run multiple hang checking threads in
parallel.
This would potentially allow savings in terms of recovery latency since
we don't have to scan all engines every time the hang checker is
scheduled and the error handler does not have to scan all engines every
time it is scheduled. Instead, we could implement one work queue per
engine that would invoke the hang checker that only checks _that_
particular engine and then the error handler is invoked for _that_
particular engine. If one engine has hung the latency for getting to
the hang recovery path for that particular engine would be (Time For
Hang Checking One Engine) + (Time For Error Handling One Engine) rather
than the time it takes to do hang checking for all engines + the time
it takes to do error handling for all engines that have been detected
as hung (which in the worst case would be all engines). There would
potentially be as many hang checker and error handling threads going on
concurrently as there are engines in the hardware but they would all be
running in parallel without any significant locking. The first time
where any thread needs exclusive access to the driver is at the point
of the actual hang recovery but the time it takes to get there would
theoretically be lower and the time it actually takes to do per-engine
hang recovery is quite a lot lower than the time it takes to actually
detect a hang reliably.
How much we would save by such a change still needs to be analysed and
compared against the current single-thread model but it makes sense
from a theoretical design point of view.
* Feature 2: Watchdog Timeout (a.k.a "media engine reset") for gen8.
This feature allows userland applications to control whether or not individual batch buffers should have a first-level, fine-grained, hardware-based hang detection mechanism on top of the ordinary, software-based periodic hang checker that is already in the driver. The advantage over relying solely on the current software-based hang checker is that the watchdog timeout mechanism is about 1000x quicker and more precise. Since it's not a full driver-level hang detection mechanism but only targetting one individual batch buffer at a time it can afford to be that quick without risking an increase in false positive hang detection.
This feature includes the following changes:
a) Watchdog timeout interrupt service routine for handling watchdog interrupts and connecting these to per-engine hang recovery.
b) Injection of watchdog timer enablement/cancellation instructions before/after the batch buffer start instruction in the ring buffer so that watchdog timeout is connected to the submission of an individual batch buffer.
c) Extension of the DRM batch buffer interface, exposing the watchdog timeout feature to userland. We've got two open source groups in VPG currently in the process of integrating support for this feature, which should make it principally possible to upstream this extension.
There is currently full watchdog timeout support for gen7 in GMIN and it is quite similar to the gen8 implementation so there is nothing obvious that prevents us from upstreaming that code along with the gen8 code. However, watchdog timeout is fully dependent on the per-engine hang recovery path and that is not part of this code submission for gen7. Therefore watchdog timeout support for gen7 has been excluded until per-engine hang recovery support for
gen7 has landed upstream.
As part of this submission we've had to reinstate the work queue that was previously in place between the error handler and the hang recovery path. The reason for this is that the per-engine recovery path is called directly from the interrupt handler in the case of watchdog timeout. In that situation there's no way of grabbing the struct_mutex, which is a requirement for the hang recovery path. Therefore, by reinstating the work queue we provide a unified execution context for the hang recovery code that allows the hang recovery code to grab whatever locks it needs without sacrificing interrupt latency too much or sleeping indefinitely in hard interrupt context.
* Feature 3. Context Submission Status Consistency checking
Something that becomes apparent when you run long-duration operations tests with concurrent rendering processes with intermittently injected hangs is that it seems like the GPU forgets to send context completion interrupts to the driver under some circumstances. What this means is that the driver sometimes gets stuck on a context that never seems to finish, all the while the hardware has completed and is waiting for more work.
The problem with this is that the per-engine hang recovery path relies on context resubmission to kick off the hardware again following an engine reset.
This can only be done safely if the hardware and driver share the same opinion about the current state. Therefore we've extended the periodic hang checker to check for context submission state inconsistencies aside from the hang checking it already does.
If such a state is detected it is assumed (based on experience) that a context completion interrupt has been lost somehow. If this state persists for some time an attempt to correct it is made by faking the presumably lost context completion interrupt by manually calling the execlist interrupt handler, which is normally called from the main interrupt handler cued by a received context event interrupt. Just because an interrupt goes missing does not mean that the context status buffer (CSB) does not get appropriately updated by the hardware, which means that we can expect to find all the recent changes to the context states for each engine captured there.
If there are outstanding context status changes in store there then the faked context event interrupt will allow the interrupt handler to act on them. In the case of lost context completion interrupts this will prompt the driver to remove the already completed context from the execlist queue and move on to the next pending piece of work and thereby eliminating the inconsistency.
* Feature 4. Debugfs extensions for per-engine hang recovery and TDR/watchdog trace points.
GITHUB REPOSITORY:
https://github.com/telf/TDR_watchdog_RFC_1.git
RFC VERSION 1 BRANCH:
20150608_TDR_upstream_adaptation_single-thread_hangchecking_RFC_delivery_sendmail_1
RFC VERSION 2 BRANCH:
20150604_TDR_upstream_adaptation_single-thread_hangchecking_RFCv2_delivery_sendmail_1
CHANGES IN VERSION 2
--------------------
Version 2 of this RFC series addresses design concerns that Chris Wilson and Daniel Vetter et. al. had with the first version of this RFC series. Below is a summary of all the changes made between versions:
* [1/12] drm/i915: Early exit from semaphore_waits_for for execlist mode:
Turned the execlist mode check into a ringbuffer NULL check to make it more submission mode agnostic and less of a layering violation.
* [2/12] drm/i915: Make i915_gem_reset_ring_status() public
Replaces the old patch in RFCv1: "drm/i915: Add reset stats entry point for per-engine reset"
* [3/12] drm/i915: Adding TDR / per-engine reset support for gen8:
1. Simply use the previously private function
i915_gem_reset_ring_status() from the engine hang recovery path to set
active/pending context status. This replicates the same behaviour as in
full GPU reset but for a single, targetted engine.
2. Remove all additional uevents for both full GPU reset and per-engine
reset. Adapted uevent behaviour to the new per-engine hang recovery
mode in that it will only send one uevent regardless of which form of
recovery is employed. If a per-engine reset is attempted first then
one uevent will be dispatched. If that recovery mode fails and the
hang is promoted to a full GPU reset no further uevents will be
dispatched at that point.
3. Removed the 2*HUNG hang threshold from i915_hangcheck_elapsed in
order to not make the hang detection algorithm too complicated. This
threshold was introduced to compensate for the possibility that hang
recovery might be delayed due to inconsistent context submission status
that would prevent per-engine hang recovery from happening. In a later
patch we introduce faked context event interrupts and inconsistency
rectification at the onset of per-engine hang recovery instead of
relying on the hang checker to do this for us. Therefore, since we do
not delay and defer to future hang detections, we never allow hangs to
go addressed beyond the HUNG threshold and therefore there is no need
for any further thresholds.
4. Tidied up the TDR context resubmission path in intel_lrc.c . Reduced
the amount of duplication by relying entirely on the normal unqueue
function. Added a new parameter to the unqueue function that takes
into consideration if the unqueue call is for a first-time context
submission or a resubmission and adapts the handling of elsp_submitted
accordingly. The reason for this is that for context resubmission we
don't expect any further interrupts for the submission or the following
context completion. A more elegant way of handling this would be to
phase out elsp_submitted altogether, however that's part of a
LRC/execlist cleanup effort that is happening independently of this
RFC. For now we make this change as simple as possible with as few
non-TDR-related side-effects as possible.
* [4/12] drm/i915: Extending i915_gem_check_wedge to check engine reset in progress:
Removed unwarranted changes made to i915_gem_ring_throttle()
* [7/12] drm/i915: Fake lost context interrupts through forced CSB check
Remove context submission status consistency pre-check from
i915_hangcheck_elapsed() and turn it into a pre-check to per-engine
reset.
The following describes the change in philosphy in how context
submission state inconsistencies are detected:
Previously we would let the periodic hang checker ensure that there
were no context submission status inconsistencies on any engine, at any
point. If an inconsistency was detected in the per-engine hang recovery
path we would back off and defer to the next hang check since
per-engine hang recovery is not effective during inconsistent context
submission states.
What we do in this new version is to move the consistency pre-check
from the hang checker to the earliest point in the per-engine hang
recovery path. If we detect an inconsistency at that point we fake a
potentially lost context event interrupt by forcing a CSB check. If
there are outstanding events in the CSB these will be acted upon and
hopefully that will bring the driver up to speed with the hardware. If
the CSB check did not amount to anything it is concluded that the
inconsistency is unresolvable and the per-engine hang recovery fails
and promotes to full GPU reset instead.
In the hang checker-based consistency checking we would check the
inconsistency for a number of times to make sure the detected state was
stable before attempting to rectify the situation. This is possible
since hang checking is a recurring event. Having moved the consistency
checking to the recovery path instead (i.e. a one-time, fire &
forget-style event) it is assumed that the hang detection that brought
on the hang recovery has detected a stable hang and therefore, if an
inconsistency is detected at that point, the inconsistency must be
stable and not the result of a momentary context state transition.
Therefore, unlike in the hang checker case, at the very first
indication of an inconsistent context submission status the interrupt
is faked speculatively. If outstanding CSB events are found it is
determined that the hang was in fact just a context submission status
inconsistency and no hang recovery is done. If the inconsistency cannot
be resolved the per-engine hang recovery is failed and the hang is
promoted to full GPU reset instead.
* [8/12] drm/i915: Debugfs interface for per-engine hang recovery
1. After review comments by Chris Wilson we're dropping the dual-mode parameter
value interpretation in i915_wedged_set(). In this version we only accept
engine id flag masks that contain the engine id flags of all currently hung
engines. Full GPU reset is most easily requested by passing an all zero
engine id flag mask.
2. Moved TDR-specific engine metrics like number of detected engine hangs and
number of per-engine resets into i915_hangcheck_info() from
i915_hangcheck_read().
* [9/12] drm/i915: TDR/watchdog trace points
As a consequence of the faking context event interrupt commit being moved from
the hang checker to the per-engine recovery path we no longer check context
submission status from the hang checker. Therefore there is no need to provide
submission status of the currently running context to the
trace_i915_tdr_hang_check() event.
* [10/12] drm/i915: Fix __i915_wait_request() behaviour during hang detection
* [11/12] drm/i915: Port of Added scheduler support to __wait_request() calls
NEW: Added to address the way that __i915_wait_request()
behaves in the face of hang detections and hang recovery.
* [12/12] drm/i915: Extended error state with TDR count, watchdog count and engine reset count
NEW: Adds per-engine TDR statistics to captured error state.
Tomas Elf (12):
drm/i915: Early exit from semaphore_waits_for for execlist mode.
drm/i915: Make i915_gem_reset_ring_status() public
drm/i915: Adding TDR / per-engine reset support for gen8.
drm/i915: Extending i915_gem_check_wedge to check engine reset in
progress
drm/i915: Reinstate hang recovery work queue.
drm/i915: Watchdog timeout support for gen8.
drm/i915: Fake lost context interrupts through forced CSB check.
drm/i915: Debugfs interface for per-engine hang recovery.
drm/i915: TDR/watchdog trace points.
drm/i915: Port of Added scheduler support to __wait_request() calls
drm/i915: Fix __i915_wait_request() behaviour during hang detection.
drm/i915: Extended error state with TDR count, watchdog count and
engine reset count
drivers/gpu/drm/i915/i915_debugfs.c | 76 +++-
drivers/gpu/drm/i915/i915_dma.c | 78 ++++
drivers/gpu/drm/i915/i915_drv.c | 257 +++++++++++
drivers/gpu/drm/i915/i915_drv.h | 79 +++-
drivers/gpu/drm/i915/i915_gem.c | 128 ++++--
drivers/gpu/drm/i915/i915_gpu_error.c | 8 +-
drivers/gpu/drm/i915/i915_irq.c | 292 +++++++++++--
drivers/gpu/drm/i915/i915_params.c | 10 +
drivers/gpu/drm/i915/i915_reg.h | 13 +
drivers/gpu/drm/i915/i915_trace.h | 308 +++++++++++++-
drivers/gpu/drm/i915/intel_display.c | 2 +-
drivers/gpu/drm/i915/intel_lrc.c | 729 ++++++++++++++++++++++++++++++--
drivers/gpu/drm/i915/intel_lrc.h | 16 +-
drivers/gpu/drm/i915/intel_lrc_tdr.h | 39 ++
drivers/gpu/drm/i915/intel_ringbuffer.c | 87 +++-
drivers/gpu/drm/i915/intel_ringbuffer.h | 95 +++++
drivers/gpu/drm/i915/intel_uncore.c | 203 +++++++++
include/uapi/drm/i915_drm.h | 5 +-
18 files changed, 2313 insertions(+), 112 deletions(-)
create mode 100644 drivers/gpu/drm/i915/intel_lrc_tdr.h
--
1.9.1
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFCv2 00/12] TDR/watchdog timeout support for gen8 (edit: fixed coverletter)
2015-07-21 14:51 ` [RFCv2 00/12] TDR/watchdog timeout support for gen8 (edit: fixed coverletter) Tomas Elf
@ 2015-07-24 11:13 ` Tomas Elf
0 siblings, 0 replies; 15+ messages in thread
From: Tomas Elf @ 2015-07-24 11:13 UTC (permalink / raw)
To: Intel-GFX@Lists.FreeDesktop.Org
On 21/07/2015 15:51, Tomas Elf wrote:
> This patch series introduces the following features:
>
> * Feature 1: TDR (Timeout Detection and Recovery) for gen8 execlist mode.
>
> TDR is an umbrella term for anything that goes into detecting and recovering from GPU hangs and is a term more widely used outside of the upstream driver.
> This feature introduces an extensible framework that currently supports gen8 but that can be easily extended to support gen7 as well (which is already the case in GMIN but unfortunately in a not quite upstreamable form). The code contained in this submission represents the essentials of what is currently in GMIN merged with what is currently in upstream (as of the time when this work commenced a few months back).
>
> This feature adds a new hang recovery path alongside the legacy GPU reset path, which takes care of engine recovery only. Aside from adding support for per-engine recovery this feature also introduces rules for how to promote a potential per-engine reset to a legacy, full GPU reset.
>
> The hang checker now integrates with the error handler in a slightly different way in that it allows hang recovery on multiple engines at the same time by passing an engine flag mask to the error handler where flags representing all of the hung engines are set. This allows us to schedule hang recovery once for all currently hung engines instead of one hang recovery per detected engine hang. Previously, when only full GPU reset was supported this was all the same since it wouldn't matter if one or four engines were hung at any given point since it would all amount to the same thing - the GPU getting reset. As it stands now the behaviour is different depending on which engine is hung since each engine is reset separately from all the other engines, therefore we have to think about this in terms of scheduling cost and recovery latency. (see open question below)
>
> OPEN QUESTIONS:
>
> 1. Do we want to investigate the possibility of per-engine hang
> detection? In the current upstream driver there is only one work queue
> that handles the hang checker and everything from initial hang
> detection to final hang recovery runs in this thread. This makes sense
> if you're only supporting one form of hang recovery - using full GPU
> reset and nothing tied to any particular engine. However, as part
> of this patch series we're changing that by introducing per-engine
> hang recovery. It could make sense to introduce multiple work
> queues - one per engine - to run multiple hang checking threads in
> parallel.
>
> This would potentially allow savings in terms of recovery latency since
> we don't have to scan all engines every time the hang checker is
> scheduled and the error handler does not have to scan all engines every
> time it is scheduled. Instead, we could implement one work queue per
> engine that would invoke the hang checker that only checks _that_
> particular engine and then the error handler is invoked for _that_
> particular engine. If one engine has hung the latency for getting to
> the hang recovery path for that particular engine would be (Time For
> Hang Checking One Engine) + (Time For Error Handling One Engine) rather
> than the time it takes to do hang checking for all engines + the time
> it takes to do error handling for all engines that have been detected
> as hung (which in the worst case would be all engines). There would
> potentially be as many hang checker and error handling threads going on
> concurrently as there are engines in the hardware but they would all be
> running in parallel without any significant locking. The first time
> where any thread needs exclusive access to the driver is at the point
> of the actual hang recovery but the time it takes to get there would
> theoretically be lower and the time it actually takes to do per-engine
> hang recovery is quite a lot lower than the time it takes to actually
> detect a hang reliably.
>
> How much we would save by such a change still needs to be analysed and
> compared against the current single-thread model but it makes sense
> from a theoretical design point of view.
>
> * Feature 2: Watchdog Timeout (a.k.a "media engine reset") for gen8.
>
> This feature allows userland applications to control whether or not individual batch buffers should have a first-level, fine-grained, hardware-based hang detection mechanism on top of the ordinary, software-based periodic hang checker that is already in the driver. The advantage over relying solely on the current software-based hang checker is that the watchdog timeout mechanism is about 1000x quicker and more precise. Since it's not a full driver-level hang detection mechanism but only targetting one individual batch buffer at a time it can afford to be that quick without risking an increase in false positive hang detection.
>
> This feature includes the following changes:
>
> a) Watchdog timeout interrupt service routine for handling watchdog interrupts and connecting these to per-engine hang recovery.
>
> b) Injection of watchdog timer enablement/cancellation instructions before/after the batch buffer start instruction in the ring buffer so that watchdog timeout is connected to the submission of an individual batch buffer.
>
> c) Extension of the DRM batch buffer interface, exposing the watchdog timeout feature to userland. We've got two open source groups in VPG currently in the process of integrating support for this feature, which should make it principally possible to upstream this extension.
>
> There is currently full watchdog timeout support for gen7 in GMIN and it is quite similar to the gen8 implementation so there is nothing obvious that prevents us from upstreaming that code along with the gen8 code. However, watchdog timeout is fully dependent on the per-engine hang recovery path and that is not part of this code submission for gen7. Therefore watchdog timeout support for gen7 has been excluded until per-engine hang recovery support for
> gen7 has landed upstream.
>
> As part of this submission we've had to reinstate the work queue that was previously in place between the error handler and the hang recovery path. The reason for this is that the per-engine recovery path is called directly from the interrupt handler in the case of watchdog timeout. In that situation there's no way of grabbing the struct_mutex, which is a requirement for the hang recovery path. Therefore, by reinstating the work queue we provide a unified execution context for the hang recovery code that allows the hang recovery code to grab whatever locks it needs without sacrificing interrupt latency too much or sleeping indefinitely in hard interrupt context.
>
> * Feature 3. Context Submission Status Consistency checking
>
> Something that becomes apparent when you run long-duration operations tests with concurrent rendering processes with intermittently injected hangs is that it seems like the GPU forgets to send context completion interrupts to the driver under some circumstances. What this means is that the driver sometimes gets stuck on a context that never seems to finish, all the while the hardware has completed and is waiting for more work.
>
> The problem with this is that the per-engine hang recovery path relies on context resubmission to kick off the hardware again following an engine reset.
> This can only be done safely if the hardware and driver share the same opinion about the current state. Therefore we've extended the periodic hang checker to check for context submission state inconsistencies aside from the hang checking it already does.
>
> If such a state is detected it is assumed (based on experience) that a context completion interrupt has been lost somehow. If this state persists for some time an attempt to correct it is made by faking the presumably lost context completion interrupt by manually calling the execlist interrupt handler, which is normally called from the main interrupt handler cued by a received context event interrupt. Just because an interrupt goes missing does not mean that the context status buffer (CSB) does not get appropriately updated by the hardware, which means that we can expect to find all the recent changes to the context states for each engine captured there.
>
> If there are outstanding context status changes in store there then the faked context event interrupt will allow the interrupt handler to act on them. In the case of lost context completion interrupts this will prompt the driver to remove the already completed context from the execlist queue and move on to the next pending piece of work and thereby eliminating the inconsistency.
>
> * Feature 4. Debugfs extensions for per-engine hang recovery and TDR/watchdog trace points.
>
> GITHUB REPOSITORY:
> https://github.com/telf/TDR_watchdog_RFC_1.git
>
> RFC VERSION 1 BRANCH:
> 20150608_TDR_upstream_adaptation_single-thread_hangchecking_RFC_delivery_sendmail_1
>
> RFC VERSION 2 BRANCH:
> 20150604_TDR_upstream_adaptation_single-thread_hangchecking_RFCv2_delivery_sendmail_1
>
> CHANGES IN VERSION 2
> --------------------
> Version 2 of this RFC series addresses design concerns that Chris Wilson and Daniel Vetter et. al. had with the first version of this RFC series. Below is a summary of all the changes made between versions:
>
> * [1/12] drm/i915: Early exit from semaphore_waits_for for execlist mode:
> Turned the execlist mode check into a ringbuffer NULL check to make it more submission mode agnostic and less of a layering violation.
>
> * [2/12] drm/i915: Make i915_gem_reset_ring_status() public
> Replaces the old patch in RFCv1: "drm/i915: Add reset stats entry point for per-engine reset"
>
> * [3/12] drm/i915: Adding TDR / per-engine reset support for gen8:
>
> 1. Simply use the previously private function
> i915_gem_reset_ring_status() from the engine hang recovery path to set
> active/pending context status. This replicates the same behaviour as in
> full GPU reset but for a single, targetted engine.
>
> 2. Remove all additional uevents for both full GPU reset and per-engine
> reset. Adapted uevent behaviour to the new per-engine hang recovery
> mode in that it will only send one uevent regardless of which form of
> recovery is employed. If a per-engine reset is attempted first then
> one uevent will be dispatched. If that recovery mode fails and the
> hang is promoted to a full GPU reset no further uevents will be
> dispatched at that point.
>
> 3. Removed the 2*HUNG hang threshold from i915_hangcheck_elapsed in
> order to not make the hang detection algorithm too complicated. This
> threshold was introduced to compensate for the possibility that hang
> recovery might be delayed due to inconsistent context submission status
> that would prevent per-engine hang recovery from happening. In a later
> patch we introduce faked context event interrupts and inconsistency
> rectification at the onset of per-engine hang recovery instead of
> relying on the hang checker to do this for us. Therefore, since we do
> not delay and defer to future hang detections, we never allow hangs to
> go addressed beyond the HUNG threshold and therefore there is no need
> for any further thresholds.
>
> 4. Tidied up the TDR context resubmission path in intel_lrc.c . Reduced
> the amount of duplication by relying entirely on the normal unqueue
> function. Added a new parameter to the unqueue function that takes
> into consideration if the unqueue call is for a first-time context
> submission or a resubmission and adapts the handling of elsp_submitted
> accordingly. The reason for this is that for context resubmission we
> don't expect any further interrupts for the submission or the following
> context completion. A more elegant way of handling this would be to
> phase out elsp_submitted altogether, however that's part of a
> LRC/execlist cleanup effort that is happening independently of this
> RFC. For now we make this change as simple as possible with as few
> non-TDR-related side-effects as possible.
>
> * [4/12] drm/i915: Extending i915_gem_check_wedge to check engine reset in progress:
> Removed unwarranted changes made to i915_gem_ring_throttle()
>
> * [7/12] drm/i915: Fake lost context interrupts through forced CSB check
> Remove context submission status consistency pre-check from
> i915_hangcheck_elapsed() and turn it into a pre-check to per-engine
> reset.
>
> The following describes the change in philosphy in how context
> submission state inconsistencies are detected:
>
> Previously we would let the periodic hang checker ensure that there
> were no context submission status inconsistencies on any engine, at any
> point. If an inconsistency was detected in the per-engine hang recovery
> path we would back off and defer to the next hang check since
> per-engine hang recovery is not effective during inconsistent context
> submission states.
>
> What we do in this new version is to move the consistency pre-check
> from the hang checker to the earliest point in the per-engine hang
> recovery path. If we detect an inconsistency at that point we fake a
> potentially lost context event interrupt by forcing a CSB check. If
> there are outstanding events in the CSB these will be acted upon and
> hopefully that will bring the driver up to speed with the hardware. If
> the CSB check did not amount to anything it is concluded that the
> inconsistency is unresolvable and the per-engine hang recovery fails
> and promotes to full GPU reset instead.
>
> In the hang checker-based consistency checking we would check the
> inconsistency for a number of times to make sure the detected state was
> stable before attempting to rectify the situation. This is possible
> since hang checking is a recurring event. Having moved the consistency
> checking to the recovery path instead (i.e. a one-time, fire &
> forget-style event) it is assumed that the hang detection that brought
> on the hang recovery has detected a stable hang and therefore, if an
> inconsistency is detected at that point, the inconsistency must be
> stable and not the result of a momentary context state transition.
> Therefore, unlike in the hang checker case, at the very first
> indication of an inconsistent context submission status the interrupt
> is faked speculatively. If outstanding CSB events are found it is
> determined that the hang was in fact just a context submission status
> inconsistency and no hang recovery is done. If the inconsistency cannot
> be resolved the per-engine hang recovery is failed and the hang is
> promoted to full GPU reset instead.
>
> * [8/12] drm/i915: Debugfs interface for per-engine hang recovery
> 1. After review comments by Chris Wilson we're dropping the dual-mode parameter
> value interpretation in i915_wedged_set(). In this version we only accept
> engine id flag masks that contain the engine id flags of all currently hung
> engines. Full GPU reset is most easily requested by passing an all zero
> engine id flag mask.
>
> 2. Moved TDR-specific engine metrics like number of detected engine hangs and
> number of per-engine resets into i915_hangcheck_info() from
> i915_hangcheck_read().
>
> * [9/12] drm/i915: TDR/watchdog trace points
> As a consequence of the faking context event interrupt commit being moved from
> the hang checker to the per-engine recovery path we no longer check context
> submission status from the hang checker. Therefore there is no need to provide
> submission status of the currently running context to the
> trace_i915_tdr_hang_check() event.
>
> * [10/12] drm/i915: Fix __i915_wait_request() behaviour during hang detection
> * [11/12] drm/i915: Port of Added scheduler support to __wait_request() calls
> NEW: Added to address the way that __i915_wait_request()
> behaves in the face of hang detections and hang recovery.
>
> * [12/12] drm/i915: Extended error state with TDR count, watchdog count and engine reset count
> NEW: Adds per-engine TDR statistics to captured error state.
>
>
> Tomas Elf (12):
> drm/i915: Early exit from semaphore_waits_for for execlist mode.
> drm/i915: Make i915_gem_reset_ring_status() public
> drm/i915: Adding TDR / per-engine reset support for gen8.
> drm/i915: Extending i915_gem_check_wedge to check engine reset in
> progress
> drm/i915: Reinstate hang recovery work queue.
> drm/i915: Watchdog timeout support for gen8.
> drm/i915: Fake lost context interrupts through forced CSB check.
> drm/i915: Debugfs interface for per-engine hang recovery.
> drm/i915: TDR/watchdog trace points.
> drm/i915: Port of Added scheduler support to __wait_request() calls
> drm/i915: Fix __i915_wait_request() behaviour during hang detection.
> drm/i915: Extended error state with TDR count, watchdog count and
> engine reset count
>
> drivers/gpu/drm/i915/i915_debugfs.c | 76 +++-
> drivers/gpu/drm/i915/i915_dma.c | 78 ++++
> drivers/gpu/drm/i915/i915_drv.c | 257 +++++++++++
> drivers/gpu/drm/i915/i915_drv.h | 79 +++-
> drivers/gpu/drm/i915/i915_gem.c | 128 ++++--
> drivers/gpu/drm/i915/i915_gpu_error.c | 8 +-
> drivers/gpu/drm/i915/i915_irq.c | 292 +++++++++++--
> drivers/gpu/drm/i915/i915_params.c | 10 +
> drivers/gpu/drm/i915/i915_reg.h | 13 +
> drivers/gpu/drm/i915/i915_trace.h | 308 +++++++++++++-
> drivers/gpu/drm/i915/intel_display.c | 2 +-
> drivers/gpu/drm/i915/intel_lrc.c | 729 ++++++++++++++++++++++++++++++--
> drivers/gpu/drm/i915/intel_lrc.h | 16 +-
> drivers/gpu/drm/i915/intel_lrc_tdr.h | 39 ++
> drivers/gpu/drm/i915/intel_ringbuffer.c | 87 +++-
> drivers/gpu/drm/i915/intel_ringbuffer.h | 95 +++++
> drivers/gpu/drm/i915/intel_uncore.c | 203 +++++++++
> include/uapi/drm/i915_drm.h | 5 +-
> 18 files changed, 2313 insertions(+), 112 deletions(-)
> create mode 100644 drivers/gpu/drm/i915/intel_lrc_tdr.h
>
Summarising follow-up on IRC:
Chris Wilson is ok with RFCv2 from a design point-of-view. We're moving
on to the detailed code review which will commence once I get back from
my holiday in three weeks and once I've prepared a properly split up
patch series that is suitable for code reviewing.
Thanks,
Tomas
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2015-07-24 11:13 UTC | newest]
Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-07-21 13:58 [RFCv2 00/12] TDR/watchdog timeout support for gen8 Tomas Elf
2015-07-21 13:58 ` [RFCv2 01/12] drm/i915: Early exit from semaphore_waits_for for execlist mode Tomas Elf
2015-07-21 13:58 ` [RFCv2 02/12] drm/i915: Make i915_gem_reset_ring_status() public Tomas Elf
2015-07-21 13:58 ` [RFCv2 03/12] drm/i915: Adding TDR / per-engine reset support for gen8 Tomas Elf
2015-07-21 13:58 ` [RFCv2 04/12] drm/i915: Extending i915_gem_check_wedge to check engine reset in progress Tomas Elf
2015-07-21 13:58 ` [RFCv2 05/12] drm/i915: Reinstate hang recovery work queue Tomas Elf
2015-07-21 13:58 ` [RFCv2 06/12] drm/i915: Watchdog timeout support for gen8 Tomas Elf
2015-07-21 13:58 ` [RFCv2 07/12] drm/i915: Fake lost context interrupts through forced CSB check Tomas Elf
2015-07-21 13:58 ` [RFCv2 08/12] drm/i915: Debugfs interface for per-engine hang recovery Tomas Elf
2015-07-21 13:58 ` [RFCv2 09/12] drm/i915: TDR/watchdog trace points Tomas Elf
2015-07-21 13:58 ` [RFCv2 10/12] drm/i915: Port of Added scheduler support to __wait_request() calls Tomas Elf
2015-07-21 13:58 ` [RFCv2 11/12] drm/i915: Fix __i915_wait_request() behaviour during hang detection Tomas Elf
2015-07-21 13:58 ` [RFCv2 12/12] drm/i915: Extended error state with TDR count, watchdog count and engine reset count Tomas Elf
2015-07-21 14:51 ` [RFCv2 00/12] TDR/watchdog timeout support for gen8 (edit: fixed coverletter) Tomas Elf
2015-07-24 11:13 ` Tomas Elf
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox