From: Daniel Vetter <daniel@ffwll.ch>
To: Jesse Barnes <jbarnes@virtuousgeek.org>
Cc: intel-gfx@lists.freedesktop.org, David Woodhouse <dwmw2@infradead.org>
Subject: Re: [PATCH 6/9] drm/i915: driver based PASID handling
Date: Thu, 8 Oct 2015 10:12:08 +0200 [thread overview]
Message-ID: <20151008081208.GY3383@phenom.ffwll.local> (raw)
In-Reply-To: <561555BC.7070700@virtuousgeek.org>
On Wed, Oct 07, 2015 at 10:26:20AM -0700, Jesse Barnes wrote:
> On 10/07/2015 10:17 AM, David Woodhouse wrote:
> > On Wed, 2015-10-07 at 09:28 -0700, Jesse Barnes wrote:
> >> On 10/07/2015 09:14 AM, Daniel Vetter wrote:
> >>> On Wed, Oct 07, 2015 at 08:16:42AM -0700, Jesse Barnes wrote:
> >>>> On 10/07/2015 06:00 AM, David Woodhouse wrote:
> >>>>> On Fri, 2015-09-04 at 09:59 -0700, Jesse Barnes wrote:
> >>>>>> +
> >>>>>> + ret = handle_mm_fault(mm, vma, address,
> >>>>>> + desc.wr_req ? FAULT_FLAG_WRITE : 0);
> >>>>>> + if (ret & VM_FAULT_ERROR) {
> >>>>>> + gpu_mm_segv(tsk, address, SEGV_ACCERR); /* ? */
> >>>>>> + goto out_unlock;
> >>>>>> + }
> >>>>>> +
> >>>>>
> >>>>> Hm, do you need to force the SEGV there, in what ought to be generic
> >>>>> IOMMU code?
> >>>>>
> >>>>> Can you instead just let the fault handler return an appropriate
> >>>>> failure code to the IOMMU request queue and then deal with the
> >>>>> resulting error on the i915 device side?
> >>>>
> >>>> I'm not sure if we get enough info on the i915 side to handle it
> >>>> reasonably, we'll have to test that out.
> >>>
> >>> We do know precisely which context blew up, but without the TDR work we
> >>> can't yet just kill the offender selective without affecting the other
> >>> active gpu contexts.
> >>
> >> How? The notification from the IOMMU queue is asynchronous...
> >
> > The page request, and the response, include 'private data' which an
> > endpoint can use to carry that kind of information.
> >
> > In $7.5.1.1 of the VT-d specification it tells us:
> >
> > "Private Data: The Private Data field can be used by
> > Root-Complex integrated endpoints to uniquely identify
> > device-specific private information associated with an
> > individual page request.
> >
> > "For Intel ® Processor Graphics device, the Private Data field
> > specifies the identity of the GPU advanced-context (see
> > Section 3.10) sending the page request."
>
> Ah right so we could put our private context ID in there if the PASID
> doesn't end up being per-context. That would work fine (though as
> Daniel said we still need fancier reset to handle things more
> gracefully).
I'd hope we can be even more lazy: If we complete the page request with a
failure then hopefully the gpu read/write transaction never completes in
the EU/ff-unit which means it'll be stuck forever and our hangcheck will
get around to clean up the mess. That should work with 0 code changes (but
needs a testcase ofc).
Later on we can get fancy and try to immediate preempt the ctx and kill it
if it faults. But there's a bit of infrastructure missing for that, and I
think it'd be better to not stall svm on that.
> >>> But besides that I really don't see a reason why we need to kill the
> >>> process if the gpu faults. After all if a thread sigfaults then signal
> >>> goes to that thread and not some random one (or the one thread that forked
> >>> the thread that blew up). And we do have interfaces to tell userspace that
> >>> something bad happened with the gpu work it submitted.
> >
> > I certainly don't want the core IOMMU code killing things. I really
> > want to just complete the page request with an appropriate failure
> > code, and let the endpoint device deal with it from there.
>
> I think that will work, but I want to test and make sure. In the driver
> mode version I took advantage of the fact that I got an unambiguous page
> request failure from the IOMMU along with a unique PASID to send the
> signal. Getting it on the GPU side means looking at some of our
> existing error state bits, which is something I've been avoiding...
gem_reset_stats does this for your already, including test coverage and
all. What might be missing is getting reset events from a pollable fd,
since the current needs explicit polling. It works that way since that's
all arb_robustness wants.
And with an fd we can always use the generic SIG_IO stuff for userspace
that wants a signal, but by default I don't think we should use signals at
all for gpu page faults: The current SIG_SEGV can't be used (since it's
clearly for thread-local faults only), same for SIG_BUS, SIG_IO is for fd
polling and there's nothing else available.
But even the pollable fd is probably best done in a follow-up series, if
we even have a need for it.
-Daniel
--
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx
next prev parent reply other threads:[~2015-10-08 8:09 UTC|newest]
Thread overview: 36+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-09-04 16:58 [RFC] Page table sharing and bufferless execbuf Jesse Barnes
2015-09-04 16:58 ` [PATCH 1/9] mm: move mmu_find_ops to mmu_notifier.c Jesse Barnes
2015-09-04 16:58 ` [PATCH 2/9] signal: export force_sig_info Jesse Barnes
2015-09-04 16:58 ` [PATCH 3/9] android/sync: add sync_fence_create_dma Jesse Barnes
2015-09-04 16:58 ` [PATCH 4/9] android/sync: hack: enable fence signaling in Android Native Sync implementation Jesse Barnes
2015-09-04 16:58 ` [PATCH 5/9] drm/i915: add create_context2 ioctl Jesse Barnes
2015-09-04 16:59 ` [PATCH 6/9] drm/i915: driver based PASID handling Jesse Barnes
2015-10-07 13:00 ` David Woodhouse
2015-10-07 15:16 ` Jesse Barnes
2015-10-07 16:14 ` Daniel Vetter
2015-10-07 16:28 ` Jesse Barnes
2015-10-07 17:17 ` David Woodhouse
2015-10-07 17:26 ` Jesse Barnes
2015-10-08 8:12 ` Daniel Vetter [this message]
2015-10-08 10:28 ` Tomas Elf
2015-10-08 11:29 ` Tomas Elf
2015-10-08 22:46 ` David Woodhouse
2015-10-09 7:28 ` Daniel Vetter
2015-10-09 7:52 ` Daniel Vetter
2015-10-09 7:56 ` David Woodhouse
2015-10-09 8:47 ` Daniel Vetter
2015-10-09 9:07 ` David Woodhouse
2015-10-09 16:20 ` Jesse Barnes
2015-10-08 15:57 ` Chris Wilson
2015-10-09 7:24 ` David Woodhouse
2015-09-04 16:59 ` [PATCH 7/9] drm/i915: add fences to the request struct Jesse Barnes
2015-10-09 13:29 ` David Woodhouse
2015-10-09 16:11 ` Jesse Barnes
2015-09-04 16:59 ` [PATCH 8/9] drm/i915: Android sync points for i915 v4 (obsolete) Jesse Barnes
2015-09-04 16:59 ` [PATCH 9/9] drm/i915: add bufferless execbuf ioctl Jesse Barnes
2015-09-04 17:37 ` Chris Wilson
2015-09-04 19:09 ` Jesse Barnes
2015-10-08 10:34 ` David Woodhouse
2015-09-04 17:23 ` [RFC] Page table sharing and bufferless execbuf Chris Wilson
2015-09-04 19:08 ` Jesse Barnes
2015-09-26 14:55 ` David Woodhouse
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20151008081208.GY3383@phenom.ffwll.local \
--to=daniel@ffwll.ch \
--cc=dwmw2@infradead.org \
--cc=intel-gfx@lists.freedesktop.org \
--cc=jbarnes@virtuousgeek.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox