Linux IOMMU Development
 help / color / mirror / Atom feed
From: Alex Williamson <alex.williamson-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
To: Roland Dreier <roland-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Cc: iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org,
	Jiang Liu <jiang.liu-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
Subject: Re: Hang (due to HW?) in qi_submit_sync()
Date: Thu, 08 Jan 2015 16:39:44 -0700	[thread overview]
Message-ID: <1420760384.25367.101.camel@redhat.com> (raw)
In-Reply-To: <1420756610-20918-1-git-send-email-roland-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>

On Thu, 2015-01-08 at 14:36 -0800, Roland Dreier wrote:
> Hi,
> 
> So we've managed to find one magic setup and workload that reproduces
> this reliably, and I've done a bit more debugging that leads me to
> believe we probably need Intel's help to really get to the bottom of
> this.

Can you share your test case?  My test case was also using vfio,
repeatedly booting a pair of VMs with assigned GPUs, but I haven't seen
any sign of the issue without NET_DMA.  I agree that it seems like the
hardware is getting somewhere that it shouldn't regardless of the
software, but are you able to reproduce on a more recent kernel?

I've also seen a report of the same problem without vfio by manipulating
IRQ affinity, but it was not a reliable test case.

> One question though: you mentioned that you saw this behavior until
> you turned off CONFIG_NET_DMA.  What platform was that on?  We see
> this on dual socket Xeon E5 v3 (Haswell EP / Grantley), and I don't
> really have any other setup I can try.  Did you see this on other
> platforms (Ivy Bridge / Romley maybe)?

I saw it on a dual socket Xeon E5 v2 (Ivy Bridge EP / Patsburg), and the
other report I mention above was the same, different systems though.

> Anyway, I added the debugging patch at the end of this mail to our
> kernel to dump some status when the driver detects a hung queue.
> Below we see some example output.  Things I notice:
> 
>  - IQH == IQT, in other words the QI hardware thinks it has picked up
>    and validated all the descriptors submitted by software.
>  - The queue has successfully processed many operations (although
>    everything that we can see succeeded is type 2h, ie "IOTLB
>    invalidate" as opposed to the type 4h "Interrupt Entry Cache
>    invalidate" that we're stuck on.
>  - There haven't been any faults and FSTS is clear.
>  - The hardware hasn't executed the "Invalidation wait" descriptor.

This is all consistent with my observations as well.

> Beyond that I'm not sure how to make much more progress without
> insight into the hardware -- it looks like the driver is doing
> everything right.  Anyone from Intel have any thoughts?

Agreed, the hardware appears to be getting wedged without and sort of
fault indication or recovery mechanism.  Thanks,

Alex

      parent reply	other threads:[~2015-01-08 23:39 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-01-06  0:57 Hang (due to HW?) in qi_submit_sync() Roland Dreier
     [not found] ` <1420505840-30096-1-git-send-email-roland-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
2015-01-06  2:54   ` Alex Williamson
     [not found]     ` <1420512860.3541.77.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2015-01-06  4:39       ` Roland Dreier
     [not found]         ` <CAG4TOxPOgTOKYZa7q9Of8XzHV_wAadtJmXFC0bmyN2Qds7T9RA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-01-06  6:48           ` Alex Williamson
     [not found]             ` <1420526901.3541.96.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2015-01-08 22:36               ` Roland Dreier
     [not found]                 ` <1420756610-20918-1-git-send-email-roland-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
2015-01-08 23:39                   ` Alex Williamson [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1420760384.25367.101.camel@redhat.com \
    --to=alex.williamson-h+wxahxf7alqt0dzr+alfa@public.gmane.org \
    --cc=iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org \
    --cc=jiang.liu-VuQAYsv1563Yd54FQh9/CA@public.gmane.org \
    --cc=roland-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox