Re: REGRESSION on linux-next (next-20251106)

intel-gfx.lists.freedesktop.org archive mirror
 help / color / mirror / Atom feed

From: Jason Gunthorpe <jgg@nvidia.com>
To: "Borah, Chaitanya Kumar" <chaitanya.kumar.borah@intel.com>
Cc: "intel-gfx@lists.freedesktop.org"
	<intel-gfx@lists.freedesktop.org>,
	"intel-xe@lists.freedesktop.org" <intel-xe@lists.freedesktop.org>,
	Lucas De Marchi <lucas.demarchi@intel.com>,
	"Kurmi, Suresh Kumar" <suresh.kumar.kurmi@intel.com>,
	"Saarinen, Jani" <jani.saarinen@intel.com>,
	matthew.auld@intel.com, baolu.lu@linux.intel.com,
	iommu@lists.linux.dev
Subject: Re: REGRESSION on linux-next (next-20251106)
Date: Mon, 17 Nov 2025 21:29:44 -0400	[thread overview]
Message-ID: <20251118012944.GA60885@nvidia.com> (raw)
In-Reply-To: <4f15cf3b-6fad-4cd8-87e5-6d86c0082673@intel.com>

On Mon, Nov 10, 2025 at 12:06:30PM +0530, Borah, Chaitanya Kumar wrote:
> Hello Jason,
> 
> Hope you are doing well. I am Chaitanya from the linux graphics team in
> Intel.
> 
> This mail is regarding a regression we are seeing in our CI runs[1] on
> linux-next repository.
> 
> Since the version next-20251106 [2], we are seeing our tests timing out
> presumably caused by a GPU Hang.
> 
> `````````````````````````````````````````````````````````````````````````````````
> <6> [490.872058] i915 0000:00:02.0: [drm] Got hung context on vcs0 with
> active request 939:2 [0x1004] not yet started
> <6> [490.875244] i915 0000:00:02.0: [drm] GPU HANG: ecode 12:4:baffffff
> <7> [496.424189] i915 0000:00:02.0: [drm:intel_guc_context_reset_process_msg
> [i915]] GT1: GUC: Got context reset notification: 0x1004 on vcs0, exiting =
> no, banned = no
> <6> [496.921551] i915 0000:00:02.0: [drm] Got hung context on vcs0 with
> active request 939:2 [0x1004] not yet started
> <6> [496.924799] i915 0000:00:02.0: [drm] GPU HANG: ecode 12:4:baffffff
> <4> [499.946641] [IGT] Per-test timeout exceeded. Killing the current test
> with SIGQUIT.
> `````````````````````````````````````````````````````````````````````````````````
> Details log can be found in [3].

Chaitanya, can you check these two debugging patches:

https://github.com/jgunthorpe/linux/commits/for-borah

10635ad3ff26a0 DEBUGGING: Force flush the whole cpu cache for the page table on every map operation
2789602b882499 DEBUGGING: Force flush the whole iotlb on every map operation

Please run a test with each of them applied *individually* and report
back what changes in the test. The "cpu cache" one may oops or
something, we are just looking to see if it gets past the error Kevin
pointed to:

<7>[   67.231149] [IGT] gem_exec_gttfill: starting subtest basic
[..]
<5>[   68.824598] i915 0000:00:02.0: Using 46-bit DMA addresses
<3>[   68.825482] i915 0000:00:02.0: [drm] *ERROR* GT0: GUC: CT: Failed to process request 6000 (-EOPNOTSUPP)

I could not test these patches so they may not work at all..

Also, I'd like to know if this is happening 100% reproducibly or of it
is flakey.. Also this is 68 after boot and right at the first test,
 and just to confirm this is 68s after boot and right after
starting a test so it looks like the test is just not working at all?

I'm still interested to know if there is an iommu error that is
somehow not getting into the log?

It would also help to collect the trace points:

int iommu_map_nosync(struct iommu_domain *domain, unsigned long iova,
{
[..]
		trace_map(orig_iova, orig_paddr, orig_size);

And

static size_t __iommu_unmap(struct iommu_domain *domain,
			    unsigned long iova, size_t size,
			    struct iommu_iotlb_gather *iotlb_gather)
{
[..]
	trace_unmap(orig_iova, size, unmapped);

As well as some instrumentation for the IOVA involved with the above
error for request 6000.

Finally, it is interesting that this test prints this:

<5>[   68.824598] i915 0000:00:02.0: Using 46-bit DMA addresses

Which comes from here:

        if (dma_limit > DMA_BIT_MASK(32) && dev->iommu->pci_32bit_workaround) {
                iova = alloc_iova_fast(iovad, iova_len,
                                       DMA_BIT_MASK(32) >> shift, false);
                if (iova)
                        goto done;

                dev->iommu->pci_32bit_workaround = false;
                dev_notice(dev, "Using %d-bit DMA addresses\n", bits_per(dma_limit));
        }

Which means dma-iommu has exceeded the 32 bit pool and is allocating
high addresses now? 

It prints that and then immediately fails? Seems like a clue!

Is there a failing map call perhaps due to the driver setting up the
wrong iova range for the table? iommpt is strict about enforcing the
IOVA limitation. A failing map call might product this outcome (though
I expect a iommu error log)

The map traces only log on success though, so please add a print on
failure too..

46 bits is not particularly big... Hmm, I wonder if we have some issue
with the sign-extend? iommupt does that properly and IIRC the old code
did not. Which of the page table formats is this using second stage or
first stage?

Kevin/Baolu any thoughts around the above?

Jason

next prev parent reply	other threads:[~2025-11-18  1:29 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-11-10  6:36 REGRESSION on linux-next (next-20251106) Borah, Chaitanya Kumar
2025-11-12 22:32 ` Jason Gunthorpe
2025-11-13  2:00   ` Tian, Kevin
2025-11-17 15:24     ` Jason Gunthorpe
2025-11-17 12:54   ` Baolu Lu
2025-11-17 15:22     ` Jason Gunthorpe
2025-11-18  1:29 ` Jason Gunthorpe [this message]
2025-11-18  4:04   ` Tian, Kevin
2025-11-18  6:19     ` Baolu Lu
2025-11-18  6:23     ` Baolu Lu
2025-11-18  7:47       ` Tian, Kevin
2025-11-18 11:29         ` Baolu Lu
2025-11-18 12:35           ` Jason Gunthorpe
2025-11-19  7:25             ` Baolu Lu
2025-11-18 10:30   ` Baolu Lu
2025-11-18 15:16   ` Borah, Chaitanya Kumar
2025-11-18 16:13     ` Jason Gunthorpe
2025-11-19  9:29       ` Baolu Lu
     [not found]       ` <0c3cd494-e42a-4607-896c-4c341f90c270@intel.com>
2025-11-19  9:31         ` Tian, Kevin
2025-11-19 18:51           ` Jason Gunthorpe
2025-11-19 23:56             ` Tian, Kevin
2025-11-20  2:18               ` Jason Gunthorpe
2025-11-20  2:24                 ` Baolu Lu
2025-11-20  7:27                 ` Baolu Lu
2025-11-20  0:19             ` Tian, Kevin
2025-11-18 12:42 ` ✗ Fi.CI.BUILD: failure for " Patchwork

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20251118012944.GA60885@nvidia.com \
    --to=jgg@nvidia.com \
    --cc=baolu.lu@linux.intel.com \
    --cc=chaitanya.kumar.borah@intel.com \
    --cc=intel-gfx@lists.freedesktop.org \
    --cc=intel-xe@lists.freedesktop.org \
    --cc=iommu@lists.linux.dev \
    --cc=jani.saarinen@intel.com \
    --cc=lucas.demarchi@intel.com \
    --cc=matthew.auld@intel.com \
    --cc=suresh.kumar.kurmi@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).