From: Robert Beckett <bob.beckett@collabora.com>
To: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>,
intel-gfx@lists.freedesktop.org
Subject: Re: [Intel-gfx] ✗ Fi.CI.BAT: failure for drm/i915: ttm for stolen (rev5)
Date: Tue, 28 Jun 2022 17:22:05 +0100 [thread overview]
Message-ID: <b20da287-e7f4-ebad-a534-a129b57eeede@collabora.com> (raw)
In-Reply-To: <6d608010-a9b1-8182-a967-6ea435fd20a4@linux.intel.com>
On 28/06/2022 09:46, Tvrtko Ursulin wrote:
>
> On 27/06/2022 18:08, Robert Beckett wrote:
>>
>>
>> On 22/06/2022 10:05, Tvrtko Ursulin wrote:
>>>
>>> On 21/06/2022 20:11, Robert Beckett wrote:
>>>>
>>>>
>>>> On 21/06/2022 18:37, Patchwork wrote:
>>>>> *Patch Details*
>>>>> *Series:* drm/i915: ttm for stolen (rev5)
>>>>> *URL:* https://patchwork.freedesktop.org/series/101396/
>>>>> <https://patchwork.freedesktop.org/series/101396/>
>>>>> *State:* failure
>>>>> *Details:*
>>>>> https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_101396v5/index.html
>>>>> <https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_101396v5/index.html>
>>>>>
>>>>>
>>>>>
>>>>> CI Bug Log - changes from CI_DRM_11790 -> Patchwork_101396v5
>>>>>
>>>>>
>>>>> Summary
>>>>>
>>>>> *FAILURE*
>>>>>
>>>>> Serious unknown changes coming with Patchwork_101396v5 absolutely
>>>>> need to be
>>>>> verified manually.
>>>>>
>>>>> If you think the reported changes have nothing to do with the changes
>>>>> introduced in Patchwork_101396v5, please notify your bug team to
>>>>> allow them
>>>>> to document this new failure mode, which will reduce false
>>>>> positives in CI.
>>>>>
>>>>> External URL:
>>>>> https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_101396v5/index.html
>>>>>
>>>>>
>>>>> Participating hosts (40 -> 41)
>>>>>
>>>>> Additional (2): fi-icl-u2 bat-dg2-9
>>>>> Missing (1): fi-bdw-samus
>>>>>
>>>>>
>>>>> Possible new issues
>>>>>
>>>>> Here are the unknown changes that may have been introduced in
>>>>> Patchwork_101396v5:
>>>>>
>>>>>
>>>>> IGT changes
>>>>>
>>>>>
>>>>> Possible regressions
>>>>>
>>>>> * igt@i915_selftest@live@reset:
>>>>> o bat-adlp-4: PASS
>>>>> <https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11790/bat-adlp-4/igt@i915_selftest@live@reset.html>
>>>>>
>>>>> -> DMESG-FAIL
>>>>> <https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_101396v5/bat-adlp-4/igt@i915_selftest@live@reset.html>
>>>>>
>>>>>
>>>>
>>>> I keep hitting clobbered pages during engine resets on bat-adlp-4.
>>>> It seems to happen most of the time on that machine and occasionally
>>>> on bat-adlp-6.
>>>>
>>>> Should bat-adlp-4 be considered an unreliable machine like
>>>> bat-adlp-6 is for now?
>>>>
>>>> Alternatively, seeing the history of this in
>>>>
>>>> commit 3da3c5c1c9825c24168f27b021339e90af37e969 "drm/i915: Exclude
>>>> low pages (128KiB) of stolen from use"
>>>>
>>>> could this be an indication that maybe the original issue is worse
>>>> on adlp machines?
>>>> I have only ever seen page page 135 or 136 clobbered across many
>>>> runs via trybot, so it looks fairly consistent.
>>>> Though excluding the use of over 540K of stolen might be too severe.
>>>
>>> Don't know but I see that on the latest version you even hit pages
>>> 165/166.
>>>
>>> Any history of hitting this in CI without your series? If not, are
>>> there some other changes which could explain it? Are you touching the
>>> selftest itself?
>>>
>>> Hexdump of the clobbered page looks quite complex. Especially
>>> POISON_FREE. Any idea how that ends up there?
>>
>>
>> (see
>> https://intel-gfx-ci.01.org/tree/drm-tip/Trybot_105517v4/fi-rkl-guc/igt@i915_selftest@live@reset.html#dmesg-warnings702)
>>
>>
>> after lots of slow debug via CI, it looks like the issue is that a
>> ring buffer was allocated and taking up that page during the initial
>> crc capture in the test, but by the time it came to check for
>> corruption, it had been freed from that page.
>>
>> The test has a number of weaknesses:
>>
>> 1. the busy check is done twice, without taking in to account any
>> change in between. I assume previously this could be relied on never
>> to occur, but now it can for some reason (more on that later)
>
> You mean the stolen page used/unused test? Probably the premise is that
> the test controls the driver completely ie. is the sole user and the two
> checks are run at the time where nothing else could have changed the state.
>
> With the nerfed request (as with GuC) this actually should hold. In the
> generic case I am less sure, my working knowledge faded a bit, but
> perhaps there was something guaranteeing the spinner couldn't have been
> retired yet at the time of the second check. Would need clarifying at
> least in comments.
>>
>> 2. the engine reset returns early with an error for guc submission
>> engines, but it is silently ignored in the test. Perhaps it should
>> ignore guc submission engines as it is a largely useless test for
>> those situations.
>
> Yes looks dodgy indeed. You will need to summon the owners of the GuC
> backend to comment on this.
>
> However even if the test should be skipped with GuC it is extremely
> interesting that you are hitting this so I suspect there is a more
> serious issue at play.
indeed. That's why I am keen to get to the root cause instead of just
slapping in a fix.
>
>> A quick obvious fix is to have a busy bitmask that remembers each
>> page's busy state initially and only check for corruption if it was
>> busy during both checks.
>>
>> However, the main question is why this is occurring now with my changes.
>> I have added more debug to check where the stolen memory is being
>> freed, but the first run last night didn't hit the issue for once.
>> I am running again now, will report back if I figure out where it is
>> being freed.
>>
>> I am pretty sure the "corruption" (which isn't actually corruption) is
>> from a ring buffer.
>> The POISON_FREE is the only difference between the captured before and
>> after dumps:
>>
>> [0040] 00000000 02800000 6b6b6b6b 6b6b6b6b 6b6b6b6b 6b6b6b6b 6b6b6b6b
>> 6b6b6b6b
>>
>> with the 2nd dword being the MI_ARB_CHECK used for the spinner.
>> I think this is the request poisoning from i915_request_retire()
>>
>> The bit I don't know yet is why a ring buffer was freed between the
>> initial crc capture and the corruption check. The spinner should be
>> active across the entire test, maintaining a ref on the context and
>> it's ring.
>>
>> hopefully my latest debug will give more answers.
>
> Yeah if you can figure our whether the a) spinner is still active during
> the 2nd check (as I think it should be), and b) is the corruption
> detected in the same pages which were used in the 1st pass that would be
> interesting.
yep. The latest run is still stuck in the CI queue after 27 hours.
I think I have enough debug in there to catch it now.
Hopefully I can get a root cause once it gets chance to run.
>
> Regards,
>
> Tvrtko
>
>>
>>
>>>
>>> Btw what is the benefit of converting stolen to start with? It's not
>>> much of a backend since it just uses the drm range manager. So quite
>>> thin and uneventful. Diffstats for the series also do not look like
>>> you end up with much code reduction?
>>>
>>> Regards,
>>>
>>> Tvrtko
next prev parent reply other threads:[~2022-06-28 16:22 UTC|newest]
Thread overview: 27+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-06-20 21:33 [Intel-gfx] [PATCH v7 00/10] drm/i915: ttm for stolen Robert Beckett
2022-06-20 21:33 ` [Intel-gfx] [PATCH v7 01/10] drm/i915/ttm: dont trample cache_level overrides during ttm move Robert Beckett
2022-06-20 21:33 ` [Intel-gfx] [PATCH v7 02/10] drm/i915: limit ttm to dma32 for i965G[M] Robert Beckett
2022-06-20 21:33 ` [Intel-gfx] [PATCH v7 03/10] drm/i915/ttm: only trust snooping for dgfx when deciding default cache_level Robert Beckett
2022-06-20 21:33 ` [Intel-gfx] [PATCH v7 04/10] drm/i915/gem: selftest should not attempt mmap of private regions Robert Beckett
2022-06-20 21:33 ` [Intel-gfx] [PATCH v7 05/10] drm/i915: instantiate ttm ranger manager for stolen memory Robert Beckett
2022-06-20 21:33 ` [Intel-gfx] [PATCH v7 06/10] drm/i915: sanitize mem_flags for stolen buffers Robert Beckett
2022-06-20 21:33 ` [Intel-gfx] [PATCH v7 07/10] drm/i915: ttm move/clear logic fix Robert Beckett
2022-06-20 21:33 ` [Intel-gfx] [PATCH v7 08/10] drm/i915: allow memory region creators to alloc and free the region Robert Beckett
2022-06-20 21:33 ` [Intel-gfx] [PATCH v7 09/10] drm/i915/ttm: add buffer pin on alloc flag Robert Beckett
2022-06-20 21:33 ` [Intel-gfx] [PATCH v7 10/10] drm/i915: stolen memory use ttm backend Robert Beckett
2022-06-21 1:09 ` [Intel-gfx] ✗ Fi.CI.SPARSE: warning for drm/i915: ttm for stolen (rev3) Patchwork
2022-06-21 1:39 ` [Intel-gfx] ✗ Fi.CI.BAT: failure " Patchwork
2022-06-21 15:10 ` [Intel-gfx] ✗ Fi.CI.SPARSE: warning for drm/i915: ttm for stolen (rev4) Patchwork
2022-06-21 15:32 ` [Intel-gfx] ✗ Fi.CI.BAT: failure " Patchwork
2022-06-21 17:16 ` [Intel-gfx] ✗ Fi.CI.SPARSE: warning for drm/i915: ttm for stolen (rev5) Patchwork
2022-06-21 17:37 ` [Intel-gfx] ✗ Fi.CI.BAT: failure " Patchwork
2022-06-21 19:11 ` Robert Beckett
2022-06-22 9:05 ` Tvrtko Ursulin
2022-06-27 17:08 ` Robert Beckett
2022-06-28 8:46 ` Tvrtko Ursulin
2022-06-28 16:22 ` Robert Beckett [this message]
2022-06-29 12:51 ` Robert Beckett
2022-06-30 14:20 ` Robert Beckett
2022-06-30 14:52 ` Hellstrom, Thomas
2022-06-30 15:24 ` Robert Beckett
2022-07-01 7:34 ` Tvrtko Ursulin
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=b20da287-e7f4-ebad-a534-a129b57eeede@collabora.com \
--to=bob.beckett@collabora.com \
--cc=intel-gfx@lists.freedesktop.org \
--cc=tvrtko.ursulin@linux.intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox