From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 23AACC433EF for ; Wed, 29 Jun 2022 12:52:00 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id 64AB310EB69; Wed, 29 Jun 2022 12:51:59 +0000 (UTC) Received: from madras.collabora.co.uk (madras.collabora.co.uk [46.235.227.172]) by gabe.freedesktop.org (Postfix) with ESMTPS id F332F10EB69 for ; Wed, 29 Jun 2022 12:51:57 +0000 (UTC) Received: from [192.168.1.110] (82-69-11-12.dsl.in-addr.zen.co.uk [82.69.11.12]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) (Authenticated sender: bbeckett) by madras.collabora.co.uk (Postfix) with ESMTPSA id 6A83B660191F; Wed, 29 Jun 2022 13:51:56 +0100 (BST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=collabora.com; s=mail; t=1656507116; bh=iVKYfbgfLeQ1/S/9ByVv2GLg5D47Hdc+I6uPozqRM1g=; h=Date:Subject:From:To:References:In-Reply-To:From; b=j6M5lf0YiLWrFXQOmcvo5QPm3utv2hySi1fae2+xjvZPDmCFSYrncPdiCM3uDJmBK VKFn4IoaslIoaI3a6MTRLueltUiA/qmPgF04ZxLgQDVX7fBDd797NBnUYejujWTnqC K0Ldn/+oE6Lp2tV/TYjdx0uI97shTB45EHq3RSqHSinkgm8kOi4vylTImXnk7/mahO jeCviG5cxrywMwRSN9/b7l9AhJcSJ9FbXs4aJbf4owkGjvful/WmIHVnREOc8PTOfa 25MrrouAcmRpa2kYBAMTRnG6XwyJ4kFKbuBi6s6QMBE77MVhKsOYPm5iQ8RthLnE7/ O1DYRN6XDaDBg== Message-ID: Date: Wed, 29 Jun 2022 13:51:54 +0100 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.10.0 Content-Language: en-US From: Robert Beckett To: Tvrtko Ursulin , intel-gfx@lists.freedesktop.org, "Hellstrom, Thomas" References: <20220620213340.3199605-1-bob.beckett@collabora.com> <165583307119.13647.9219456323624399121@emeril.freedesktop.org> <20dedda6-534b-b39e-fb35-8becc5c27043@collabora.com> <2752282d-a28a-ef9b-8b27-2d84aee9c8bf@linux.intel.com> <4eee638e-ff08-0e45-ac79-ca6c77a0bb6f@collabora.com> <6d608010-a9b1-8182-a967-6ea435fd20a4@linux.intel.com> In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit Subject: Re: [Intel-gfx] =?utf-8?b?4pyXIEZpLkNJLkJBVDogZmFpbHVyZSBmb3IgZHJt?= =?utf-8?q?/i915=3A_ttm_for_stolen_=28rev5=29?= X-BeenThere: intel-gfx@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Intel graphics driver community testing & development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: intel-gfx-bounces@lists.freedesktop.org Sender: "Intel-gfx" On 28/06/2022 17:22, Robert Beckett wrote: > > > On 28/06/2022 09:46, Tvrtko Ursulin wrote: >> >> On 27/06/2022 18:08, Robert Beckett wrote: >>> >>> >>> On 22/06/2022 10:05, Tvrtko Ursulin wrote: >>>> >>>> On 21/06/2022 20:11, Robert Beckett wrote: >>>>> >>>>> >>>>> On 21/06/2022 18:37, Patchwork wrote: >>>>>> *Patch Details* >>>>>> *Series:*    drm/i915: ttm for stolen (rev5) >>>>>> *URL:*    https://patchwork.freedesktop.org/series/101396/ >>>>>> >>>>>> *State:*    failure >>>>>> *Details:* >>>>>> https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_101396v5/index.html >>>>>> >>>>>> >>>>>> >>>>>> >>>>>>   CI Bug Log - changes from CI_DRM_11790 -> Patchwork_101396v5 >>>>>> >>>>>> >>>>>>     Summary >>>>>> >>>>>> *FAILURE* >>>>>> >>>>>> Serious unknown changes coming with Patchwork_101396v5 absolutely >>>>>> need to be >>>>>> verified manually. >>>>>> >>>>>> If you think the reported changes have nothing to do with the changes >>>>>> introduced in Patchwork_101396v5, please notify your bug team to >>>>>> allow them >>>>>> to document this new failure mode, which will reduce false >>>>>> positives in CI. >>>>>> >>>>>> External URL: >>>>>> https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_101396v5/index.html >>>>>> >>>>>> >>>>>> >>>>>>     Participating hosts (40 -> 41) >>>>>> >>>>>> Additional (2): fi-icl-u2 bat-dg2-9 >>>>>> Missing (1): fi-bdw-samus >>>>>> >>>>>> >>>>>>     Possible new issues >>>>>> >>>>>> Here are the unknown changes that may have been introduced in >>>>>> Patchwork_101396v5: >>>>>> >>>>>> >>>>>>       IGT changes >>>>>> >>>>>> >>>>>>         Possible regressions >>>>>> >>>>>>   * igt@i915_selftest@live@reset: >>>>>>       o bat-adlp-4: PASS >>>>>> >>>>>> >>>>>>         -> DMESG-FAIL >>>>>> >>>>>> >>>>>> >>>>> >>>>> I keep hitting clobbered pages during engine resets on bat-adlp-4. >>>>> It seems to happen most of the time on that machine and >>>>> occasionally on bat-adlp-6. >>>>> >>>>> Should bat-adlp-4 be considered an unreliable machine like >>>>> bat-adlp-6 is for now? >>>>> >>>>> Alternatively, seeing the history of this in >>>>> >>>>> commit 3da3c5c1c9825c24168f27b021339e90af37e969 "drm/i915: Exclude >>>>> low pages (128KiB) of stolen from use" >>>>> >>>>> could this be an indication that maybe the original issue is worse >>>>> on adlp machines? >>>>> I have only ever seen page page 135 or 136 clobbered across many >>>>> runs via trybot, so it looks fairly consistent. >>>>> Though excluding the use of over 540K of stolen might be too severe. >>>> >>>> Don't know but I see that on the latest version you even hit pages >>>> 165/166. >>>> >>>> Any history of hitting this in CI without your series? If not, are >>>> there some other changes which could explain it? Are you touching >>>> the selftest itself? >>>> >>>> Hexdump of the clobbered page looks quite complex. Especially >>>> POISON_FREE. Any idea how that ends up there? >>> >>> >>> (see >>> https://intel-gfx-ci.01.org/tree/drm-tip/Trybot_105517v4/fi-rkl-guc/igt@i915_selftest@live@reset.html#dmesg-warnings702) >>> >>> >>> after lots of slow debug via CI, it looks like the issue is that a >>> ring buffer was allocated and taking up that page during the initial >>> crc capture in the test, but by the time it came to check for >>> corruption, it had been freed from that page. >>> >>> The test has a number of weaknesses: >>> >>> 1. the busy check is done twice, without taking in to account any >>> change in between. I assume previously this could be relied on never >>> to occur, but now it can for some reason (more on that later) >> >> You mean the stolen page used/unused test? Probably the premise is >> that the test controls the driver completely ie. is the sole user and >> the two checks are run at the time where nothing else could have >> changed the state. >> >> With the nerfed request (as with GuC) this actually should hold. In >> the generic case I am less sure, my working knowledge faded a bit, but >> perhaps there was something guaranteeing the spinner couldn't have >> been retired yet at the time of the second check. Would need >> clarifying at least in comments. >>> >>> 2. the engine reset returns early with an error for guc submission >>> engines, but it is silently ignored in the test. Perhaps it should >>> ignore guc submission engines as it is a largely useless test for >>> those situations. >> >> Yes looks dodgy indeed. You will need to summon the owners of the GuC >> backend to comment on this. >> >> However even if the test should be skipped with GuC it is extremely >> interesting that you are hitting this so I suspect there is a more >> serious issue at play. > > indeed. That's why I am keen to get to the root cause instead of just > slapping in a fix. > >> >>> A quick obvious fix is to have a busy bitmask that remembers each >>> page's busy state initially and only check for corruption if it was >>> busy during both checks. >>> >>> However, the main question is why this is occurring now with my changes. >>> I have added more debug to check where the stolen memory is being >>> freed, but the first run last night didn't hit the issue for once. >>> I am running again now, will report back if I figure out where it is >>> being freed. >>> >>> I am pretty sure the "corruption" (which isn't actually corruption) >>> is from a ring buffer. >>> The POISON_FREE is the only difference between the captured before >>> and after dumps: >>> >>> [0040] 00000000 02800000 6b6b6b6b 6b6b6b6b 6b6b6b6b 6b6b6b6b 6b6b6b6b >>> 6b6b6b6b >>> >>> with the 2nd dword being the MI_ARB_CHECK used for the spinner. >>> I think this is the request poisoning from i915_request_retire() >>> >>> The bit I don't know yet is why a ring buffer was freed between the >>> initial crc capture and the corruption check. The spinner should be >>> active across the entire test, maintaining a ref on the context and >>> it's ring. >>> >>> hopefully my latest debug will give more answers. >> >> Yeah if you can figure our whether the a) spinner is still active >> during the 2nd check (as I think it should be), and b) is the >> corruption detected in the same pages which were used in the 1st pass >> that would be interesting. > > yep. The latest run is still stuck in the CI queue after 27 hours. > I think I have enough debug in there to catch it now. > Hopefully I can get a root cause once it gets chance to run. > https://intel-gfx-ci.01.org/tree/drm-tip/Trybot_105517v7/fi-adl-ddr5/igt@i915_selftest@live@reset.html#dmesg-warnings496 well, the run finally happened. And it shows that the freed resource happens from a workqueue. Not helpful. I'll now add a saved stack traces to all objects that saves where it is allocated and freed/queued for free. > >> >> Regards, >> >> Tvrtko >> >>> >>> >>>> >>>> Btw what is the benefit of converting stolen to start with? It's not >>>> much of a backend since it just uses the drm range manager. So quite >>>> thin and uneventful. Diffstats for the series also do not look like >>>> you end up with much code reduction? >>>> >>>> Regards, >>>> >>>> Tvrtko