From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id D5F91C3DA7D for ; Tue, 3 Jan 2023 11:59:54 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id 893D210E25B; Tue, 3 Jan 2023 11:59:47 +0000 (UTC) Received: from wout2-smtp.messagingengine.com (wout2-smtp.messagingengine.com [64.147.123.25]) by gabe.freedesktop.org (Postfix) with ESMTPS id 89CBD10E17C for ; Mon, 2 Jan 2023 01:18:04 +0000 (UTC) Received: from compute3.internal (compute3.nyi.internal [10.202.2.43]) by mailout.west.internal (Postfix) with ESMTP id 23C7832006F5; Sun, 1 Jan 2023 20:18:03 -0500 (EST) Received: from mailfrontend1 ([10.202.2.162]) by compute3.internal (MEProxy); Sun, 01 Jan 2023 20:18:03 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d= invisiblethingslab.com; h=cc:cc:content-type:date:date:from:from :in-reply-to:in-reply-to:message-id:mime-version:references :reply-to:sender:subject:subject:to:to; s=fm2; t=1672622282; x= 1672708682; bh=yVoTWhV9NOKPo4v/n3sT/VMH9GHO2ZdiZBvPkH9zdC8=; b=B vMoYz9Mhi9YgxkkD3IIl5q0kG4dW8IaCqWvko7S1VMASUN7f88iKJRsEvT/bFxJ/ m26ip5/42bqXaNAtxLqWebGyk+qWgv6BjAeJWMxd6viebeIdSLT5pAIgN3wWiRyW XywzNeCGT6D2BXAJ8TXz5jXrH6v3UeRi1RxgcEpOfaz8QryIo9lsoqC+Pv1CfskK RVyCZkZrt0qoq0q+Dq77ZmDQImvafYJzFBZb4oszMVZ5cHLMekMUa892VVZ1M6mH 210ErWh7zE5Hlzl5w2m7/2d9rKPCTYlTp4AlA38pqJYDQTA4tSpdxbG0i+EFBm9J Rgo9biNVcWIvomPQTPC/A== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d= messagingengine.com; h=cc:cc:content-type:date:date:feedback-id :feedback-id:from:from:in-reply-to:in-reply-to:message-id :mime-version:references:reply-to:sender:subject:subject:to:to :x-me-proxy:x-me-proxy:x-me-sender:x-me-sender:x-sasl-enc; s= fm2; t=1672622282; x=1672708682; bh=yVoTWhV9NOKPo4v/n3sT/VMH9GHO 2ZdiZBvPkH9zdC8=; b=laMRUKP7nV5k4DmVETxqR13hJI8RB0akL2mRwZWmzr47 bpqo/s/VgAhLqew5gfWIZL5j/09kJT2k2sY0/uThi6YqBXl5R93Q5mjy3tpam9iP hRFFCKe5o7HJNnL2b3D9XLlsez1o6NNFxtuSCPhP/SZTcxJ9r5iNMyWFsX5fEb3y RQ3gC1P8Z+qyeVPq07pm0cpfXw2z20ycksnF5Ul08czz6+b8M0H381KyqhEOOfNC w5J/KUYP7A6BdOSNd2eP6uP7/r6Je4lFmfcVE8INpfIrqxWb/HxcrDNmaB97+cUX uriLyiscHnjVs9GaDQ8V+RlbpJ2WVro13xGJfivOEg== X-ME-Sender: X-ME-Received: X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgedvhedrjedugdefudcutefuodetggdotefrodftvf curfhrohhfihhlvgemucfhrghsthforghilhdpqfgfvfdpuffrtefokffrpgfnqfghnecu uegrihhlohhuthemuceftddtnecusecvtfgvtghiphhivghnthhsucdlqddutddtmdenuc fjughrpeffhffvvefukfhfgggtuggjsehgtderredttdejnecuhfhrohhmpeffvghmihcu ofgrrhhivgcuqfgsvghnohhurhcuoeguvghmihesihhnvhhishhisghlvghthhhinhhgsh hlrggsrdgtohhmqeenucggtffrrghtthgvrhhnpeetleffvdehleeffeffudffvedutdev geffvdffkeevfeevkefhjeegleduteffjeenucffohhmrghinhepqhhusggvshdqohhsrd horhhgpdhgihhthhhusgdrtghomhdpkhgvrhhnvghlrdhorhhgpdhithhlrdhsphgrtggv necuvehluhhsthgvrhfuihiivgeptdenucfrrghrrghmpehmrghilhhfrhhomhepuggvmh hisehinhhvihhsihgslhgvthhhihhnghhslhgrsgdrtghomh X-ME-Proxy: Feedback-ID: iac594737:Fastmail Received: by mail.messagingengine.com (Postfix) with ESMTPA; Sun, 1 Jan 2023 20:18:01 -0500 (EST) Date: Sun, 1 Jan 2023 20:17:52 -0500 From: Demi Marie Obenour To: Marek =?utf-8?Q?Marczykowski-G=C3=B3recki?= Message-ID: References: <1c326e0c-5812-083a-0739-aa20fab3efc4@citrix.com> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha512; protocol="application/pgp-signature"; boundary="2tpZ6p+tZudaMPXj" Content-Disposition: inline In-Reply-To: X-Mailman-Approved-At: Tue, 03 Jan 2023 11:59:45 +0000 Subject: Re: [Intel-gfx] [cache coherency bug] i915 and PAT attributes X-BeenThere: intel-gfx@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Intel graphics driver community testing & development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Andrew Cooper , "intel-gfx@lists.freedesktop.org" , the arch/x86 maintainers , Lucas De Marchi , Daniel Vetter , Rodrigo Vivi , xen-devel Errors-To: intel-gfx-bounces@lists.freedesktop.org Sender: "Intel-gfx" --2tpZ6p+tZudaMPXj Content-Type: text/plain; protected-headers=v1; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: quoted-printable Date: Sun, 1 Jan 2023 20:17:52 -0500 From: Demi Marie Obenour To: Marek =?utf-8?Q?Marczykowski-G=C3=B3recki?= Cc: Ville =?utf-8?B?U3lyasOkbMOk?= , Andrew Cooper , "intel-gfx@lists.freedesktop.org" , the arch/x86 maintainers , Lucas De Marchi , Daniel Vetter , Rodrigo Vivi , xen-devel Subject: Re: [Intel-gfx] [cache coherency bug] i915 and PAT attributes On Mon, Jan 02, 2023 at 02:00:51AM +0100, Marek Marczykowski-G=C3=B3recki w= rote: > On Sun, Jan 01, 2023 at 07:03:18PM -0500, Demi Marie Obenour wrote: > > On Mon, Jan 02, 2023 at 12:24:54AM +0100, Marek Marczykowski-G=C3=B3rec= ki wrote: > > > On Thu, Dec 22, 2022 at 10:29:57AM +0200, Ville Syrj=C3=A4l=C3=A4 wro= te: > > > > On Fri, Dec 16, 2022 at 03:30:13PM +0000, Andrew Cooper wrote: > > > > > On 08/12/2022 1:55 pm, Marek Marczykowski-G=C3=B3recki wrote: > > > > > > Hi, > > > > > > > > > > > > There is an issue with i915 on Xen PV (dom0). The end result is= a lot of > > > > > > glitches, like here: https://openqa.qubes-os.org/tests/54748#st= ep/startup/8 > > > > > > (this one is on ADL, Linux 6.1-rc7 as a Xen PV dom0). It's usin= g Xorg > > > > > > with "modesetting" driver. > > > > > > > > > > > > After some iterations of debugging, we narrowed it down to i915= handling > > > > > > caching. The main difference is that PAT is setup differently o= n Xen PV > > > > > > than on native Linux. Normally, Linux does have appropriate abs= traction > > > > > > for that, but apparently something related to i915 doesn't play= well > > > > > > with it. The specific difference is: > > > > > > native linux: > > > > > > x86/PAT: Configuration [0-7]: WB WC UC- UC WB WP UC- WT > > > > > > xen pv: > > > > > > x86/PAT: Configuration [0-7]: WB WT UC- UC WC WP UC UC > > > > > > ~~ ~~ ~~ ~~ > > > > > > > > > > > > The specific impact depends on kernel version and the hardware.= The most > > > > > > severe issues I see on >=3DADL, but some older hardware is affe= cted too - > > > > > > sometimes only if composition is disabled in the window manager. > > > > > > Some more information is collected at > > > > > > https://github.com/QubesOS/qubes-issues/issues/4782 (and few li= nked > > > > > > duplicates...). > > > > > > > > > > > > Kind-of related commit is here: > > > > > > https://github.com/torvalds/linux/commit/bdd8b6c98239cad ("drm/= i915: > > > > > > replace X86_FEATURE_PAT with pat_enabled()") - it is the place = where > > > > > > i915 explicitly checks for PAT support, so I'm cc-ing people me= ntioned > > > > > > there too. > > > > > > > > > > > > Any ideas? > > > > > > > > > > > > The issue can be easily reproduced without Xen too, by adjustin= g PAT in > > > > > > Linux: > > > > > > -----8<----- > > > > > > diff --git a/arch/x86/mm/pat/memtype.c b/arch/x86/mm/pat/memtyp= e.c > > > > > > index 66a209f7eb86..319ab60c8d8c 100644 > > > > > > --- a/arch/x86/mm/pat/memtype.c > > > > > > +++ b/arch/x86/mm/pat/memtype.c > > > > > > @@ -400,8 +400,8 @@ void pat_init(void) > > > > > > * The reserved slots are unused, but mapped to their > > > > > > * corresponding types in the presence of PAT errata. > > > > > > */ > > > > > > - pat =3D PAT(0, WB) | PAT(1, WC) | PAT(2, UC_MINUS) | PAT(3, = UC) | > > > > > > - PAT(4, WB) | PAT(5, WP) | PAT(6, UC_MINUS) | PAT(7, WT= ); > > > > > > + pat =3D PAT(0, WB) | PAT(1, WT) | PAT(2, UC_MINUS) | PAT(3, = UC) | > > > > > > + PAT(4, WC) | PAT(5, WP) | PAT(6, UC) | PAT(7, UC= ); > > > > > > } > > > > > > =20 > > > > > > if (!pat_bp_initialized) { > > > > > > -----8<----- > > > > > > > > > > >=20 > > > > > Hello, can anyone help please? > > > > >=20 > > > > > Intel's CI has taken this reproducer of the bug, and confirmed the > > > > > regression.=C2=A0 > > > > > https://lore.kernel.org/intel-gfx/Y5Hst0bCxQDTN7lK@mail-itl/T/#m4= 480c15a0d117dce6210562eb542875e757647fb > > > > >=20 > > > > > We're reasonably confident that it is an i915 bug (given the repr= o with > > > > > no Xen in the mix), but we're out of any further ideas. > > > >=20 > > > > I don't think we have any code that assumes anything about the PAT, > > > > apart from WC being available (which seems like it should still be > > > > the case with your modified PAT). I suppose you'll just have to=20 > > > > start digging from pgprot_writecombine()/noncached() and make sure > > > > everything ends up using the correct PAT entry. > > >=20 > > > I tried several approach to this, without success. Here is an update = on > > > debugging (reported also on #intel-gfx live): > > >=20 > > > I did several tests with different PAT configuration (by modifying Xen > > > that sets the MSR). Full table is at https://pad.itl.space/sheet/#/2/= sheet/view/HD1qT2Zf44Ha36TJ3wj2YL+PchsTidyNTFepW5++ZKM/ > > > Some highlights: > > > - 1=3DWC, 4=3DWT - good > > > - 1=3DWT, 4=3DWC - bad > > > - 1=3DWT, 3=3DWC (4=3DWC too) - good > > > - 1=3DWT, 5=3DWC - good > > >=20 > > > So, for me it seems WC at index 4 is problematic for some reason. > > >=20 > > > Next, I tried to trap all the places in arch/x86/xen/mmu_pv.c that > > > write PTEs and verify requested cache attributes. There, it seems all > > > the requested WC are properly translated (using either index 1, 3, 4,= or > > > 5 according to PAT settings). And then after reading PTE back, it ind= eed > > > seems to be correctly set. I didn't added reading back after > > > HYPERVISOR_update_va_mapping, but verified it isn't used for setting = WC. > > >=20 > > > Using the same method, I also checked that indexes that aren't suppos= ed > > > to be used (for example index 4 when both 3 and 4 are WC) indeed are = not > > > used. So, the hypothesis that specific indexes are hardcoded somewhere > > > is unlikely. > > >=20 > > > This all looks very weird to me. Any ideas? > >=20 > > Old CPUs have had hardware errata that caused the top bit of the PAT > > entry to be ignored in certain cases. Could modern CPUs be ignoring > > this bit when accessing iGPU memory or registers? With WC at position > > 4, this would cause WC to be treated as WB, which is consistent with the > > observed behavior. WC at position 3 would not be impacted, and WC at > > position 5 would be treated as WT which I expect to be safe. One way to > > test this is to test 1=3DWB, 5=3DWC. If my hypothesis is correct, this > > should trigger the bug, even if entry 1 in the PAT is unused because > > entry 0 is also WB. >=20 > This looks like a very probable situation, indeed 1=3DWB, 5=3DWC does > trigger the bug! Specifically this layout: >=20 > WB WB UC- UC WP WC WT UC What about WB WT WB UC WB WP WC UC- and WB WT WT UC WB WP WC UC-? Those only differ in entry 2, which will not be used as it duplicates entry 0 or 1. Therefore, architecturally, these should behave identically. If I am correct, the second will work fine, but the first will trigger the bug. --=20 Sincerely, Demi Marie Obenour (she/her/hers) Invisible Things Lab --2tpZ6p+tZudaMPXj Content-Type: application/pgp-signature; name="signature.asc" -----BEGIN PGP SIGNATURE----- iQIzBAEBCgAdFiEEdodNnxM2uiJZBxxxsoi1X/+cIsEFAmOyMMcACgkQsoi1X/+c IsFgvA/9FmiZPls9UyctSYLzjz5C03aRvLIN/oFqAo/laqzBcDPHVJeCLWr39qy6 hUoXgzqIGISsDyAmzAgX3+fdd7WkNRjqworEjRdtYRFu7hGkzbPkFd7oa7Ty+CCk QUy7oZo8u/UEjiBfTctO5P/YZaKFIqZyLFeVRRZp3kZUS+pk35IJ4EE7n7bm7duH HHGvrptmL8u2KpPNlPLKQamQYHDVemsf6q5MJQ7HdL6ouD3+bv/DFqxxeYECNhiP XiJ3lyZdbgrzoRGFGTM7MzxHZkG3agWcE4tz6s4owEYUxdkUb4YX4Kr8Nd9SK00F hy3l266Auwi1s6NZfXx3IA4zLwlII+7UmBhUai42dKSNVX0+8lIDt7wYArVoW2f+ wXlZk8fDxr/tqx/wzCj1thWIdcue10TB/xPFXmSdcW0zDz82rU6xg0aLg8OWlKR7 ssGUl28ApVPgD95u+xONhGvWegD7N1TjUGVTTPMwHh12308h8T9sze//53ws5Jjy yv6si59zBe2qN2RroSBtcAM9qX+yrTovpY38V8sde+sIPIXLOrLdRNekpQK9AbsE Gbxg+02/9Exq0KKgS/6KmbUUD/NRpK+Q/gHAWPAS5uU9QmOhIs6OGiQ3Ni1Z4uvy e0lpaiLd3TTfrY1U0Z/4dmBsqbZHCHw+L6B75y8ZsTcN4EIS2vE= =rum4 -----END PGP SIGNATURE----- --2tpZ6p+tZudaMPXj--