From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id BE9D0CAC5AE for ; Fri, 26 Sep 2025 07:42:03 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id 7E73C10E9E2; Fri, 26 Sep 2025 07:42:03 +0000 (UTC) Authentication-Results: gabe.freedesktop.org; dkim=pass (2048-bit key; unprotected) header.d=intel.com header.i=@intel.com header.b="eORws3tM"; dkim-atps=neutral Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.7]) by gabe.freedesktop.org (Postfix) with ESMTPS id AD7FE10E9E2 for ; Fri, 26 Sep 2025 07:42:01 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1758872522; x=1790408522; h=date:from:to:cc:subject:message-id:references: mime-version:content-transfer-encoding:in-reply-to; bh=k81HwlEeCFoDjAOZDV6d4LWKvyBMYDj2QLJNbeDjsns=; b=eORws3tMntVBzqbnADx36yfE57Ci4zlPCgSHb4+txOGY/glflnjRyI4q JIsFX2QdQAznJ7e1L45xLptk+rvSiC2j2ziF1OgadaskMDUhvg2wlcLH6 QLcbdEMCnw3vbnw5p6ahy8h+8mIp1jfgWsIMrSaZ+3QpG5MZJ1z6RRKVa Ca8DG8c4wCz5qYfM8CdIIHq1V9d96NwXzYWnJ3qhAW/4QHSK3ok9ivYS2 hs+ZXdUuV0bHSbIwWQrAdC5rVlKdvABwxBarbHN8fyI0WcXBFAJkcTtd9 tjmcVTUFz4+NTLhsTfvDTWJ115ijw29LdbalbZ6BNw7SISY/EKpidHYrh w==; X-CSE-ConnectionGUID: 6OWg2GVuRmGY8H7SmR8ioA== X-CSE-MsgGUID: b7plzzrxTIqhMgbH9YDssQ== X-IronPort-AV: E=McAfee;i="6800,10657,11564"; a="86645964" X-IronPort-AV: E=Sophos;i="6.18,294,1751266800"; d="scan'208";a="86645964" Received: from orviesa003.jf.intel.com ([10.64.159.143]) by fmvoesa101.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 26 Sep 2025 00:42:01 -0700 X-CSE-ConnectionGUID: YqSJ0NvYRgqoNjVEZSIF+Q== X-CSE-MsgGUID: 45BvtTZIRbmdEuUT+/3QtQ== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.18,294,1751266800"; d="scan'208";a="181545749" Received: from abityuts-desk.ger.corp.intel.com (HELO localhost) ([10.245.244.175]) by ORVIESA003-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 26 Sep 2025 00:42:00 -0700 Date: Fri, 26 Sep 2025 10:41:56 +0300 From: Ville =?iso-8859-1?Q?Syrj=E4l=E4?= To: Tvrtko Ursulin Cc: intel-xe@lists.freedesktop.org, kernel-dev@igalia.com Subject: Re: [PATCH v12 11/13] drm/xe: Force flush system memory AuxCCS framebuffers before scan out Message-ID: References: <8bec1692-1bec-45ff-a552-10ff8c7420b5@igalia.com> <7e07606b-d542-4407-a092-476f202cc8e2@igalia.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: X-Patchwork-Hint: comment Organization: Intel Finland Oy - BIC 0357606-4 - Westendinkatu 7, 02160 Espoo X-BeenThere: intel-xe@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Intel Xe graphics driver List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: intel-xe-bounces@lists.freedesktop.org Sender: "Intel-xe" On Thu, Sep 25, 2025 at 11:08:50AM +0100, Tvrtko Ursulin wrote: > > On 25/09/2025 08:24, Tvrtko Ursulin wrote: > > > > On 24/09/2025 23:35, Ville Syrjälä wrote: > >> On Wed, Sep 24, 2025 at 02:09:42PM +0100, Tvrtko Ursulin wrote: > >>> > >>> On 23/09/2025 15:52, Ville Syrjälä wrote: > >>>> On Tue, Sep 23, 2025 at 03:40:55PM +0100, Tvrtko Ursulin wrote: > >>>>> > >>>>> On 23/09/2025 14:20, Ville Syrjälä wrote: > >>>>>> On Tue, Sep 23, 2025 at 01:25:58PM +0100, Tvrtko Ursulin wrote: > >>>>>>> > >>>>>>> On 23/09/2025 13:01, Ville Syrjälä wrote: > >>>>>>>> On Tue, Sep 23, 2025 at 11:48:59AM +0100, Tvrtko Ursulin wrote: > >>>>>>>>> > >>>>>>>>> On 23/09/2025 11:19, Ville Syrjälä wrote: > >>>>>>>>>> On Tue, Sep 23, 2025 at 11:08:04AM +0100, Tvrtko Ursulin wrote: > >>>>>>>>>>> Even though frame buffer objects are created as write- > >>>>>>>>>>> combined, in > >>>>>>>>>>> practice, on top of all the ring buffer flushing, an > >>>>>>>>>>> additional clflush > >>>>>>>>>>> seems to be needed before display engine can coherently scan > >>>>>>>>>>> out the > >>>>>>>>>>> AuxCCS compressed data without transient artifacts. > >>>>>>>>>>> > >>>>>>>>>>> If for comparison we look at how i915 handles things (where > >>>>>>>>>>> AuxCCS works > >>>>>>>>>>> fine), as it happens it has this same clflush before a frame > >>>>>>>>>>> buffer is > >>>>>>>>>>> pinned for display for the first time, courtesy the dynamic > >>>>>>>>>>> tracking of > >>>>>>>>>>> the buffer cache mode and setting the latter to uncached > >>>>>>>>>>> before handing > >>>>>>>>>>> to display. > >>>>>>>>>>> > >>>>>>>>>>> Since xe considers the buffer object caching mode as static > >>>>>>>>>>> we can > >>>>>>>>>>> implement the same approach by adding a flag telling us if > >>>>>>>>>>> the buffer > >>>>>>>>>>> was ever pinned for display and flush on the first pin. > >>>>>>>>>>> Subsequent re-pins > >>>>>>>>>>> will not repeat the clflush but so far I have not observed > >>>>>>>>>>> any glitching > >>>>>>>>>>> after the first pin. > >>>>>>>>>>> > >>>>>>>>>>> Signed-off-by: Tvrtko Ursulin > >>>>>>>>>>> --- > >>>>>>>>>>>       drivers/gpu/drm/xe/display/xe_fb_pin.c | 14 +++++++++++++- > >>>>>>>>>>>       drivers/gpu/drm/xe/xe_bo_types.h       | 14 +++++++++----- > >>>>>>>>>>>       2 files changed, 22 insertions(+), 6 deletions(-) > >>>>>>>>>>> > >>>>>>>>>>> diff --git a/drivers/gpu/drm/xe/display/xe_fb_pin.c b/ > >>>>>>>>>>> drivers/gpu/drm/xe/display/xe_fb_pin.c > >>>>>>>>>>> index d8aa23b8cf14..f247c0da6b9e 100644 > >>>>>>>>>>> --- a/drivers/gpu/drm/xe/display/xe_fb_pin.c > >>>>>>>>>>> +++ b/drivers/gpu/drm/xe/display/xe_fb_pin.c > >>>>>>>>>>> @@ -382,6 +382,7 @@ static struct i915_vma > >>>>>>>>>>> *__xe_pin_fb_vma(const struct intel_framebuffer *fb, > >>>>>>>>>>>           struct xe_bo *bo = gem_to_xe_bo(obj); > >>>>>>>>>>>           struct xe_validation_ctx ctx; > >>>>>>>>>>>           struct drm_exec exec; > >>>>>>>>>>> +    bool first_pin; > >>>>>>>>>>>           int ret = 0; > >>>>>>>>>>>           if (!vma) > >>>>>>>>>>> @@ -422,8 +423,11 @@ static struct i915_vma > >>>>>>>>>>> *__xe_pin_fb_vma(const struct intel_framebuffer *fb, > >>>>>>>>>>>                   ret = xe_bo_validate(bo, NULL, true, &exec); > >>>>>>>>>>>               drm_exec_retry_on_contention(&exec); > >>>>>>>>>>>               xe_validation_retry_on_oom(&ctx, &ret); > >>>>>>>>>>> -        if (!ret) > >>>>>>>>>>> +        if (!ret) { > >>>>>>>>>>>                   ttm_bo_pin(&bo->ttm); > >>>>>>>>>>> +            first_pin = !bo->display_pin; > >>>>>>>>>>> +            bo->display_pin = true; > >>>>>>>>>>> +        } > >>>>>>>>>>>           } > >>>>>>>>>>>           if (ret) > >>>>>>>>>>>               goto err; > >>>>>>>>>>> @@ -436,6 +440,14 @@ static struct i915_vma > >>>>>>>>>>> *__xe_pin_fb_vma(const struct intel_framebuffer *fb, > >>>>>>>>>>>           if (ret) > >>>>>>>>>>>               goto err_unpin; > >>>>>>>>>>> +    /* > >>>>>>>>>>> +     * Force flush frame buffer data for non-coherent > >>>>>>>>>>> display access when > >>>>>>>>>>> +     * AuxCCS formats are used. > >>>>>>>>>>> +     */ > >>>>>>>>>>> +    if (first_pin && !xe_bo_is_vram(bo) && ! > >>>>>>>>>>> xe_bo_is_stolen(bo) && > >>>>>>>>>>> +        intel_fb_is_ccs_modifier(fb->base.modifier)) > >>>>>>>>>>> +        drm_clflush_sg(xe_bo_sg(bo)); > >>>>>>>>>> > >>>>>>>>>> You still haven't found the actual bug that causes the dirty > >>>>>>>>>> cache? > >>>>>>>>> > >>>>>>>>> Sadly no. I cross referenced everything numerous times, including > >>>>>>>>> workarounds, tried pretty much 1:1 i915 vs xe ring buffer > >>>>>>>>> programming, > >>>>>>>>> but this extra flush always remains required. I also heard that > >>>>>>>>> how > >>>>>>>>> flushing of the aux metadata works isn't documented anywhere > >>>>>>>>> and i915 > >>>>>>>>> does have this flush, by design or accident I don't know. > >>>>>>>> > >>>>>>>> i915 has the flush for the *whole* bo because it started out as > >>>>>>>> cached. As soon as it undergoes a cached->uncached change (either > >>>>>>>> due to set_caching ioctl or becoming a display scanout buffer) we > >>>>>>>> clflush it and switch the GPU page tables from WB to UC. After > >>>>>>>> that the bo stays uncached, and will need no further clflushes, > >>>>>>>> assuming that everyone follows the rules: > >>>>>>>> > >>>>>>>> - CPU accesses via a WB mapping to an uncached bo will explicitly > >>>>>>>>       clflush before (invalidate) and after (write-back) the access > >>>>>>>> - userspace always sets the GPU to use use "consult the PTE" MOCS > >>>>>>>>       setting for any potential scanout buffer. That way the GPU > >>>>>>>> will > >>>>>>>>       use WB as long as the bo is cached, and once it becomes > >>>>>>>>       uncached the GPU also switches to UC accesses > >>>>>>>> > >>>>>>>> With xe I presume the BO should already start out as UC/WC with > >>>>>>>> clean caches, and nothing should be dirtying the caches unless > >>>>>>>> there is a real bug somewhere. > >>>>>>> > >>>>>>> Correct, it is all WC and MOCS are correctly set as WC. > >>>>>>> > >>>>>>> For example for the media compression flavour that was broken and > >>>>>>> needed > >>>>>>> fixing in b412b144685f ("lib/intel/veboxcopy: Respect buffer MOCS > >>>>>>> index"). (For the render compression flavour MOCS was already > >>>>>>> correct.) > >>>>>>> But MOCS is not enough on its own for some reason. > >>>>>>> > >>>>>>> I also looked into the rendercopy surface state programming and > >>>>>>> that too > >>>>>>> looks okay. In fact it is possible to see how using the wrong > >>>>>>> MOCS makes > >>>>>>> things worse. But it also appears the surface state MOCS only > >>>>>>> applies to > >>>>>>> the main surface, while the aux state is the one which appears to > >>>>>>> contain cached/unflushed data. > >>>>>> > >>>>>> Are you saying that if you start scanning out a compressed buffer, > >>>>>> then clean the caches, and then do frontbuffer rendering you get > >>>>>> more cache dirt? > >>>>> > >>>>> Render with the GPU or CPU? Do you know of any tests or userspaces > >>>>> which > >>>>> do that? > >>>> > >>>> On the GPU. > >>>> > >>>> Can't think of anything nice off the top of my head. > >>>> > >>>> I'd probably just write a quick igt: > >>>> 1. create a compressed fb > >>>> 2. flip to it > >>>> 3. manually clflush the whole thing to be sure it's clean > >>>> 4. rendercopy some junk around, and keep an eye out for cache dirt > >>> > >>> Looks the same dirt on i915 and xe, which would support the theory that > >>> MOCS settings do not apply to the aux surface. > >>> > >>> If you look at these videos frame by frame you can see it: > >>> > >>> https://people.igalia.com/tursulin/i915-ccs-cache-dirt.MOV > >>> https://people.igalia.com/tursulin/xe-ccs-cache-dirt.MOV > >>> > >> > >> That doesn't really look like cache dirt since it doesn't linger. > >> So I'd say that is more tearing than anything. Tearing can look > >> rather weird with tiled/compressed buffers. > > > > Some occurrences have two frames of artefacts before the grey square > > settles. With a 90Hz display and camera taking 60fps video can it be > > tearing? > > > >> To raelly see cache dirt you should tweak the test to > >> select a wb mocs. > >> > >> Oh, and disable CPU c-states. Turns out deep c-states cause extra > >> cache flushes all the time which will hide the dirt in short order. > >> Without those extra flushes cache dirt will linger for a long time > >> on screen. > > > > This I didn't do, will try. > > With no C-states results are indeed much more "fun". Although still the > same between i915 and xe: > > https://people.igalia.com/tursulin/i915-ccs-cache-dirt-nocstates.MOV > https://people.igalia.com/tursulin/xe-ccs-cache-dirt-nocstates.MOV > > Still MOCS 3 so uncached in render state verified for both. I reverse engineered this a bit and there's definitely a MOCS issue at play. First I noticed that if filled the entire MOCS table with UC the problem went away. I then filled the entire table with WB and essentially bisected what I need to make UC to fix it. And I had to repeat that same process starting from the other end of table. Looks like there is some undocumented magic in the hardware. MOCS 61 really is special: - MOCS 61 UC, others WB, select MOCS 61 -> no corruption MOCS 0 and 63 are special in other ways: - MOCS X UC, others WB, select MOCS X -> corruption - MOCS X+0 UC, others WB, select MOCS X -> corruption - MOCS X+63 UC, others WB, select MOCS X -> corruption - MOCS X+0+63 UC, others WB, select MOCS X -> no corruption where X != 61 I didn't actually test all values of X there, but I did spot check a handful of them. Also, ADL is affected, but TGL doesn't seem to be. Though I still need to check the situation on TGL a bit more thoroughly. -- Ville Syrjälä Intel