public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Chuck Zmudzinski <brchuckz@netscape.net>
To: Jan Beulich <jbeulich@suse.com>,
	regressions@lists.linux.dev, stable@vger.kernel.org
Cc: Thomas Gleixner <tglx@linutronix.de>,
	Ingo Molnar <mingo@redhat.com>, Borislav Petkov <bp@alien8.de>,
	Dave Hansen <dave.hansen@linux.intel.com>,
	"H. Peter Anvin" <hpa@zytor.com>,
	Andy Lutomirski <luto@kernel.org>,
	Peter Zijlstra <peterz@infradead.org>,
	Jani Nikula <jani.nikula@linux.intel.com>,
	Joonas Lahtinen <joonas.lahtinen@linux.intel.com>,
	Rodrigo Vivi <rodrigo.vivi@intel.com>,
	Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>,
	David Airlie <airlied@linux.ie>, Daniel Vetter <daniel@ffwll.ch>,
	xen-devel@lists.xenproject.org, x86@kernel.org,
	linux-kernel@vger.kernel.org, intel-gfx@lists.freedesktop.org,
	dri-devel@lists.freedesktop.org, Juergen Gross <jgross@suse.com>
Subject: Re: [REGRESSION} Re: [PATCH 2/2] x86/pat: add functions to query specific cache mode availability
Date: Fri, 20 May 2022 13:13:25 -0400	[thread overview]
Message-ID: <0a2e61ea-a73c-bbdc-e7c7-5110162b39bb@netscape.net> (raw)
In-Reply-To: <3efb9e54-b0d6-36db-c1c4-68d4f8f9a5ed@netscape.net>

I think this summary of the regression is appropriate for a top-post. 
Details follow below.

commit bdd8b6c98239: introduced what I call a real regression which 
persists in 5.17.x

Jan's proposed patch: 
https://lore.kernel.org/lkml/9385fa60-fa5d-f559-a137-6608408f88b0@suse.com/

Jan's patch would fix the real regression introduced by bdd8b6c98239 when
the nopat option is not enabled, but when the nopat option is enabled, this
patch would introduce what Jan calls a "perceived regression" that is really
caused by the failure of the i915 driver to handle the case of the nopat 
option
being provided on the command line properly.

What I request: commit Jan's proposed patch, and backport it to 5.17. That
would fix the real regression and only cause a perceived regression for the
case when nopat is enabled. In that case, patches to the i915 driver
would be helpful but necessary to fix a regression.

Regard,

Chuck Zmudzinski

On 5/20/2022 11:46 AM, Chuck Zmudzinski wrote:
> On 5/20/2022 10:06 AM, Jan Beulich wrote:
>> On 20.05.2022 15:33, Chuck Zmudzinski wrote:
>>> On 5/20/2022 5:41 AM, Jan Beulich wrote:
>>>> On 20.05.2022 10:30, Chuck Zmudzinski wrote:
>>>>> On 5/20/2022 2:59 AM, Chuck Zmudzinski wrote:
>>>>>> On 5/20/2022 2:05 AM, Jan Beulich wrote:
>>>>>>> On 20.05.2022 06:43, Chuck Zmudzinski wrote:
>>>>>>>> On 5/4/22 5:14 AM, Juergen Gross wrote:
>>>>>>>>> On 04.05.22 10:31, Jan Beulich wrote:
>>>>>>>>>> On 03.05.2022 15:22, Juergen Gross wrote:
>>>>>>>>>>
>>>>>>>>>> ... these uses there are several more. You say nothing on why
>>>>>>>>>> those want
>>>>>>>>>> leaving unaltered. When preparing my earlier patch I did 
>>>>>>>>>> inspect them
>>>>>>>>>> and came to the conclusion that these all would also better
>>>>>>>>>> observe the
>>>>>>>>>> adjusted behavior (or else I couldn't have left pat_enabled() 
>>>>>>>>>> as the
>>>>>>>>>> only predicate). In fact, as said in the description of my 
>>>>>>>>>> earlier
>>>>>>>>>> patch, in
>>>>>>>>>> my debugging I did find the use in i915_gem_object_pin_map() 
>>>>>>>>>> to be
>>>>>>>>>> the
>>>>>>>>>> problematic one, which you leave alone.
>>>>>>>>> Oh, I missed that one, sorry.
>>>>>>>> That is why your patch would not fix my Haswell unless
>>>>>>>> it also touches i915_gem_object_pin_map() in
>>>>>>>> drivers/gpu/drm/i915/gem/i915_gem_pages.c
>>>>>>>>
>>>>>>>>> I wanted to be rather defensive in my changes, but I agree at 
>>>>>>>>> least
>>>>>>>>> the
>>>>>>>>> case in arch_phys_wc_add() might want to be changed, too.
>>>>>>>> I think your approach needs to be more aggressive so it will fix
>>>>>>>> all the known false negatives introduced by bdd8b6c98239
>>>>>>>> such as the one in i915_gem_object_pin_map().
>>>>>>>>
>>>>>>>> I looked at Jan's approach and I think it would fix the issue
>>>>>>>> with my Haswell as long as I don't use the nopat option. I
>>>>>>>> really don't have a strong opinion on that question, but I
>>>>>>>> think the nopat option as a Linux kernel option, as opposed
>>>>>>>> to a hypervisor option, should only affect the kernel, and
>>>>>>>> if the hypervisor provides the pat feature, then the kernel
>>>>>>>> should not override that,
>>>>>>> Hmm, why would the kernel not be allowed to override that? Such
>>>>>>> an override would affect only the single domain where the
>>>>>>> kernel runs; other domains could take their own decisions.
>>>>>>>
>>>>>>> Also, for the sake of completeness: "nopat" used when running on
>>>>>>> bare metal has the same bad effect on system boot, so there
>>>>>>> pretty clearly is an error cleanup issue in the i915 driver. But
>>>>>>> that's orthogonal, and I expect the maintainers may not even care
>>>>>>> (but tell us "don't do that then").
>>>>> Actually I just did a test with the last official Debian kernel
>>>>> build of Linux 5.16, that is, a kernel before bdd8b6c98239 was
>>>>> applied. In fact, the nopat option does *not* break the i915 driver
>>>>> in 5.16. That is, with the nopat option, the i915 driver loads
>>>>> normally on both the bare metal and on the Xen hypervisor.
>>>>> That means your presumption (and the presumption of
>>>>> the author of bdd8b6c98239) that the "nopat" option was
>>>>> being observed by the i915 driver is incorrect. Setting "nopat"
>>>>> had no effect on my system with Linux 5.16. So after doing these
>>>>> tests, I am against the aggressive approach of breaking the i915
>>>>> driver with the "nopat" option because prior to bdd8b6c98239,
>>>>> nopat did not break the i915 driver. Why break it now?
>>>> Because that's, in my understanding, is the purpose of "nopat"
>>>> (not breaking the driver of course - that's a driver bug -, but
>>>> having an effect on the driver).
>>> I wouldn't call it a driver bug, but an incorrect configuration of the
>>> kernel by the user.  I presume X86_FEATURE_PAT is required by the
>>> i915 driver
>> The driver ought to work fine without PAT (and hence without being
>> able to make WC mappings). It would use UC instead and be slow, but
>> it ought to work.
>
> I am not an expert, but I think the reason it failed on my box was
> because of the requirements of CI. Maybe the driver would fall back
> to UC if the add_taint_for_CI function did not halt the entire system
> in response to the failed test for PAT when trying to use WC mappings.
>
>>> and therefore the driver should refuse to disable
>>> it if the user requests to disable it and instead warn the user that
>>> the driver did not disable the feature, contrary to what the user
>>> requested with the nopat option.
>>>
>>> In any case, my test did not verify that when nopat is set in Linux 
>>> 5.16,
>>> the thread takes the same code path as when nopat is not set,
>>> so I am not totally sure that the reason nopat does not break the
>>> i915 driver in 5.16 is that static_cpu_has(X86_FEATURE_PAT)
>>> returns true even when nopat is set. I could test it with a custom
>>> log message in 5.16 if that is necessary.
>>>
>>> Are you saying it was wrong for
>>> to return true in 5.16 when the user requests nopat?
>> No, I'm not saying that. It was wrong for this construct to be used
>> in the driver, which was fixed for 5.17 (and which had caused the
>> regression I did observe, leading to the patch as a hopefully least
>> bad option).
>
> Hmm, the patch I used to fix my box with 5.17.6 used
> static_cpu_has(X86_FEATURE_PAT) so the driver could
> continue to configure the hardware using WC. This is the
> relevant part of the patch I used to fix my box, which includes
> extra error logs, (against Debian's official build of 5.17.6):
>
> --- a/drivers/gpu/drm/i915/gem/i915_gem_pages.c    2022-05-09 
> 03:16:33.000000000 -0400
> +++ b/drivers/gpu/drm/i915/gem/i915_gem_pages.c    2022-05-19 
> 15:55:40.339778818 -0400
> ...
> @@ -430,17 +434,23 @@
>          err = i915_gem_object_wait_moving_fence(obj, true);
>          if (err) {
>              ptr = ERR_PTR(err);
> +            DRM_ERROR("i915_gem_object_wait_moving_fence error, err = 
> %d\n", err);
>              goto err_unpin;
>          }
>
> -        if (GEM_WARN_ON(type == I915_MAP_WC && !pat_enabled()))
> +        if (GEM_WARN_ON(type == I915_MAP_WC &&
> +                !pat_enabled() && !static_cpu_has(X86_FEATURE_PAT))) {
> +            DRM_ERROR("type == I915_MAP_WC && !pat_enabled(), err = 
> %d\n", -ENODEV);
>              ptr = ERR_PTR(-ENODEV);
> +        }
>          else if (i915_gem_object_has_struct_page(obj))
>              ptr = i915_gem_object_map_page(obj, type);
>          else
>              ptr = i915_gem_object_map_pfn(obj, type);
> -        if (IS_ERR(ptr))
> +        if (IS_ERR(ptr)) {
> +            DRM_ERROR("IS_ERR(PTR) is true, returning a (ptr) error\n");
>              goto err_unpin;
> +        }
>
>          obj->mm.mapping = page_pack_bits(ptr, type);
>      }
>
> As you can see, adding the static_cpu_has(X86_FEATURE_PAT)
> function to the test for PAT restored the behavior of 5.16 on the
> Xen hypervisor to 5.17, and that is how I discovered the solution
> to this problem on 5.17 on my box.
>
>>> I think that is
>>> just permitting a bad configuration to break the driver that a
>>> well-written operating system should not allow. The i915 driver
>>> was, in my opinion, correctly ignoring the nopat option in 5.16
>>> because that option is not compatible with the hardware the
>>> i915 driver is trying to initialize and setup at boot time. At least
>>> that is my understanding now, but I will need to test it on 5.16
>>> to be sure I understand it correctly.
>>>
>>> Also, AFAICT, your patch would break the driver when the nopat
>>> option is set and only fix the regression introduced by bdd8b6c98239
>>> when nopat is not set on my box, so your patch would
>>> introduce a regression relative to Linux 5.16 and earlier for the
>>> case when nopat is set on my box. I think your point would
>>> be that it is not a regression if it is an incorrect user 
>>> configuration.
>> Again no - my view is that there's a separate, pre-existing issue
>> in the driver which was uncovered by the change. This may be a
>> perceived regression, but is imo different from a real one.
>
> Maybe it is only a perceived regression if nopat is set, but
> imo bdd8b6c98239 introduced a real regression in 5.17
> relative to 5.16 for the correctly and identically configured
> case when the nopat option is not set. That is why I still think
> it should be reverted and the fix backported to 5.17 until the
> regression for the case when nopat is not set is fixed. As I
> said before, the i915 driver relies on the memory subsyste
> to provide it with an accurate test for the x86 pat feature.
> The test the driver used in bdd8b6c98239 gives the i915 driver
> a false negative, and that caused a real regression when nopat
> is not set. bdd8b6c98239 can be re-applied if we apply your
> patch which corrects the false negative that pat_enabled() is
> currently providing the i915 driver with. That false negative
> from pat_enabled() is not an i915 bug, it is a bug in x86/pat.
>
> Chuck


  reply	other threads:[~2022-05-20 17:13 UTC|newest]

Thread overview: 41+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-05-03 13:22 [PATCH 0/2] x86/pat: fix querying available caching modes Juergen Gross
2022-05-03 13:22 ` [PATCH 1/2] x86/pat: fix x86_has_pat_wp() Juergen Gross
2022-05-27 10:21   ` Juergen Gross
2022-06-14 15:09   ` Juergen Gross
2022-06-20  5:22     ` Thorsten Leemhuis
2022-06-20  5:30       ` Juergen Gross
2022-06-20  6:15         ` Thorsten Leemhuis
2022-06-20 10:26   ` Borislav Petkov
2022-06-20 10:41     ` Juergen Gross
2022-06-20 15:27       ` Dave Hansen
2022-06-20 15:34         ` Juergen Gross
2022-05-03 13:22 ` [PATCH 2/2] x86/pat: add functions to query specific cache mode availability Juergen Gross
2022-05-04  8:31   ` Jan Beulich
2022-05-04  9:14     ` Juergen Gross
2022-05-04  9:51       ` Jan Beulich
2022-05-20  4:43       ` Chuck Zmudzinski
2022-05-20  5:56         ` Chuck Zmudzinski
2022-05-20  6:05         ` Jan Beulich
2022-05-20  6:59           ` Chuck Zmudzinski
2022-05-20  8:30             ` Chuck Zmudzinski
2022-05-20  9:41               ` Jan Beulich
2022-05-20 13:33                 ` Chuck Zmudzinski
2022-05-20 14:06                   ` Jan Beulich
2022-05-20 14:48                     ` Chuck Zmudzinski
2022-05-21 10:47                       ` Thorsten Leemhuis
2022-05-24 18:32                         ` Chuck Zmudzinski
2022-05-25  7:45                           ` Thorsten Leemhuis
2022-05-25  8:04                             ` Juergen Gross
2022-05-25  8:37                             ` Jan Beulich
2022-05-25  8:51                               ` Thorsten Leemhuis
2022-05-25 19:25                             ` Chuck Zmudzinski
2022-05-20 15:46                     ` [REGRESSION} " Chuck Zmudzinski
2022-05-20 17:13                       ` Chuck Zmudzinski [this message]
2022-05-20 17:17                         ` Chuck Zmudzinski
2022-05-18 13:45   ` Christoph Hellwig
2022-05-20  2:15   ` Chuck Zmudzinski
2022-05-20  2:21     ` Chuck Zmudzinski
2022-05-21 13:24   ` Chuck Zmudzinski
2022-07-11  9:46 ` [tip: x86/urgent] x86/pat: Fix x86_has_pat_wp() tip-bot2 for Juergen Gross
2022-07-13 10:45 ` tip-bot2 for Juergen Gross
2022-07-13 10:52 ` tip-bot2 for Juergen Gross

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=0a2e61ea-a73c-bbdc-e7c7-5110162b39bb@netscape.net \
    --to=brchuckz@netscape.net \
    --cc=airlied@linux.ie \
    --cc=bp@alien8.de \
    --cc=daniel@ffwll.ch \
    --cc=dave.hansen@linux.intel.com \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=hpa@zytor.com \
    --cc=intel-gfx@lists.freedesktop.org \
    --cc=jani.nikula@linux.intel.com \
    --cc=jbeulich@suse.com \
    --cc=jgross@suse.com \
    --cc=joonas.lahtinen@linux.intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=luto@kernel.org \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=regressions@lists.linux.dev \
    --cc=rodrigo.vivi@intel.com \
    --cc=stable@vger.kernel.org \
    --cc=tglx@linutronix.de \
    --cc=tvrtko.ursulin@linux.intel.com \
    --cc=x86@kernel.org \
    --cc=xen-devel@lists.xenproject.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox