[Intel-gfx] Regression in linux-next

Intel-GFX Archive on lore.kernel.org
 help / color / mirror / Atom feed

* [Intel-gfx] Regression in linux-next
       [not found]   ` <SJ1PR11MB612980562220A376CA90E105B97E9@SJ1PR11MB6129.namprd11.prod.outlook.com>
@ 2023-07-25  6:42     ` Borah, Chaitanya Kumar
  2023-07-25 10:53       ` Tvrtko Ursulin
  2023-07-25 13:15       ` Alistair Popple
  0 siblings, 2 replies; 14+ messages in thread
From: Borah, Chaitanya Kumar @ 2023-07-25  6:42 UTC (permalink / raw)
  To: apopple@nvidia.com
  Cc: Nikula, Jani, intel-gfx@lists.freedesktop.org,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	Kurmi, Suresh Kumar, Yedireswarapu, SaiX Nandan

Hello Alistair,

Hope you are doing well. I am Chaitanya from the linux graphics team in Intel.
 
This mail is regarding a regression we are seeing in our CI runs[1] on linux-next
repository.
 
On next-20230720 [2], we are seeing the following error

<4>[   76.189375] Hardware name: Intel Corporation Meteor Lake Client Platform/MTL-P DDR5 SODIMM SBS RVP, BIOS MTLPFWI1.R00.3271.D81.2307101805 07/10/2023
<4>[   76.202534] RIP: 0010:__mmu_notifier_register+0x40/0x210
<4>[   76.207804] Code: 1a 71 5a 01 85 c0 0f 85 ec 00 00 00 48 8b 85 30 01 00 00 48 85 c0 0f 84 04 01 00 00 8b 85 cc 00 00 00 85 c0 0f 8e bb 01 00 00 <49> 8b 44 24 10 48 83 78 38 00 74 1a 48 83 78 28 00 74 0c 0f 0b b8
<4>[   76.226368] RSP: 0018:ffffc900019d7ca8 EFLAGS: 00010202
<4>[   76.231549] RAX: 0000000000000001 RBX: 0000000000001000 RCX: 0000000000000001
<4>[   76.238613] RDX: 0000000000000000 RSI: ffffffff823ceb7b RDI: ffffffff823ee12d
<4>[   76.245680] RBP: ffff888102ec9b40 R08: 00000000ffffffff R09: 0000000000000001
<4>[   76.252747] R10: 0000000000000001 R11: ffff8881157cd2c0 R12: 0000000000000000
<4>[   76.259811] R13: ffff888102ec9c70 R14: ffffffffa07de500 R15: ffff888102ec9ce0
<4>[   76.266875] FS:  00007fbcabe11c00(0000) GS:ffff88846ec00000(0000) knlGS:0000000000000000
<4>[   76.274884] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4>[   76.280578] CR2: 0000000000000010 CR3: 000000010d4c2005 CR4: 0000000000f70ee0
<4>[   76.287643] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
<4>[   76.294711] DR3: 0000000000000000 DR6: 00000000ffff07f0 DR7: 0000000000000400
<4>[   76.301775] PKRU: 55555554
<4>[   76.304463] Call Trace:
<4>[   76.306893]  <TASK>
<4>[   76.308983]  ? __die_body+0x1a/0x60
<4>[   76.312444]  ? page_fault_oops+0x156/0x450
<4>[   76.316510]  ? do_user_addr_fault+0x65/0x980
<4>[   76.320747]  ? exc_page_fault+0x68/0x1a0
<4>[   76.324643]  ? asm_exc_page_fault+0x26/0x30
<4>[   76.328796]  ? __mmu_notifier_register+0x40/0x210
<4>[   76.333460]  ? __mmu_notifier_register+0x11c/0x210
<4>[   76.338206]  ? preempt_count_add+0x4c/0xa0
<4>[   76.342273]  mmu_notifier_register+0x30/0xe0
<4>[   76.346509]  mmu_interval_notifier_insert+0x74/0xb0
<4>[   76.351344]  i915_gem_userptr_ioctl+0x21a/0x320 [i915]
<4>[   76.356565]  ? __pfx_i915_gem_userptr_ioctl+0x10/0x10 [i915]
<4>[   76.362271]  drm_ioctl_kernel+0xb4/0x150
<4>[   76.366159]  drm_ioctl+0x21d/0x420
<4>[   76.369537]  ? __pfx_i915_gem_userptr_ioctl+0x10/0x10 [i915]
<4>[   76.375242]  ? find_held_lock+0x2b/0x80
<4>[   76.379046]  __x64_sys_ioctl+0x79/0xb0
<4>[   76.382766]  do_syscall_64+0x3c/0x90
<4>[   76.386312]  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
<4>[   76.391317] RIP: 0033:0x7fbcae63f3ab

Details log can be found in [3].

After bisecting the tree, the following patch seems to be causing the
regression.

commit 828fe4085cae77acb3abf7dd3d25b3ed6c560edf
Author: Alistair Popple apopple@nvidia.com
Date:   Wed Jul 19 22:18:46 2023 +1000

    mmu_notifiers: rename invalidate_range notifier

    There are two main use cases for mmu notifiers.  One is by KVM which uses
    mmu_notifier_invalidate_range_start()/end() to manage a software TLB.

    The other is to manage hardware TLBs which need to use the
    invalidate_range() callback because HW can establish new TLB entries at
    any time.  Hence using start/end() can lead to memory corruption as these
    callbacks happen too soon/late during page unmap.

    mmu notifier users should therefore either use the start()/end() callbacks
    or the invalidate_range() callbacks.  To make this usage clearer rename
    the invalidate_range() callback to arch_invalidate_secondary_tlbs() and
    update documention.

    Link: https://lkml.kernel.org/r/9a02dde2f8ddaad2db31e54706a80c12d1817aaf.1689768831.git-series.apopple@nvidia.com


We also verified by reverting the patch in the tree.

Could you please check why this patch causes the regression and if we can find
a solution for it soon?

[1] https://intel-gfx-ci.01.org/tree/linux-next/combined-alt.html?
[2] https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?h=next-20230720 
[3] https://intel-gfx-ci.01.org/tree/linux-next/next-20230720/bat-mtlp-6/dmesg0.txt

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Intel-gfx] Regression in linux-next
  2023-07-25  6:42     ` [Intel-gfx] Regression in linux-next Borah, Chaitanya Kumar
@ 2023-07-25 10:53       ` Tvrtko Ursulin
  2023-07-26  3:55         ` Borah, Chaitanya Kumar
  2023-07-25 13:15       ` Alistair Popple
  1 sibling, 1 reply; 14+ messages in thread
From: Tvrtko Ursulin @ 2023-07-25 10:53 UTC (permalink / raw)
  To: Borah, Chaitanya Kumar, apopple@nvidia.com
  Cc: Nikula, Jani, intel-gfx@lists.freedesktop.org,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	Kurmi, Suresh Kumar, Yedireswarapu, SaiX Nandan


On 25/07/2023 07:42, Borah, Chaitanya Kumar wrote:
> Hello Alistair,
> 
> Hope you are doing well. I am Chaitanya from the linux graphics team in Intel.
>   
> This mail is regarding a regression we are seeing in our CI runs[1] on linux-next
> repository.
>   
> On next-20230720 [2], we are seeing the following error
> 
> <4>[   76.189375] Hardware name: Intel Corporation Meteor Lake Client Platform/MTL-P DDR5 SODIMM SBS RVP, BIOS MTLPFWI1.R00.3271.D81.2307101805 07/10/2023
> <4>[   76.202534] RIP: 0010:__mmu_notifier_register+0x40/0x210
> <4>[   76.207804] Code: 1a 71 5a 01 85 c0 0f 85 ec 00 00 00 48 8b 85 30 01 00 00 48 85 c0 0f 84 04 01 00 00 8b 85 cc 00 00 00 85 c0 0f 8e bb 01 00 00 <49> 8b 44 24 10 48 83 78 38 00 74 1a 48 83 78 28 00 74 0c 0f 0b b8
> <4>[   76.226368] RSP: 0018:ffffc900019d7ca8 EFLAGS: 00010202
> <4>[   76.231549] RAX: 0000000000000001 RBX: 0000000000001000 RCX: 0000000000000001
> <4>[   76.238613] RDX: 0000000000000000 RSI: ffffffff823ceb7b RDI: ffffffff823ee12d
> <4>[   76.245680] RBP: ffff888102ec9b40 R08: 00000000ffffffff R09: 0000000000000001
> <4>[   76.252747] R10: 0000000000000001 R11: ffff8881157cd2c0 R12: 0000000000000000
> <4>[   76.259811] R13: ffff888102ec9c70 R14: ffffffffa07de500 R15: ffff888102ec9ce0
> <4>[   76.266875] FS:  00007fbcabe11c00(0000) GS:ffff88846ec00000(0000) knlGS:0000000000000000
> <4>[   76.274884] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> <4>[   76.280578] CR2: 0000000000000010 CR3: 000000010d4c2005 CR4: 0000000000f70ee0
> <4>[   76.287643] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> <4>[   76.294711] DR3: 0000000000000000 DR6: 00000000ffff07f0 DR7: 0000000000000400
> <4>[   76.301775] PKRU: 55555554
> <4>[   76.304463] Call Trace:
> <4>[   76.306893]  <TASK>
> <4>[   76.308983]  ? __die_body+0x1a/0x60
> <4>[   76.312444]  ? page_fault_oops+0x156/0x450
> <4>[   76.316510]  ? do_user_addr_fault+0x65/0x980
> <4>[   76.320747]  ? exc_page_fault+0x68/0x1a0
> <4>[   76.324643]  ? asm_exc_page_fault+0x26/0x30
> <4>[   76.328796]  ? __mmu_notifier_register+0x40/0x210
> <4>[   76.333460]  ? __mmu_notifier_register+0x11c/0x210
> <4>[   76.338206]  ? preempt_count_add+0x4c/0xa0
> <4>[   76.342273]  mmu_notifier_register+0x30/0xe0
> <4>[   76.346509]  mmu_interval_notifier_insert+0x74/0xb0
> <4>[   76.351344]  i915_gem_userptr_ioctl+0x21a/0x320 [i915]
> <4>[   76.356565]  ? __pfx_i915_gem_userptr_ioctl+0x10/0x10 [i915]
> <4>[   76.362271]  drm_ioctl_kernel+0xb4/0x150
> <4>[   76.366159]  drm_ioctl+0x21d/0x420
> <4>[   76.369537]  ? __pfx_i915_gem_userptr_ioctl+0x10/0x10 [i915]
> <4>[   76.375242]  ? find_held_lock+0x2b/0x80
> <4>[   76.379046]  __x64_sys_ioctl+0x79/0xb0
> <4>[   76.382766]  do_syscall_64+0x3c/0x90
> <4>[   76.386312]  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
> <4>[   76.391317] RIP: 0033:0x7fbcae63f3ab
> 
> Details log can be found in [3].
> 
> After bisecting the tree, the following patch seems to be causing the
> regression.
> 
> commit 828fe4085cae77acb3abf7dd3d25b3ed6c560edf
> Author: Alistair Popple apopple@nvidia.com
> Date:   Wed Jul 19 22:18:46 2023 +1000
> 
>      mmu_notifiers: rename invalidate_range notifier
> 
>      There are two main use cases for mmu notifiers.  One is by KVM which uses
>      mmu_notifier_invalidate_range_start()/end() to manage a software TLB.
> 
>      The other is to manage hardware TLBs which need to use the
>      invalidate_range() callback because HW can establish new TLB entries at
>      any time.  Hence using start/end() can lead to memory corruption as these
>      callbacks happen too soon/late during page unmap.
> 
>      mmu notifier users should therefore either use the start()/end() callbacks
>      or the invalidate_range() callbacks.  To make this usage clearer rename
>      the invalidate_range() callback to arch_invalidate_secondary_tlbs() and
>      update documention.
> 
>      Link: https://lkml.kernel.org/r/9a02dde2f8ddaad2db31e54706a80c12d1817aaf.1689768831.git-series.apopple@nvidia.com
> 
> 
> We also verified by reverting the patch in the tree.
> 
> Could you please check why this patch causes the regression and if we can find
> a solution for it soon?

Without checking out the whole tree but only looking at this patch in isolation, it could be that it is not considering NULL subscription can be passed to mmu_notifier_register. For instance from mmu_interval_notifier_insert, which i915 is calling. So the check patch added to __mmu_notifier_register causes a null pointer dereference:

@@ -616,6 +617,15 @@ int __mmu_notifier_register(struct mmu_notifier *subscription,
         mmap_assert_write_locked(mm);
         BUG_ON(atomic_read(&mm->mm_users) <= 0);
  
+       /*
+        * Subsystems should only register for invalidate_secondary_tlbs() or
+        * invalidate_range_start()/end() callbacks, not both.
+        */
+       if (WARN_ON_ONCE(subscription->ops->arch_invalidate_secondary_tlbs &&

---> subscription is NULL here <---

+                               (subscription->ops->invalidate_range_start ||
+                               subscription->ops->invalidate_range_end)))
+               return -EINVAL;
+

Regards,

Tvrtko

> 
> [1] https://intel-gfx-ci.01.org/tree/linux-next/combined-alt.html?
> [2] https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?h=next-20230720
> [3] https://intel-gfx-ci.01.org/tree/linux-next/next-20230720/bat-mtlp-6/dmesg0.txt

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Intel-gfx] Regression in linux-next
  2023-07-25 10:53       ` Tvrtko Ursulin
@ 2023-07-26  3:55         ` Borah, Chaitanya Kumar
  0 siblings, 0 replies; 14+ messages in thread
From: Borah, Chaitanya Kumar @ 2023-07-26  3:55 UTC (permalink / raw)
  To: Tvrtko Ursulin, apopple@nvidia.com
  Cc: Nikula, Jani, intel-gfx@lists.freedesktop.org,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	Kurmi, Suresh Kumar, Yedireswarapu, SaiX Nandan

Hello Tvrtko,

Your analysis is correct. Alistair has sent a new patch set with a fix.

Thank you.

Regards

Chaitanya

> -----Original Message-----
> From: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
> Sent: Tuesday, July 25, 2023 4:24 PM
> To: Borah, Chaitanya Kumar <chaitanya.kumar.borah@intel.com>;
> apopple@nvidia.com
> Cc: Nikula, Jani <jani.nikula@intel.com>; intel-gfx@lists.freedesktop.org; linux-
> kernel@vger.kernel.org; linux-mm@kvack.org; Kurmi, Suresh Kumar
> <suresh.kumar.kurmi@intel.com>; Yedireswarapu, SaiX Nandan
> <saix.nandan.yedireswarapu@intel.com>
> Subject: Re: [Intel-gfx] Regression in linux-next
> 
> 
> On 25/07/2023 07:42, Borah, Chaitanya Kumar wrote:
> > Hello Alistair,
> >
> > Hope you are doing well. I am Chaitanya from the linux graphics team in
> Intel.
> >
> > This mail is regarding a regression we are seeing in our CI runs[1] on
> > linux-next repository.
> >
> > On next-20230720 [2], we are seeing the following error
> >
> > <4>[   76.189375] Hardware name: Intel Corporation Meteor Lake Client
> Platform/MTL-P DDR5 SODIMM SBS RVP, BIOS
> MTLPFWI1.R00.3271.D81.2307101805 07/10/2023
> > <4>[   76.202534] RIP: 0010:__mmu_notifier_register+0x40/0x210
> > <4>[   76.207804] Code: 1a 71 5a 01 85 c0 0f 85 ec 00 00 00 48 8b 85 30 01 00
> 00 48 85 c0 0f 84 04 01 00 00 8b 85 cc 00 00 00 85 c0 0f 8e bb 01 00 00 <49> 8b
> 44 24 10 48 83 78 38 00 74 1a 48 83 78 28 00 74 0c 0f 0b b8
> > <4>[   76.226368] RSP: 0018:ffffc900019d7ca8 EFLAGS: 00010202
> > <4>[   76.231549] RAX: 0000000000000001 RBX: 0000000000001000 RCX:
> 0000000000000001
> > <4>[   76.238613] RDX: 0000000000000000 RSI: ffffffff823ceb7b RDI:
> ffffffff823ee12d
> > <4>[   76.245680] RBP: ffff888102ec9b40 R08: 00000000ffffffff R09:
> 0000000000000001
> > <4>[   76.252747] R10: 0000000000000001 R11: ffff8881157cd2c0 R12:
> 0000000000000000
> > <4>[   76.259811] R13: ffff888102ec9c70 R14: ffffffffa07de500 R15:
> ffff888102ec9ce0
> > <4>[   76.266875] FS:  00007fbcabe11c00(0000) GS:ffff88846ec00000(0000)
> knlGS:0000000000000000
> > <4>[   76.274884] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > <4>[   76.280578] CR2: 0000000000000010 CR3: 000000010d4c2005 CR4:
> 0000000000f70ee0
> > <4>[   76.287643] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> 0000000000000000
> > <4>[   76.294711] DR3: 0000000000000000 DR6: 00000000ffff07f0 DR7:
> 0000000000000400
> > <4>[   76.301775] PKRU: 55555554
> > <4>[   76.304463] Call Trace:
> > <4>[   76.306893]  <TASK>
> > <4>[   76.308983]  ? __die_body+0x1a/0x60
> > <4>[   76.312444]  ? page_fault_oops+0x156/0x450
> > <4>[   76.316510]  ? do_user_addr_fault+0x65/0x980
> > <4>[   76.320747]  ? exc_page_fault+0x68/0x1a0
> > <4>[   76.324643]  ? asm_exc_page_fault+0x26/0x30
> > <4>[   76.328796]  ? __mmu_notifier_register+0x40/0x210
> > <4>[   76.333460]  ? __mmu_notifier_register+0x11c/0x210
> > <4>[   76.338206]  ? preempt_count_add+0x4c/0xa0
> > <4>[   76.342273]  mmu_notifier_register+0x30/0xe0
> > <4>[   76.346509]  mmu_interval_notifier_insert+0x74/0xb0
> > <4>[   76.351344]  i915_gem_userptr_ioctl+0x21a/0x320 [i915]
> > <4>[   76.356565]  ? __pfx_i915_gem_userptr_ioctl+0x10/0x10 [i915]
> > <4>[   76.362271]  drm_ioctl_kernel+0xb4/0x150
> > <4>[   76.366159]  drm_ioctl+0x21d/0x420
> > <4>[   76.369537]  ? __pfx_i915_gem_userptr_ioctl+0x10/0x10 [i915]
> > <4>[   76.375242]  ? find_held_lock+0x2b/0x80
> > <4>[   76.379046]  __x64_sys_ioctl+0x79/0xb0
> > <4>[   76.382766]  do_syscall_64+0x3c/0x90
> > <4>[   76.386312]  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
> > <4>[   76.391317] RIP: 0033:0x7fbcae63f3ab
> >
> > Details log can be found in [3].
> >
> > After bisecting the tree, the following patch seems to be causing the
> > regression.
> >
> > commit 828fe4085cae77acb3abf7dd3d25b3ed6c560edf
> > Author: Alistair Popple apopple@nvidia.com
> > Date:   Wed Jul 19 22:18:46 2023 +1000
> >
> >      mmu_notifiers: rename invalidate_range notifier
> >
> >      There are two main use cases for mmu notifiers.  One is by KVM which
> uses
> >      mmu_notifier_invalidate_range_start()/end() to manage a software TLB.
> >
> >      The other is to manage hardware TLBs which need to use the
> >      invalidate_range() callback because HW can establish new TLB entries at
> >      any time.  Hence using start/end() can lead to memory corruption as
> these
> >      callbacks happen too soon/late during page unmap.
> >
> >      mmu notifier users should therefore either use the start()/end() callbacks
> >      or the invalidate_range() callbacks.  To make this usage clearer rename
> >      the invalidate_range() callback to arch_invalidate_secondary_tlbs() and
> >      update documention.
> >
> >      Link:
> >
> https://lkml.kernel.org/r/9a02dde2f8ddaad2db31e54706a80c12d1817aaf.168
> > 9768831.git-series.apopple@nvidia.com
> >
> >
> > We also verified by reverting the patch in the tree.
> >
> > Could you please check why this patch causes the regression and if we
> > can find a solution for it soon?
> 
> Without checking out the whole tree but only looking at this patch in
> isolation, it could be that it is not considering NULL subscription can be
> passed to mmu_notifier_register. For instance from
> mmu_interval_notifier_insert, which i915 is calling. So the check patch added
> to __mmu_notifier_register causes a null pointer dereference:
> 
> @@ -616,6 +617,15 @@ int __mmu_notifier_register(struct mmu_notifier
> *subscription,
>          mmap_assert_write_locked(mm);
>          BUG_ON(atomic_read(&mm->mm_users) <= 0);
> 
> +       /*
> +        * Subsystems should only register for invalidate_secondary_tlbs() or
> +        * invalidate_range_start()/end() callbacks, not both.
> +        */
> +       if
> + (WARN_ON_ONCE(subscription->ops->arch_invalidate_secondary_tlbs &&
> 
> ---> subscription is NULL here <---
> 
> +                               (subscription->ops->invalidate_range_start ||
> +                               subscription->ops->invalidate_range_end)))
> +               return -EINVAL;
> +
> 
> Regards,
> 
> Tvrtko
> 
> >
> > [1] https://intel-gfx-ci.01.org/tree/linux-next/combined-alt.html?
> > [2]
> > https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/co
> > mmit/?h=next-20230720 [3]
> > https://intel-gfx-ci.01.org/tree/linux-next/next-20230720/bat-mtlp-6/d
> > mesg0.txt

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Intel-gfx] Regression in linux-next
  2023-07-25  6:42     ` [Intel-gfx] Regression in linux-next Borah, Chaitanya Kumar
  2023-07-25 10:53       ` Tvrtko Ursulin
@ 2023-07-25 13:15       ` Alistair Popple
  2023-07-26  3:53         ` Borah, Chaitanya Kumar
  1 sibling, 1 reply; 14+ messages in thread
From: Alistair Popple @ 2023-07-25 13:15 UTC (permalink / raw)
  To: Borah, Chaitanya Kumar
  Cc: Nikula, Jani, intel-gfx@lists.freedesktop.org,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	Kurmi, Suresh Kumar, Yedireswarapu, SaiX Nandan, dan.carpenter


Thanks Chaitanya for the detailed report. Dan Carpenter also reported a
Smatch warning for this:

https://lore.kernel.org/linux-mm/38ed0627-1283-4da2-827a-e90484d7bd7d@moroto.mountain/

The below should fix the problem, will respin the series to include the
fix.

---

diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
index 63c8eb740af7..ec3b068cbbe6 100644
--- a/mm/mmu_notifier.c
+++ b/mm/mmu_notifier.c
@@ -621,9 +621,10 @@ int __mmu_notifier_register(struct mmu_notifier *subscription,
 	 * Subsystems should only register for invalidate_secondary_tlbs() or
 	 * invalidate_range_start()/end() callbacks, not both.
 	 */
-	if (WARN_ON_ONCE(subscription->ops->arch_invalidate_secondary_tlbs &&
-				(subscription->ops->invalidate_range_start ||
-				subscription->ops->invalidate_range_end)))
+	if (WARN_ON_ONCE(subscription &&
+			 (subscription->ops->arch_invalidate_secondary_tlbs &&
+			 (subscription->ops->invalidate_range_start ||
+			  subscription->ops->invalidate_range_end))))
 		return -EINVAL;
 
 	if (!mm->notifier_subscriptions) {


"Borah, Chaitanya Kumar" <chaitanya.kumar.borah@intel.com> writes:

> Hello Alistair,
>
> Hope you are doing well. I am Chaitanya from the linux graphics team in Intel.
>  
> This mail is regarding a regression we are seeing in our CI runs[1] on linux-next
> repository.
>  
> On next-20230720 [2], we are seeing the following error
>
> <4>[   76.189375] Hardware name: Intel Corporation Meteor Lake Client Platform/MTL-P DDR5 SODIMM SBS RVP, BIOS MTLPFWI1.R00.3271.D81.2307101805 07/10/2023
> <4>[   76.202534] RIP: 0010:__mmu_notifier_register+0x40/0x210
> <4>[ 76.207804] Code: 1a 71 5a 01 85 c0 0f 85 ec 00 00 00 48 8b 85 30
> 01 00 00 48 85 c0 0f 84 04 01 00 00 8b 85 cc 00 00 00 85 c0 0f 8e bb
> 01 00 00 <49> 8b 44 24 10 48 83 78 38 00 74 1a 48 83 78 28 00 74 0c 0f
> 0b b8
> <4>[   76.226368] RSP: 0018:ffffc900019d7ca8 EFLAGS: 00010202
> <4>[   76.231549] RAX: 0000000000000001 RBX: 0000000000001000 RCX: 0000000000000001
> <4>[   76.238613] RDX: 0000000000000000 RSI: ffffffff823ceb7b RDI: ffffffff823ee12d
> <4>[   76.245680] RBP: ffff888102ec9b40 R08: 00000000ffffffff R09: 0000000000000001
> <4>[   76.252747] R10: 0000000000000001 R11: ffff8881157cd2c0 R12: 0000000000000000
> <4>[   76.259811] R13: ffff888102ec9c70 R14: ffffffffa07de500 R15: ffff888102ec9ce0
> <4>[   76.266875] FS:  00007fbcabe11c00(0000) GS:ffff88846ec00000(0000) knlGS:0000000000000000
> <4>[   76.274884] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> <4>[   76.280578] CR2: 0000000000000010 CR3: 000000010d4c2005 CR4: 0000000000f70ee0
> <4>[   76.287643] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> <4>[   76.294711] DR3: 0000000000000000 DR6: 00000000ffff07f0 DR7: 0000000000000400
> <4>[   76.301775] PKRU: 55555554
> <4>[   76.304463] Call Trace:
> <4>[   76.306893]  <TASK>
> <4>[   76.308983]  ? __die_body+0x1a/0x60
> <4>[   76.312444]  ? page_fault_oops+0x156/0x450
> <4>[   76.316510]  ? do_user_addr_fault+0x65/0x980
> <4>[   76.320747]  ? exc_page_fault+0x68/0x1a0
> <4>[   76.324643]  ? asm_exc_page_fault+0x26/0x30
> <4>[   76.328796]  ? __mmu_notifier_register+0x40/0x210
> <4>[   76.333460]  ? __mmu_notifier_register+0x11c/0x210
> <4>[   76.338206]  ? preempt_count_add+0x4c/0xa0
> <4>[   76.342273]  mmu_notifier_register+0x30/0xe0
> <4>[   76.346509]  mmu_interval_notifier_insert+0x74/0xb0
> <4>[   76.351344]  i915_gem_userptr_ioctl+0x21a/0x320 [i915]
> <4>[   76.356565]  ? __pfx_i915_gem_userptr_ioctl+0x10/0x10 [i915]
> <4>[   76.362271]  drm_ioctl_kernel+0xb4/0x150
> <4>[   76.366159]  drm_ioctl+0x21d/0x420
> <4>[   76.369537]  ? __pfx_i915_gem_userptr_ioctl+0x10/0x10 [i915]
> <4>[   76.375242]  ? find_held_lock+0x2b/0x80
> <4>[   76.379046]  __x64_sys_ioctl+0x79/0xb0
> <4>[   76.382766]  do_syscall_64+0x3c/0x90
> <4>[   76.386312]  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
> <4>[   76.391317] RIP: 0033:0x7fbcae63f3ab
>
> Details log can be found in [3].
>
> After bisecting the tree, the following patch seems to be causing the
> regression.
>
> commit 828fe4085cae77acb3abf7dd3d25b3ed6c560edf
> Author: Alistair Popple apopple@nvidia.com
> Date:   Wed Jul 19 22:18:46 2023 +1000
>
>     mmu_notifiers: rename invalidate_range notifier
>
>     There are two main use cases for mmu notifiers.  One is by KVM which uses
>     mmu_notifier_invalidate_range_start()/end() to manage a software TLB.
>
>     The other is to manage hardware TLBs which need to use the
>     invalidate_range() callback because HW can establish new TLB entries at
>     any time.  Hence using start/end() can lead to memory corruption as these
>     callbacks happen too soon/late during page unmap.
>
>     mmu notifier users should therefore either use the start()/end() callbacks
>     or the invalidate_range() callbacks.  To make this usage clearer rename
>     the invalidate_range() callback to arch_invalidate_secondary_tlbs() and
>     update documention.
>
>     Link: https://lkml.kernel.org/r/9a02dde2f8ddaad2db31e54706a80c12d1817aaf.1689768831.git-series.apopple@nvidia.com
>
>
> We also verified by reverting the patch in the tree.
>
> Could you please check why this patch causes the regression and if we can find
> a solution for it soon?
>
> [1] https://intel-gfx-ci.01.org/tree/linux-next/combined-alt.html?
> [2] https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?h=next-20230720 
> [3] https://intel-gfx-ci.01.org/tree/linux-next/next-20230720/bat-mtlp-6/dmesg0.txt


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [Intel-gfx] Regression in linux-next
  2023-07-25 13:15       ` Alistair Popple
@ 2023-07-26  3:53         ` Borah, Chaitanya Kumar
  0 siblings, 0 replies; 14+ messages in thread
From: Borah, Chaitanya Kumar @ 2023-07-26  3:53 UTC (permalink / raw)
  To: Alistair Popple
  Cc: Nikula, Jani, intel-gfx@lists.freedesktop.org,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	Kurmi, Suresh Kumar, Yedireswarapu, SaiX Nandan,
	dan.carpenter@linaro.org

Hello Alistair,

Thank you for the quick fix.

Regards

Chaitanya

> -----Original Message-----
> From: Alistair Popple <apopple@nvidia.com>
> Sent: Tuesday, July 25, 2023 6:45 PM
> To: Borah, Chaitanya Kumar <chaitanya.kumar.borah@intel.com>
> Cc: Yedireswarapu, SaiX Nandan <saix.nandan.yedireswarapu@intel.com>;
> Saarinen, Jani <jani.saarinen@intel.com>; Kurmi, Suresh Kumar
> <suresh.kumar.kurmi@intel.com>; Nikula, Jani <jani.nikula@intel.com>; intel-
> gfx@lists.freedesktop.org; linux-kernel@vger.kernel.org; linux-
> mm@kvack.org; dan.carpenter@linaro.org
> Subject: Re: Regression in linux-next
> 
> 
> Thanks Chaitanya for the detailed report. Dan Carpenter also reported a
> Smatch warning for this:
> 
> https://lore.kernel.org/linux-mm/38ed0627-1283-4da2-827a-
> e90484d7bd7d@moroto.mountain/
> 
> The below should fix the problem, will respin the series to include the fix.
> 
> ---
> 
> diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c index
> 63c8eb740af7..ec3b068cbbe6 100644
> --- a/mm/mmu_notifier.c
> +++ b/mm/mmu_notifier.c
> @@ -621,9 +621,10 @@ int __mmu_notifier_register(struct mmu_notifier
> *subscription,
>  	 * Subsystems should only register for invalidate_secondary_tlbs() or
>  	 * invalidate_range_start()/end() callbacks, not both.
>  	 */
> -	if (WARN_ON_ONCE(subscription->ops-
> >arch_invalidate_secondary_tlbs &&
> -				(subscription->ops->invalidate_range_start ||
> -				subscription->ops->invalidate_range_end)))
> +	if (WARN_ON_ONCE(subscription &&
> +			 (subscription->ops->arch_invalidate_secondary_tlbs
> &&
> +			 (subscription->ops->invalidate_range_start ||
> +			  subscription->ops->invalidate_range_end))))
>  		return -EINVAL;
> 
>  	if (!mm->notifier_subscriptions) {
> 
> 
> "Borah, Chaitanya Kumar" <chaitanya.kumar.borah@intel.com> writes:
> 
> > Hello Alistair,
> >
> > Hope you are doing well. I am Chaitanya from the linux graphics team in
> Intel.
> >
> > This mail is regarding a regression we are seeing in our CI runs[1] on
> > linux-next repository.
> >
> > On next-20230720 [2], we are seeing the following error
> >
> > <4>[   76.189375] Hardware name: Intel Corporation Meteor Lake Client
> Platform/MTL-P DDR5 SODIMM SBS RVP, BIOS
> MTLPFWI1.R00.3271.D81.2307101805 07/10/2023
> > <4>[   76.202534] RIP: 0010:__mmu_notifier_register+0x40/0x210
> > <4>[ 76.207804] Code: 1a 71 5a 01 85 c0 0f 85 ec 00 00 00 48 8b 85 30
> > 01 00 00 48 85 c0 0f 84 04 01 00 00 8b 85 cc 00 00 00 85 c0 0f 8e bb
> > 01 00 00 <49> 8b 44 24 10 48 83 78 38 00 74 1a 48 83 78 28 00 74 0c 0f
> > 0b b8
> > <4>[   76.226368] RSP: 0018:ffffc900019d7ca8 EFLAGS: 00010202
> > <4>[   76.231549] RAX: 0000000000000001 RBX: 0000000000001000 RCX:
> 0000000000000001
> > <4>[   76.238613] RDX: 0000000000000000 RSI: ffffffff823ceb7b RDI:
> ffffffff823ee12d
> > <4>[   76.245680] RBP: ffff888102ec9b40 R08: 00000000ffffffff R09:
> 0000000000000001
> > <4>[   76.252747] R10: 0000000000000001 R11: ffff8881157cd2c0 R12:
> 0000000000000000
> > <4>[   76.259811] R13: ffff888102ec9c70 R14: ffffffffa07de500 R15:
> ffff888102ec9ce0
> > <4>[   76.266875] FS:  00007fbcabe11c00(0000) GS:ffff88846ec00000(0000)
> knlGS:0000000000000000
> > <4>[   76.274884] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > <4>[   76.280578] CR2: 0000000000000010 CR3: 000000010d4c2005 CR4:
> 0000000000f70ee0
> > <4>[   76.287643] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> 0000000000000000
> > <4>[   76.294711] DR3: 0000000000000000 DR6: 00000000ffff07f0 DR7:
> 0000000000000400
> > <4>[   76.301775] PKRU: 55555554
> > <4>[   76.304463] Call Trace:
> > <4>[   76.306893]  <TASK>
> > <4>[   76.308983]  ? __die_body+0x1a/0x60
> > <4>[   76.312444]  ? page_fault_oops+0x156/0x450
> > <4>[   76.316510]  ? do_user_addr_fault+0x65/0x980
> > <4>[   76.320747]  ? exc_page_fault+0x68/0x1a0
> > <4>[   76.324643]  ? asm_exc_page_fault+0x26/0x30
> > <4>[   76.328796]  ? __mmu_notifier_register+0x40/0x210
> > <4>[   76.333460]  ? __mmu_notifier_register+0x11c/0x210
> > <4>[   76.338206]  ? preempt_count_add+0x4c/0xa0
> > <4>[   76.342273]  mmu_notifier_register+0x30/0xe0
> > <4>[   76.346509]  mmu_interval_notifier_insert+0x74/0xb0
> > <4>[   76.351344]  i915_gem_userptr_ioctl+0x21a/0x320 [i915]
> > <4>[   76.356565]  ? __pfx_i915_gem_userptr_ioctl+0x10/0x10 [i915]
> > <4>[   76.362271]  drm_ioctl_kernel+0xb4/0x150
> > <4>[   76.366159]  drm_ioctl+0x21d/0x420
> > <4>[   76.369537]  ? __pfx_i915_gem_userptr_ioctl+0x10/0x10 [i915]
> > <4>[   76.375242]  ? find_held_lock+0x2b/0x80
> > <4>[   76.379046]  __x64_sys_ioctl+0x79/0xb0
> > <4>[   76.382766]  do_syscall_64+0x3c/0x90
> > <4>[   76.386312]  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
> > <4>[   76.391317] RIP: 0033:0x7fbcae63f3ab
> >
> > Details log can be found in [3].
> >
> > After bisecting the tree, the following patch seems to be causing the
> > regression.
> >
> > commit 828fe4085cae77acb3abf7dd3d25b3ed6c560edf
> > Author: Alistair Popple apopple@nvidia.com
> > Date:   Wed Jul 19 22:18:46 2023 +1000
> >
> >     mmu_notifiers: rename invalidate_range notifier
> >
> >     There are two main use cases for mmu notifiers.  One is by KVM which
> uses
> >     mmu_notifier_invalidate_range_start()/end() to manage a software TLB.
> >
> >     The other is to manage hardware TLBs which need to use the
> >     invalidate_range() callback because HW can establish new TLB entries at
> >     any time.  Hence using start/end() can lead to memory corruption as these
> >     callbacks happen too soon/late during page unmap.
> >
> >     mmu notifier users should therefore either use the start()/end() callbacks
> >     or the invalidate_range() callbacks.  To make this usage clearer rename
> >     the invalidate_range() callback to arch_invalidate_secondary_tlbs() and
> >     update documention.
> >
> >     Link:
> >
> https://lkml.kernel.org/r/9a02dde2f8ddaad2db31e54706a80c12d1817aaf.16
> 8
> > 9768831.git-series.apopple@nvidia.com
> >
> >
> > We also verified by reverting the patch in the tree.
> >
> > Could you please check why this patch causes the regression and if we
> > can find a solution for it soon?
> >
> > [1] https://intel-gfx-ci.01.org/tree/linux-next/combined-alt.html?
> > [2]
> > https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/co
> > mmit/?h=next-20230720 [3]
> > https://intel-gfx-ci.01.org/tree/linux-next/next-20230720/bat-mtlp-6/d
> > mesg0.txt


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [Intel-gfx] ✗ Fi.CI.BUILD: failure for Regression in linux-next
       [not found] <SJ1PR11MB6129592BDF5D06949F99816CB95B9@SJ1PR11MB6129.namprd11.prod.outlook.com>
       [not found] ` <SJ1PR11MB6129A7F5C08E2C47748F2BA5B97E9@SJ1PR11MB6129.namprd11.prod.outlook.com>
@ 2023-07-26 16:09 ` Patchwork
  1 sibling, 0 replies; 14+ messages in thread
From: Patchwork @ 2023-07-26 16:09 UTC (permalink / raw)
  To: Alistair Popple; +Cc: intel-gfx

== Series Details ==

Series: Regression in linux-next
URL   : https://patchwork.freedesktop.org/series/121356/
State : failure

== Summary ==

Error: patch https://patchwork.freedesktop.org/api/1.0/series/121356/revisions/1/mbox/ not applied
Applying: Regression in linux-next
Using index info to reconstruct a base tree...
M	mm/mmu_notifier.c
Falling back to patching base and 3-way merge...
Auto-merging mm/mmu_notifier.c
CONFLICT (content): Merge conflict in mm/mmu_notifier.c
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0001 Regression in linux-next
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".
Build failed, no error log produced

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [Intel-gfx] Regression in linux-next
@ 2023-10-05 15:58 Borah, Chaitanya Kumar
  2023-10-06 20:30 ` Wysocki, Rafael J
  0 siblings, 1 reply; 14+ messages in thread
From: Borah, Chaitanya Kumar @ 2023-10-05 15:58 UTC (permalink / raw)
  To: Wysocki, Rafael J; +Cc: intel-gfx@lists.freedesktop.org, Kurmi, Suresh Kumar

[-- Attachment #1: Type: text/plain, Size: 2759 bytes --]

Hello Rafael,


Hope you are doing well. I am Chaitanya from the linux graphics team in Intel.

This mail is regarding a regression we are seeing in our CI runs[1] on linux-next repository.



On next-20231003 [2], we are seeing the following error


```````````````````````````````````````````````````````````````````````````````
<4>[   14.093075] ------------[ cut here ]------------
<4>[   14.097664] WARNING: CPU: 0 PID: 1 at drivers/thermal/thermal_trip.c:18 for_each_thermal_trip+0x83/0x90
<4>[   14.106977] Modules linked in:
<4>[   14.110017] CPU: 0 PID: 1 Comm: swapper/0 Tainted: G        W          6.6.0-rc4-next-20231003-next-20231003-gc9f2baaa18b5+ #1
<4>[   14.121305] Hardware name: Intel Corporation Meteor Lake Client Platform/MTL-P DDR5 SODIMM SBS RVP, BIOS MTLPFWI1.R00.3323.D89.2309110529 09/11/2023
<4>[   14.134478] RIP: 0010:for_each_thermal_trip+0x83/0x90
<4>[   14.139496] Code: 5c 41 5d c3 cc cc cc cc 5b 31 c0 5d 41 5c 41 5d c3 cc cc cc cc 48 8d bf f0 05 00 00 be ff ff ff ff e8 21 a2 2d 00 85 c0 75 9a <0f> 0b eb 96 66 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90 90 90 90



Details log can be found in [3].



After bisecting the tree, the following patch [4] seems to be causing the regression.


commit d5ea889246b112e228433a5f27f57af90ca0c1fb
Author: Rafael J. Wysocki rafael.j.wysocki@intel.com<mailto:rafael.j.wysocki@intel.com>
Date:   Thu Sep 21 20:02:59 2023 +0200

    ACPI: thermal: Do not use trip indices for cooling device binding

    Rearrange the ACPI thermal driver's callback functions used for cooling
    device binding and unbinding, acpi_thermal_bind_cooling_device() and
    acpi_thermal_unbind_cooling_device(), respectively, so that they use trip
    pointers instead of trip indices which is more straightforward and allows
    the driver to become independent of the ordering of trips in the thermal
    zone structure.

    The general functionality is not expected to be changed.

    Signed-off-by: Rafael J. Wysocki rafael.j.wysocki@intel.com<mailto:rafael.j.wysocki@intel.com>
    Reviewed-by: Daniel Lezcano daniel.lezcano@linaro.org<mailto:daniel.lezcano@linaro.org>



We also verified by moving the head of the tree to the previous commit.



Could you please check why this patch causes the regression and if we can find a solution for it soon?


[1] https://intel-gfx-ci.01.org/tree/linux-next/combined-alt.html?
[2] https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?h=next-20231003
[3] https://intel-gfx-ci.01.org/tree/linux-next/next-20231003/bat-mtlp-6/boot0.txt
[4] https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?h=next-20231003&id=d5ea889246b112e228433a5f27f57af90ca0c1fb

[-- Attachment #2: Type: text/html, Size: 8043 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Intel-gfx] Regression in linux-next
  2023-10-05 15:58 [Intel-gfx] " Borah, Chaitanya Kumar
@ 2023-10-06 20:30 ` Wysocki, Rafael J
  2023-10-09  5:10   ` Borah, Chaitanya Kumar
  0 siblings, 1 reply; 14+ messages in thread
From: Wysocki, Rafael J @ 2023-10-06 20:30 UTC (permalink / raw)
  To: Borah, Chaitanya Kumar
  Cc: intel-gfx@lists.freedesktop.org, Kurmi, Suresh Kumar

[-- Attachment #1: Type: text/plain, Size: 3148 bytes --]

Hi,

On 10/5/2023 5:58 PM, Borah, Chaitanya Kumar wrote:
>
> Hello Rafael,
>
> Hope you are doing well. I am Chaitanya from the linux graphics team 
> in Intel.
>
> This mail is regarding a regression we are seeing in our CI runs[1] on 
> linux-next repository.
>
Thanks for the report, I think that this is a lockdep assertion failing.

If that is correct, it should be straightforward to fix.

I'll take care of this early next week.

Thanks!


> On next-20231003 [2], we are seeing the following error
>
> ```````````````````````````````````````````````````````````````````````````````
>
> <4>[ 14.093075] ------------[ cut here ]------------
>
> <4>[ 14.097664] WARNING: CPU: 0 PID: 1 at 
> drivers/thermal/thermal_trip.c:18 for_each_thermal_trip+0x83/0x90
>
> <4>[ 14.106977] Modules linked in:
>
> <4>[ 14.110017] CPU: 0 PID: 1 Comm: swapper/0 Tainted: G W 
>       6.6.0-rc4-next-20231003-next-20231003-gc9f2baaa18b5+ #1
>
> <4>[ 14.121305] Hardware name: Intel Corporation Meteor Lake Client 
> Platform/MTL-P DDR5 SODIMM SBS RVP, BIOS 
> MTLPFWI1.R00.3323.D89.2309110529 09/11/2023
>
> <4>[ 14.134478] RIP: 0010:for_each_thermal_trip+0x83/0x90
>
> <4>[ 14.139496] Code: 5c 41 5d c3 cc cc cc cc 5b 31 c0 5d 41 5c 41 5d 
> c3 cc cc cc cc 48 8d bf f0 05 00 00 be ff ff ff ff e8 21 a2 2d 00 85 
> c0 75 9a <0f> 0b eb 96 66 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90 90 
> 90 90
>
> Details log can be found in [3].
>
> After bisecting the tree, the following patch [4] seems to be causing 
> the regression.
>
> commit d5ea889246b112e228433a5f27f57af90ca0c1fb
>
> Author: Rafael J. Wysocki rafael.j.wysocki@intel.com
>
> Date:   Thu Sep 21 20:02:59 2023 +0200
>
>     ACPI: thermal: Do not use trip indices for cooling device binding
>
>     Rearrange the ACPI thermal driver's callback functions used for 
> cooling
>
>     device binding and unbinding, acpi_thermal_bind_cooling_device() and
>
>     acpi_thermal_unbind_cooling_device(), respectively, so that they 
> use trip
>
>     pointers instead of trip indices which is more straightforward and 
> allows
>
>     the driver to become independent of the ordering of trips in the 
> thermal
>
>     zone structure.
>
>     The general functionality is not expected to be changed.
>
>     Signed-off-by: Rafael J. Wysocki rafael.j.wysocki@intel.com
>
>     Reviewed-by: Daniel Lezcano daniel.lezcano@linaro.org
>
> We also verified by moving the head of the tree to the previous commit.
>
> Could you please check why this patch causes the regression and if we 
> can find a solution for it soon?
>
> [1] https://intel-gfx-ci.01.org/tree/linux-next/combined-alt.html?
>
> [2] 
> https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?h=next-20231003
>
> [3] 
> https://intel-gfx-ci.01.org/tree/linux-next/next-20231003/bat-mtlp-6/boot0.txt
>
> [4] 
> https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?h=next-20231003&id=d5ea889246b112e228433a5f27f57af90ca0c1fb 
> <https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?h=next-20231003&id=d5ea889246b112e228433a5f27f57af90ca0c1fb>
>

[-- Attachment #2: Type: text/html, Size: 9430 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Intel-gfx] Regression in linux-next
  2023-10-06 20:30 ` Wysocki, Rafael J
@ 2023-10-09  5:10   ` Borah, Chaitanya Kumar
  2023-10-09 19:23     ` Wysocki, Rafael J
  0 siblings, 1 reply; 14+ messages in thread
From: Borah, Chaitanya Kumar @ 2023-10-09  5:10 UTC (permalink / raw)
  To: Wysocki, Rafael J; +Cc: intel-gfx@lists.freedesktop.org, Kurmi, Suresh Kumar

Hello Rafael

>Thanks for the report, I think that this is a lockdep assertion failing.
>If that is correct, it should be straightforward to fix.
>I'll take care of this early next week.
>Thanks!

Thank you for your response.  Please let us know when a fix is available.

Regards

Chaitanya

From: Wysocki, Rafael J <rafael.j.wysocki@intel.com> 
Sent: Saturday, October 7, 2023 2:01 AM
To: Borah, Chaitanya Kumar <chaitanya.kumar.borah@intel.com>
Cc: intel-gfx@lists.freedesktop.org; Kurmi, Suresh Kumar <suresh.kumar.kurmi@intel.com>; Saarinen, Jani <jani.saarinen@intel.com>
Subject: Re: Regression in linux-next

Hi,
On 10/5/2023 5:58 PM, Borah, Chaitanya Kumar wrote:
Hello Rafael,

Hope you are doing well. I am Chaitanya from the linux graphics team in Intel.
This mail is regarding a regression we are seeing in our CI runs[1] on linux-next repository.

Thanks for the report, I think that this is a lockdep assertion failing.
If that is correct, it should be straightforward to fix.
I'll take care of this early next week.
Thanks!

On next-20231003 [2], we are seeing the following error

```````````````````````````````````````````````````````````````````````````````
<4>[   14.093075] ------------[ cut here ]------------
<4>[   14.097664] WARNING: CPU: 0 PID: 1 at drivers/thermal/thermal_trip.c:18 for_each_thermal_trip+0x83/0x90
<4>[   14.106977] Modules linked in:
<4>[   14.110017] CPU: 0 PID: 1 Comm: swapper/0 Tainted: G        W          6.6.0-rc4-next-20231003-next-20231003-gc9f2baaa18b5+ #1
<4>[   14.121305] Hardware name: Intel Corporation Meteor Lake Client Platform/MTL-P DDR5 SODIMM SBS RVP, BIOS MTLPFWI1.R00.3323.D89.2309110529 09/11/2023
<4>[   14.134478] RIP: 0010:for_each_thermal_trip+0x83/0x90
<4>[   14.139496] Code: 5c 41 5d c3 cc cc cc cc 5b 31 c0 5d 41 5c 41 5d c3 cc cc cc cc 48 8d bf f0 05 00 00 be ff ff ff ff e8 21 a2 2d 00 85 c0 75 9a <0f> 0b eb 96 66 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90 90 90 90

Details log can be found in [3].

After bisecting the tree, the following patch [4] seems to be causing the regression.

commit d5ea889246b112e228433a5f27f57af90ca0c1fb
Author: Rafael J. Wysocki mailto:rafael.j.wysocki@intel.com
Date:   Thu Sep 21 20:02:59 2023 +0200

    ACPI: thermal: Do not use trip indices for cooling device binding

    Rearrange the ACPI thermal driver's callback functions used for cooling
    device binding and unbinding, acpi_thermal_bind_cooling_device() and
    acpi_thermal_unbind_cooling_device(), respectively, so that they use trip
    pointers instead of trip indices which is more straightforward and allows
    the driver to become independent of the ordering of trips in the thermal
    zone structure.

    The general functionality is not expected to be changed.

    Signed-off-by: Rafael J. Wysocki mailto:rafael.j.wysocki@intel.com
    Reviewed-by: Daniel Lezcano mailto:daniel.lezcano@linaro.org

We also verified by moving the head of the tree to the previous commit.

Could you please check why this patch causes the regression and if we can find a solution for it soon?

[1] https://intel-gfx-ci.01.org/tree/linux-next/combined-alt.html?
[2] https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?h=next-20231003
[3] https://intel-gfx-ci.01.org/tree/linux-next/next-20231003/bat-mtlp-6/boot0.txt
[4] https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?h=next-20231003&id=d5ea889246b112e228433a5f27f57af90ca0c1fb

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Intel-gfx] Regression in linux-next
  2023-10-09  5:10   ` Borah, Chaitanya Kumar
@ 2023-10-09 19:23     ` Wysocki, Rafael J
  2023-10-11  4:00       ` Borah, Chaitanya Kumar
  0 siblings, 1 reply; 14+ messages in thread
From: Wysocki, Rafael J @ 2023-10-09 19:23 UTC (permalink / raw)
  To: Borah, Chaitanya Kumar
  Cc: intel-gfx@lists.freedesktop.org, Kurmi, Suresh Kumar

Hi,

On 10/9/2023 7:10 AM, Borah, Chaitanya Kumar wrote:
> Hello Rafael
>
>> Thanks for the report, I think that this is a lockdep assertion failing.
>> If that is correct, it should be straightforward to fix.
>> I'll take care of this early next week.
>> Thanks!
> Thank you for your response.  Please let us know when a fix is available.

It should be fixed in linux-next from today, by this commit:

https://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm.git/commit/?h=linux-next&id=b44444027ce7714f309e96b804b7fb088a40d708

Thanks!


> From: Wysocki, Rafael J <rafael.j.wysocki@intel.com>
> Sent: Saturday, October 7, 2023 2:01 AM
> To: Borah, Chaitanya Kumar <chaitanya.kumar.borah@intel.com>
> Cc: intel-gfx@lists.freedesktop.org; Kurmi, Suresh Kumar <suresh.kumar.kurmi@intel.com>; Saarinen, Jani <jani.saarinen@intel.com>
> Subject: Re: Regression in linux-next
>
> Hi,
> On 10/5/2023 5:58 PM, Borah, Chaitanya Kumar wrote:
> Hello Rafael,
>   
> Hope you are doing well. I am Chaitanya from the linux graphics team in Intel.
> This mail is regarding a regression we are seeing in our CI runs[1] on linux-next repository.
>   
> Thanks for the report, I think that this is a lockdep assertion failing.
> If that is correct, it should be straightforward to fix.
> I'll take care of this early next week.
> Thanks!
>
> On next-20231003 [2], we are seeing the following error
>   
> ```````````````````````````````````````````````````````````````````````````````
> <4>[   14.093075] ------------[ cut here ]------------
> <4>[   14.097664] WARNING: CPU: 0 PID: 1 at drivers/thermal/thermal_trip.c:18 for_each_thermal_trip+0x83/0x90
> <4>[   14.106977] Modules linked in:
> <4>[   14.110017] CPU: 0 PID: 1 Comm: swapper/0 Tainted: G        W          6.6.0-rc4-next-20231003-next-20231003-gc9f2baaa18b5+ #1
> <4>[   14.121305] Hardware name: Intel Corporation Meteor Lake Client Platform/MTL-P DDR5 SODIMM SBS RVP, BIOS MTLPFWI1.R00.3323.D89.2309110529 09/11/2023
> <4>[   14.134478] RIP: 0010:for_each_thermal_trip+0x83/0x90
> <4>[   14.139496] Code: 5c 41 5d c3 cc cc cc cc 5b 31 c0 5d 41 5c 41 5d c3 cc cc cc cc 48 8d bf f0 05 00 00 be ff ff ff ff e8 21 a2 2d 00 85 c0 75 9a <0f> 0b eb 96 66 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90 90 90 90
>   
> Details log can be found in [3].
>   
> After bisecting the tree, the following patch [4] seems to be causing the regression.
>   
> commit d5ea889246b112e228433a5f27f57af90ca0c1fb
> Author: Rafael J. Wysocki mailto:rafael.j.wysocki@intel.com
> Date:   Thu Sep 21 20:02:59 2023 +0200
>   
>      ACPI: thermal: Do not use trip indices for cooling device binding
>   
>      Rearrange the ACPI thermal driver's callback functions used for cooling
>      device binding and unbinding, acpi_thermal_bind_cooling_device() and
>      acpi_thermal_unbind_cooling_device(), respectively, so that they use trip
>      pointers instead of trip indices which is more straightforward and allows
>      the driver to become independent of the ordering of trips in the thermal
>      zone structure.
>   
>      The general functionality is not expected to be changed.
>   
>      Signed-off-by: Rafael J. Wysocki mailto:rafael.j.wysocki@intel.com
>      Reviewed-by: Daniel Lezcano mailto:daniel.lezcano@linaro.org
>   
> We also verified by moving the head of the tree to the previous commit.
>   
> Could you please check why this patch causes the regression and if we can find a solution for it soon?
>   
> [1] https://intel-gfx-ci.01.org/tree/linux-next/combined-alt.html?
> [2] https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?h=next-20231003
> [3] https://intel-gfx-ci.01.org/tree/linux-next/next-20231003/bat-mtlp-6/boot0.txt
> [4] https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?h=next-20231003&id=d5ea889246b112e228433a5f27f57af90ca0c1fb

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Intel-gfx] Regression in linux-next
  2023-10-09 19:23     ` Wysocki, Rafael J
@ 2023-10-11  4:00       ` Borah, Chaitanya Kumar
  2023-10-11 16:14         ` Wysocki, Rafael J
  0 siblings, 1 reply; 14+ messages in thread
From: Borah, Chaitanya Kumar @ 2023-10-11  4:00 UTC (permalink / raw)
  To: Wysocki, Rafael J; +Cc: intel-gfx@lists.freedesktop.org, Kurmi, Suresh Kumar

Hello Rafael,

> -----Original Message-----
> From: Wysocki, Rafael J <rafael.j.wysocki@intel.com>
> Sent: Tuesday, October 10, 2023 12:54 AM
> To: Borah, Chaitanya Kumar <chaitanya.kumar.borah@intel.com>
> Cc: intel-gfx@lists.freedesktop.org; Kurmi, Suresh Kumar
> <suresh.kumar.kurmi@intel.com>; Saarinen, Jani <jani.saarinen@intel.com>
> Subject: Re: Regression in linux-next
> 
> Hi,
> 
> On 10/9/2023 7:10 AM, Borah, Chaitanya Kumar wrote:
> > Hello Rafael
> >
> >> Thanks for the report, I think that this is a lockdep assertion failing.
> >> If that is correct, it should be straightforward to fix.
> >> I'll take care of this early next week.
> >> Thanks!
> > Thank you for your response.  Please let us know when a fix is available.
> 
> It should be fixed in linux-next from today, by this commit:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-
> pm.git/commit/?h=linux-
> next&id=b44444027ce7714f309e96b804b7fb088a40d708
> 
> Thanks!

Thanks a lot for the fix. This seems to have fixed the issue in most of the machines but we are still seeing a similar problem in few of the machines.

This has a different call stack but seems to be from the same thermal subsystem. Full logs in [1]

<4>[    4.392015] WARNING: CPU: 1 PID: 306 at drivers/thermal/thermal_trip.c:178 thermal_zone_trip_id+0x61/0x70
<4>[    4.392022] Modules linked in: x86_pkg_temp_thermal coretemp kvm_intel mei_pxp mei_hdcp wmi_bmof kvm e1000e irqbypass crct10dif_pclmul video ptp crc32_pclmul ghash_clmulni_intel i2c_i801 mei_me pps_core mei i2c_smbus wmi
<4>[    4.392057] CPU: 1 PID: 306 Comm: thermald Not tainted 6.6.0-rc5-next-20231010-next-20231010-gc0a6edb636cb+ #1
<4>[    4.392061] Hardware name: System manufacturer System Product Name/Z170M-PLUS, BIOS 3610 03/29/2018
<4>[    4.392063] RIP: 0010:thermal_zone_trip_id+0x61/0x70
<4>[    4.392066] Code: 74 0c 83 c0 01 39 c8 75 f0 b8 c3 ff ff ff 5b 5d c3 cc cc cc cc 48 8d bf f0 05 00 00 be ff ff ff ff e8 63 a4 2d 00 85 c0 75 b5 <0f> 0b eb b1 66 2e 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90 90 90
<4>[    4.392069] RSP: 0018:ffffc9000156bda8 EFLAGS: 00010246
<4>[    4.392073] RAX: 0000000000000000 RBX: ffff888103828ae8 RCX: 0000000000000001
<4>[    4.392075] RDX: 0000000080000000 RSI: ffffffff823de5ab RDI: ffffffff823fdfba
<4>[    4.392078] RBP: ffff888103a88800 R08: ffff888103828ae8 R09: 0000000000000001
<4>[    4.392080] R10: 0000000000000001 R11: ffff88811494d3c0 R12: ffff888103a88818
<4>[    4.392082] R13: ffff8881108bfa00 R14: ffff888103794408 R15: 0000000000000001
<4>[    4.392084] FS:  00007f1f0d6d28c0(0000) GS:ffff88822e680000(0000) knlGS:0000000000000000
<4>[    4.392087] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4>[    4.392089] CR2: 000055857c50b750 CR3: 0000000111efa005 CR4: 00000000003706f0
<4>[    4.392091] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
<4>[    4.392093] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
<4>[    4.392095] Call Trace:
<4>[    4.392097]  <TASK>
<4>[    4.392100]  ? __warn+0x7f/0x170
<4>[    4.392104]  ? thermal_zone_trip_id+0x61/0x70
<4>[    4.392109]  ? report_bug+0x1f8/0x200
<4>[    4.392116]  ? handle_bug+0x3c/0x70
<4>[    4.392119]  ? exc_invalid_op+0x18/0x70
<4>[    4.392123]  ? asm_exc_invalid_op+0x1a/0x20
<4>[    4.392133]  ? thermal_zone_trip_id+0x61/0x70
<4>[    4.392137]  ? thermal_zone_trip_id+0x5d/0x70
<4>[    4.392141]  trip_point_show+0x18/0x40
<4>[    4.392145]  dev_attr_show+0x15/0x60
<4>[    4.392149]  sysfs_kf_seq_show+0xb5/0x100
<4>[    4.392154]  seq_read_iter+0x111/0x450
<4>[    4.392158]  ? check_object+0x133/0x320
<4>[    4.392164]  vfs_read+0x20d/0x300
<4>[    4.392175]  ksys_read+0x64/0xe0
<4>[    4.392180]  do_syscall_64+0x3c/0x90
<4>[    4.392183]  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
<4>[    4.392187] RIP: 0033:0x7f1f0e193392

Can you please check what could be the reason for this issue?

[1] https://intel-gfx-ci.01.org/tree/linux-next/next-20231010/fi-kbl-guc/boot0.txt

Regards

Chaitanya




> 
> 
> > From: Wysocki, Rafael J <rafael.j.wysocki@intel.com>
> > Sent: Saturday, October 7, 2023 2:01 AM
> > To: Borah, Chaitanya Kumar <chaitanya.kumar.borah@intel.com>
> > Cc: intel-gfx@lists.freedesktop.org; Kurmi, Suresh Kumar
> > <suresh.kumar.kurmi@intel.com>; Saarinen, Jani
> > <jani.saarinen@intel.com>
> > Subject: Re: Regression in linux-next
> >
> > Hi,
> > On 10/5/2023 5:58 PM, Borah, Chaitanya Kumar wrote:
> > Hello Rafael,
> >
> > Hope you are doing well. I am Chaitanya from the linux graphics team in
> Intel.
> > This mail is regarding a regression we are seeing in our CI runs[1] on linux-
> next repository.
> >
> > Thanks for the report, I think that this is a lockdep assertion failing.
> > If that is correct, it should be straightforward to fix.
> > I'll take care of this early next week.
> > Thanks!
> >
> > On next-20231003 [2], we are seeing the following error
> >
> > ``````````````````````````````````````````````````````````````````````
> > ````````` <4>[   14.093075] ------------[ cut here ]------------ <4>[
> > 14.097664] WARNING: CPU: 0 PID: 1 at drivers/thermal/thermal_trip.c:18
> > for_each_thermal_trip+0x83/0x90 <4>[   14.106977] Modules linked in:
> > <4>[   14.110017] CPU: 0 PID: 1 Comm: swapper/0 Tainted: G        W
> > 6.6.0-rc4-next-20231003-next-20231003-gc9f2baaa18b5+ #1 <4>[
> > 14.121305] Hardware name: Intel Corporation Meteor Lake Client
> > Platform/MTL-P DDR5 SODIMM SBS RVP, BIOS
> > MTLPFWI1.R00.3323.D89.2309110529 09/11/2023 <4>[   14.134478] RIP:
> > 0010:for_each_thermal_trip+0x83/0x90
> > <4>[   14.139496] Code: 5c 41 5d c3 cc cc cc cc 5b 31 c0 5d 41 5c 41
> > 5d c3 cc cc cc cc 48 8d bf f0 05 00 00 be ff ff ff ff e8 21 a2 2d 00
> > 85 c0 75 9a <0f> 0b eb 96 66 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90
> > 90 90 90
> >
> > Details log can be found in [3].
> >
> > After bisecting the tree, the following patch [4] seems to be causing the
> regression.
> >
> > commit d5ea889246b112e228433a5f27f57af90ca0c1fb
> > Author: Rafael J. Wysocki mailto:rafael.j.wysocki@intel.com
> > Date:   Thu Sep 21 20:02:59 2023 +0200
> >
> >      ACPI: thermal: Do not use trip indices for cooling device binding
> >
> >      Rearrange the ACPI thermal driver's callback functions used for
> > cooling
> >      device binding and unbinding, acpi_thermal_bind_cooling_device()
> > and
> >      acpi_thermal_unbind_cooling_device(), respectively, so that they
> > use trip
> >      pointers instead of trip indices which is more straightforward
> > and allows
> >      the driver to become independent of the ordering of trips in the
> > thermal
> >      zone structure.
> >
> >      The general functionality is not expected to be changed.
> >
> >      Signed-off-by: Rafael J. Wysocki
> > mailto:rafael.j.wysocki@intel.com
> >      Reviewed-by: Daniel Lezcano mailto:daniel.lezcano@linaro.org
> >
> > We also verified by moving the head of the tree to the previous commit.
> >
> > Could you please check why this patch causes the regression and if we can
> find a solution for it soon?
> >
> > [1] https://intel-gfx-ci.01.org/tree/linux-next/combined-alt.html?
> > [2]
> > https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/co
> > mmit/?h=next-20231003 [3]
> > https://intel-gfx-ci.01.org/tree/linux-next/next-20231003/bat-mtlp-6/b
> > oot0.txt [4]
> > https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/co
> > mmit/?h=next-20231003&id=d5ea889246b112e228433a5f27f57af90ca0c1fb

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Intel-gfx] Regression in linux-next
  2023-10-11  4:00       ` Borah, Chaitanya Kumar
@ 2023-10-11 16:14         ` Wysocki, Rafael J
  2023-10-11 16:49           ` Borah, Chaitanya Kumar
  0 siblings, 1 reply; 14+ messages in thread
From: Wysocki, Rafael J @ 2023-10-11 16:14 UTC (permalink / raw)
  To: Borah, Chaitanya Kumar
  Cc: intel-gfx@lists.freedesktop.org, Kurmi, Suresh Kumar

Hi,

On 10/11/2023 6:00 AM, Borah, Chaitanya Kumar wrote:
> Hello Rafael,
>
>> -----Original Message-----
>> From: Wysocki, Rafael J <rafael.j.wysocki@intel.com>
>> Sent: Tuesday, October 10, 2023 12:54 AM
>> To: Borah, Chaitanya Kumar <chaitanya.kumar.borah@intel.com>
>> Cc: intel-gfx@lists.freedesktop.org; Kurmi, Suresh Kumar
>> <suresh.kumar.kurmi@intel.com>; Saarinen, Jani <jani.saarinen@intel.com>
>> Subject: Re: Regression in linux-next
>>
>> Hi,
>>
>> On 10/9/2023 7:10 AM, Borah, Chaitanya Kumar wrote:
>>> Hello Rafael
>>>
>>>> Thanks for the report, I think that this is a lockdep assertion failing.
>>>> If that is correct, it should be straightforward to fix.
>>>> I'll take care of this early next week.
>>>> Thanks!
>>> Thank you for your response.  Please let us know when a fix is available.
>> It should be fixed in linux-next from today, by this commit:
>>
>> https://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-
>> pm.git/commit/?h=linux-
>> next&id=b44444027ce7714f309e96b804b7fb088a40d708
>>
>> Thanks!
> Thanks a lot for the fix. This seems to have fixed the issue in most of the machines but we are still seeing a similar problem in few of the machines.

Thanks for reporting this!


> This has a different call stack but seems to be from the same thermal subsystem. Full logs in [1]
>
> <4>[    4.392015] WARNING: CPU: 1 PID: 306 at drivers/thermal/thermal_trip.c:178 thermal_zone_trip_id+0x61/0x70
> <4>[    4.392022] Modules linked in: x86_pkg_temp_thermal coretemp kvm_intel mei_pxp mei_hdcp wmi_bmof kvm e1000e irqbypass crct10dif_pclmul video ptp crc32_pclmul ghash_clmulni_intel i2c_i801 mei_me pps_core mei i2c_smbus wmi
> <4>[    4.392057] CPU: 1 PID: 306 Comm: thermald Not tainted 6.6.0-rc5-next-20231010-next-20231010-gc0a6edb636cb+ #1
> <4>[    4.392061] Hardware name: System manufacturer System Product Name/Z170M-PLUS, BIOS 3610 03/29/2018
> <4>[    4.392063] RIP: 0010:thermal_zone_trip_id+0x61/0x70
> <4>[    4.392066] Code: 74 0c 83 c0 01 39 c8 75 f0 b8 c3 ff ff ff 5b 5d c3 cc cc cc cc 48 8d bf f0 05 00 00 be ff ff ff ff e8 63 a4 2d 00 85 c0 75 b5 <0f> 0b eb b1 66 2e 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90 90 90
> <4>[    4.392069] RSP: 0018:ffffc9000156bda8 EFLAGS: 00010246
> <4>[    4.392073] RAX: 0000000000000000 RBX: ffff888103828ae8 RCX: 0000000000000001
> <4>[    4.392075] RDX: 0000000080000000 RSI: ffffffff823de5ab RDI: ffffffff823fdfba
> <4>[    4.392078] RBP: ffff888103a88800 R08: ffff888103828ae8 R09: 0000000000000001
> <4>[    4.392080] R10: 0000000000000001 R11: ffff88811494d3c0 R12: ffff888103a88818
> <4>[    4.392082] R13: ffff8881108bfa00 R14: ffff888103794408 R15: 0000000000000001
> <4>[    4.392084] FS:  00007f1f0d6d28c0(0000) GS:ffff88822e680000(0000) knlGS:0000000000000000
> <4>[    4.392087] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> <4>[    4.392089] CR2: 000055857c50b750 CR3: 0000000111efa005 CR4: 00000000003706f0
> <4>[    4.392091] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> <4>[    4.392093] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> <4>[    4.392095] Call Trace:
> <4>[    4.392097]  <TASK>
> <4>[    4.392100]  ? __warn+0x7f/0x170
> <4>[    4.392104]  ? thermal_zone_trip_id+0x61/0x70
> <4>[    4.392109]  ? report_bug+0x1f8/0x200
> <4>[    4.392116]  ? handle_bug+0x3c/0x70
> <4>[    4.392119]  ? exc_invalid_op+0x18/0x70
> <4>[    4.392123]  ? asm_exc_invalid_op+0x1a/0x20
> <4>[    4.392133]  ? thermal_zone_trip_id+0x61/0x70
> <4>[    4.392137]  ? thermal_zone_trip_id+0x5d/0x70
> <4>[    4.392141]  trip_point_show+0x18/0x40
> <4>[    4.392145]  dev_attr_show+0x15/0x60
> <4>[    4.392149]  sysfs_kf_seq_show+0xb5/0x100
> <4>[    4.392154]  seq_read_iter+0x111/0x450
> <4>[    4.392158]  ? check_object+0x133/0x320
> <4>[    4.392164]  vfs_read+0x20d/0x300
> <4>[    4.392175]  ksys_read+0x64/0xe0
> <4>[    4.392180]  do_syscall_64+0x3c/0x90
> <4>[    4.392183]  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
> <4>[    4.392187] RIP: 0033:0x7f1f0e193392
>
> Can you please check what could be the reason for this issue?

Well, one more unuseful lockdep assertion has been added recently to the 
thermal core, sorry about that.

This commit

https://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm.git/commit/?h=linux-next&id=108ffd12be24ba1d74b3314df8db32a0a6d55ba5

that will be merged into linux-next tomorrow if all goes well, should 
address this.

Thanks!


> [1] https://intel-gfx-ci.01.org/tree/linux-next/next-20231010/fi-kbl-guc/boot0.txt
>
> Regards
>
> Chaitanya
>
>
>
>
>>
>>> From: Wysocki, Rafael J <rafael.j.wysocki@intel.com>
>>> Sent: Saturday, October 7, 2023 2:01 AM
>>> To: Borah, Chaitanya Kumar <chaitanya.kumar.borah@intel.com>
>>> Cc: intel-gfx@lists.freedesktop.org; Kurmi, Suresh Kumar
>>> <suresh.kumar.kurmi@intel.com>; Saarinen, Jani
>>> <jani.saarinen@intel.com>
>>> Subject: Re: Regression in linux-next
>>>
>>> Hi,
>>> On 10/5/2023 5:58 PM, Borah, Chaitanya Kumar wrote:
>>> Hello Rafael,
>>>
>>> Hope you are doing well. I am Chaitanya from the linux graphics team in
>> Intel.
>>> This mail is regarding a regression we are seeing in our CI runs[1] on linux-
>> next repository.
>>> Thanks for the report, I think that this is a lockdep assertion failing.
>>> If that is correct, it should be straightforward to fix.
>>> I'll take care of this early next week.
>>> Thanks!
>>>
>>> On next-20231003 [2], we are seeing the following error
>>>
>>> ``````````````````````````````````````````````````````````````````````
>>> ````````` <4>[   14.093075] ------------[ cut here ]------------ <4>[
>>> 14.097664] WARNING: CPU: 0 PID: 1 at drivers/thermal/thermal_trip.c:18
>>> for_each_thermal_trip+0x83/0x90 <4>[   14.106977] Modules linked in:
>>> <4>[   14.110017] CPU: 0 PID: 1 Comm: swapper/0 Tainted: G        W
>>> 6.6.0-rc4-next-20231003-next-20231003-gc9f2baaa18b5+ #1 <4>[
>>> 14.121305] Hardware name: Intel Corporation Meteor Lake Client
>>> Platform/MTL-P DDR5 SODIMM SBS RVP, BIOS
>>> MTLPFWI1.R00.3323.D89.2309110529 09/11/2023 <4>[   14.134478] RIP:
>>> 0010:for_each_thermal_trip+0x83/0x90
>>> <4>[   14.139496] Code: 5c 41 5d c3 cc cc cc cc 5b 31 c0 5d 41 5c 41
>>> 5d c3 cc cc cc cc 48 8d bf f0 05 00 00 be ff ff ff ff e8 21 a2 2d 00
>>> 85 c0 75 9a <0f> 0b eb 96 66 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90
>>> 90 90 90
>>>
>>> Details log can be found in [3].
>>>
>>> After bisecting the tree, the following patch [4] seems to be causing the
>> regression.
>>> commit d5ea889246b112e228433a5f27f57af90ca0c1fb
>>> Author: Rafael J. Wysocki mailto:rafael.j.wysocki@intel.com
>>> Date:   Thu Sep 21 20:02:59 2023 +0200
>>>
>>>       ACPI: thermal: Do not use trip indices for cooling device binding
>>>
>>>       Rearrange the ACPI thermal driver's callback functions used for
>>> cooling
>>>       device binding and unbinding, acpi_thermal_bind_cooling_device()
>>> and
>>>       acpi_thermal_unbind_cooling_device(), respectively, so that they
>>> use trip
>>>       pointers instead of trip indices which is more straightforward
>>> and allows
>>>       the driver to become independent of the ordering of trips in the
>>> thermal
>>>       zone structure.
>>>
>>>       The general functionality is not expected to be changed.
>>>
>>>       Signed-off-by: Rafael J. Wysocki
>>> mailto:rafael.j.wysocki@intel.com
>>>       Reviewed-by: Daniel Lezcano mailto:daniel.lezcano@linaro.org
>>>
>>> We also verified by moving the head of the tree to the previous commit.
>>>
>>> Could you please check why this patch causes the regression and if we can
>> find a solution for it soon?
>>> [1] https://intel-gfx-ci.01.org/tree/linux-next/combined-alt.html?
>>> [2]
>>> https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/co
>>> mmit/?h=next-20231003 [3]
>>> https://intel-gfx-ci.01.org/tree/linux-next/next-20231003/bat-mtlp-6/b
>>> oot0.txt [4]
>>> https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/co
>>> mmit/?h=next-20231003&id=d5ea889246b112e228433a5f27f57af90ca0c1fb

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Intel-gfx] Regression in linux-next
  2023-10-11 16:14         ` Wysocki, Rafael J
@ 2023-10-11 16:49           ` Borah, Chaitanya Kumar
  2023-10-13 14:05             ` Borah, Chaitanya Kumar
  0 siblings, 1 reply; 14+ messages in thread
From: Borah, Chaitanya Kumar @ 2023-10-11 16:49 UTC (permalink / raw)
  To: Wysocki, Rafael J; +Cc: intel-gfx@lists.freedesktop.org, Kurmi, Suresh Kumar

Hello Rafael,

> -----Original Message-----
> From: Wysocki, Rafael J <rafael.j.wysocki@intel.com>
> Sent: Wednesday, October 11, 2023 9:44 PM
> To: Borah, Chaitanya Kumar <chaitanya.kumar.borah@intel.com>
> Cc: intel-gfx@lists.freedesktop.org; Kurmi, Suresh Kumar
> <suresh.kumar.kurmi@intel.com>; Saarinen, Jani <jani.saarinen@intel.com>
> Subject: Re: Regression in linux-next
> 
> Hi,
> 
> On 10/11/2023 6:00 AM, Borah, Chaitanya Kumar wrote:
> > Hello Rafael,
> >
> >> -----Original Message-----
> >> From: Wysocki, Rafael J <rafael.j.wysocki@intel.com>
> >> Sent: Tuesday, October 10, 2023 12:54 AM
> >> To: Borah, Chaitanya Kumar <chaitanya.kumar.borah@intel.com>
> >> Cc: intel-gfx@lists.freedesktop.org; Kurmi, Suresh Kumar
> >> <suresh.kumar.kurmi@intel.com>; Saarinen, Jani
> >> <jani.saarinen@intel.com>
> >> Subject: Re: Regression in linux-next
> >>
> >> Hi,
> >>
> >> On 10/9/2023 7:10 AM, Borah, Chaitanya Kumar wrote:
> >>> Hello Rafael
> >>>
> >>>> Thanks for the report, I think that this is a lockdep assertion failing.
> >>>> If that is correct, it should be straightforward to fix.
> >>>> I'll take care of this early next week.
> >>>> Thanks!
> >>> Thank you for your response.  Please let us know when a fix is available.
> >> It should be fixed in linux-next from today, by this commit:
> >>
> >> https://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-
> >> pm.git/commit/?h=linux-
> >> next&id=b44444027ce7714f309e96b804b7fb088a40d708
> >>
> >> Thanks!
> > Thanks a lot for the fix. This seems to have fixed the issue in most of the
> machines but we are still seeing a similar problem in few of the machines.
> 
> Thanks for reporting this!
> 
> 
> > This has a different call stack but seems to be from the same thermal
> > subsystem. Full logs in [1]
> >
> > <4>[    4.392015] WARNING: CPU: 1 PID: 306 at
> drivers/thermal/thermal_trip.c:178 thermal_zone_trip_id+0x61/0x70
> > <4>[    4.392022] Modules linked in: x86_pkg_temp_thermal coretemp
> kvm_intel mei_pxp mei_hdcp wmi_bmof kvm e1000e irqbypass
> crct10dif_pclmul video ptp crc32_pclmul ghash_clmulni_intel i2c_i801
> mei_me pps_core mei i2c_smbus wmi
> > <4>[    4.392057] CPU: 1 PID: 306 Comm: thermald Not tainted 6.6.0-rc5-
> next-20231010-next-20231010-gc0a6edb636cb+ #1
> > <4>[    4.392061] Hardware name: System manufacturer System Product
> Name/Z170M-PLUS, BIOS 3610 03/29/2018
> > <4>[    4.392063] RIP: 0010:thermal_zone_trip_id+0x61/0x70
> > <4>[    4.392066] Code: 74 0c 83 c0 01 39 c8 75 f0 b8 c3 ff ff ff 5b 5d c3 cc cc
> cc cc 48 8d bf f0 05 00 00 be ff ff ff ff e8 63 a4 2d 00 85 c0 75 b5 <0f> 0b eb b1
> 66 2e 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90 90 90
> > <4>[    4.392069] RSP: 0018:ffffc9000156bda8 EFLAGS: 00010246
> > <4>[    4.392073] RAX: 0000000000000000 RBX: ffff888103828ae8 RCX:
> 0000000000000001
> > <4>[    4.392075] RDX: 0000000080000000 RSI: ffffffff823de5ab RDI:
> ffffffff823fdfba
> > <4>[    4.392078] RBP: ffff888103a88800 R08: ffff888103828ae8 R09:
> 0000000000000001
> > <4>[    4.392080] R10: 0000000000000001 R11: ffff88811494d3c0 R12:
> ffff888103a88818
> > <4>[    4.392082] R13: ffff8881108bfa00 R14: ffff888103794408 R15:
> 0000000000000001
> > <4>[    4.392084] FS:  00007f1f0d6d28c0(0000) GS:ffff88822e680000(0000)
> knlGS:0000000000000000
> > <4>[    4.392087] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > <4>[    4.392089] CR2: 000055857c50b750 CR3: 0000000111efa005 CR4:
> 00000000003706f0
> > <4>[    4.392091] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> 0000000000000000
> > <4>[    4.392093] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
> 0000000000000400
> > <4>[    4.392095] Call Trace:
> > <4>[    4.392097]  <TASK>
> > <4>[    4.392100]  ? __warn+0x7f/0x170
> > <4>[    4.392104]  ? thermal_zone_trip_id+0x61/0x70
> > <4>[    4.392109]  ? report_bug+0x1f8/0x200
> > <4>[    4.392116]  ? handle_bug+0x3c/0x70
> > <4>[    4.392119]  ? exc_invalid_op+0x18/0x70
> > <4>[    4.392123]  ? asm_exc_invalid_op+0x1a/0x20
> > <4>[    4.392133]  ? thermal_zone_trip_id+0x61/0x70
> > <4>[    4.392137]  ? thermal_zone_trip_id+0x5d/0x70
> > <4>[    4.392141]  trip_point_show+0x18/0x40
> > <4>[    4.392145]  dev_attr_show+0x15/0x60
> > <4>[    4.392149]  sysfs_kf_seq_show+0xb5/0x100
> > <4>[    4.392154]  seq_read_iter+0x111/0x450
> > <4>[    4.392158]  ? check_object+0x133/0x320
> > <4>[    4.392164]  vfs_read+0x20d/0x300
> > <4>[    4.392175]  ksys_read+0x64/0xe0
> > <4>[    4.392180]  do_syscall_64+0x3c/0x90
> > <4>[    4.392183]  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
> > <4>[    4.392187] RIP: 0033:0x7f1f0e193392
> >
> > Can you please check what could be the reason for this issue?
> 
> Well, one more unuseful lockdep assertion has been added recently to the
> thermal core, sorry about that.
> 
> This commit
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-
> pm.git/commit/?h=linux-
> next&id=108ffd12be24ba1d74b3314df8db32a0a6d55ba5
> 
> that will be merged into linux-next tomorrow if all goes well, should address
> this.

Thank you for the fix. We will wait for it to get merged in linux-next.

Regards

Chaitanya

> 
> Thanks!
> 
> 
> > [1]
> > https://intel-gfx-ci.01.org/tree/linux-next/next-20231010/fi-kbl-guc/b
> > oot0.txt
> >
> > Regards
> >
> > Chaitanya
> >
> >
> >
> >
> >>
> >>> From: Wysocki, Rafael J <rafael.j.wysocki@intel.com>
> >>> Sent: Saturday, October 7, 2023 2:01 AM
> >>> To: Borah, Chaitanya Kumar <chaitanya.kumar.borah@intel.com>
> >>> Cc: intel-gfx@lists.freedesktop.org; Kurmi, Suresh Kumar
> >>> <suresh.kumar.kurmi@intel.com>; Saarinen, Jani
> >>> <jani.saarinen@intel.com>
> >>> Subject: Re: Regression in linux-next
> >>>
> >>> Hi,
> >>> On 10/5/2023 5:58 PM, Borah, Chaitanya Kumar wrote:
> >>> Hello Rafael,
> >>>
> >>> Hope you are doing well. I am Chaitanya from the linux graphics team
> >>> in
> >> Intel.
> >>> This mail is regarding a regression we are seeing in our CI runs[1]
> >>> on linux-
> >> next repository.
> >>> Thanks for the report, I think that this is a lockdep assertion failing.
> >>> If that is correct, it should be straightforward to fix.
> >>> I'll take care of this early next week.
> >>> Thanks!
> >>>
> >>> On next-20231003 [2], we are seeing the following error
> >>>
> >>> ````````````````````````````````````````````````````````````````````
> >>> `` ````````` <4>[   14.093075] ------------[ cut here ]------------
> >>> <4>[ 14.097664] WARNING: CPU: 0 PID: 1 at
> >>> drivers/thermal/thermal_trip.c:18
> >>> for_each_thermal_trip+0x83/0x90 <4>[   14.106977] Modules linked in:
> >>> <4>[   14.110017] CPU: 0 PID: 1 Comm: swapper/0 Tainted: G        W
> >>> 6.6.0-rc4-next-20231003-next-20231003-gc9f2baaa18b5+ #1 <4>[
> >>> 14.121305] Hardware name: Intel Corporation Meteor Lake Client
> >>> Platform/MTL-P DDR5 SODIMM SBS RVP, BIOS
> >>> MTLPFWI1.R00.3323.D89.2309110529 09/11/2023 <4>[   14.134478] RIP:
> >>> 0010:for_each_thermal_trip+0x83/0x90
> >>> <4>[   14.139496] Code: 5c 41 5d c3 cc cc cc cc 5b 31 c0 5d 41 5c 41
> >>> 5d c3 cc cc cc cc 48 8d bf f0 05 00 00 be ff ff ff ff e8 21 a2 2d 00
> >>> 85 c0 75 9a <0f> 0b eb 96 66 0f 1f 84 00 00 00 00 00 90 90 90 90 90
> >>> 90
> >>> 90 90 90
> >>>
> >>> Details log can be found in [3].
> >>>
> >>> After bisecting the tree, the following patch [4] seems to be
> >>> causing the
> >> regression.
> >>> commit d5ea889246b112e228433a5f27f57af90ca0c1fb
> >>> Author: Rafael J. Wysocki mailto:rafael.j.wysocki@intel.com
> >>> Date:   Thu Sep 21 20:02:59 2023 +0200
> >>>
> >>>       ACPI: thermal: Do not use trip indices for cooling device
> >>> binding
> >>>
> >>>       Rearrange the ACPI thermal driver's callback functions used
> >>> for cooling
> >>>       device binding and unbinding,
> >>> acpi_thermal_bind_cooling_device()
> >>> and
> >>>       acpi_thermal_unbind_cooling_device(), respectively, so that
> >>> they use trip
> >>>       pointers instead of trip indices which is more straightforward
> >>> and allows
> >>>       the driver to become independent of the ordering of trips in
> >>> the thermal
> >>>       zone structure.
> >>>
> >>>       The general functionality is not expected to be changed.
> >>>
> >>>       Signed-off-by: Rafael J. Wysocki
> >>> mailto:rafael.j.wysocki@intel.com
> >>>       Reviewed-by: Daniel Lezcano mailto:daniel.lezcano@linaro.org
> >>>
> >>> We also verified by moving the head of the tree to the previous commit.
> >>>
> >>> Could you please check why this patch causes the regression and if
> >>> we can
> >> find a solution for it soon?
> >>> [1] https://intel-gfx-ci.01.org/tree/linux-next/combined-alt.html?
> >>> [2]
> >>> https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/
> >>> co
> >>> mmit/?h=next-20231003 [3]
> >>> https://intel-gfx-ci.01.org/tree/linux-next/next-20231003/bat-mtlp-6
> >>> /b
> >>> oot0.txt [4]
> >>> https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/
> >>> co mmit/?h=next-
> 20231003&id=d5ea889246b112e228433a5f27f57af90ca0c1fb

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Intel-gfx] Regression in linux-next
  2023-10-11 16:49           ` Borah, Chaitanya Kumar
@ 2023-10-13 14:05             ` Borah, Chaitanya Kumar
  0 siblings, 0 replies; 14+ messages in thread
From: Borah, Chaitanya Kumar @ 2023-10-13 14:05 UTC (permalink / raw)
  To: Wysocki, Rafael J; +Cc: intel-gfx@lists.freedesktop.org, Kurmi, Suresh Kumar

Hello Rafael,

> -----Original Message-----
> From: Borah, Chaitanya Kumar
> Sent: Wednesday, October 11, 2023 10:19 PM
> To: Wysocki, Rafael J <rafael.j.wysocki@intel.com>
> Cc: intel-gfx@lists.freedesktop.org; Kurmi, Suresh Kumar
> <Suresh.Kumar.Kurmi@intel.com>; Saarinen, Jani <jani.saarinen@intel.com>
> Subject: RE: Regression in linux-next
> 
> Hello Rafael,
> 
> > -----Original Message-----
> > From: Wysocki, Rafael J <rafael.j.wysocki@intel.com>
> > Sent: Wednesday, October 11, 2023 9:44 PM
> > To: Borah, Chaitanya Kumar <chaitanya.kumar.borah@intel.com>
> > Cc: intel-gfx@lists.freedesktop.org; Kurmi, Suresh Kumar
> > <suresh.kumar.kurmi@intel.com>; Saarinen, Jani
> > <jani.saarinen@intel.com>
> > Subject: Re: Regression in linux-next
> >
> > Hi,
> >
> > On 10/11/2023 6:00 AM, Borah, Chaitanya Kumar wrote:
> > > Hello Rafael,
> > >
> > >> -----Original Message-----
> > >> From: Wysocki, Rafael J <rafael.j.wysocki@intel.com>
> > >> Sent: Tuesday, October 10, 2023 12:54 AM
> > >> To: Borah, Chaitanya Kumar <chaitanya.kumar.borah@intel.com>
> > >> Cc: intel-gfx@lists.freedesktop.org; Kurmi, Suresh Kumar
> > >> <suresh.kumar.kurmi@intel.com>; Saarinen, Jani
> > >> <jani.saarinen@intel.com>
> > >> Subject: Re: Regression in linux-next
> > >>
> > >> Hi,
> > >>
> > >> On 10/9/2023 7:10 AM, Borah, Chaitanya Kumar wrote:
> > >>> Hello Rafael
> > >>>
> > >>>> Thanks for the report, I think that this is a lockdep assertion failing.
> > >>>> If that is correct, it should be straightforward to fix.
> > >>>> I'll take care of this early next week.
> > >>>> Thanks!
> > >>> Thank you for your response.  Please let us know when a fix is available.
> > >> It should be fixed in linux-next from today, by this commit:
> > >>
> > >> https://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-
> > >> pm.git/commit/?h=linux-
> > >> next&id=b44444027ce7714f309e96b804b7fb088a40d708
> > >>
> > >> Thanks!
> > > Thanks a lot for the fix. This seems to have fixed the issue in most
> > > of the
> > machines but we are still seeing a similar problem in few of the machines.
> >
> > Thanks for reporting this!
> >
> >
> > > This has a different call stack but seems to be from the same
> > > thermal subsystem. Full logs in [1]
> > >
> > > <4>[    4.392015] WARNING: CPU: 1 PID: 306 at
> > drivers/thermal/thermal_trip.c:178 thermal_zone_trip_id+0x61/0x70
> > > <4>[    4.392022] Modules linked in: x86_pkg_temp_thermal coretemp
> > kvm_intel mei_pxp mei_hdcp wmi_bmof kvm e1000e irqbypass
> > crct10dif_pclmul video ptp crc32_pclmul ghash_clmulni_intel i2c_i801
> > mei_me pps_core mei i2c_smbus wmi
> > > <4>[    4.392057] CPU: 1 PID: 306 Comm: thermald Not tainted 6.6.0-rc5-
> > next-20231010-next-20231010-gc0a6edb636cb+ #1
> > > <4>[    4.392061] Hardware name: System manufacturer System Product
> > Name/Z170M-PLUS, BIOS 3610 03/29/2018
> > > <4>[    4.392063] RIP: 0010:thermal_zone_trip_id+0x61/0x70
> > > <4>[    4.392066] Code: 74 0c 83 c0 01 39 c8 75 f0 b8 c3 ff ff ff 5b 5d c3 cc
> cc
> > cc cc 48 8d bf f0 05 00 00 be ff ff ff ff e8 63 a4 2d 00 85 c0 75 b5
> > <0f> 0b eb b1
> > 66 2e 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90 90 90
> > > <4>[    4.392069] RSP: 0018:ffffc9000156bda8 EFLAGS: 00010246
> > > <4>[    4.392073] RAX: 0000000000000000 RBX: ffff888103828ae8 RCX:
> > 0000000000000001
> > > <4>[    4.392075] RDX: 0000000080000000 RSI: ffffffff823de5ab RDI:
> > ffffffff823fdfba
> > > <4>[    4.392078] RBP: ffff888103a88800 R08: ffff888103828ae8 R09:
> > 0000000000000001
> > > <4>[    4.392080] R10: 0000000000000001 R11: ffff88811494d3c0 R12:
> > ffff888103a88818
> > > <4>[    4.392082] R13: ffff8881108bfa00 R14: ffff888103794408 R15:
> > 0000000000000001
> > > <4>[    4.392084] FS:  00007f1f0d6d28c0(0000)
> GS:ffff88822e680000(0000)
> > knlGS:0000000000000000
> > > <4>[    4.392087] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > <4>[    4.392089] CR2: 000055857c50b750 CR3: 0000000111efa005
> CR4:
> > 00000000003706f0
> > > <4>[    4.392091] DR0: 0000000000000000 DR1: 0000000000000000
> DR2:
> > 0000000000000000
> > > <4>[    4.392093] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
> > 0000000000000400
> > > <4>[    4.392095] Call Trace:
> > > <4>[    4.392097]  <TASK>
> > > <4>[    4.392100]  ? __warn+0x7f/0x170
> > > <4>[    4.392104]  ? thermal_zone_trip_id+0x61/0x70
> > > <4>[    4.392109]  ? report_bug+0x1f8/0x200
> > > <4>[    4.392116]  ? handle_bug+0x3c/0x70
> > > <4>[    4.392119]  ? exc_invalid_op+0x18/0x70
> > > <4>[    4.392123]  ? asm_exc_invalid_op+0x1a/0x20
> > > <4>[    4.392133]  ? thermal_zone_trip_id+0x61/0x70
> > > <4>[    4.392137]  ? thermal_zone_trip_id+0x5d/0x70
> > > <4>[    4.392141]  trip_point_show+0x18/0x40
> > > <4>[    4.392145]  dev_attr_show+0x15/0x60
> > > <4>[    4.392149]  sysfs_kf_seq_show+0xb5/0x100
> > > <4>[    4.392154]  seq_read_iter+0x111/0x450
> > > <4>[    4.392158]  ? check_object+0x133/0x320
> > > <4>[    4.392164]  vfs_read+0x20d/0x300
> > > <4>[    4.392175]  ksys_read+0x64/0xe0
> > > <4>[    4.392180]  do_syscall_64+0x3c/0x90
> > > <4>[    4.392183]  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
> > > <4>[    4.392187] RIP: 0033:0x7f1f0e193392
> > >
> > > Can you please check what could be the reason for this issue?
> >
> > Well, one more unuseful lockdep assertion has been added recently to
> > the thermal core, sorry about that.
> >
> > This commit
> >
> > https://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-
> > pm.git/commit/?h=linux-
> > next&id=108ffd12be24ba1d74b3314df8db32a0a6d55ba5
> >
> > that will be merged into linux-next tomorrow if all goes well, should
> > address this.
> 
> Thank you for the fix. We will wait for it to get merged in linux-next.
> 

Happy to let to you know that we did not see these issues in the latest linux-next run.

Thanks a lot of your quick resolutions.

Regards

Chaitanya

> Regards
> 
> Chaitanya
> 
> >
> > Thanks!
> >
> >
> > > [1]
> > > https://intel-gfx-ci.01.org/tree/linux-next/next-20231010/fi-kbl-guc
> > > /b
> > > oot0.txt
> > >
> > > Regards
> > >
> > > Chaitanya
> > >
> > >
> > >
> > >
> > >>
> > >>> From: Wysocki, Rafael J <rafael.j.wysocki@intel.com>
> > >>> Sent: Saturday, October 7, 2023 2:01 AM
> > >>> To: Borah, Chaitanya Kumar <chaitanya.kumar.borah@intel.com>
> > >>> Cc: intel-gfx@lists.freedesktop.org; Kurmi, Suresh Kumar
> > >>> <suresh.kumar.kurmi@intel.com>; Saarinen, Jani
> > >>> <jani.saarinen@intel.com>
> > >>> Subject: Re: Regression in linux-next
> > >>>
> > >>> Hi,
> > >>> On 10/5/2023 5:58 PM, Borah, Chaitanya Kumar wrote:
> > >>> Hello Rafael,
> > >>>
> > >>> Hope you are doing well. I am Chaitanya from the linux graphics
> > >>> team in
> > >> Intel.
> > >>> This mail is regarding a regression we are seeing in our CI
> > >>> runs[1] on linux-
> > >> next repository.
> > >>> Thanks for the report, I think that this is a lockdep assertion failing.
> > >>> If that is correct, it should be straightforward to fix.
> > >>> I'll take care of this early next week.
> > >>> Thanks!
> > >>>
> > >>> On next-20231003 [2], we are seeing the following error
> > >>>
> > >>> ``````````````````````````````````````````````````````````````````
> > >>> `` `` ````````` <4>[   14.093075] ------------[ cut here
> > >>> ]------------ <4>[ 14.097664] WARNING: CPU: 0 PID: 1 at
> > >>> drivers/thermal/thermal_trip.c:18
> > >>> for_each_thermal_trip+0x83/0x90 <4>[   14.106977] Modules linked in:
> > >>> <4>[   14.110017] CPU: 0 PID: 1 Comm: swapper/0 Tainted: G
> > >>> W 6.6.0-rc4-next-20231003-next-20231003-gc9f2baaa18b5+ #1 <4>[
> > >>> 14.121305] Hardware name: Intel Corporation Meteor Lake Client
> > >>> Platform/MTL-P DDR5 SODIMM SBS RVP, BIOS
> > >>> MTLPFWI1.R00.3323.D89.2309110529 09/11/2023 <4>[   14.134478]
> RIP:
> > >>> 0010:for_each_thermal_trip+0x83/0x90
> > >>> <4>[   14.139496] Code: 5c 41 5d c3 cc cc cc cc 5b 31 c0 5d 41 5c
> > >>> 41 5d c3 cc cc cc cc 48 8d bf f0 05 00 00 be ff ff ff ff e8 21 a2
> > >>> 2d 00
> > >>> 85 c0 75 9a <0f> 0b eb 96 66 0f 1f 84 00 00 00 00 00 90 90 90 90
> > >>> 90
> > >>> 90
> > >>> 90 90 90
> > >>>
> > >>> Details log can be found in [3].
> > >>>
> > >>> After bisecting the tree, the following patch [4] seems to be
> > >>> causing the
> > >> regression.
> > >>> commit d5ea889246b112e228433a5f27f57af90ca0c1fb
> > >>> Author: Rafael J. Wysocki mailto:rafael.j.wysocki@intel.com
> > >>> Date:   Thu Sep 21 20:02:59 2023 +0200
> > >>>
> > >>>       ACPI: thermal: Do not use trip indices for cooling device
> > >>> binding
> > >>>
> > >>>       Rearrange the ACPI thermal driver's callback functions used
> > >>> for cooling
> > >>>       device binding and unbinding,
> > >>> acpi_thermal_bind_cooling_device()
> > >>> and
> > >>>       acpi_thermal_unbind_cooling_device(), respectively, so that
> > >>> they use trip
> > >>>       pointers instead of trip indices which is more
> > >>> straightforward and allows
> > >>>       the driver to become independent of the ordering of trips in
> > >>> the thermal
> > >>>       zone structure.
> > >>>
> > >>>       The general functionality is not expected to be changed.
> > >>>
> > >>>       Signed-off-by: Rafael J. Wysocki
> > >>> mailto:rafael.j.wysocki@intel.com
> > >>>       Reviewed-by: Daniel Lezcano mailto:daniel.lezcano@linaro.org
> > >>>
> > >>> We also verified by moving the head of the tree to the previous commit.
> > >>>
> > >>> Could you please check why this patch causes the regression and if
> > >>> we can
> > >> find a solution for it soon?
> > >>> [1] https://intel-gfx-ci.01.org/tree/linux-next/combined-alt.html?
> > >>> [2]
> > >>> https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.gi
> > >>> t/
> > >>> co
> > >>> mmit/?h=next-20231003 [3]
> > >>> https://intel-gfx-ci.01.org/tree/linux-next/next-20231003/bat-mtlp
> > >>> -6
> > >>> /b
> > >>> oot0.txt [4]
> > >>> https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.gi
> > >>> t/
> > >>> co mmit/?h=next-
> > 20231003&id=d5ea889246b112e228433a5f27f57af90ca0c1fb

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2023-10-13 14:05 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <SJ1PR11MB6129592BDF5D06949F99816CB95B9@SJ1PR11MB6129.namprd11.prod.outlook.com>
     [not found] ` <SJ1PR11MB6129A7F5C08E2C47748F2BA5B97E9@SJ1PR11MB6129.namprd11.prod.outlook.com>
     [not found]   ` <SJ1PR11MB612980562220A376CA90E105B97E9@SJ1PR11MB6129.namprd11.prod.outlook.com>
2023-07-25  6:42     ` [Intel-gfx] Regression in linux-next Borah, Chaitanya Kumar
2023-07-25 10:53       ` Tvrtko Ursulin
2023-07-26  3:55         ` Borah, Chaitanya Kumar
2023-07-25 13:15       ` Alistair Popple
2023-07-26  3:53         ` Borah, Chaitanya Kumar
2023-07-26 16:09 ` [Intel-gfx] ✗ Fi.CI.BUILD: failure for " Patchwork
2023-10-05 15:58 [Intel-gfx] " Borah, Chaitanya Kumar
2023-10-06 20:30 ` Wysocki, Rafael J
2023-10-09  5:10   ` Borah, Chaitanya Kumar
2023-10-09 19:23     ` Wysocki, Rafael J
2023-10-11  4:00       ` Borah, Chaitanya Kumar
2023-10-11 16:14         ` Wysocki, Rafael J
2023-10-11 16:49           ` Borah, Chaitanya Kumar
2023-10-13 14:05             ` Borah, Chaitanya Kumar

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox