Intel-XE Archive on lore.kernel.org
 help / color / mirror / Atom feed
* Regression on linux-next (next-20250321)
@ 2025-03-25  5:39 Borah, Chaitanya Kumar
  2025-03-25  7:40 ` Nicolin Chen
  0 siblings, 1 reply; 8+ messages in thread
From: Borah, Chaitanya Kumar @ 2025-03-25  5:39 UTC (permalink / raw)
  To: nicolinc@nvidia.com
  Cc: iommu@lists.linux.dev, intel-gfx@lists.freedesktop.org,
	intel-xe@lists.freedesktop.org, Kurmi, Suresh Kumar,
	Saarinen, Jani

Hello Nicolin,

Hope you are doing well. I am Chaitanya from the linux graphics team in Intel.

This mail is regarding a regression we are seeing in our CI runs[1] on linux-next repository.

Since the version next-20250321 [2], we are seeing the following regression

`````````````````````````````````````````````````````````````````````````````````
<4>[    0.226495] Unpatched return thunk in use. This should not happen!
<4>[    0.226502] WARNING: CPU: 0 PID: 1 at arch/x86/kernel/cpu/bugs.c:3107 __warn_thunk+0x62/0x70
<4>[    0.226513] Modules linked in:
<4>[    0.226521] CPU: 0 UID: 0 PID: 1 Comm: swapper/0 Not tainted 6.14.0-rc7-next-20250321-next-20250321-g9388ec571cb1+ #1 PREEMPT(voluntary) 
<4>[    0.226532] Hardware name: ASUS System Product Name/PRIME Z790-P WIFI, BIOS 0812 02/24/2023
<4>[    0.226539] RIP: 0010:__warn_thunk+0x62/0x70
<4>[    0.226544] Code: 34 4c 5d 02 01 e8 fe f6 a7 00 84 c0 75 d9 48 c7 c7 f8 bf 0d 83 e8 7e c6 08 00 48 c7 c7 a0 a2 a0 82 e8 e2 f6 a7 00 84 c0 75 bd <0f> 0b eb b9 cc cc cc cc cc cc cc cc cc cc 90 90 90 90 90 90 90 90
<4>[    0.226559] RSP: 0000:ffffc90000067d78 EFLAGS: 00010246
<4>[    0.226565] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
<4>[    0.226571] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
<4>[    0.226577] RBP: ffffc90000067d80 R08: 0000000000000000 R09: 0000000000000000
<4>[    0.226583] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
<4>[    0.226589] R13: ffffffff83c9417c R14: ffff88887f344bc0 R15: ffff888102370100
<4>[    0.226595] FS:  0000000000000000(0000) GS:ffff8888dacfd000(0000) knlGS:0000000000000000
<4>[    0.226602] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4>[    0.226608] CR2: ffff88887f7ff000 CR3: 000000000344a000 CR4: 0000000000f50ef0
<4>[    0.226614] PKRU: 55555554
<4>[    0.226617] Call Trace:
<4>[    0.226620]  <TASK>
<4>[    0.226624]  ? show_regs+0x6c/0x80
<4>[    0.226630]  ? __warn+0x94/0x210
<4>[    0.226635]  ? __warn_thunk+0x62/0x70
<4>[    0.226640]  ? __report_bug+0x110/0x280
<4>[    0.227000]  ? __lock_acquire+0x447/0x2c70
<4>[    0.227011]  ? _prb_read_valid+0x25a/0x310
<4>[    0.227018]  ? __lock_acquire+0x447/0x2c70
<4>[    0.227024]  ? prb_read_valid+0x1c/0x30
<4>[    0.227037]  ? lock_acquire+0xc4/0x330
<4>[    0.227055]  ? _prb_read_valid+0x25a/0x310
<4>[    0.227073]  ? __warn_thunk+0x62/0x70
<4>[    0.227081]  ? report_bug+0x24/0x80
<4>[    0.227089]  ? handle_bug+0x16a/0x2a0
<4>[    0.227098]  ? exc_invalid_op+0x18/0x80
<4>[    0.227106]  ? asm_exc_invalid_op+0x1b/0x20
<4>[    0.227122]  ? __warn_thunk+0x62/0x70
<4>[    0.227130]  ? __warn_thunk+0x5e/0x70
<4>[    0.227135]  ? iommu_dma_ranges_sort+0x40/0x40
<4>[    0.227144]  warn_thunk_thunk+0x16/0x30
<4>[    0.227157]  do_one_initcall+0x5d/0x460
<4>[    0.227171]  kernel_init_freeable+0x3ac/0x530
<4>[    0.227187]  ? __pfx_kernel_init+0x10/0x10
<4>[    0.227196]  kernel_init+0x1b/0x200
<4>[    0.227203]  ret_from_fork+0x44/0x70
<4>[    0.227210]  ? __pfx_kernel_init+0x10/0x10
<4>[    0.227217]  ret_from_fork_asm+0x1a/0x30
<4>[    0.227236]  </TASK>
`````````````````````````````````````````````````````````````````````````````````
Details log can be found in [3].

After bisecting the tree, the following patch [4] seems to be the first "bad"
commit

`````````````````````````````````````````````````````````````````````````````````````````````````````````
commit e009e088d88e8402539f9595b10c0014125a70c1
Author: Nicolin Chen mailto:nicolinc@nvidia.com
Date:   Thu Mar 6 13:00:49 2025 -0800

    iommu: Drop sw_msi from iommu_domain

    There are only two sw_msi implementations in the entire system, thus it's
    not very necessary to have an sw_msi pointer.

    Instead, check domain->cookie_type to call the two sw_msi implementations
    directly from the core code.
`````````````````````````````````````````````````````````````````````````````````````````````````````````

We also verified that if we revert the patch the issue is not seen.

Could you please check why the patch causes this regression and provide a fix if necessary?

Thank you.

Regards

Chaitanya

[1] https://intel-gfx-ci.01.org/tree/linux-next/combined-alt.html?
[2] https://web.git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?h=next-20250321 
[3] https://intel-gfx-ci.01.org/tree/linux-next/next-20250321/bat-rpls-4/boot0.txt 
[4] https://web.git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?h=next-20250321&id=e009e088d88e8402539f9595b10c0014125a70c1


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Regression on linux-next (next-20250321)
  2025-03-25  5:39 Regression on linux-next (next-20250321) Borah, Chaitanya Kumar
@ 2025-03-25  7:40 ` Nicolin Chen
  2025-03-25 13:43   ` Jason Gunthorpe
  2025-03-26  8:31   ` Borah, Chaitanya Kumar
  0 siblings, 2 replies; 8+ messages in thread
From: Nicolin Chen @ 2025-03-25  7:40 UTC (permalink / raw)
  To: Borah, Chaitanya Kumar
  Cc: iommu@lists.linux.dev, intel-gfx@lists.freedesktop.org,
	intel-xe@lists.freedesktop.org, Kurmi, Suresh Kumar,
	Saarinen, Jani, jgg

(CC += Jason)

Hi Chaitanya,

On Tue, Mar 25, 2025 at 05:39:39AM +0000, Borah, Chaitanya Kumar wrote:
> Hello Nicolin,
> 
> Hope you are doing well. I am Chaitanya from the linux graphics team in Intel.
> 
> This mail is regarding a regression we are seeing in our CI runs[1] on linux-next repository.
> 
> Since the version next-20250321 [2], we are seeing the following regression
> 
> `````````````````````````````````````````````````````````````````````````````````
> <4>[    0.226495] Unpatched return thunk in use. This should not happen!
> <4>[    0.226502] WARNING: CPU: 0 PID: 1 at arch/x86/kernel/cpu/bugs.c:3107 __warn_thunk+0x62/0x70

Hmm....I wonder why x86 can be affected...

The only four callers of iommu_dma_prepare_msi() are ARM platforms.

> <4>[    0.226513] Modules linked in:
> <4>[    0.226521] CPU: 0 UID: 0 PID: 1 Comm: swapper/0 Not tainted 6.14.0-rc7-next-20250321-next-20250321-g9388ec571cb1+ #1 PREEMPT(voluntary) 
> <4>[    0.226532] Hardware name: ASUS System Product Name/PRIME Z790-P WIFI, BIOS 0812 02/24/2023
> <4>[    0.226539] RIP: 0010:__warn_thunk+0x62/0x70
> <4>[    0.226544] Code: 34 4c 5d 02 01 e8 fe f6 a7 00 84 c0 75 d9 48 c7 c7 f8 bf 0d 83 e8 7e c6 08 00 48 c7 c7 a0 a2 a0 82 e8 e2 f6 a7 00 84 c0 75 bd <0f> 0b eb b9 cc cc cc cc cc cc cc cc cc cc 90 90 90 90 90 90 90 90
> <4>[    0.226559] RSP: 0000:ffffc90000067d78 EFLAGS: 00010246
> <4>[    0.226565] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
> <4>[    0.226571] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
> <4>[    0.226577] RBP: ffffc90000067d80 R08: 0000000000000000 R09: 0000000000000000
> <4>[    0.226583] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
> <4>[    0.226589] R13: ffffffff83c9417c R14: ffff88887f344bc0 R15: ffff888102370100
> <4>[    0.226595] FS:  0000000000000000(0000) GS:ffff8888dacfd000(0000) knlGS:0000000000000000
> <4>[    0.226602] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> <4>[    0.226608] CR2: ffff88887f7ff000 CR3: 000000000344a000 CR4: 0000000000f50ef0
> <4>[    0.226614] PKRU: 55555554
> <4>[    0.226617] Call Trace:
> <4>[    0.226620]  <TASK>
> <4>[    0.226624]  ? show_regs+0x6c/0x80
> <4>[    0.226630]  ? __warn+0x94/0x210
> <4>[    0.226635]  ? __warn_thunk+0x62/0x70
> <4>[    0.226640]  ? __report_bug+0x110/0x280
> <4>[    0.227000]  ? __lock_acquire+0x447/0x2c70
> <4>[    0.227011]  ? _prb_read_valid+0x25a/0x310
> <4>[    0.227018]  ? __lock_acquire+0x447/0x2c70
> <4>[    0.227024]  ? prb_read_valid+0x1c/0x30
> <4>[    0.227037]  ? lock_acquire+0xc4/0x330
> <4>[    0.227055]  ? _prb_read_valid+0x25a/0x310
> <4>[    0.227073]  ? __warn_thunk+0x62/0x70
> <4>[    0.227081]  ? report_bug+0x24/0x80
> <4>[    0.227089]  ? handle_bug+0x16a/0x2a0
> <4>[    0.227098]  ? exc_invalid_op+0x18/0x80
> <4>[    0.227106]  ? asm_exc_invalid_op+0x1b/0x20
> <4>[    0.227122]  ? __warn_thunk+0x62/0x70
> <4>[    0.227130]  ? __warn_thunk+0x5e/0x70
> <4>[    0.227135]  ? iommu_dma_ranges_sort+0x40/0x40
> <4>[    0.227144]  warn_thunk_thunk+0x16/0x30
> <4>[    0.227157]  do_one_initcall+0x5d/0x460
> <4>[    0.227171]  kernel_init_freeable+0x3ac/0x530
> <4>[    0.227187]  ? __pfx_kernel_init+0x10/0x10
> <4>[    0.227196]  kernel_init+0x1b/0x200
> <4>[    0.227203]  ret_from_fork+0x44/0x70
> <4>[    0.227210]  ? __pfx_kernel_init+0x10/0x10
> <4>[    0.227217]  ret_from_fork_asm+0x1a/0x30
> <4>[    0.227236]  </TASK>
> `````````````````````````````````````````````````````````````````````````````````
> Details log can be found in [3].

And I can't see something obvious from the log..

Would you please give the git-diff a try (drivers/iommu/iommu.c)?
https://lore.kernel.org/linux-iommu/Z+Itnw4ys6dmDsc+@nvidia.com/

If this doesn't help, would you please give this a try?
https://lore.kernel.org/linux-iommu/20250324170743.GA1339275@ax162/

Thanks!
Nicolin

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Regression on linux-next (next-20250321)
  2025-03-25  7:40 ` Nicolin Chen
@ 2025-03-25 13:43   ` Jason Gunthorpe
  2025-03-27  5:39     ` Josh Poimboeuf
  2025-03-26  8:31   ` Borah, Chaitanya Kumar
  1 sibling, 1 reply; 8+ messages in thread
From: Jason Gunthorpe @ 2025-03-25 13:43 UTC (permalink / raw)
  To: Nicolin Chen, Josh Poimboeuf, Peter Zijlstra
  Cc: Borah, Chaitanya Kumar, iommu@lists.linux.dev,
	intel-gfx@lists.freedesktop.org, intel-xe@lists.freedesktop.org,
	Kurmi, Suresh Kumar, Saarinen, Jani

On Tue, Mar 25, 2025 at 12:40:16AM -0700, Nicolin Chen wrote:
> 
> On Tue, Mar 25, 2025 at 05:39:39AM +0000, Borah, Chaitanya Kumar wrote:
> > Hello Nicolin,
> > 
> > Hope you are doing well. I am Chaitanya from the linux graphics team in Intel.
> > 
> > This mail is regarding a regression we are seeing in our CI runs[1] on linux-next repository.
> > 
> > Since the version next-20250321 [2], we are seeing the following regression
> > 
> > `````````````````````````````````````````````````````````````````````````````````
> > <4>[    0.226495] Unpatched return thunk in use. This should not happen!
> > <4>[    0.226502] WARNING: CPU: 0 PID: 1 at arch/x86/kernel/cpu/bugs.c:3107 __warn_thunk+0x62/0x70
> 
> Hmm....I wonder why x86 can be affected...

I wonder if this is realted to the objtool warning Steven reported:

https://lore.kernel.org/linux-next/20250321193600.2bfe03bb@canb.auug.org.au/

vmlinux.o: warning: objtool: iommu_dma_get_msi_page() falls through to next function __iommu_dma_unmap()

I have no idea what either error means or how to fix it. AFAICT there
is nothing special about this patch to trigger this?

+Peter & Josh

Jason

^ permalink raw reply	[flat|nested] 8+ messages in thread

* RE: Regression on linux-next (next-20250321)
  2025-03-25  7:40 ` Nicolin Chen
  2025-03-25 13:43   ` Jason Gunthorpe
@ 2025-03-26  8:31   ` Borah, Chaitanya Kumar
  2025-03-26 20:43     ` Nicolin Chen
  1 sibling, 1 reply; 8+ messages in thread
From: Borah, Chaitanya Kumar @ 2025-03-26  8:31 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: iommu@lists.linux.dev, intel-gfx@lists.freedesktop.org,
	intel-xe@lists.freedesktop.org, Kurmi, Suresh Kumar,
	Saarinen, Jani, jgg@nvidia.com



> -----Original Message-----
> From: Nicolin Chen <nicolinc@nvidia.com>
> Sent: Tuesday, March 25, 2025 1:10 PM
> To: Borah, Chaitanya Kumar <chaitanya.kumar.borah@intel.com>
> Cc: iommu@lists.linux.dev; intel-gfx@lists.freedesktop.org; intel-
> xe@lists.freedesktop.org; Kurmi, Suresh Kumar
> <suresh.kumar.kurmi@intel.com>; Saarinen, Jani <jani.saarinen@intel.com>;
> jgg@nvidia.com
> Subject: Re: Regression on linux-next (next-20250321)
> 
> (CC += Jason)
> 
> Hi Chaitanya,
> 
> On Tue, Mar 25, 2025 at 05:39:39AM +0000, Borah, Chaitanya Kumar wrote:
> > Hello Nicolin,
> >
> > Hope you are doing well. I am Chaitanya from the linux graphics team in
> Intel.
> >
> > This mail is regarding a regression we are seeing in our CI runs[1] on linux-
> next repository.
> >
> > Since the version next-20250321 [2], we are seeing the following regression
> >
> > `````````````````````````````````````````````````````````````````````````````````
> > <4>[    0.226495] Unpatched return thunk in use. This should not happen!
> > <4>[    0.226502] WARNING: CPU: 0 PID: 1 at
> arch/x86/kernel/cpu/bugs.c:3107 __warn_thunk+0x62/0x70
> 
> Hmm....I wonder why x86 can be affected...
> 
> The only four callers of iommu_dma_prepare_msi() are ARM platforms.
> 
> > <4>[    0.226513] Modules linked in:
> > <4>[    0.226521] CPU: 0 UID: 0 PID: 1 Comm: swapper/0 Not tainted
> 6.14.0-rc7-next-20250321-next-20250321-g9388ec571cb1+ #1
> PREEMPT(voluntary)
> > <4>[    0.226532] Hardware name: ASUS System Product Name/PRIME
> Z790-P WIFI, BIOS 0812 02/24/2023
> > <4>[    0.226539] RIP: 0010:__warn_thunk+0x62/0x70
> > <4>[    0.226544] Code: 34 4c 5d 02 01 e8 fe f6 a7 00 84 c0 75 d9 48 c7 c7
> f8 bf 0d 83 e8 7e c6 08 00 48 c7 c7 a0 a2 a0 82 e8 e2 f6 a7 00 84 c0 75 bd
> <0f> 0b eb b9 cc cc cc cc cc cc cc cc cc cc 90 90 90 90 90 90 90 90
> > <4>[    0.226559] RSP: 0000:ffffc90000067d78 EFLAGS: 00010246
> > <4>[    0.226565] RAX: 0000000000000000 RBX: 0000000000000000 RCX:
> 0000000000000000
> > <4>[    0.226571] RDX: 0000000000000000 RSI: 0000000000000000 RDI:
> 0000000000000000
> > <4>[    0.226577] RBP: ffffc90000067d80 R08: 0000000000000000 R09:
> 0000000000000000
> > <4>[    0.226583] R10: 0000000000000000 R11: 0000000000000000 R12:
> 0000000000000000
> > <4>[    0.226589] R13: ffffffff83c9417c R14: ffff88887f344bc0 R15:
> ffff888102370100
> > <4>[    0.226595] FS:  0000000000000000(0000)
> GS:ffff8888dacfd000(0000) knlGS:0000000000000000
> > <4>[    0.226602] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > <4>[    0.226608] CR2: ffff88887f7ff000 CR3: 000000000344a000 CR4:
> 0000000000f50ef0
> > <4>[    0.226614] PKRU: 55555554
> > <4>[    0.226617] Call Trace:
> > <4>[    0.226620]  <TASK>
> > <4>[    0.226624]  ? show_regs+0x6c/0x80
> > <4>[    0.226630]  ? __warn+0x94/0x210
> > <4>[    0.226635]  ? __warn_thunk+0x62/0x70
> > <4>[    0.226640]  ? __report_bug+0x110/0x280
> > <4>[    0.227000]  ? __lock_acquire+0x447/0x2c70
> > <4>[    0.227011]  ? _prb_read_valid+0x25a/0x310
> > <4>[    0.227018]  ? __lock_acquire+0x447/0x2c70
> > <4>[    0.227024]  ? prb_read_valid+0x1c/0x30
> > <4>[    0.227037]  ? lock_acquire+0xc4/0x330
> > <4>[    0.227055]  ? _prb_read_valid+0x25a/0x310
> > <4>[    0.227073]  ? __warn_thunk+0x62/0x70
> > <4>[    0.227081]  ? report_bug+0x24/0x80
> > <4>[    0.227089]  ? handle_bug+0x16a/0x2a0
> > <4>[    0.227098]  ? exc_invalid_op+0x18/0x80
> > <4>[    0.227106]  ? asm_exc_invalid_op+0x1b/0x20
> > <4>[    0.227122]  ? __warn_thunk+0x62/0x70
> > <4>[    0.227130]  ? __warn_thunk+0x5e/0x70
> > <4>[    0.227135]  ? iommu_dma_ranges_sort+0x40/0x40
> > <4>[    0.227144]  warn_thunk_thunk+0x16/0x30
> > <4>[    0.227157]  do_one_initcall+0x5d/0x460
> > <4>[    0.227171]  kernel_init_freeable+0x3ac/0x530
> > <4>[    0.227187]  ? __pfx_kernel_init+0x10/0x10
> > <4>[    0.227196]  kernel_init+0x1b/0x200
> > <4>[    0.227203]  ret_from_fork+0x44/0x70
> > <4>[    0.227210]  ? __pfx_kernel_init+0x10/0x10
> > <4>[    0.227217]  ret_from_fork_asm+0x1a/0x30
> > <4>[    0.227236]  </TASK>
> > `````````````````````````````````````````````````````````````````````````````````
> > Details log can be found in [3].
> 
> And I can't see something obvious from the log..
> 
> Would you please give the git-diff a try (drivers/iommu/iommu.c)?
> https://lore.kernel.org/linux-iommu/Z+Itnw4ys6dmDsc+@nvidia.com/
> 
> If this doesn't help, would you please give this a try?
> https://lore.kernel.org/linux-iommu/20250324170743.GA1339275@ax162/
> 

Thank you, Nicolin, for your reply. Unfortunately, these changes does not solve the issue. (applied individually and together)

Regards

Chaitanya

> Thanks!
> Nicolin

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Regression on linux-next (next-20250321)
  2025-03-26  8:31   ` Borah, Chaitanya Kumar
@ 2025-03-26 20:43     ` Nicolin Chen
  2025-03-27  8:46       ` Borah, Chaitanya Kumar
  0 siblings, 1 reply; 8+ messages in thread
From: Nicolin Chen @ 2025-03-26 20:43 UTC (permalink / raw)
  To: Borah, Chaitanya Kumar
  Cc: iommu@lists.linux.dev, intel-gfx@lists.freedesktop.org,
	intel-xe@lists.freedesktop.org, Kurmi, Suresh Kumar,
	Saarinen, Jani, jgg@nvidia.com

Hi Chaitanya,

On Wed, Mar 26, 2025 at 08:31:15AM +0000, Borah, Chaitanya Kumar wrote:
> > > `````````````````````````````````````````````````````````````````````````````````
> > > <4>[    0.226495] Unpatched return thunk in use. This should not happen!
> > > <4>[    0.226502] WARNING: CPU: 0 PID: 1 at
> > arch/x86/kernel/cpu/bugs.c:3107 __warn_thunk+0x62/0x70
> > 
> > Hmm....I wonder why x86 can be affected...
> > 
> > The only four callers of iommu_dma_prepare_msi() are ARM platforms.
[...]
> > > Details log can be found in [3].
> > 
> > And I can't see something obvious from the log..
> > 
> > Would you please give the git-diff a try (drivers/iommu/iommu.c)?
> > https://lore.kernel.org/linux-iommu/Z+Itnw4ys6dmDsc+@nvidia.com/
> > 
> > If this doesn't help, would you please give this a try?
> > https://lore.kernel.org/linux-iommu/20250324170743.GA1339275@ax162/
> > 
> 
> Thank you, Nicolin, for your reply. Unfortunately, these changes
> does not solve the issue. (applied individually and together)

Would you please try the latest linux-next (next-20250326) and see
if the issue still occurs?

If it does occur with next-20250326, would you please try reverting
06d54f00f3f5a on top of the tree rather than bisect?
 "06d54f00f3f5 iommu: Drop sw_msi from iommu_domain"

It still feels odd to me and Jason that this change would break x86.
So, we want to confirm that this is really the culprit.

Also, I think your platform doesn't set CONFIG_IRQ_MSI_IOMMU, would
you please check you .config file and confirm?

Thanks
Nicolin

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Regression on linux-next (next-20250321)
  2025-03-25 13:43   ` Jason Gunthorpe
@ 2025-03-27  5:39     ` Josh Poimboeuf
  2025-03-27  8:09       ` Borah, Chaitanya Kumar
  0 siblings, 1 reply; 8+ messages in thread
From: Josh Poimboeuf @ 2025-03-27  5:39 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Nicolin Chen, Peter Zijlstra, Borah, Chaitanya Kumar,
	iommu@lists.linux.dev, intel-gfx@lists.freedesktop.org,
	intel-xe@lists.freedesktop.org, Kurmi, Suresh Kumar,
	Saarinen, Jani

On Tue, Mar 25, 2025 at 10:43:17AM -0300, Jason Gunthorpe wrote:
> On Tue, Mar 25, 2025 at 12:40:16AM -0700, Nicolin Chen wrote:
> > 
> > On Tue, Mar 25, 2025 at 05:39:39AM +0000, Borah, Chaitanya Kumar wrote:
> > > Hello Nicolin,
> > > 
> > > Hope you are doing well. I am Chaitanya from the linux graphics team in Intel.
> > > 
> > > This mail is regarding a regression we are seeing in our CI runs[1] on linux-next repository.
> > > 
> > > Since the version next-20250321 [2], we are seeing the following regression
> > > 
> > > `````````````````````````````````````````````````````````````````````````````````
> > > <4>[    0.226495] Unpatched return thunk in use. This should not happen!
> > > <4>[    0.226502] WARNING: CPU: 0 PID: 1 at arch/x86/kernel/cpu/bugs.c:3107 __warn_thunk+0x62/0x70
> > 
> > Hmm....I wonder why x86 can be affected...
> 
> I wonder if this is realted to the objtool warning Steven reported:
> 
> https://lore.kernel.org/linux-next/20250321193600.2bfe03bb@canb.auug.org.au/
> 
> vmlinux.o: warning: objtool: iommu_dma_get_msi_page() falls through to next function __iommu_dma_unmap()
> 
> I have no idea what either error means or how to fix it. AFAICT there
> is nothing special about this patch to trigger this?

Yeah, I'm fairly sure the boot warning is related to that objtool
warning.  I just posted a patch for that:

  https://lore.kernel.org/0c801ae017ec078cacd39f8f0898fc7780535f85.1743053325.git.jpoimboe@kernel.org

But also, we need to fix objtool to handle that warning more gracefully
so it doesn't trigger the boot failure.

-- 
Josh

^ permalink raw reply	[flat|nested] 8+ messages in thread

* RE: Regression on linux-next (next-20250321)
  2025-03-27  5:39     ` Josh Poimboeuf
@ 2025-03-27  8:09       ` Borah, Chaitanya Kumar
  0 siblings, 0 replies; 8+ messages in thread
From: Borah, Chaitanya Kumar @ 2025-03-27  8:09 UTC (permalink / raw)
  To: Josh Poimboeuf, Jason Gunthorpe
  Cc: Nicolin Chen, Peter Zijlstra, iommu@lists.linux.dev,
	intel-gfx@lists.freedesktop.org, intel-xe@lists.freedesktop.org,
	Kurmi, Suresh Kumar, Saarinen, Jani



> -----Original Message-----
> From: Josh Poimboeuf <jpoimboe@kernel.org>
> Sent: Thursday, March 27, 2025 11:09 AM
> To: Jason Gunthorpe <jgg@nvidia.com>
> Cc: Nicolin Chen <nicolinc@nvidia.com>; Peter Zijlstra
> <peterz@infradead.org>; Borah, Chaitanya Kumar
> <chaitanya.kumar.borah@intel.com>; iommu@lists.linux.dev; intel-
> gfx@lists.freedesktop.org; intel-xe@lists.freedesktop.org; Kurmi, Suresh
> Kumar <suresh.kumar.kurmi@intel.com>; Saarinen, Jani
> <jani.saarinen@intel.com>
> Subject: Re: Regression on linux-next (next-20250321)
> 
> On Tue, Mar 25, 2025 at 10:43:17AM -0300, Jason Gunthorpe wrote:
> > On Tue, Mar 25, 2025 at 12:40:16AM -0700, Nicolin Chen wrote:
> > >
> > > On Tue, Mar 25, 2025 at 05:39:39AM +0000, Borah, Chaitanya Kumar
> wrote:
> > > > Hello Nicolin,
> > > >
> > > > Hope you are doing well. I am Chaitanya from the linux graphics team in
> Intel.
> > > >
> > > > This mail is regarding a regression we are seeing in our CI runs[1] on linux-
> next repository.
> > > >
> > > > Since the version next-20250321 [2], we are seeing the following
> > > > regression
> > > >
> > > > `````````````````````````````````````````````````````````````````````````````````
> > > > <4>[    0.226495] Unpatched return thunk in use. This should not
> happen!
> > > > <4>[    0.226502] WARNING: CPU: 0 PID: 1 at
> arch/x86/kernel/cpu/bugs.c:3107 __warn_thunk+0x62/0x70
> > >
> > > Hmm....I wonder why x86 can be affected...
> >
> > I wonder if this is realted to the objtool warning Steven reported:
> >
> > https://lore.kernel.org/linux-
> next/20250321193600.2bfe03bb@canb.auug.o
> > rg.au/
> >
> > vmlinux.o: warning: objtool: iommu_dma_get_msi_page() falls through to
> > next function __iommu_dma_unmap()
> >
> > I have no idea what either error means or how to fix it. AFAICT there
> > is nothing special about this patch to trigger this?
> 
> Yeah, I'm fairly sure the boot warning is related to that objtool warning.  I just
> posted a patch for that:
> 
> 
> https://lore.kernel.org/0c801ae017ec078cacd39f8f0898fc7780535f85.174
> 3053325.git.jpoimboe@kernel.org
> 
> But also, we need to fix objtool to handle that warning more gracefully so it
> doesn't trigger the boot failure.

Thank  you for the change.
We can confirm that this gets rid of the boot failures. Does it land in linux-next soon?

Regards

Chaitanya


> 
> --
> Josh

^ permalink raw reply	[flat|nested] 8+ messages in thread

* RE: Regression on linux-next (next-20250321)
  2025-03-26 20:43     ` Nicolin Chen
@ 2025-03-27  8:46       ` Borah, Chaitanya Kumar
  0 siblings, 0 replies; 8+ messages in thread
From: Borah, Chaitanya Kumar @ 2025-03-27  8:46 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: iommu@lists.linux.dev, intel-gfx@lists.freedesktop.org,
	intel-xe@lists.freedesktop.org, Kurmi, Suresh Kumar,
	Saarinen, Jani, jgg@nvidia.com



> -----Original Message-----
> From: Nicolin Chen <nicolinc@nvidia.com>
> Sent: Thursday, March 27, 2025 2:14 AM
> To: Borah, Chaitanya Kumar <chaitanya.kumar.borah@intel.com>
> Cc: iommu@lists.linux.dev; intel-gfx@lists.freedesktop.org; intel-
> xe@lists.freedesktop.org; Kurmi, Suresh Kumar
> <suresh.kumar.kurmi@intel.com>; Saarinen, Jani <jani.saarinen@intel.com>;
> jgg@nvidia.com
> Subject: Re: Regression on linux-next (next-20250321)
> 
> Hi Chaitanya,
> 
> On Wed, Mar 26, 2025 at 08:31:15AM +0000, Borah, Chaitanya Kumar
> wrote:
> > > > `````````````````````````````````````````````````````````````````````````````````
> > > > <4>[    0.226495] Unpatched return thunk in use. This should not
> happen!
> > > > <4>[    0.226502] WARNING: CPU: 0 PID: 1 at
> > > arch/x86/kernel/cpu/bugs.c:3107 __warn_thunk+0x62/0x70
> > >
> > > Hmm....I wonder why x86 can be affected...
> > >
> > > The only four callers of iommu_dma_prepare_msi() are ARM platforms.
> [...]
> > > > Details log can be found in [3].
> > >
> > > And I can't see something obvious from the log..
> > >
> > > Would you please give the git-diff a try (drivers/iommu/iommu.c)?
> > > https://lore.kernel.org/linux-iommu/Z+Itnw4ys6dmDsc+@nvidia.com/
> > >
> > > If this doesn't help, would you please give this a try?
> > > https://lore.kernel.org/linux-
> iommu/20250324170743.GA1339275@ax162/
> > >
> >
> > Thank you, Nicolin, for your reply. Unfortunately, these changes does
> > not solve the issue. (applied individually and together)
> 
> Would you please try the latest linux-next (next-20250326) and see if the
> issue still occurs?
> 
> If it does occur with next-20250326, would you please try reverting
> 06d54f00f3f5a on top of the tree rather than bisect?
>  "06d54f00f3f5 iommu: Drop sw_msi from iommu_domain"
> 

We have tried both of these but the error persists.

> It still feels odd to me and Jason that this change would break x86.
> So, we want to confirm that this is really the culprit.
> 

I think as discussed in the other thread [1]. It makes sense that objtool is injecting the error.
That would explain it.

[1] https://lore.kernel.org/intel-gfx/Z+Rm9LweNAtQBrmD@nvidia.com/T/#t

Regards

Chaitanya

> Also, I think your platform doesn't set CONFIG_IRQ_MSI_IOMMU, would you
> please check you .config file and confirm?
> 
> Thanks
> Nicolin

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2025-03-28 13:09 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-03-25  5:39 Regression on linux-next (next-20250321) Borah, Chaitanya Kumar
2025-03-25  7:40 ` Nicolin Chen
2025-03-25 13:43   ` Jason Gunthorpe
2025-03-27  5:39     ` Josh Poimboeuf
2025-03-27  8:09       ` Borah, Chaitanya Kumar
2025-03-26  8:31   ` Borah, Chaitanya Kumar
2025-03-26 20:43     ` Nicolin Chen
2025-03-27  8:46       ` Borah, Chaitanya Kumar

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox