kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* vhost: linux-next: crash at vhost_dev_cleanup()
@ 2025-07-23 15:04 Breno Leitao
  2025-07-23 19:09 ` Michael S. Tsirkin
  2025-07-24  7:47 ` Michael S. Tsirkin
  0 siblings, 2 replies; 12+ messages in thread
From: Breno Leitao @ 2025-07-23 15:04 UTC (permalink / raw)
  To: mst, jasowang, eperezma; +Cc: linux-arm-kernel, kvm

Hello,

I've seen a crash in linux-next for a while on my arm64 server, and
I decided to report.

While running stress-ng on linux-next, I see the crash below.

This is happening in a kernel configure with some debug options (KASAN,
LOCKDEP and KMEMLEAK).

Basically running stress-ng in a loop would crash the host in 15-20
minutes:
	# while (true); do stress-ng -r 10 -t 10; done
	
From the early warning "virt_to_phys used for non-linear address",
I suppose corrupted data is at vq->nheads.

Here is the decoded stack against 9798752 ("Add linux-next specific
files for 20250721")


	[  620.685144] [ T250731] VFIO - User Level meta-driver version: 0.3
	[  622.394448] [ T250254] ------------[ cut here ]------------
	[  622.413492] [ T250254] virt_to_phys used for non-linear address: 000000006e69fe64 (0xcfcecdcccbcac9c8)
	[  622.447771] [     T250254] WARNING: arch/arm64/mm/physaddr.c:15 at __virt_to_phys+0x64/0x90, CPU#57: stress-ng-dev/250254 
	[  622.487227] [ T250254] Modules linked in: vhost_vsock(E) vfio_iommu_type1(E) vfio(E) unix_diag(E) sch_fq(E) ghes_edac(E) tls(E) tcp_diag(E) inet_diag(E) act_gact(E) cls_bpf(E) nvidia_cspmu(E) ipmi_ssif(E) coresight_trbe(E) arm_cspmu_module(E) arm_smmuv3_pmu(E) ipmi_devintf(E) coresight_stm(E) coresight_funnel(E) coresight_etm4x(E) coresight_tmc(E) stm_core(E) ipmi_msghandler(E) coresight(E) cppc_cpufreq(E) sch_fq_codel(E) drm(E) backlight(E) drm_panel_orientation_quirks(E) sm3_ce(E) sha3_ce(E) spi_tegra210_quad(E) vhost_net(E) tap(E) tun(E) vhost(E) vhost_iotlb(E) mpls_gso(E) mpls_iptunnel(E) mpls_router(E) fou(E) acpi_power_meter(E) loop(E) efivarfs(E) autofs4(E) [last unloaded: test_bpf(E)]
	[  622.734524] [ T250254] Tainted: [W]=WARN, [E]=UNSIGNED_MODULE, [N]=TEST
	[  622.734525] [ T250254] Hardware name: ...
	[  622.734526] [ T250254] pstate: 63401009 (nZCv daif +PAN -UAO +TCO +DIT +SSBS BTYPE=--)
	[  622.734529] [     T250254] pc : __virt_to_phys (/home/user/Devel/linux-next/arch/arm64/mm/physaddr.c:?) 
	[  622.734531] [     T250254] lr : __virt_to_phys (/home/user/Devel/linux-next/arch/arm64/mm/physaddr.c:?) 
	[  622.734533] [ T250254] sp : ffff800158e8fc60
	[  622.734534] [ T250254] x29: ffff800158e8fc60 x28: ffff0034a7cc7900 x27: 0000000000000000
	[  622.734537] [ T250254] x26: 0000000000000000 x25: ffff0034a7cc7900 x24: 00000000040e001f
	[  622.734539] [ T250254] x23: ffff0010858afb00 x22: cfcecdcccbcac9c8 x21: ffff0033526a01e0
	[  622.734541] [ T250254] x20: 0000000000008000 x19: ffcecdcccbcac9c8 x18: ffff80008149c8e4
	[  622.734543] [ T250254] x17: 0000000000000001 x16: 0000000000000000 x15: 0000000000000003
	[  622.734545] [ T250254] x14: ffff800082962e78 x13: 0000000000000003 x12: ffff003bc6231630
	[  622.734546] [ T250254] x11: 0000000000000000 x10: 0000000000000000 x9 : ed44a220ae716b00
	[  622.734548] [ T250254] x8 : 0001000000000000 x7 : 0720072007200720 x6 : ffff80008018710c
	[  622.734550] [ T250254] x5 : 0000000000000001 x4 : 00000090ecc72ac0 x3 : 0000000000000000
	[  622.734552] [ T250254] x2 : 0000000000000000 x1 : ffff800081a72bc6 x0 : 000000000000004f
	[  622.734554] [ T250254] Call trace:
	[  622.734555] [     T250254] __virt_to_phys (/home/user/Devel/linux-next/arch/arm64/mm/physaddr.c:?) (P)
	[  622.734557] [     T250254] kfree (/home/user/Devel/linux-next/./include/linux/mm.h:1180 /home/user/Devel/linux-next/mm/slub.c:4871) 
	[  622.734562] [     T250254] vhost_dev_cleanup (/home/user/Devel/linux-next/drivers/vhost/vhost.c:506 /home/user/Devel/linux-next/drivers/vhost/vhost.c:542 /home/user/Devel/linux-next/drivers/vhost/vhost.c:1214) vhost 
	[  622.734571] [     T250254] vhost_vsock_dev_release (/home/user/Devel/linux-next/drivers/vhost/vsock.c:756) vhost_vsock 
	[  622.734575] [     T250254] __fput (/home/user/Devel/linux-next/fs/file_table.c:469) 
	[  622.734578] [     T250254] fput_close_sync (/home/user/Devel/linux-next/fs/file_table.c:?) 
	[  622.734579] [     T250254] __arm64_sys_close (/home/user/Devel/linux-next/fs/open.c:1589 /home/user/Devel/linux-next/fs/open.c:1572 /home/user/Devel/linux-next/fs/open.c:1572) 
	[  622.734584] [     T250254] invoke_syscall (/home/user/Devel/linux-next/arch/arm64/kernel/syscall.c:50) 
	[  622.734589] [     T250254] el0_svc_common (/home/user/Devel/linux-next/./include/linux/thread_info.h:135 /home/user/Devel/linux-next/arch/arm64/kernel/syscall.c:140) 
	[  622.734591] [     T250254] do_el0_svc (/home/user/Devel/linux-next/arch/arm64/kernel/syscall.c:152) 
	[  622.734594] [     T250254] el0_svc (/home/user/Devel/linux-next/arch/arm64/kernel/entry-common.c:169 /home/user/Devel/linux-next/arch/arm64/kernel/entry-common.c:182 /home/user/Devel/linux-next/arch/arm64/kernel/entry-common.c:880) 
	[  622.734600] [     T250254] el0t_64_sync_handler (/home/user/Devel/linux-next/arch/arm64/kernel/entry-common.c:958) 
	[  622.734603] [     T250254] el0t_64_sync (/home/user/Devel/linux-next/arch/arm64/kernel/entry.S:596) 
	[  622.734605] [ T250254] irq event stamp: 0
	[  622.734606] [     T250254] hardirqs last enabled at (0): 0x0 
	[  622.734610] [     T250254] hardirqs last disabled at (0): copy_process (/home/user/Devel/linux-next/kernel/fork.c:?) 
	[  622.734614] [     T250254] softirqs last enabled at (0): copy_process (/home/user/Devel/linux-next/kernel/fork.c:?) 
	[  622.734616] [     T250254] softirqs last disabled at (0): 0x0 
	[  622.734618] [ T250254] ---[ end trace 0000000000000000 ]---
	[  622.734697] [ T250254] Unable to handle kernel paging request at virtual address 003ff3b33312f288
	[  622.734700] [ T250254] Mem abort info:
	[  622.734701] [ T250254]   ESR = 0x0000000096000004
	[  622.734702] [ T250254]   EC = 0x25: DABT (current EL), IL = 32 bits
	[  622.734704] [ T250254]   SET = 0, FnV = 0
	[  622.734705] [ T250254]   EA = 0, S1PTW = 0
	[  622.734706] [ T250254]   FSC = 0x04: level 0 translation fault
	[  622.734708] [ T250254] Data abort info:
	[  622.734709] [ T250254]   ISV = 0, ISS = 0x00000004, ISS2 = 0x00000000
	[  622.734711] [ T250254]   CM = 0, WnR = 0, TnD = 0, TagAccess = 0
	[  622.734712] [ T250254]   GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
	[  622.734713] [ T250254] [003ff3b33312f288] address between user and kernel address ranges
	[  622.734715] [ T250254] Internal error: Oops: 0000000096000004 [#1]  SMP
	[  622.734718] [ T250254] Modules linked in: vhost_vsock(E) vfio_iommu_type1(E) vfio(E) unix_diag(E) sch_fq(E) ghes_edac(E) tls(E) tcp_diag(E) inet_diag(E) act_gact(E) cls_bpf(E) nvidia_cspmu(E) ipmi_ssif(E) coresight_trbe(E) arm_cspmu_module(E) arm_smmuv3_pmu(E) ipmi_devintf(E) coresight_stm(E) coresight_funnel(E) coresight_etm4x(E) coresight_tmc(E) stm_core(E) ipmi_msghandler(E) coresight(E) cppc_cpufreq(E) sch_fq_codel(E) drm(E) backlight(E) drm_panel_orientation_quirks(E) sm3_ce(E) sha3_ce(E) spi_tegra210_quad(E) vhost_net(E) tap(E) tun(E) vhost(E) vhost_iotlb(E) mpls_gso(E) mpls_iptunnel(E) mpls_router(E) fou(E) acpi_power_meter(E) loop(E) efivarfs(E) autofs4(E) [last unloaded: test_bpf(E)]
	[  622.734740] [ T250254] Tainted: [W]=WARN, [E]=UNSIGNED_MODULE, [N]=TEST
	[  622.734740] [ T250254] Hardware name: ...
	[  622.734741] [ T250254] pstate: 63401009 (nZCv daif +PAN -UAO +TCO +DIT +SSBS BTYPE=--)
	[  622.734742] [     T250254] pc : kfree (/home/user/Devel/linux-next/./include/linux/page-flags.h:284 /home/user/Devel/linux-next/./include/linux/mm.h:1182 /home/user/Devel/linux-next/mm/slub.c:4871) 
	[  622.734745] [     T250254] lr : kfree (/home/user/Devel/linux-next/./include/linux/mm.h:1180 /home/user/Devel/linux-next/mm/slub.c:4871) 
	[  622.734747] [ T250254] sp : ffff800158e8fc80
	[  622.734748] [ T250254] x29: ffff800158e8fc90 x28: ffff0034a7cc7900 x27: 0000000000000000
	[  622.734749] [ T250254] x26: 0000000000000000 x25: ffff0034a7cc7900 x24: 00000000040e001f
	[  622.734751] [ T250254] x23: ffff0010858afb00 x22: cfcecdcccbcac9c8 x21: ffff0033526a01e0
	[  622.734752] [ T250254] x20: 003ff3b33312f280 x19: ffff80000acd1a20 x18: ffff80008149c8e4
	[  622.734754] [ T250254] x17: 0000000000000001 x16: 0000000000000000 x15: 0000000000000003
	[  622.734755] [ T250254] x14: ffff800082962e78 x13: 0000000000000003 x12: ffff003bc6231630
	[  622.734757] [ T250254] x11: 0000000000000000 x10: 0000000000000000 x9 : ffffffdfc0000000
	[  622.734758] [ T250254] x8 : 003ff3d37312f280 x7 : 0720072007200720 x6 : ffff80008018710c
	[  622.734760] [ T250254] x5 : 0000000000000001 x4 : 00000090ecc72ac0 x3 : 0000000000000000
	[  622.734761] [ T250254] x2 : 0000000000000000 x1 : ffff800081a72bc6 x0 : ffcf4dcccbcac9c8
	[  622.734763] [ T250254] Call trace:
	[  622.734763] [     T250254] kfree (/home/user/Devel/linux-next/./include/linux/page-flags.h:284 /home/user/Devel/linux-next/./include/linux/mm.h:1182 /home/user/Devel/linux-next/mm/slub.c:4871) (P)
	[  622.734766] [     T250254] vhost_dev_cleanup (/home/user/Devel/linux-next/drivers/vhost/vhost.c:506 /home/user/Devel/linux-next/drivers/vhost/vhost.c:542 /home/user/Devel/linux-next/drivers/vhost/vhost.c:1214) vhost 
	[  622.734769] [     T250254] vhost_vsock_dev_release (/home/user/Devel/linux-next/drivers/vhost/vsock.c:756) vhost_vsock 
	[  622.734771] [     T250254] __fput (/home/user/Devel/linux-next/fs/file_table.c:469) 
	[  622.734772] [     T250254] fput_close_sync (/home/user/Devel/linux-next/fs/file_table.c:?) 
	[  622.734773] [     T250254] __arm64_sys_close (/home/user/Devel/linux-next/fs/open.c:1589 /home/user/Devel/linux-next/fs/open.c:1572 /home/user/Devel/linux-next/fs/open.c:1572) 
	[  622.734776] [     T250254] invoke_syscall (/home/user/Devel/linux-next/arch/arm64/kernel/syscall.c:50) 
	[  622.734778] [     T250254] el0_svc_common (/home/user/Devel/linux-next/./include/linux/thread_info.h:135 /home/user/Devel/linux-next/arch/arm64/kernel/syscall.c:140) 
	[  622.734781] [     T250254] do_el0_svc (/home/user/Devel/linux-next/arch/arm64/kernel/syscall.c:152) 
	[  622.734783] [     T250254] el0_svc (/home/user/Devel/linux-next/arch/arm64/kernel/entry-common.c:169 /home/user/Devel/linux-next/arch/arm64/kernel/entry-common.c:182 /home/user/Devel/linux-next/arch/arm64/kernel/entry-common.c:880) 
	[  622.734787] [     T250254] el0t_64_sync_handler (/home/user/Devel/linux-next/arch/arm64/kernel/entry-common.c:958) 
	[  622.734790] [     T250254] el0t_64_sync (/home/user/Devel/linux-next/arch/arm64/kernel/entry.S:596) 
	[ 622.734792] [ T250254] Code: f2dffbe9 927abd08 cb141908 8b090114 (f9400688)
	All code
	========
	0:*	e9 fb df f2 08       	jmp    0x8f2e000		<-- trapping instruction
	5:	bd 7a 92 08 19       	mov    $0x1908927a,%ebp
	a:	14 cb                	adc    $0xcb,%al
	c:	14 01                	adc    $0x1,%al
	e:	09 8b 88 06 40 f9    	or     %ecx,-0x6bff978(%rbx)

	Code starting with the faulting instruction
	===========================================
	0:	88 06                	mov    %al,(%rsi)
	2:	40 f9                	rex stc 
	[  622.734795] [ T250254] SMP: stopping secondary CPUs
	[  622.735089] [ T250254] Starting crashdump kernel...
	[  622.735091] [ T250254] Bye!


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: vhost: linux-next: crash at vhost_dev_cleanup()
  2025-07-23 15:04 vhost: linux-next: crash at vhost_dev_cleanup() Breno Leitao
@ 2025-07-23 19:09 ` Michael S. Tsirkin
  2025-07-24  7:47 ` Michael S. Tsirkin
  1 sibling, 0 replies; 12+ messages in thread
From: Michael S. Tsirkin @ 2025-07-23 19:09 UTC (permalink / raw)
  To: Breno Leitao; +Cc: jasowang, eperezma, linux-arm-kernel, kvm

On Wed, Jul 23, 2025 at 08:04:42AM -0700, Breno Leitao wrote:
> Hello,
> 
> I've seen a crash in linux-next for a while on my arm64 server, and
> I decided to report.
> 
> While running stress-ng on linux-next, I see the crash below.
> 
> This is happening in a kernel configure with some debug options (KASAN,
> LOCKDEP and KMEMLEAK).


Thanks for the report!
Any chance of a bisect?
Much appreciated.

> Basically running stress-ng in a loop would crash the host in 15-20
> minutes:
> 	# while (true); do stress-ng -r 10 -t 10; done
> 	
> >From the early warning "virt_to_phys used for non-linear address",
> I suppose corrupted data is at vq->nheads.
> 
> Here is the decoded stack against 9798752 ("Add linux-next specific
> files for 20250721")
> 
> 
> 	[  620.685144] [ T250731] VFIO - User Level meta-driver version: 0.3
> 	[  622.394448] [ T250254] ------------[ cut here ]------------
> 	[  622.413492] [ T250254] virt_to_phys used for non-linear address: 000000006e69fe64 (0xcfcecdcccbcac9c8)
> 	[  622.447771] [     T250254] WARNING: arch/arm64/mm/physaddr.c:15 at __virt_to_phys+0x64/0x90, CPU#57: stress-ng-dev/250254 
> 	[  622.487227] [ T250254] Modules linked in: vhost_vsock(E) vfio_iommu_type1(E) vfio(E) unix_diag(E) sch_fq(E) ghes_edac(E) tls(E) tcp_diag(E) inet_diag(E) act_gact(E) cls_bpf(E) nvidia_cspmu(E) ipmi_ssif(E) coresight_trbe(E) arm_cspmu_module(E) arm_smmuv3_pmu(E) ipmi_devintf(E) coresight_stm(E) coresight_funnel(E) coresight_etm4x(E) coresight_tmc(E) stm_core(E) ipmi_msghandler(E) coresight(E) cppc_cpufreq(E) sch_fq_codel(E) drm(E) backlight(E) drm_panel_orientation_quirks(E) sm3_ce(E) sha3_ce(E) spi_tegra210_quad(E) vhost_net(E) tap(E) tun(E) vhost(E) vhost_iotlb(E) mpls_gso(E) mpls_iptunnel(E) mpls_router(E) fou(E) acpi_power_meter(E) loop(E) efivarfs(E) autofs4(E) [last unloaded: test_bpf(E)]
> 	[  622.734524] [ T250254] Tainted: [W]=WARN, [E]=UNSIGNED_MODULE, [N]=TEST
> 	[  622.734525] [ T250254] Hardware name: ...
> 	[  622.734526] [ T250254] pstate: 63401009 (nZCv daif +PAN -UAO +TCO +DIT +SSBS BTYPE=--)
> 	[  622.734529] [     T250254] pc : __virt_to_phys (/home/user/Devel/linux-next/arch/arm64/mm/physaddr.c:?) 
> 	[  622.734531] [     T250254] lr : __virt_to_phys (/home/user/Devel/linux-next/arch/arm64/mm/physaddr.c:?) 
> 	[  622.734533] [ T250254] sp : ffff800158e8fc60
> 	[  622.734534] [ T250254] x29: ffff800158e8fc60 x28: ffff0034a7cc7900 x27: 0000000000000000
> 	[  622.734537] [ T250254] x26: 0000000000000000 x25: ffff0034a7cc7900 x24: 00000000040e001f
> 	[  622.734539] [ T250254] x23: ffff0010858afb00 x22: cfcecdcccbcac9c8 x21: ffff0033526a01e0
> 	[  622.734541] [ T250254] x20: 0000000000008000 x19: ffcecdcccbcac9c8 x18: ffff80008149c8e4
> 	[  622.734543] [ T250254] x17: 0000000000000001 x16: 0000000000000000 x15: 0000000000000003
> 	[  622.734545] [ T250254] x14: ffff800082962e78 x13: 0000000000000003 x12: ffff003bc6231630
> 	[  622.734546] [ T250254] x11: 0000000000000000 x10: 0000000000000000 x9 : ed44a220ae716b00
> 	[  622.734548] [ T250254] x8 : 0001000000000000 x7 : 0720072007200720 x6 : ffff80008018710c
> 	[  622.734550] [ T250254] x5 : 0000000000000001 x4 : 00000090ecc72ac0 x3 : 0000000000000000
> 	[  622.734552] [ T250254] x2 : 0000000000000000 x1 : ffff800081a72bc6 x0 : 000000000000004f
> 	[  622.734554] [ T250254] Call trace:
> 	[  622.734555] [     T250254] __virt_to_phys (/home/user/Devel/linux-next/arch/arm64/mm/physaddr.c:?) (P)
> 	[  622.734557] [     T250254] kfree (/home/user/Devel/linux-next/./include/linux/mm.h:1180 /home/user/Devel/linux-next/mm/slub.c:4871) 
> 	[  622.734562] [     T250254] vhost_dev_cleanup (/home/user/Devel/linux-next/drivers/vhost/vhost.c:506 /home/user/Devel/linux-next/drivers/vhost/vhost.c:542 /home/user/Devel/linux-next/drivers/vhost/vhost.c:1214) vhost 
> 	[  622.734571] [     T250254] vhost_vsock_dev_release (/home/user/Devel/linux-next/drivers/vhost/vsock.c:756) vhost_vsock 
> 	[  622.734575] [     T250254] __fput (/home/user/Devel/linux-next/fs/file_table.c:469) 
> 	[  622.734578] [     T250254] fput_close_sync (/home/user/Devel/linux-next/fs/file_table.c:?) 
> 	[  622.734579] [     T250254] __arm64_sys_close (/home/user/Devel/linux-next/fs/open.c:1589 /home/user/Devel/linux-next/fs/open.c:1572 /home/user/Devel/linux-next/fs/open.c:1572) 
> 	[  622.734584] [     T250254] invoke_syscall (/home/user/Devel/linux-next/arch/arm64/kernel/syscall.c:50) 
> 	[  622.734589] [     T250254] el0_svc_common (/home/user/Devel/linux-next/./include/linux/thread_info.h:135 /home/user/Devel/linux-next/arch/arm64/kernel/syscall.c:140) 
> 	[  622.734591] [     T250254] do_el0_svc (/home/user/Devel/linux-next/arch/arm64/kernel/syscall.c:152) 
> 	[  622.734594] [     T250254] el0_svc (/home/user/Devel/linux-next/arch/arm64/kernel/entry-common.c:169 /home/user/Devel/linux-next/arch/arm64/kernel/entry-common.c:182 /home/user/Devel/linux-next/arch/arm64/kernel/entry-common.c:880) 
> 	[  622.734600] [     T250254] el0t_64_sync_handler (/home/user/Devel/linux-next/arch/arm64/kernel/entry-common.c:958) 
> 	[  622.734603] [     T250254] el0t_64_sync (/home/user/Devel/linux-next/arch/arm64/kernel/entry.S:596) 
> 	[  622.734605] [ T250254] irq event stamp: 0
> 	[  622.734606] [     T250254] hardirqs last enabled at (0): 0x0 
> 	[  622.734610] [     T250254] hardirqs last disabled at (0): copy_process (/home/user/Devel/linux-next/kernel/fork.c:?) 
> 	[  622.734614] [     T250254] softirqs last enabled at (0): copy_process (/home/user/Devel/linux-next/kernel/fork.c:?) 
> 	[  622.734616] [     T250254] softirqs last disabled at (0): 0x0 
> 	[  622.734618] [ T250254] ---[ end trace 0000000000000000 ]---
> 	[  622.734697] [ T250254] Unable to handle kernel paging request at virtual address 003ff3b33312f288
> 	[  622.734700] [ T250254] Mem abort info:
> 	[  622.734701] [ T250254]   ESR = 0x0000000096000004
> 	[  622.734702] [ T250254]   EC = 0x25: DABT (current EL), IL = 32 bits
> 	[  622.734704] [ T250254]   SET = 0, FnV = 0
> 	[  622.734705] [ T250254]   EA = 0, S1PTW = 0
> 	[  622.734706] [ T250254]   FSC = 0x04: level 0 translation fault
> 	[  622.734708] [ T250254] Data abort info:
> 	[  622.734709] [ T250254]   ISV = 0, ISS = 0x00000004, ISS2 = 0x00000000
> 	[  622.734711] [ T250254]   CM = 0, WnR = 0, TnD = 0, TagAccess = 0
> 	[  622.734712] [ T250254]   GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
> 	[  622.734713] [ T250254] [003ff3b33312f288] address between user and kernel address ranges
> 	[  622.734715] [ T250254] Internal error: Oops: 0000000096000004 [#1]  SMP
> 	[  622.734718] [ T250254] Modules linked in: vhost_vsock(E) vfio_iommu_type1(E) vfio(E) unix_diag(E) sch_fq(E) ghes_edac(E) tls(E) tcp_diag(E) inet_diag(E) act_gact(E) cls_bpf(E) nvidia_cspmu(E) ipmi_ssif(E) coresight_trbe(E) arm_cspmu_module(E) arm_smmuv3_pmu(E) ipmi_devintf(E) coresight_stm(E) coresight_funnel(E) coresight_etm4x(E) coresight_tmc(E) stm_core(E) ipmi_msghandler(E) coresight(E) cppc_cpufreq(E) sch_fq_codel(E) drm(E) backlight(E) drm_panel_orientation_quirks(E) sm3_ce(E) sha3_ce(E) spi_tegra210_quad(E) vhost_net(E) tap(E) tun(E) vhost(E) vhost_iotlb(E) mpls_gso(E) mpls_iptunnel(E) mpls_router(E) fou(E) acpi_power_meter(E) loop(E) efivarfs(E) autofs4(E) [last unloaded: test_bpf(E)]
> 	[  622.734740] [ T250254] Tainted: [W]=WARN, [E]=UNSIGNED_MODULE, [N]=TEST
> 	[  622.734740] [ T250254] Hardware name: ...
> 	[  622.734741] [ T250254] pstate: 63401009 (nZCv daif +PAN -UAO +TCO +DIT +SSBS BTYPE=--)
> 	[  622.734742] [     T250254] pc : kfree (/home/user/Devel/linux-next/./include/linux/page-flags.h:284 /home/user/Devel/linux-next/./include/linux/mm.h:1182 /home/user/Devel/linux-next/mm/slub.c:4871) 
> 	[  622.734745] [     T250254] lr : kfree (/home/user/Devel/linux-next/./include/linux/mm.h:1180 /home/user/Devel/linux-next/mm/slub.c:4871) 
> 	[  622.734747] [ T250254] sp : ffff800158e8fc80
> 	[  622.734748] [ T250254] x29: ffff800158e8fc90 x28: ffff0034a7cc7900 x27: 0000000000000000
> 	[  622.734749] [ T250254] x26: 0000000000000000 x25: ffff0034a7cc7900 x24: 00000000040e001f
> 	[  622.734751] [ T250254] x23: ffff0010858afb00 x22: cfcecdcccbcac9c8 x21: ffff0033526a01e0
> 	[  622.734752] [ T250254] x20: 003ff3b33312f280 x19: ffff80000acd1a20 x18: ffff80008149c8e4
> 	[  622.734754] [ T250254] x17: 0000000000000001 x16: 0000000000000000 x15: 0000000000000003
> 	[  622.734755] [ T250254] x14: ffff800082962e78 x13: 0000000000000003 x12: ffff003bc6231630
> 	[  622.734757] [ T250254] x11: 0000000000000000 x10: 0000000000000000 x9 : ffffffdfc0000000
> 	[  622.734758] [ T250254] x8 : 003ff3d37312f280 x7 : 0720072007200720 x6 : ffff80008018710c
> 	[  622.734760] [ T250254] x5 : 0000000000000001 x4 : 00000090ecc72ac0 x3 : 0000000000000000
> 	[  622.734761] [ T250254] x2 : 0000000000000000 x1 : ffff800081a72bc6 x0 : ffcf4dcccbcac9c8
> 	[  622.734763] [ T250254] Call trace:
> 	[  622.734763] [     T250254] kfree (/home/user/Devel/linux-next/./include/linux/page-flags.h:284 /home/user/Devel/linux-next/./include/linux/mm.h:1182 /home/user/Devel/linux-next/mm/slub.c:4871) (P)
> 	[  622.734766] [     T250254] vhost_dev_cleanup (/home/user/Devel/linux-next/drivers/vhost/vhost.c:506 /home/user/Devel/linux-next/drivers/vhost/vhost.c:542 /home/user/Devel/linux-next/drivers/vhost/vhost.c:1214) vhost 
> 	[  622.734769] [     T250254] vhost_vsock_dev_release (/home/user/Devel/linux-next/drivers/vhost/vsock.c:756) vhost_vsock 
> 	[  622.734771] [     T250254] __fput (/home/user/Devel/linux-next/fs/file_table.c:469) 
> 	[  622.734772] [     T250254] fput_close_sync (/home/user/Devel/linux-next/fs/file_table.c:?) 
> 	[  622.734773] [     T250254] __arm64_sys_close (/home/user/Devel/linux-next/fs/open.c:1589 /home/user/Devel/linux-next/fs/open.c:1572 /home/user/Devel/linux-next/fs/open.c:1572) 
> 	[  622.734776] [     T250254] invoke_syscall (/home/user/Devel/linux-next/arch/arm64/kernel/syscall.c:50) 
> 	[  622.734778] [     T250254] el0_svc_common (/home/user/Devel/linux-next/./include/linux/thread_info.h:135 /home/user/Devel/linux-next/arch/arm64/kernel/syscall.c:140) 
> 	[  622.734781] [     T250254] do_el0_svc (/home/user/Devel/linux-next/arch/arm64/kernel/syscall.c:152) 
> 	[  622.734783] [     T250254] el0_svc (/home/user/Devel/linux-next/arch/arm64/kernel/entry-common.c:169 /home/user/Devel/linux-next/arch/arm64/kernel/entry-common.c:182 /home/user/Devel/linux-next/arch/arm64/kernel/entry-common.c:880) 
> 	[  622.734787] [     T250254] el0t_64_sync_handler (/home/user/Devel/linux-next/arch/arm64/kernel/entry-common.c:958) 
> 	[  622.734790] [     T250254] el0t_64_sync (/home/user/Devel/linux-next/arch/arm64/kernel/entry.S:596) 
> 	[ 622.734792] [ T250254] Code: f2dffbe9 927abd08 cb141908 8b090114 (f9400688)
> 	All code
> 	========
> 	0:*	e9 fb df f2 08       	jmp    0x8f2e000		<-- trapping instruction
> 	5:	bd 7a 92 08 19       	mov    $0x1908927a,%ebp
> 	a:	14 cb                	adc    $0xcb,%al
> 	c:	14 01                	adc    $0x1,%al
> 	e:	09 8b 88 06 40 f9    	or     %ecx,-0x6bff978(%rbx)
> 
> 	Code starting with the faulting instruction
> 	===========================================
> 	0:	88 06                	mov    %al,(%rsi)
> 	2:	40 f9                	rex stc 
> 	[  622.734795] [ T250254] SMP: stopping secondary CPUs
> 	[  622.735089] [ T250254] Starting crashdump kernel...
> 	[  622.735091] [ T250254] Bye!


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: vhost: linux-next: crash at vhost_dev_cleanup()
  2025-07-23 15:04 vhost: linux-next: crash at vhost_dev_cleanup() Breno Leitao
  2025-07-23 19:09 ` Michael S. Tsirkin
@ 2025-07-24  7:47 ` Michael S. Tsirkin
  2025-07-24  8:14   ` Stefano Garzarella
  1 sibling, 1 reply; 12+ messages in thread
From: Michael S. Tsirkin @ 2025-07-24  7:47 UTC (permalink / raw)
  To: Breno Leitao
  Cc: jasowang, eperezma, linux-arm-kernel, kvm, Stefan Hajnoczi,
	Stefano Garzarella, netdev

On Wed, Jul 23, 2025 at 08:04:42AM -0700, Breno Leitao wrote:
> Hello,
> 
> I've seen a crash in linux-next for a while on my arm64 server, and
> I decided to report.
> 
> While running stress-ng on linux-next, I see the crash below.
> 
> This is happening in a kernel configure with some debug options (KASAN,
> LOCKDEP and KMEMLEAK).
> 
> Basically running stress-ng in a loop would crash the host in 15-20
> minutes:
> 	# while (true); do stress-ng -r 10 -t 10; done
> 	
> >From the early warning "virt_to_phys used for non-linear address",
> I suppose corrupted data is at vq->nheads.
> 
> Here is the decoded stack against 9798752 ("Add linux-next specific
> files for 20250721")
> 
> 
> 	[  620.685144] [ T250731] VFIO - User Level meta-driver version: 0.3
> 	[  622.394448] [ T250254] ------------[ cut here ]------------
> 	[  622.413492] [ T250254] virt_to_phys used for non-linear address: 000000006e69fe64 (0xcfcecdcccbcac9c8)
> 	[  622.447771] [     T250254] WARNING: arch/arm64/mm/physaddr.c:15 at __virt_to_phys+0x64/0x90, CPU#57: stress-ng-dev/250254 
> 	[  622.487227] [ T250254] Modules linked in: vhost_vsock(E) vfio_iommu_type1(E) vfio(E) unix_diag(E) sch_fq(E) ghes_edac(E) tls(E) tcp_diag(E) inet_diag(E) act_gact(E) cls_bpf(E) nvidia_cspmu(E) ipmi_ssif(E) coresight_trbe(E) arm_cspmu_module(E) arm_smmuv3_pmu(E) ipmi_devintf(E) coresight_stm(E) coresight_funnel(E) coresight_etm4x(E) coresight_tmc(E) stm_core(E) ipmi_msghandler(E) coresight(E) cppc_cpufreq(E) sch_fq_codel(E) drm(E) backlight(E) drm_panel_orientation_quirks(E) sm3_ce(E) sha3_ce(E) spi_tegra210_quad(E) vhost_net(E) tap(E) tun(E) vhost(E) vhost_iotlb(E) mpls_gso(E) mpls_iptunnel(E) mpls_router(E) fou(E) acpi_power_meter(E) loop(E) efivarfs(E) autofs4(E) [last unloaded: test_bpf(E)]
> 	[  622.734524] [ T250254] Tainted: [W]=WARN, [E]=UNSIGNED_MODULE, [N]=TEST
> 	[  622.734525] [ T250254] Hardware name: ...
> 	[  622.734526] [ T250254] pstate: 63401009 (nZCv daif +PAN -UAO +TCO +DIT +SSBS BTYPE=--)
> 	[  622.734529] [     T250254] pc : __virt_to_phys (/home/user/Devel/linux-next/arch/arm64/mm/physaddr.c:?) 
> 	[  622.734531] [     T250254] lr : __virt_to_phys (/home/user/Devel/linux-next/arch/arm64/mm/physaddr.c:?) 
> 	[  622.734533] [ T250254] sp : ffff800158e8fc60
> 	[  622.734534] [ T250254] x29: ffff800158e8fc60 x28: ffff0034a7cc7900 x27: 0000000000000000
> 	[  622.734537] [ T250254] x26: 0000000000000000 x25: ffff0034a7cc7900 x24: 00000000040e001f
> 	[  622.734539] [ T250254] x23: ffff0010858afb00 x22: cfcecdcccbcac9c8 x21: ffff0033526a01e0
> 	[  622.734541] [ T250254] x20: 0000000000008000 x19: ffcecdcccbcac9c8 x18: ffff80008149c8e4
> 	[  622.734543] [ T250254] x17: 0000000000000001 x16: 0000000000000000 x15: 0000000000000003
> 	[  622.734545] [ T250254] x14: ffff800082962e78 x13: 0000000000000003 x12: ffff003bc6231630
> 	[  622.734546] [ T250254] x11: 0000000000000000 x10: 0000000000000000 x9 : ed44a220ae716b00
> 	[  622.734548] [ T250254] x8 : 0001000000000000 x7 : 0720072007200720 x6 : ffff80008018710c
> 	[  622.734550] [ T250254] x5 : 0000000000000001 x4 : 00000090ecc72ac0 x3 : 0000000000000000
> 	[  622.734552] [ T250254] x2 : 0000000000000000 x1 : ffff800081a72bc6 x0 : 000000000000004f
> 	[  622.734554] [ T250254] Call trace:
> 	[  622.734555] [     T250254] __virt_to_phys (/home/user/Devel/linux-next/arch/arm64/mm/physaddr.c:?) (P)
> 	[  622.734557] [     T250254] kfree (/home/user/Devel/linux-next/./include/linux/mm.h:1180 /home/user/Devel/linux-next/mm/slub.c:4871) 
> 	[  622.734562] [     T250254] vhost_dev_cleanup (/home/user/Devel/linux-next/drivers/vhost/vhost.c:506 /home/user/Devel/linux-next/drivers/vhost/vhost.c:542 /home/user/Devel/linux-next/drivers/vhost/vhost.c:1214) vhost 
> 	[  622.734571] [     T250254] vhost_vsock_dev_release (/home/user/Devel/linux-next/drivers/vhost/vsock.c:756) vhost_vsock 


Cc more vsock maintainers.




> 	[  622.734575] [     T250254] __fput (/home/user/Devel/linux-next/fs/file_table.c:469) 
> 	[  622.734578] [     T250254] fput_close_sync (/home/user/Devel/linux-next/fs/file_table.c:?) 
> 	[  622.734579] [     T250254] __arm64_sys_close (/home/user/Devel/linux-next/fs/open.c:1589 /home/user/Devel/linux-next/fs/open.c:1572 /home/user/Devel/linux-next/fs/open.c:1572) 
> 	[  622.734584] [     T250254] invoke_syscall (/home/user/Devel/linux-next/arch/arm64/kernel/syscall.c:50) 
> 	[  622.734589] [     T250254] el0_svc_common (/home/user/Devel/linux-next/./include/linux/thread_info.h:135 /home/user/Devel/linux-next/arch/arm64/kernel/syscall.c:140) 
> 	[  622.734591] [     T250254] do_el0_svc (/home/user/Devel/linux-next/arch/arm64/kernel/syscall.c:152) 
> 	[  622.734594] [     T250254] el0_svc (/home/user/Devel/linux-next/arch/arm64/kernel/entry-common.c:169 /home/user/Devel/linux-next/arch/arm64/kernel/entry-common.c:182 /home/user/Devel/linux-next/arch/arm64/kernel/entry-common.c:880) 
> 	[  622.734600] [     T250254] el0t_64_sync_handler (/home/user/Devel/linux-next/arch/arm64/kernel/entry-common.c:958) 
> 	[  622.734603] [     T250254] el0t_64_sync (/home/user/Devel/linux-next/arch/arm64/kernel/entry.S:596) 
> 	[  622.734605] [ T250254] irq event stamp: 0
> 	[  622.734606] [     T250254] hardirqs last enabled at (0): 0x0 
> 	[  622.734610] [     T250254] hardirqs last disabled at (0): copy_process (/home/user/Devel/linux-next/kernel/fork.c:?) 
> 	[  622.734614] [     T250254] softirqs last enabled at (0): copy_process (/home/user/Devel/linux-next/kernel/fork.c:?) 
> 	[  622.734616] [     T250254] softirqs last disabled at (0): 0x0 
> 	[  622.734618] [ T250254] ---[ end trace 0000000000000000 ]---
> 	[  622.734697] [ T250254] Unable to handle kernel paging request at virtual address 003ff3b33312f288
> 	[  622.734700] [ T250254] Mem abort info:
> 	[  622.734701] [ T250254]   ESR = 0x0000000096000004
> 	[  622.734702] [ T250254]   EC = 0x25: DABT (current EL), IL = 32 bits
> 	[  622.734704] [ T250254]   SET = 0, FnV = 0
> 	[  622.734705] [ T250254]   EA = 0, S1PTW = 0
> 	[  622.734706] [ T250254]   FSC = 0x04: level 0 translation fault
> 	[  622.734708] [ T250254] Data abort info:
> 	[  622.734709] [ T250254]   ISV = 0, ISS = 0x00000004, ISS2 = 0x00000000
> 	[  622.734711] [ T250254]   CM = 0, WnR = 0, TnD = 0, TagAccess = 0
> 	[  622.734712] [ T250254]   GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
> 	[  622.734713] [ T250254] [003ff3b33312f288] address between user and kernel address ranges
> 	[  622.734715] [ T250254] Internal error: Oops: 0000000096000004 [#1]  SMP
> 	[  622.734718] [ T250254] Modules linked in: vhost_vsock(E) vfio_iommu_type1(E) vfio(E) unix_diag(E) sch_fq(E) ghes_edac(E) tls(E) tcp_diag(E) inet_diag(E) act_gact(E) cls_bpf(E) nvidia_cspmu(E) ipmi_ssif(E) coresight_trbe(E) arm_cspmu_module(E) arm_smmuv3_pmu(E) ipmi_devintf(E) coresight_stm(E) coresight_funnel(E) coresight_etm4x(E) coresight_tmc(E) stm_core(E) ipmi_msghandler(E) coresight(E) cppc_cpufreq(E) sch_fq_codel(E) drm(E) backlight(E) drm_panel_orientation_quirks(E) sm3_ce(E) sha3_ce(E) spi_tegra210_quad(E) vhost_net(E) tap(E) tun(E) vhost(E) vhost_iotlb(E) mpls_gso(E) mpls_iptunnel(E) mpls_router(E) fou(E) acpi_power_meter(E) loop(E) efivarfs(E) autofs4(E) [last unloaded: test_bpf(E)]
> 	[  622.734740] [ T250254] Tainted: [W]=WARN, [E]=UNSIGNED_MODULE, [N]=TEST
> 	[  622.734740] [ T250254] Hardware name: ...
> 	[  622.734741] [ T250254] pstate: 63401009 (nZCv daif +PAN -UAO +TCO +DIT +SSBS BTYPE=--)
> 	[  622.734742] [     T250254] pc : kfree (/home/user/Devel/linux-next/./include/linux/page-flags.h:284 /home/user/Devel/linux-next/./include/linux/mm.h:1182 /home/user/Devel/linux-next/mm/slub.c:4871) 
> 	[  622.734745] [     T250254] lr : kfree (/home/user/Devel/linux-next/./include/linux/mm.h:1180 /home/user/Devel/linux-next/mm/slub.c:4871) 
> 	[  622.734747] [ T250254] sp : ffff800158e8fc80
> 	[  622.734748] [ T250254] x29: ffff800158e8fc90 x28: ffff0034a7cc7900 x27: 0000000000000000
> 	[  622.734749] [ T250254] x26: 0000000000000000 x25: ffff0034a7cc7900 x24: 00000000040e001f
> 	[  622.734751] [ T250254] x23: ffff0010858afb00 x22: cfcecdcccbcac9c8 x21: ffff0033526a01e0
> 	[  622.734752] [ T250254] x20: 003ff3b33312f280 x19: ffff80000acd1a20 x18: ffff80008149c8e4
> 	[  622.734754] [ T250254] x17: 0000000000000001 x16: 0000000000000000 x15: 0000000000000003
> 	[  622.734755] [ T250254] x14: ffff800082962e78 x13: 0000000000000003 x12: ffff003bc6231630
> 	[  622.734757] [ T250254] x11: 0000000000000000 x10: 0000000000000000 x9 : ffffffdfc0000000
> 	[  622.734758] [ T250254] x8 : 003ff3d37312f280 x7 : 0720072007200720 x6 : ffff80008018710c
> 	[  622.734760] [ T250254] x5 : 0000000000000001 x4 : 00000090ecc72ac0 x3 : 0000000000000000
> 	[  622.734761] [ T250254] x2 : 0000000000000000 x1 : ffff800081a72bc6 x0 : ffcf4dcccbcac9c8
> 	[  622.734763] [ T250254] Call trace:
> 	[  622.734763] [     T250254] kfree (/home/user/Devel/linux-next/./include/linux/page-flags.h:284 /home/user/Devel/linux-next/./include/linux/mm.h:1182 /home/user/Devel/linux-next/mm/slub.c:4871) (P)
> 	[  622.734766] [     T250254] vhost_dev_cleanup (/home/user/Devel/linux-next/drivers/vhost/vhost.c:506 /home/user/Devel/linux-next/drivers/vhost/vhost.c:542 /home/user/Devel/linux-next/drivers/vhost/vhost.c:1214) vhost 
> 	[  622.734769] [     T250254] vhost_vsock_dev_release (/home/user/Devel/linux-next/drivers/vhost/vsock.c:756) vhost_vsock 
> 	[  622.734771] [     T250254] __fput (/home/user/Devel/linux-next/fs/file_table.c:469) 
> 	[  622.734772] [     T250254] fput_close_sync (/home/user/Devel/linux-next/fs/file_table.c:?) 
> 	[  622.734773] [     T250254] __arm64_sys_close (/home/user/Devel/linux-next/fs/open.c:1589 /home/user/Devel/linux-next/fs/open.c:1572 /home/user/Devel/linux-next/fs/open.c:1572) 
> 	[  622.734776] [     T250254] invoke_syscall (/home/user/Devel/linux-next/arch/arm64/kernel/syscall.c:50) 
> 	[  622.734778] [     T250254] el0_svc_common (/home/user/Devel/linux-next/./include/linux/thread_info.h:135 /home/user/Devel/linux-next/arch/arm64/kernel/syscall.c:140) 
> 	[  622.734781] [     T250254] do_el0_svc (/home/user/Devel/linux-next/arch/arm64/kernel/syscall.c:152) 
> 	[  622.734783] [     T250254] el0_svc (/home/user/Devel/linux-next/arch/arm64/kernel/entry-common.c:169 /home/user/Devel/linux-next/arch/arm64/kernel/entry-common.c:182 /home/user/Devel/linux-next/arch/arm64/kernel/entry-common.c:880) 
> 	[  622.734787] [     T250254] el0t_64_sync_handler (/home/user/Devel/linux-next/arch/arm64/kernel/entry-common.c:958) 
> 	[  622.734790] [     T250254] el0t_64_sync (/home/user/Devel/linux-next/arch/arm64/kernel/entry.S:596) 
> 	[ 622.734792] [ T250254] Code: f2dffbe9 927abd08 cb141908 8b090114 (f9400688)
> 	All code
> 	========
> 	0:*	e9 fb df f2 08       	jmp    0x8f2e000		<-- trapping instruction
> 	5:	bd 7a 92 08 19       	mov    $0x1908927a,%ebp
> 	a:	14 cb                	adc    $0xcb,%al
> 	c:	14 01                	adc    $0x1,%al
> 	e:	09 8b 88 06 40 f9    	or     %ecx,-0x6bff978(%rbx)
> 
> 	Code starting with the faulting instruction
> 	===========================================
> 	0:	88 06                	mov    %al,(%rsi)
> 	2:	40 f9                	rex stc 
> 	[  622.734795] [ T250254] SMP: stopping secondary CPUs
> 	[  622.735089] [ T250254] Starting crashdump kernel...
> 	[  622.735091] [ T250254] Bye!


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: vhost: linux-next: crash at vhost_dev_cleanup()
  2025-07-24  7:47 ` Michael S. Tsirkin
@ 2025-07-24  8:14   ` Stefano Garzarella
  2025-07-24  8:22     ` Michael S. Tsirkin
  0 siblings, 1 reply; 12+ messages in thread
From: Stefano Garzarella @ 2025-07-24  8:14 UTC (permalink / raw)
  To: Michael S. Tsirkin, Will Deacon
  Cc: Breno Leitao, jasowang, eperezma, linux-arm-kernel, kvm,
	Stefan Hajnoczi, netdev

CCing Will

On Thu, 24 Jul 2025 at 09:48, Michael S. Tsirkin <mst@redhat.com> wrote:
>
> On Wed, Jul 23, 2025 at 08:04:42AM -0700, Breno Leitao wrote:
> > Hello,
> >
> > I've seen a crash in linux-next for a while on my arm64 server, and
> > I decided to report.
> >
> > While running stress-ng on linux-next, I see the crash below.
> >
> > This is happening in a kernel configure with some debug options (KASAN,
> > LOCKDEP and KMEMLEAK).
> >
> > Basically running stress-ng in a loop would crash the host in 15-20
> > minutes:
> >       # while (true); do stress-ng -r 10 -t 10; done
> >
> > >From the early warning "virt_to_phys used for non-linear address",

mmm, we recently added nonlinear SKBs support in vhost-vsock [1],
@Will can this issue be related?

I checked next-20250721 tag and I confirm that contains those changes.

[1] https://lore.kernel.org/virtualization/20250717090116.11987-1-will@kernel.org/

Thanks,
Stefano

> > I suppose corrupted data is at vq->nheads.
> >
> > Here is the decoded stack against 9798752 ("Add linux-next specific
> > files for 20250721")
> >
> >
> >       [  620.685144] [ T250731] VFIO - User Level meta-driver version: 0.3
> >       [  622.394448] [ T250254] ------------[ cut here ]------------
> >       [  622.413492] [ T250254] virt_to_phys used for non-linear address: 000000006e69fe64 (0xcfcecdcccbcac9c8)
> >       [  622.447771] [     T250254] WARNING: arch/arm64/mm/physaddr.c:15 at __virt_to_phys+0x64/0x90, CPU#57: stress-ng-dev/250254
> >       [  622.487227] [ T250254] Modules linked in: vhost_vsock(E) vfio_iommu_type1(E) vfio(E) unix_diag(E) sch_fq(E) ghes_edac(E) tls(E) tcp_diag(E) inet_diag(E) act_gact(E) cls_bpf(E) nvidia_cspmu(E) ipmi_ssif(E) coresight_trbe(E) arm_cspmu_module(E) arm_smmuv3_pmu(E) ipmi_devintf(E) coresight_stm(E) coresight_funnel(E) coresight_etm4x(E) coresight_tmc(E) stm_core(E) ipmi_msghandler(E) coresight(E) cppc_cpufreq(E) sch_fq_codel(E) drm(E) backlight(E) drm_panel_orientation_quirks(E) sm3_ce(E) sha3_ce(E) spi_tegra210_quad(E) vhost_net(E) tap(E) tun(E) vhost(E) vhost_iotlb(E) mpls_gso(E) mpls_iptunnel(E) mpls_router(E) fou(E) acpi_power_meter(E) loop(E) efivarfs(E) autofs4(E) [last unloaded: test_bpf(E)]
> >       [  622.734524] [ T250254] Tainted: [W]=WARN, [E]=UNSIGNED_MODULE, [N]=TEST
> >       [  622.734525] [ T250254] Hardware name: ...
> >       [  622.734526] [ T250254] pstate: 63401009 (nZCv daif +PAN -UAO +TCO +DIT +SSBS BTYPE=--)
> >       [  622.734529] [     T250254] pc : __virt_to_phys (/home/user/Devel/linux-next/arch/arm64/mm/physaddr.c:?)
> >       [  622.734531] [     T250254] lr : __virt_to_phys (/home/user/Devel/linux-next/arch/arm64/mm/physaddr.c:?)
> >       [  622.734533] [ T250254] sp : ffff800158e8fc60
> >       [  622.734534] [ T250254] x29: ffff800158e8fc60 x28: ffff0034a7cc7900 x27: 0000000000000000
> >       [  622.734537] [ T250254] x26: 0000000000000000 x25: ffff0034a7cc7900 x24: 00000000040e001f
> >       [  622.734539] [ T250254] x23: ffff0010858afb00 x22: cfcecdcccbcac9c8 x21: ffff0033526a01e0
> >       [  622.734541] [ T250254] x20: 0000000000008000 x19: ffcecdcccbcac9c8 x18: ffff80008149c8e4
> >       [  622.734543] [ T250254] x17: 0000000000000001 x16: 0000000000000000 x15: 0000000000000003
> >       [  622.734545] [ T250254] x14: ffff800082962e78 x13: 0000000000000003 x12: ffff003bc6231630
> >       [  622.734546] [ T250254] x11: 0000000000000000 x10: 0000000000000000 x9 : ed44a220ae716b00
> >       [  622.734548] [ T250254] x8 : 0001000000000000 x7 : 0720072007200720 x6 : ffff80008018710c
> >       [  622.734550] [ T250254] x5 : 0000000000000001 x4 : 00000090ecc72ac0 x3 : 0000000000000000
> >       [  622.734552] [ T250254] x2 : 0000000000000000 x1 : ffff800081a72bc6 x0 : 000000000000004f
> >       [  622.734554] [ T250254] Call trace:
> >       [  622.734555] [     T250254] __virt_to_phys (/home/user/Devel/linux-next/arch/arm64/mm/physaddr.c:?) (P)
> >       [  622.734557] [     T250254] kfree (/home/user/Devel/linux-next/./include/linux/mm.h:1180 /home/user/Devel/linux-next/mm/slub.c:4871)
> >       [  622.734562] [     T250254] vhost_dev_cleanup (/home/user/Devel/linux-next/drivers/vhost/vhost.c:506 /home/user/Devel/linux-next/drivers/vhost/vhost.c:542 /home/user/Devel/linux-next/drivers/vhost/vhost.c:1214) vhost
> >       [  622.734571] [     T250254] vhost_vsock_dev_release (/home/user/Devel/linux-next/drivers/vhost/vsock.c:756) vhost_vsock
>
>
> Cc more vsock maintainers.
>
>
>
>
> >       [  622.734575] [     T250254] __fput (/home/user/Devel/linux-next/fs/file_table.c:469)
> >       [  622.734578] [     T250254] fput_close_sync (/home/user/Devel/linux-next/fs/file_table.c:?)
> >       [  622.734579] [     T250254] __arm64_sys_close (/home/user/Devel/linux-next/fs/open.c:1589 /home/user/Devel/linux-next/fs/open.c:1572 /home/user/Devel/linux-next/fs/open.c:1572)
> >       [  622.734584] [     T250254] invoke_syscall (/home/user/Devel/linux-next/arch/arm64/kernel/syscall.c:50)
> >       [  622.734589] [     T250254] el0_svc_common (/home/user/Devel/linux-next/./include/linux/thread_info.h:135 /home/user/Devel/linux-next/arch/arm64/kernel/syscall.c:140)
> >       [  622.734591] [     T250254] do_el0_svc (/home/user/Devel/linux-next/arch/arm64/kernel/syscall.c:152)
> >       [  622.734594] [     T250254] el0_svc (/home/user/Devel/linux-next/arch/arm64/kernel/entry-common.c:169 /home/user/Devel/linux-next/arch/arm64/kernel/entry-common.c:182 /home/user/Devel/linux-next/arch/arm64/kernel/entry-common.c:880)
> >       [  622.734600] [     T250254] el0t_64_sync_handler (/home/user/Devel/linux-next/arch/arm64/kernel/entry-common.c:958)
> >       [  622.734603] [     T250254] el0t_64_sync (/home/user/Devel/linux-next/arch/arm64/kernel/entry.S:596)
> >       [  622.734605] [ T250254] irq event stamp: 0
> >       [  622.734606] [     T250254] hardirqs last enabled at (0): 0x0
> >       [  622.734610] [     T250254] hardirqs last disabled at (0): copy_process (/home/user/Devel/linux-next/kernel/fork.c:?)
> >       [  622.734614] [     T250254] softirqs last enabled at (0): copy_process (/home/user/Devel/linux-next/kernel/fork.c:?)
> >       [  622.734616] [     T250254] softirqs last disabled at (0): 0x0
> >       [  622.734618] [ T250254] ---[ end trace 0000000000000000 ]---
> >       [  622.734697] [ T250254] Unable to handle kernel paging request at virtual address 003ff3b33312f288
> >       [  622.734700] [ T250254] Mem abort info:
> >       [  622.734701] [ T250254]   ESR = 0x0000000096000004
> >       [  622.734702] [ T250254]   EC = 0x25: DABT (current EL), IL = 32 bits
> >       [  622.734704] [ T250254]   SET = 0, FnV = 0
> >       [  622.734705] [ T250254]   EA = 0, S1PTW = 0
> >       [  622.734706] [ T250254]   FSC = 0x04: level 0 translation fault
> >       [  622.734708] [ T250254] Data abort info:
> >       [  622.734709] [ T250254]   ISV = 0, ISS = 0x00000004, ISS2 = 0x00000000
> >       [  622.734711] [ T250254]   CM = 0, WnR = 0, TnD = 0, TagAccess = 0
> >       [  622.734712] [ T250254]   GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
> >       [  622.734713] [ T250254] [003ff3b33312f288] address between user and kernel address ranges
> >       [  622.734715] [ T250254] Internal error: Oops: 0000000096000004 [#1]  SMP
> >       [  622.734718] [ T250254] Modules linked in: vhost_vsock(E) vfio_iommu_type1(E) vfio(E) unix_diag(E) sch_fq(E) ghes_edac(E) tls(E) tcp_diag(E) inet_diag(E) act_gact(E) cls_bpf(E) nvidia_cspmu(E) ipmi_ssif(E) coresight_trbe(E) arm_cspmu_module(E) arm_smmuv3_pmu(E) ipmi_devintf(E) coresight_stm(E) coresight_funnel(E) coresight_etm4x(E) coresight_tmc(E) stm_core(E) ipmi_msghandler(E) coresight(E) cppc_cpufreq(E) sch_fq_codel(E) drm(E) backlight(E) drm_panel_orientation_quirks(E) sm3_ce(E) sha3_ce(E) spi_tegra210_quad(E) vhost_net(E) tap(E) tun(E) vhost(E) vhost_iotlb(E) mpls_gso(E) mpls_iptunnel(E) mpls_router(E) fou(E) acpi_power_meter(E) loop(E) efivarfs(E) autofs4(E) [last unloaded: test_bpf(E)]
> >       [  622.734740] [ T250254] Tainted: [W]=WARN, [E]=UNSIGNED_MODULE, [N]=TEST
> >       [  622.734740] [ T250254] Hardware name: ...
> >       [  622.734741] [ T250254] pstate: 63401009 (nZCv daif +PAN -UAO +TCO +DIT +SSBS BTYPE=--)
> >       [  622.734742] [     T250254] pc : kfree (/home/user/Devel/linux-next/./include/linux/page-flags.h:284 /home/user/Devel/linux-next/./include/linux/mm.h:1182 /home/user/Devel/linux-next/mm/slub.c:4871)
> >       [  622.734745] [     T250254] lr : kfree (/home/user/Devel/linux-next/./include/linux/mm.h:1180 /home/user/Devel/linux-next/mm/slub.c:4871)
> >       [  622.734747] [ T250254] sp : ffff800158e8fc80
> >       [  622.734748] [ T250254] x29: ffff800158e8fc90 x28: ffff0034a7cc7900 x27: 0000000000000000
> >       [  622.734749] [ T250254] x26: 0000000000000000 x25: ffff0034a7cc7900 x24: 00000000040e001f
> >       [  622.734751] [ T250254] x23: ffff0010858afb00 x22: cfcecdcccbcac9c8 x21: ffff0033526a01e0
> >       [  622.734752] [ T250254] x20: 003ff3b33312f280 x19: ffff80000acd1a20 x18: ffff80008149c8e4
> >       [  622.734754] [ T250254] x17: 0000000000000001 x16: 0000000000000000 x15: 0000000000000003
> >       [  622.734755] [ T250254] x14: ffff800082962e78 x13: 0000000000000003 x12: ffff003bc6231630
> >       [  622.734757] [ T250254] x11: 0000000000000000 x10: 0000000000000000 x9 : ffffffdfc0000000
> >       [  622.734758] [ T250254] x8 : 003ff3d37312f280 x7 : 0720072007200720 x6 : ffff80008018710c
> >       [  622.734760] [ T250254] x5 : 0000000000000001 x4 : 00000090ecc72ac0 x3 : 0000000000000000
> >       [  622.734761] [ T250254] x2 : 0000000000000000 x1 : ffff800081a72bc6 x0 : ffcf4dcccbcac9c8
> >       [  622.734763] [ T250254] Call trace:
> >       [  622.734763] [     T250254] kfree (/home/user/Devel/linux-next/./include/linux/page-flags.h:284 /home/user/Devel/linux-next/./include/linux/mm.h:1182 /home/user/Devel/linux-next/mm/slub.c:4871) (P)
> >       [  622.734766] [     T250254] vhost_dev_cleanup (/home/user/Devel/linux-next/drivers/vhost/vhost.c:506 /home/user/Devel/linux-next/drivers/vhost/vhost.c:542 /home/user/Devel/linux-next/drivers/vhost/vhost.c:1214) vhost
> >       [  622.734769] [     T250254] vhost_vsock_dev_release (/home/user/Devel/linux-next/drivers/vhost/vsock.c:756) vhost_vsock
> >       [  622.734771] [     T250254] __fput (/home/user/Devel/linux-next/fs/file_table.c:469)
> >       [  622.734772] [     T250254] fput_close_sync (/home/user/Devel/linux-next/fs/file_table.c:?)
> >       [  622.734773] [     T250254] __arm64_sys_close (/home/user/Devel/linux-next/fs/open.c:1589 /home/user/Devel/linux-next/fs/open.c:1572 /home/user/Devel/linux-next/fs/open.c:1572)
> >       [  622.734776] [     T250254] invoke_syscall (/home/user/Devel/linux-next/arch/arm64/kernel/syscall.c:50)
> >       [  622.734778] [     T250254] el0_svc_common (/home/user/Devel/linux-next/./include/linux/thread_info.h:135 /home/user/Devel/linux-next/arch/arm64/kernel/syscall.c:140)
> >       [  622.734781] [     T250254] do_el0_svc (/home/user/Devel/linux-next/arch/arm64/kernel/syscall.c:152)
> >       [  622.734783] [     T250254] el0_svc (/home/user/Devel/linux-next/arch/arm64/kernel/entry-common.c:169 /home/user/Devel/linux-next/arch/arm64/kernel/entry-common.c:182 /home/user/Devel/linux-next/arch/arm64/kernel/entry-common.c:880)
> >       [  622.734787] [     T250254] el0t_64_sync_handler (/home/user/Devel/linux-next/arch/arm64/kernel/entry-common.c:958)
> >       [  622.734790] [     T250254] el0t_64_sync (/home/user/Devel/linux-next/arch/arm64/kernel/entry.S:596)
> >       [ 622.734792] [ T250254] Code: f2dffbe9 927abd08 cb141908 8b090114 (f9400688)
> >       All code
> >       ========
> >       0:*     e9 fb df f2 08          jmp    0x8f2e000                <-- trapping instruction
> >       5:      bd 7a 92 08 19          mov    $0x1908927a,%ebp
> >       a:      14 cb                   adc    $0xcb,%al
> >       c:      14 01                   adc    $0x1,%al
> >       e:      09 8b 88 06 40 f9       or     %ecx,-0x6bff978(%rbx)
> >
> >       Code starting with the faulting instruction
> >       ===========================================
> >       0:      88 06                   mov    %al,(%rsi)
> >       2:      40 f9                   rex stc
> >       [  622.734795] [ T250254] SMP: stopping secondary CPUs
> >       [  622.735089] [ T250254] Starting crashdump kernel...
> >       [  622.735091] [ T250254] Bye!
>


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: vhost: linux-next: crash at vhost_dev_cleanup()
  2025-07-24  8:14   ` Stefano Garzarella
@ 2025-07-24  8:22     ` Michael S. Tsirkin
  2025-07-24  8:44       ` Will Deacon
  0 siblings, 1 reply; 12+ messages in thread
From: Michael S. Tsirkin @ 2025-07-24  8:22 UTC (permalink / raw)
  To: Stefano Garzarella
  Cc: Will Deacon, Breno Leitao, jasowang, eperezma, linux-arm-kernel,
	kvm, Stefan Hajnoczi, netdev

On Thu, Jul 24, 2025 at 10:14:36AM +0200, Stefano Garzarella wrote:
> CCing Will
> 
> On Thu, 24 Jul 2025 at 09:48, Michael S. Tsirkin <mst@redhat.com> wrote:
> >
> > On Wed, Jul 23, 2025 at 08:04:42AM -0700, Breno Leitao wrote:
> > > Hello,
> > >
> > > I've seen a crash in linux-next for a while on my arm64 server, and
> > > I decided to report.
> > >
> > > While running stress-ng on linux-next, I see the crash below.
> > >
> > > This is happening in a kernel configure with some debug options (KASAN,
> > > LOCKDEP and KMEMLEAK).
> > >
> > > Basically running stress-ng in a loop would crash the host in 15-20
> > > minutes:
> > >       # while (true); do stress-ng -r 10 -t 10; done
> > >
> > > >From the early warning "virt_to_phys used for non-linear address",
> 
> mmm, we recently added nonlinear SKBs support in vhost-vsock [1],
> @Will can this issue be related?

Good point.

Breno, if bisecting is too much trouble, would you mind testing the commits
c76f3c4364fe523cd2782269eab92529c86217aa
and
c7991b44d7b44f9270dec63acd0b2965d29aab43
and telling us if this reproduces?


> I checked next-20250721 tag and I confirm that contains those changes.
> 
> [1] https://lore.kernel.org/virtualization/20250717090116.11987-1-will@kernel.org/
> 
> Thanks,
> Stefano
> 
> > > I suppose corrupted data is at vq->nheads.
> > >
> > > Here is the decoded stack against 9798752 ("Add linux-next specific
> > > files for 20250721")
> > >
> > >
> > >       [  620.685144] [ T250731] VFIO - User Level meta-driver version: 0.3
> > >       [  622.394448] [ T250254] ------------[ cut here ]------------
> > >       [  622.413492] [ T250254] virt_to_phys used for non-linear address: 000000006e69fe64 (0xcfcecdcccbcac9c8)
> > >       [  622.447771] [     T250254] WARNING: arch/arm64/mm/physaddr.c:15 at __virt_to_phys+0x64/0x90, CPU#57: stress-ng-dev/250254
> > >       [  622.487227] [ T250254] Modules linked in: vhost_vsock(E) vfio_iommu_type1(E) vfio(E) unix_diag(E) sch_fq(E) ghes_edac(E) tls(E) tcp_diag(E) inet_diag(E) act_gact(E) cls_bpf(E) nvidia_cspmu(E) ipmi_ssif(E) coresight_trbe(E) arm_cspmu_module(E) arm_smmuv3_pmu(E) ipmi_devintf(E) coresight_stm(E) coresight_funnel(E) coresight_etm4x(E) coresight_tmc(E) stm_core(E) ipmi_msghandler(E) coresight(E) cppc_cpufreq(E) sch_fq_codel(E) drm(E) backlight(E) drm_panel_orientation_quirks(E) sm3_ce(E) sha3_ce(E) spi_tegra210_quad(E) vhost_net(E) tap(E) tun(E) vhost(E) vhost_iotlb(E) mpls_gso(E) mpls_iptunnel(E) mpls_router(E) fou(E) acpi_power_meter(E) loop(E) efivarfs(E) autofs4(E) [last unloaded: test_bpf(E)]
> > >       [  622.734524] [ T250254] Tainted: [W]=WARN, [E]=UNSIGNED_MODULE, [N]=TEST
> > >       [  622.734525] [ T250254] Hardware name: ...
> > >       [  622.734526] [ T250254] pstate: 63401009 (nZCv daif +PAN -UAO +TCO +DIT +SSBS BTYPE=--)
> > >       [  622.734529] [     T250254] pc : __virt_to_phys (/home/user/Devel/linux-next/arch/arm64/mm/physaddr.c:?)
> > >       [  622.734531] [     T250254] lr : __virt_to_phys (/home/user/Devel/linux-next/arch/arm64/mm/physaddr.c:?)
> > >       [  622.734533] [ T250254] sp : ffff800158e8fc60
> > >       [  622.734534] [ T250254] x29: ffff800158e8fc60 x28: ffff0034a7cc7900 x27: 0000000000000000
> > >       [  622.734537] [ T250254] x26: 0000000000000000 x25: ffff0034a7cc7900 x24: 00000000040e001f
> > >       [  622.734539] [ T250254] x23: ffff0010858afb00 x22: cfcecdcccbcac9c8 x21: ffff0033526a01e0
> > >       [  622.734541] [ T250254] x20: 0000000000008000 x19: ffcecdcccbcac9c8 x18: ffff80008149c8e4
> > >       [  622.734543] [ T250254] x17: 0000000000000001 x16: 0000000000000000 x15: 0000000000000003
> > >       [  622.734545] [ T250254] x14: ffff800082962e78 x13: 0000000000000003 x12: ffff003bc6231630
> > >       [  622.734546] [ T250254] x11: 0000000000000000 x10: 0000000000000000 x9 : ed44a220ae716b00
> > >       [  622.734548] [ T250254] x8 : 0001000000000000 x7 : 0720072007200720 x6 : ffff80008018710c
> > >       [  622.734550] [ T250254] x5 : 0000000000000001 x4 : 00000090ecc72ac0 x3 : 0000000000000000
> > >       [  622.734552] [ T250254] x2 : 0000000000000000 x1 : ffff800081a72bc6 x0 : 000000000000004f
> > >       [  622.734554] [ T250254] Call trace:
> > >       [  622.734555] [     T250254] __virt_to_phys (/home/user/Devel/linux-next/arch/arm64/mm/physaddr.c:?) (P)
> > >       [  622.734557] [     T250254] kfree (/home/user/Devel/linux-next/./include/linux/mm.h:1180 /home/user/Devel/linux-next/mm/slub.c:4871)
> > >       [  622.734562] [     T250254] vhost_dev_cleanup (/home/user/Devel/linux-next/drivers/vhost/vhost.c:506 /home/user/Devel/linux-next/drivers/vhost/vhost.c:542 /home/user/Devel/linux-next/drivers/vhost/vhost.c:1214) vhost
> > >       [  622.734571] [     T250254] vhost_vsock_dev_release (/home/user/Devel/linux-next/drivers/vhost/vsock.c:756) vhost_vsock
> >
> >
> > Cc more vsock maintainers.
> >
> >
> >
> >
> > >       [  622.734575] [     T250254] __fput (/home/user/Devel/linux-next/fs/file_table.c:469)
> > >       [  622.734578] [     T250254] fput_close_sync (/home/user/Devel/linux-next/fs/file_table.c:?)
> > >       [  622.734579] [     T250254] __arm64_sys_close (/home/user/Devel/linux-next/fs/open.c:1589 /home/user/Devel/linux-next/fs/open.c:1572 /home/user/Devel/linux-next/fs/open.c:1572)
> > >       [  622.734584] [     T250254] invoke_syscall (/home/user/Devel/linux-next/arch/arm64/kernel/syscall.c:50)
> > >       [  622.734589] [     T250254] el0_svc_common (/home/user/Devel/linux-next/./include/linux/thread_info.h:135 /home/user/Devel/linux-next/arch/arm64/kernel/syscall.c:140)
> > >       [  622.734591] [     T250254] do_el0_svc (/home/user/Devel/linux-next/arch/arm64/kernel/syscall.c:152)
> > >       [  622.734594] [     T250254] el0_svc (/home/user/Devel/linux-next/arch/arm64/kernel/entry-common.c:169 /home/user/Devel/linux-next/arch/arm64/kernel/entry-common.c:182 /home/user/Devel/linux-next/arch/arm64/kernel/entry-common.c:880)
> > >       [  622.734600] [     T250254] el0t_64_sync_handler (/home/user/Devel/linux-next/arch/arm64/kernel/entry-common.c:958)
> > >       [  622.734603] [     T250254] el0t_64_sync (/home/user/Devel/linux-next/arch/arm64/kernel/entry.S:596)
> > >       [  622.734605] [ T250254] irq event stamp: 0
> > >       [  622.734606] [     T250254] hardirqs last enabled at (0): 0x0
> > >       [  622.734610] [     T250254] hardirqs last disabled at (0): copy_process (/home/user/Devel/linux-next/kernel/fork.c:?)
> > >       [  622.734614] [     T250254] softirqs last enabled at (0): copy_process (/home/user/Devel/linux-next/kernel/fork.c:?)
> > >       [  622.734616] [     T250254] softirqs last disabled at (0): 0x0
> > >       [  622.734618] [ T250254] ---[ end trace 0000000000000000 ]---
> > >       [  622.734697] [ T250254] Unable to handle kernel paging request at virtual address 003ff3b33312f288
> > >       [  622.734700] [ T250254] Mem abort info:
> > >       [  622.734701] [ T250254]   ESR = 0x0000000096000004
> > >       [  622.734702] [ T250254]   EC = 0x25: DABT (current EL), IL = 32 bits
> > >       [  622.734704] [ T250254]   SET = 0, FnV = 0
> > >       [  622.734705] [ T250254]   EA = 0, S1PTW = 0
> > >       [  622.734706] [ T250254]   FSC = 0x04: level 0 translation fault
> > >       [  622.734708] [ T250254] Data abort info:
> > >       [  622.734709] [ T250254]   ISV = 0, ISS = 0x00000004, ISS2 = 0x00000000
> > >       [  622.734711] [ T250254]   CM = 0, WnR = 0, TnD = 0, TagAccess = 0
> > >       [  622.734712] [ T250254]   GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
> > >       [  622.734713] [ T250254] [003ff3b33312f288] address between user and kernel address ranges
> > >       [  622.734715] [ T250254] Internal error: Oops: 0000000096000004 [#1]  SMP
> > >       [  622.734718] [ T250254] Modules linked in: vhost_vsock(E) vfio_iommu_type1(E) vfio(E) unix_diag(E) sch_fq(E) ghes_edac(E) tls(E) tcp_diag(E) inet_diag(E) act_gact(E) cls_bpf(E) nvidia_cspmu(E) ipmi_ssif(E) coresight_trbe(E) arm_cspmu_module(E) arm_smmuv3_pmu(E) ipmi_devintf(E) coresight_stm(E) coresight_funnel(E) coresight_etm4x(E) coresight_tmc(E) stm_core(E) ipmi_msghandler(E) coresight(E) cppc_cpufreq(E) sch_fq_codel(E) drm(E) backlight(E) drm_panel_orientation_quirks(E) sm3_ce(E) sha3_ce(E) spi_tegra210_quad(E) vhost_net(E) tap(E) tun(E) vhost(E) vhost_iotlb(E) mpls_gso(E) mpls_iptunnel(E) mpls_router(E) fou(E) acpi_power_meter(E) loop(E) efivarfs(E) autofs4(E) [last unloaded: test_bpf(E)]
> > >       [  622.734740] [ T250254] Tainted: [W]=WARN, [E]=UNSIGNED_MODULE, [N]=TEST
> > >       [  622.734740] [ T250254] Hardware name: ...
> > >       [  622.734741] [ T250254] pstate: 63401009 (nZCv daif +PAN -UAO +TCO +DIT +SSBS BTYPE=--)
> > >       [  622.734742] [     T250254] pc : kfree (/home/user/Devel/linux-next/./include/linux/page-flags.h:284 /home/user/Devel/linux-next/./include/linux/mm.h:1182 /home/user/Devel/linux-next/mm/slub.c:4871)
> > >       [  622.734745] [     T250254] lr : kfree (/home/user/Devel/linux-next/./include/linux/mm.h:1180 /home/user/Devel/linux-next/mm/slub.c:4871)
> > >       [  622.734747] [ T250254] sp : ffff800158e8fc80
> > >       [  622.734748] [ T250254] x29: ffff800158e8fc90 x28: ffff0034a7cc7900 x27: 0000000000000000
> > >       [  622.734749] [ T250254] x26: 0000000000000000 x25: ffff0034a7cc7900 x24: 00000000040e001f
> > >       [  622.734751] [ T250254] x23: ffff0010858afb00 x22: cfcecdcccbcac9c8 x21: ffff0033526a01e0
> > >       [  622.734752] [ T250254] x20: 003ff3b33312f280 x19: ffff80000acd1a20 x18: ffff80008149c8e4
> > >       [  622.734754] [ T250254] x17: 0000000000000001 x16: 0000000000000000 x15: 0000000000000003
> > >       [  622.734755] [ T250254] x14: ffff800082962e78 x13: 0000000000000003 x12: ffff003bc6231630
> > >       [  622.734757] [ T250254] x11: 0000000000000000 x10: 0000000000000000 x9 : ffffffdfc0000000
> > >       [  622.734758] [ T250254] x8 : 003ff3d37312f280 x7 : 0720072007200720 x6 : ffff80008018710c
> > >       [  622.734760] [ T250254] x5 : 0000000000000001 x4 : 00000090ecc72ac0 x3 : 0000000000000000
> > >       [  622.734761] [ T250254] x2 : 0000000000000000 x1 : ffff800081a72bc6 x0 : ffcf4dcccbcac9c8
> > >       [  622.734763] [ T250254] Call trace:
> > >       [  622.734763] [     T250254] kfree (/home/user/Devel/linux-next/./include/linux/page-flags.h:284 /home/user/Devel/linux-next/./include/linux/mm.h:1182 /home/user/Devel/linux-next/mm/slub.c:4871) (P)
> > >       [  622.734766] [     T250254] vhost_dev_cleanup (/home/user/Devel/linux-next/drivers/vhost/vhost.c:506 /home/user/Devel/linux-next/drivers/vhost/vhost.c:542 /home/user/Devel/linux-next/drivers/vhost/vhost.c:1214) vhost
> > >       [  622.734769] [     T250254] vhost_vsock_dev_release (/home/user/Devel/linux-next/drivers/vhost/vsock.c:756) vhost_vsock
> > >       [  622.734771] [     T250254] __fput (/home/user/Devel/linux-next/fs/file_table.c:469)
> > >       [  622.734772] [     T250254] fput_close_sync (/home/user/Devel/linux-next/fs/file_table.c:?)
> > >       [  622.734773] [     T250254] __arm64_sys_close (/home/user/Devel/linux-next/fs/open.c:1589 /home/user/Devel/linux-next/fs/open.c:1572 /home/user/Devel/linux-next/fs/open.c:1572)
> > >       [  622.734776] [     T250254] invoke_syscall (/home/user/Devel/linux-next/arch/arm64/kernel/syscall.c:50)
> > >       [  622.734778] [     T250254] el0_svc_common (/home/user/Devel/linux-next/./include/linux/thread_info.h:135 /home/user/Devel/linux-next/arch/arm64/kernel/syscall.c:140)
> > >       [  622.734781] [     T250254] do_el0_svc (/home/user/Devel/linux-next/arch/arm64/kernel/syscall.c:152)
> > >       [  622.734783] [     T250254] el0_svc (/home/user/Devel/linux-next/arch/arm64/kernel/entry-common.c:169 /home/user/Devel/linux-next/arch/arm64/kernel/entry-common.c:182 /home/user/Devel/linux-next/arch/arm64/kernel/entry-common.c:880)
> > >       [  622.734787] [     T250254] el0t_64_sync_handler (/home/user/Devel/linux-next/arch/arm64/kernel/entry-common.c:958)
> > >       [  622.734790] [     T250254] el0t_64_sync (/home/user/Devel/linux-next/arch/arm64/kernel/entry.S:596)
> > >       [ 622.734792] [ T250254] Code: f2dffbe9 927abd08 cb141908 8b090114 (f9400688)
> > >       All code
> > >       ========
> > >       0:*     e9 fb df f2 08          jmp    0x8f2e000                <-- trapping instruction
> > >       5:      bd 7a 92 08 19          mov    $0x1908927a,%ebp
> > >       a:      14 cb                   adc    $0xcb,%al
> > >       c:      14 01                   adc    $0x1,%al
> > >       e:      09 8b 88 06 40 f9       or     %ecx,-0x6bff978(%rbx)
> > >
> > >       Code starting with the faulting instruction
> > >       ===========================================
> > >       0:      88 06                   mov    %al,(%rsi)
> > >       2:      40 f9                   rex stc
> > >       [  622.734795] [ T250254] SMP: stopping secondary CPUs
> > >       [  622.735089] [ T250254] Starting crashdump kernel...
> > >       [  622.735091] [ T250254] Bye!
> >


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: vhost: linux-next: crash at vhost_dev_cleanup()
  2025-07-24  8:22     ` Michael S. Tsirkin
@ 2025-07-24  8:44       ` Will Deacon
  2025-07-24 12:48         ` Breno Leitao
  0 siblings, 1 reply; 12+ messages in thread
From: Will Deacon @ 2025-07-24  8:44 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Stefano Garzarella, Breno Leitao, jasowang, eperezma,
	linux-arm-kernel, kvm, Stefan Hajnoczi, netdev

On Thu, Jul 24, 2025 at 04:22:15AM -0400, Michael S. Tsirkin wrote:
> On Thu, Jul 24, 2025 at 10:14:36AM +0200, Stefano Garzarella wrote:
> > CCing Will

Thanks.

> > On Thu, 24 Jul 2025 at 09:48, Michael S. Tsirkin <mst@redhat.com> wrote:
> > >
> > > On Wed, Jul 23, 2025 at 08:04:42AM -0700, Breno Leitao wrote:
> > > > Hello,
> > > >
> > > > I've seen a crash in linux-next for a while on my arm64 server, and
> > > > I decided to report.
> > > >
> > > > While running stress-ng on linux-next, I see the crash below.
> > > >
> > > > This is happening in a kernel configure with some debug options (KASAN,
> > > > LOCKDEP and KMEMLEAK).
> > > >
> > > > Basically running stress-ng in a loop would crash the host in 15-20
> > > > minutes:
> > > >       # while (true); do stress-ng -r 10 -t 10; done
> > > >
> > > > >From the early warning "virt_to_phys used for non-linear address",
> > 
> > mmm, we recently added nonlinear SKBs support in vhost-vsock [1],
> > @Will can this issue be related?
> 
> Good point.
> 
> Breno, if bisecting is too much trouble, would you mind testing the commits
> c76f3c4364fe523cd2782269eab92529c86217aa
> and
> c7991b44d7b44f9270dec63acd0b2965d29aab43
> and telling us if this reproduces?

That's definitely worth doing, but we should be careful not to confuse
the "non-linear address" from the warning (which refers to virtual
addresses that lie outside of the linear mapping of memory, e.g. in the
vmalloc space) and "non-linear SKBs" which refer to SKBs with fragment
pages.

Breno -- when you say you've been seeing this "for a while", what's the
earliest kernel you know you saw it on?

> > > > I suppose corrupted data is at vq->nheads.
> > > >
> > > > Here is the decoded stack against 9798752 ("Add linux-next specific
> > > > files for 20250721")
> > > >
> > > >
> > > >       [  620.685144] [ T250731] VFIO - User Level meta-driver version: 0.3
> > > >       [  622.394448] [ T250254] ------------[ cut here ]------------
> > > >       [  622.413492] [ T250254] virt_to_phys used for non-linear address: 000000006e69fe64 (0xcfcecdcccbcac9c8)

So here's the bad (non-linear) pointer. Do you know if 0xcfcecdcccbcac9c8
correlates with the packet data that stress-ng is generating? I wonder if
we're somehow overflowing vq->iov[].

Will

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: vhost: linux-next: crash at vhost_dev_cleanup()
  2025-07-24  8:44       ` Will Deacon
@ 2025-07-24 12:48         ` Breno Leitao
  2025-07-24 12:52           ` Stefano Garzarella
  0 siblings, 1 reply; 12+ messages in thread
From: Breno Leitao @ 2025-07-24 12:48 UTC (permalink / raw)
  To: Will Deacon
  Cc: Michael S. Tsirkin, Stefano Garzarella, jasowang, eperezma,
	linux-arm-kernel, kvm, Stefan Hajnoczi, netdev

On Thu, Jul 24, 2025 at 09:44:38AM +0100, Will Deacon wrote:
> > > On Thu, 24 Jul 2025 at 09:48, Michael S. Tsirkin <mst@redhat.com> wrote:
> > > >
> > > > On Wed, Jul 23, 2025 at 08:04:42AM -0700, Breno Leitao wrote:
> > > > > Hello,
> > > > >
> > > > > I've seen a crash in linux-next for a while on my arm64 server, and
> > > > > I decided to report.
> > > > >
> > > > > While running stress-ng on linux-next, I see the crash below.
> > > > >
> > > > > This is happening in a kernel configure with some debug options (KASAN,
> > > > > LOCKDEP and KMEMLEAK).
> > > > >
> > > > > Basically running stress-ng in a loop would crash the host in 15-20
> > > > > minutes:
> > > > >       # while (true); do stress-ng -r 10 -t 10; done
> > > > >
> > > > > >From the early warning "virt_to_phys used for non-linear address",
> > > 
> > > mmm, we recently added nonlinear SKBs support in vhost-vsock [1],
> > > @Will can this issue be related?
> > 
> > Good point.
> > 
> > Breno, if bisecting is too much trouble, would you mind testing the commits
> > c76f3c4364fe523cd2782269eab92529c86217aa
> > and
> > c7991b44d7b44f9270dec63acd0b2965d29aab43
> > and telling us if this reproduces?
> 
> That's definitely worth doing, but we should be careful not to confuse
> the "non-linear address" from the warning (which refers to virtual
> addresses that lie outside of the linear mapping of memory, e.g. in the
> vmalloc space) and "non-linear SKBs" which refer to SKBs with fragment
> pages.

I've tested both commits above, and I see the crash on both commits
above, thus, the problem reproduces in both cases. The only difference
I noted is the fact that I haven't seen the warning before the crash.


Log against c76f3c4364fe ("vhost/vsock: Avoid allocating
arbitrarily-sized SKBs")

	 Unable to handle kernel paging request at virtual address 0000001fc0000048
	 Mem abort info:
	   ESR = 0x0000000096000005
	   EC = 0x25: DABT (current EL), IL = 32 bits
	   SET = 0, FnV = 0
	   EA = 0, S1PTW = 0
	   FSC = 0x05: level 1 translation fault
	 Data abort info:
	   ISV = 0, ISS = 0x00000005, ISS2 = 0x00000000
	   CM = 0, WnR = 0, TnD = 0, TagAccess = 0
	   GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
	 user pgtable: 64k pages, 48-bit VAs, pgdp=0000000cdcf2da00
	 [0000001fc0000048] pgd=0000000000000000, p4d=0000000000000000, pud=0000000000000000
	 Internal error: Oops: 0000000096000005 [#1]  SMP
	 Modules linked in: vfio_iommu_type1 vfio md4 crc32_cryptoapi ghash_generic unix_diag vhost_net tun vhost vhost_iotlb tap mpls_gso mpls_iptunnel mpls_router fou sch_fq ghes_edac tls tcp_diag inet_diag act_gact cls_bpf nvidia_c
	 CPU: 34 UID: 0 PID: 1727297 Comm: stress-ng-dev Kdump: loaded Not tainted 6.16.0-rc6-upstream-00027-gc76f3c4364fe #19 NONE
	 pstate: 23401009 (nzCv daif +PAN -UAO +TCO +DIT +SSBS BTYPE=--)
	 pc : kfree+0x48/0x2a8
	 lr : vhost_dev_cleanup+0x138/0x2b8 [vhost]
	 sp : ffff80013a0cfcd0
	 x29: ffff80013a0cfcd0 x28: ffff0008fd0b6240 x27: 0000000000000000
	 x26: 0000000000000000 x25: 0000000000000000 x24: 0000000000000000
	 x23: 00000000040e001f x22: ffffffffffffffff x21: ffff00014f1d4ac0
	 x20: 0000000000000001 x19: ffff00014f1d0000 x18: 0000000000000000
	 x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000
	 x14: 000000000000001f x13: 000000000000000f x12: 0000000000000001
	 x11: 0000000000000000 x10: 0000000000000402 x9 : ffffffdfc0000000
	 x8 : 0000001fc0000040 x7 : 0000000000000000 x6 : 0000000000000000
	 x5 : ffff000141931840 x4 : 0000000000000000 x3 : 0000000000000008
	 x2 : ffffffffffffffff x1 : ffffffffffffffff x0 : 0000000000010000
	 Call trace:
	  kfree+0x48/0x2a8 (P)
	  vhost_dev_cleanup+0x138/0x2b8 [vhost]
	  vhost_net_release+0xa0/0x1a8 [vhost_net]
	  __fput+0xfc/0x2f0
	  fput_close_sync+0x38/0xc8
	  __arm64_sys_close+0xb4/0x108
	  invoke_syscall+0x4c/0xd0
	  do_el0_svc+0x80/0xb0
	  el0_svc+0x3c/0xd0
	  el0t_64_sync_handler+0x70/0x100
	  el0t_64_sync+0x170/0x178
	 Code: 8b080008 f2dffbe9 d350fd08 8b081928 (f9400509)

Log against c7991b44d7b4 ("vsock/virtio: Allocate nonlinear SKBs for
handling large transmit buffers")

	Unable to handle kernel paging request at virtual address 0010502f8f8f4f08
	Mem abort info:
	  ESR = 0x0000000096000004
	  EC = 0x25: DABT (current EL), IL = 32 bits
	  SET = 0, FnV = 0
	  EA = 0, S1PTW = 0
	  FSC = 0x04: level 0 translation fault
	Data abort info:
	  ISV = 0, ISS = 0x00000004, ISS2 = 0x00000000
	  CM = 0, WnR = 0, TnD = 0, TagAccess = 0
	  GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
	[0010502f8f8f4f08] address between user and kernel address ranges
	Internal error: Oops: 0000000096000004 [#1]  SMP
	Modules linked in: vhost_vsock vfio_iommu_type1 vfio md4 crc32_cryptoapi ghash_generic vhost_net tun vhost vhost_iotlb tap mpls_gso mpls_iptunnel mpls_router fou sch_fq ghes_edac tls tcp_diag inet_diag act_gact cls_bpf ipmi_s
	CPU: 47 UID: 0 PID: 1239699 Comm: stress-ng-dev Kdump: loaded Tainted: G        W           6.16.0-rc6-upstream-00035-gc7991b44d7b4 #18 NONE
	Tainted: [W]=WARN
	pstate: 23401009 (nzCv daif +PAN -UAO +TCO +DIT +SSBS BTYPE=--)
	pc : kfree+0x48/0x2a8
	lr : vhost_dev_cleanup+0x138/0x2b8 [vhost]
	sp : ffff80016c0cfcd0
	x29: ffff80016c0cfcd0 x28: ffff001ad6210d80 x27: 0000000000000000
	x26: 0000000000000000 x25: 0000000000000000 x24: 0000000000000000
	x23: 00000000040e001f x22: ffffffffffffffff x21: ffff001bb76f00c0
	x20: 0000000000000000 x19: ffff001bb76f0000 x18: 0000000000000000
	x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000
	x14: 000000000000001f x13: 000000000000000f x12: 0000000000000001
	x11: 0000000000000000 x10: 0000000000000402 x9 : ffffffdfc0000000
	x8 : 0010502f8f8f4f00 x7 : 0000000000000000 x6 : 0000000000000000
	x5 : ffff00012e7e2128 x4 : 0000000000000000 x3 : 0000000000000008
	x2 : ffffffffffffffff x1 : ffffffffffffffff x0 : 41403f3e3d3c3b3a
	Call trace:
	 kfree+0x48/0x2a8 (P)
	 vhost_dev_cleanup+0x138/0x2b8 [vhost]
	 vhost_net_release+0xa0/0x1a8 [vhost_net]
	 __fput+0xfc/0x2f0
	 fput_close_sync+0x38/0xc8
	 __arm64_sys_close+0xb4/0x108
	 invoke_syscall+0x4c/0xd0
	 do_el0_svc+0x80/0xb0
	 el0_svc+0x3c/0xd0
	 el0t_64_sync_handler+0x70/0x100
	 el0t_64_sync+0x170/0x178
	Code: 8b080008 f2dffbe9 d350fd08 8b081928 (f9400509)


> Breno -- when you say you've been seeing this "for a while", what's the
> earliest kernel you know you saw it on?

Looking at my logs, the older kernel that I saw it was net-next from
20250717, which was around the time I decided to test net-next in
preparation for 6.17, so, not very helpful. Sorry.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: vhost: linux-next: crash at vhost_dev_cleanup()
  2025-07-24 12:48         ` Breno Leitao
@ 2025-07-24 12:52           ` Stefano Garzarella
  2025-07-24 13:49             ` Breno Leitao
  0 siblings, 1 reply; 12+ messages in thread
From: Stefano Garzarella @ 2025-07-24 12:52 UTC (permalink / raw)
  To: Breno Leitao
  Cc: Will Deacon, Michael S. Tsirkin, jasowang, eperezma,
	linux-arm-kernel, kvm, Stefan Hajnoczi, netdev

On Thu, 24 Jul 2025 at 14:48, Breno Leitao <leitao@debian.org> wrote:
>
> On Thu, Jul 24, 2025 at 09:44:38AM +0100, Will Deacon wrote:
> > > > On Thu, 24 Jul 2025 at 09:48, Michael S. Tsirkin <mst@redhat.com> wrote:
> > > > >
> > > > > On Wed, Jul 23, 2025 at 08:04:42AM -0700, Breno Leitao wrote:
> > > > > > Hello,
> > > > > >
> > > > > > I've seen a crash in linux-next for a while on my arm64 server, and
> > > > > > I decided to report.
> > > > > >
> > > > > > While running stress-ng on linux-next, I see the crash below.
> > > > > >
> > > > > > This is happening in a kernel configure with some debug options (KASAN,
> > > > > > LOCKDEP and KMEMLEAK).
> > > > > >
> > > > > > Basically running stress-ng in a loop would crash the host in 15-20
> > > > > > minutes:
> > > > > >       # while (true); do stress-ng -r 10 -t 10; done
> > > > > >
> > > > > > >From the early warning "virt_to_phys used for non-linear address",
> > > >
> > > > mmm, we recently added nonlinear SKBs support in vhost-vsock [1],
> > > > @Will can this issue be related?
> > >
> > > Good point.
> > >
> > > Breno, if bisecting is too much trouble, would you mind testing the commits
> > > c76f3c4364fe523cd2782269eab92529c86217aa
> > > and
> > > c7991b44d7b44f9270dec63acd0b2965d29aab43
> > > and telling us if this reproduces?
> >
> > That's definitely worth doing, but we should be careful not to confuse
> > the "non-linear address" from the warning (which refers to virtual
> > addresses that lie outside of the linear mapping of memory, e.g. in the
> > vmalloc space) and "non-linear SKBs" which refer to SKBs with fragment
> > pages.
>
> I've tested both commits above, and I see the crash on both commits
> above, thus, the problem reproduces in both cases. The only difference
> I noted is the fact that I haven't seen the warning before the crash.
>
>
> Log against c76f3c4364fe ("vhost/vsock: Avoid allocating
> arbitrarily-sized SKBs")
>
>          Unable to handle kernel paging request at virtual address 0000001fc0000048
>          Mem abort info:
>            ESR = 0x0000000096000005
>            EC = 0x25: DABT (current EL), IL = 32 bits
>            SET = 0, FnV = 0
>            EA = 0, S1PTW = 0
>            FSC = 0x05: level 1 translation fault
>          Data abort info:
>            ISV = 0, ISS = 0x00000005, ISS2 = 0x00000000
>            CM = 0, WnR = 0, TnD = 0, TagAccess = 0
>            GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
>          user pgtable: 64k pages, 48-bit VAs, pgdp=0000000cdcf2da00
>          [0000001fc0000048] pgd=0000000000000000, p4d=0000000000000000, pud=0000000000000000
>          Internal error: Oops: 0000000096000005 [#1]  SMP
>          Modules linked in: vfio_iommu_type1 vfio md4 crc32_cryptoapi ghash_generic unix_diag vhost_net tun vhost vhost_iotlb tap mpls_gso mpls_iptunnel mpls_router fou sch_fq ghes_edac tls tcp_diag inet_diag act_gact cls_bpf nvidia_c
>          CPU: 34 UID: 0 PID: 1727297 Comm: stress-ng-dev Kdump: loaded Not tainted 6.16.0-rc6-upstream-00027-gc76f3c4364fe #19 NONE
>          pstate: 23401009 (nzCv daif +PAN -UAO +TCO +DIT +SSBS BTYPE=--)
>          pc : kfree+0x48/0x2a8
>          lr : vhost_dev_cleanup+0x138/0x2b8 [vhost]
>          sp : ffff80013a0cfcd0
>          x29: ffff80013a0cfcd0 x28: ffff0008fd0b6240 x27: 0000000000000000
>          x26: 0000000000000000 x25: 0000000000000000 x24: 0000000000000000
>          x23: 00000000040e001f x22: ffffffffffffffff x21: ffff00014f1d4ac0
>          x20: 0000000000000001 x19: ffff00014f1d0000 x18: 0000000000000000
>          x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000
>          x14: 000000000000001f x13: 000000000000000f x12: 0000000000000001
>          x11: 0000000000000000 x10: 0000000000000402 x9 : ffffffdfc0000000
>          x8 : 0000001fc0000040 x7 : 0000000000000000 x6 : 0000000000000000
>          x5 : ffff000141931840 x4 : 0000000000000000 x3 : 0000000000000008
>          x2 : ffffffffffffffff x1 : ffffffffffffffff x0 : 0000000000010000
>          Call trace:
>           kfree+0x48/0x2a8 (P)
>           vhost_dev_cleanup+0x138/0x2b8 [vhost]
>           vhost_net_release+0xa0/0x1a8 [vhost_net]

But here is the vhost_net, so I'm confused now.
Do you see the same (vhost_net) also on 9798752 ("Add linux-next
specific files for 20250721") ?

The initial report contained only vhost_vsock traces IIUC, so I'm
suspecting something in the vhost core.

Thanks,
Stefano

>           __fput+0xfc/0x2f0
>           fput_close_sync+0x38/0xc8
>           __arm64_sys_close+0xb4/0x108
>           invoke_syscall+0x4c/0xd0
>           do_el0_svc+0x80/0xb0
>           el0_svc+0x3c/0xd0
>           el0t_64_sync_handler+0x70/0x100
>           el0t_64_sync+0x170/0x178
>          Code: 8b080008 f2dffbe9 d350fd08 8b081928 (f9400509)
>
> Log against c7991b44d7b4 ("vsock/virtio: Allocate nonlinear SKBs for
> handling large transmit buffers")
>
>         Unable to handle kernel paging request at virtual address 0010502f8f8f4f08
>         Mem abort info:
>           ESR = 0x0000000096000004
>           EC = 0x25: DABT (current EL), IL = 32 bits
>           SET = 0, FnV = 0
>           EA = 0, S1PTW = 0
>           FSC = 0x04: level 0 translation fault
>         Data abort info:
>           ISV = 0, ISS = 0x00000004, ISS2 = 0x00000000
>           CM = 0, WnR = 0, TnD = 0, TagAccess = 0
>           GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
>         [0010502f8f8f4f08] address between user and kernel address ranges
>         Internal error: Oops: 0000000096000004 [#1]  SMP
>         Modules linked in: vhost_vsock vfio_iommu_type1 vfio md4 crc32_cryptoapi ghash_generic vhost_net tun vhost vhost_iotlb tap mpls_gso mpls_iptunnel mpls_router fou sch_fq ghes_edac tls tcp_diag inet_diag act_gact cls_bpf ipmi_s
>         CPU: 47 UID: 0 PID: 1239699 Comm: stress-ng-dev Kdump: loaded Tainted: G        W           6.16.0-rc6-upstream-00035-gc7991b44d7b4 #18 NONE
>         Tainted: [W]=WARN
>         pstate: 23401009 (nzCv daif +PAN -UAO +TCO +DIT +SSBS BTYPE=--)
>         pc : kfree+0x48/0x2a8
>         lr : vhost_dev_cleanup+0x138/0x2b8 [vhost]
>         sp : ffff80016c0cfcd0
>         x29: ffff80016c0cfcd0 x28: ffff001ad6210d80 x27: 0000000000000000
>         x26: 0000000000000000 x25: 0000000000000000 x24: 0000000000000000
>         x23: 00000000040e001f x22: ffffffffffffffff x21: ffff001bb76f00c0
>         x20: 0000000000000000 x19: ffff001bb76f0000 x18: 0000000000000000
>         x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000
>         x14: 000000000000001f x13: 000000000000000f x12: 0000000000000001
>         x11: 0000000000000000 x10: 0000000000000402 x9 : ffffffdfc0000000
>         x8 : 0010502f8f8f4f00 x7 : 0000000000000000 x6 : 0000000000000000
>         x5 : ffff00012e7e2128 x4 : 0000000000000000 x3 : 0000000000000008
>         x2 : ffffffffffffffff x1 : ffffffffffffffff x0 : 41403f3e3d3c3b3a
>         Call trace:
>          kfree+0x48/0x2a8 (P)
>          vhost_dev_cleanup+0x138/0x2b8 [vhost]
>          vhost_net_release+0xa0/0x1a8 [vhost_net]
>          __fput+0xfc/0x2f0
>          fput_close_sync+0x38/0xc8
>          __arm64_sys_close+0xb4/0x108
>          invoke_syscall+0x4c/0xd0
>          do_el0_svc+0x80/0xb0
>          el0_svc+0x3c/0xd0
>          el0t_64_sync_handler+0x70/0x100
>          el0t_64_sync+0x170/0x178
>         Code: 8b080008 f2dffbe9 d350fd08 8b081928 (f9400509)
>
>
> > Breno -- when you say you've been seeing this "for a while", what's the
> > earliest kernel you know you saw it on?
>
> Looking at my logs, the older kernel that I saw it was net-next from
> 20250717, which was around the time I decided to test net-next in
> preparation for 6.17, so, not very helpful. Sorry.
>


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: vhost: linux-next: crash at vhost_dev_cleanup()
  2025-07-24 12:52           ` Stefano Garzarella
@ 2025-07-24 13:49             ` Breno Leitao
  2025-07-29  7:44               ` Jason Wang
  0 siblings, 1 reply; 12+ messages in thread
From: Breno Leitao @ 2025-07-24 13:49 UTC (permalink / raw)
  To: Stefano Garzarella
  Cc: Will Deacon, Michael S. Tsirkin, jasowang, eperezma,
	linux-arm-kernel, kvm, Stefan Hajnoczi, netdev

On Thu, Jul 24, 2025 at 02:52:08PM +0200, Stefano Garzarella wrote:
> On Thu, 24 Jul 2025 at 14:48, Breno Leitao <leitao@debian.org> wrote:
> >
> > On Thu, Jul 24, 2025 at 09:44:38AM +0100, Will Deacon wrote:
> > > > > On Thu, 24 Jul 2025 at 09:48, Michael S. Tsirkin <mst@redhat.com> wrote:
> > > > > >
> > > > > > On Wed, Jul 23, 2025 at 08:04:42AM -0700, Breno Leitao wrote:
> > > > > > > Hello,
> > > > > > >
> > > > > > > I've seen a crash in linux-next for a while on my arm64 server, and
> > > > > > > I decided to report.
> > > > > > >
> > > > > > > While running stress-ng on linux-next, I see the crash below.
> > > > > > >
> > > > > > > This is happening in a kernel configure with some debug options (KASAN,
> > > > > > > LOCKDEP and KMEMLEAK).
> > > > > > >
> > > > > > > Basically running stress-ng in a loop would crash the host in 15-20
> > > > > > > minutes:
> > > > > > >       # while (true); do stress-ng -r 10 -t 10; done
> > > > > > >
> > > > > > > >From the early warning "virt_to_phys used for non-linear address",
> > > > >
> > > > > mmm, we recently added nonlinear SKBs support in vhost-vsock [1],
> > > > > @Will can this issue be related?
> > > >
> > > > Good point.
> > > >
> > > > Breno, if bisecting is too much trouble, would you mind testing the commits
> > > > c76f3c4364fe523cd2782269eab92529c86217aa
> > > > and
> > > > c7991b44d7b44f9270dec63acd0b2965d29aab43
> > > > and telling us if this reproduces?
> > >
> > > That's definitely worth doing, but we should be careful not to confuse
> > > the "non-linear address" from the warning (which refers to virtual
> > > addresses that lie outside of the linear mapping of memory, e.g. in the
> > > vmalloc space) and "non-linear SKBs" which refer to SKBs with fragment
> > > pages.
> >
> > I've tested both commits above, and I see the crash on both commits
> > above, thus, the problem reproduces in both cases. The only difference
> > I noted is the fact that I haven't seen the warning before the crash.
> >
> >
> > Log against c76f3c4364fe ("vhost/vsock: Avoid allocating
> > arbitrarily-sized SKBs")
> >
> >          Unable to handle kernel paging request at virtual address 0000001fc0000048
> >          Mem abort info:
> >            ESR = 0x0000000096000005
> >            EC = 0x25: DABT (current EL), IL = 32 bits
> >            SET = 0, FnV = 0
> >            EA = 0, S1PTW = 0
> >            FSC = 0x05: level 1 translation fault
> >          Data abort info:
> >            ISV = 0, ISS = 0x00000005, ISS2 = 0x00000000
> >            CM = 0, WnR = 0, TnD = 0, TagAccess = 0
> >            GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
> >          user pgtable: 64k pages, 48-bit VAs, pgdp=0000000cdcf2da00
> >          [0000001fc0000048] pgd=0000000000000000, p4d=0000000000000000, pud=0000000000000000
> >          Internal error: Oops: 0000000096000005 [#1]  SMP
> >          Modules linked in: vfio_iommu_type1 vfio md4 crc32_cryptoapi ghash_generic unix_diag vhost_net tun vhost vhost_iotlb tap mpls_gso mpls_iptunnel mpls_router fou sch_fq ghes_edac tls tcp_diag inet_diag act_gact cls_bpf nvidia_c
> >          CPU: 34 UID: 0 PID: 1727297 Comm: stress-ng-dev Kdump: loaded Not tainted 6.16.0-rc6-upstream-00027-gc76f3c4364fe #19 NONE
> >          pstate: 23401009 (nzCv daif +PAN -UAO +TCO +DIT +SSBS BTYPE=--)
> >          pc : kfree+0x48/0x2a8
> >          lr : vhost_dev_cleanup+0x138/0x2b8 [vhost]
> >          sp : ffff80013a0cfcd0
> >          x29: ffff80013a0cfcd0 x28: ffff0008fd0b6240 x27: 0000000000000000
> >          x26: 0000000000000000 x25: 0000000000000000 x24: 0000000000000000
> >          x23: 00000000040e001f x22: ffffffffffffffff x21: ffff00014f1d4ac0
> >          x20: 0000000000000001 x19: ffff00014f1d0000 x18: 0000000000000000
> >          x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000
> >          x14: 000000000000001f x13: 000000000000000f x12: 0000000000000001
> >          x11: 0000000000000000 x10: 0000000000000402 x9 : ffffffdfc0000000
> >          x8 : 0000001fc0000040 x7 : 0000000000000000 x6 : 0000000000000000
> >          x5 : ffff000141931840 x4 : 0000000000000000 x3 : 0000000000000008
> >          x2 : ffffffffffffffff x1 : ffffffffffffffff x0 : 0000000000010000
> >          Call trace:
> >           kfree+0x48/0x2a8 (P)
> >           vhost_dev_cleanup+0x138/0x2b8 [vhost]
> >           vhost_net_release+0xa0/0x1a8 [vhost_net]
> 
> But here is the vhost_net, so I'm confused now.
> Do you see the same (vhost_net) also on 9798752 ("Add linux-next
> specific files for 20250721") ?

I will need to reproduce, but, looking at my logs, I see the following
against: c76f3c4364fe ("vhost/vsock: Avoid allocating arbitrarily-sized SKBs").
The logs are a bit intermixed, probably there were multiple CPUs hitting
the same code path.

           virt_to_phys used for non-linear address: 000000001b662678 (0xffe61984a460)
           WARNING: CPU: 15 PID: 112846 at arch/arm64/mm/physaddr.c:15 __virt_to_phys+0x80/0xa8
           Modules linked in: vhost_vsock(E) vhost(E) vhost_iotlb(E) ghes_edac(E) tls(E) act_gact(E) cls_bpf(E) tcp_diag(E) inet_diag(E) ipmi_ssif(E) ipmi_devintf(E) ipmi_msghandler(E) sch_fq_codel(E) drm(E) backlight(E) drm_panel_orientation_quirks(E) acpi_power_meter(E) loop(E) efivarfs(E) autofs4(E)
           CPU: 15 UID: 0 PID: 112846 Comm: stress-ng-dev Kdump: loaded Tainted: G        W   E    N  6.16.0-rc6-upstream-00027-gc76f3c4364fe #16 PREEMPT(none) 
           Tainted: [W]=WARN, [E]=UNSIGNED_MODULE, [N]=TEST
           pstate: 63401009 (nZCv daif +PAN -UAO +TCO +DIT +SSBS BTYPE=--)
           pc : __virt_to_phys+0x80/0xa8
           lr : __virt_to_phys+0x7c/0xa8
           sp : ffff8001184d7a30
           x29: ffff8001184d7a30 x28: 00000000000045d8 x27: 1fffe000e7e88014
           x26: 1fffe000e7e888f7 x25: ffff0007e578bf00 x24: 1fffe000e7e88013
           x23: 0000000000000000 x22: 0000ffe61984a460 x21: ffff00073f440098
           x20: ffffff1000080000 x19: 0000ffe61984a460 x18: 0000000000000002
           x17: 6666783028203837 x16: 3632363662313030 x15: 0000000000000001
           x14: 1fffe006d52e90f2 x13: 0000000000000000 x12: 0000000000000000
           x11: ffff6006d52e90f3 x10: 0000000000000002 x9 : cfc659a21c727d00
           x8 : ffff800083c19000 x7 : 0000000000000001 x6 : 0000000000000001
           x5 : ffff8001184d7398 x4 : ffff800084866d60 x3 : ffff8000805fdd94
           x2 : 0000000000000001 x1 : 0000000000000004 x0 : 000000000000004b
           Call trace:
            __virt_to_phys+0x80/0xa8 (P)
            kfree+0xac/0x4b0
            vhost_dev_cleanup+0x484/0x8b0 [vhost]
            vhost_vsock_dev_release+0x2f4/0x358 [vhost_vsock]
            __fput+0x2b4/0x608
            fput_close_sync+0xe8/0x1e0
            __arm64_sys_close+0x84/0xd0
            invoke_syscall+0x8c/0x208
            do_el0_svc+0x128/0x1a0
            el0_svc+0x58/0x160
            el0t_64_sync_handler+0x78/0x108
            el0t_64_sync+0x198/0x1a0
           irq event stamp: 0
           hardirqs last  enabled at (0): [<0000000000000000>] 0x0
           hardirqs last disabled at (0): [<ffff8000801d876c>] copy_process+0xd5c/0x29f8
           softirqs last  enabled at (0): [<ffff8000801d879c>] copy_process+0xd8c/0x29f8
           softirqs last disabled at (0): [<0000000000000000>] 0x0
           ---[ end trace 0000000000000000 ]---
           Unable to handle kernel paging request at virtual address 0000040053791288
           ------------[ cut here ]------------
           lr : kfree+0xac/0x4b0
           virt_to_phys used for non-linear address: 00000000290839fd (0x2500000000)
           WARNING: CPU: 41 PID: 112845 at arch/arm64/mm/physaddr.c:15 __virt_to_phys+0x80/0xa8
           Modules linked in: vhost_vsock(E) vhost(E) vhost_iotlb(E) ghes_edac(E) tls(E) act_gact(E) cls_bpf(E) tcp_diag(E) inet_diag(E) ipmi_ssif(E) ipmi_devintf(E) ipmi_msghandler(E) sch_fq_codel(E) drm(E) backlight(E) drm_panel_orientation_quirks(E) acpi_power_meter(E) loop(E) efivarfs(E) autofs4(E)
           CPU: 41 UID: 0 PID: 112845 Comm: stress-ng-dev Kdump: loaded Tainted: G        W   E    N  6.16.0-rc6-upstream-00027-gc76f3c4364fe #16 PREEMPT(none) 
           x23: 0000000000000001
           Tainted: [W]=WARN, [E]=UNSIGNED_MODULE, [N]=TEST
           pstate: 63401009 (nZCv daif +PAN -UAO +TCO +DIT +SSBS BTYPE=--)
           pc : __virt_to_phys+0x80/0xa8
           lr : __virt_to_phys+0x7c/0xa8
           sp : ffff8001a8277a30
           x29: ffff8001a8277a30 x28: 00000000000045d8 x27: 1fffe000e7e3c014
           x26: 1fffe000e7e3c8f7 x25: ffff0007bcff0000 x24: 1fffe000e7e3c013
           x23: 0000000000000000 x22: 0000002500000000 x21: ffff00073f1e0098
           x20: ffffff1000080000 x19: 0000002500000000 x18: 0000000000000004
           x17: 00000000ffffffff x16: 0000000000000001 x15: 0000000000000001
           x14: 1fffe006d53920f2 x13: 0000000000000000 x12: 0000000000000000
           sp : ffff8001184d7a50
           x29: ffff8001184d7a60 x28: 00000000000045d8 x27: 1fffe000e7e88014
           x26: 1fffe000e7e888f7 x25: ffff0007e578bf00 x24: 1fffe000e7e88013
           x23: 0000000000000000 x22: 0000ffe61984a460 x21: ffff00073f440098
           x20: 0000040053791280 x19: ffff80000d0b8bbc x18: 0000000000000002
           x17: 6666783028203837 x16: 3632363662313030 x15: 0000000000000001
            x22: 0000ffe6199a459d
           x14: 1fffe006d52e90f2
            x21: ffff00073f7f0098
           x20: ffffff1000080000 x19: 0000ffe6199a459d x18: 0000000000000004
           x17: 54455320203b2d2c x16: 0000000000000011 x15: 0000000000000001
           x14: 1fffe006d52bb8f2 x13: 0000000000000000 x12: 0000000000000000
           x11: ffff6006d52bb8f3 x10: dfff800000000000 x9 : 77521a2bd3e0be00
           x8 : ffff800083c19000 x7 : 0000000000000000 x6 : ffff80008036bc2c
           x5 : 0000000000000000 x4 : 0000000000000001 x3 : ffff8000805fdd94
           x2 : 0000000000000001 x1 : 0000000000000004 x0 : 000000000000004b
           Call trace:
            __virt_to_phys+0x80/0xa8 (P)
            kfree+0xac/0x4b0
            vhost_dev_cleanup+0x484/0x8b0 [vhost]
            vhost_vsock_dev_release+0x2f4/0x358 [vhost_vsock]
            __fput+0x2b4/0x608
           x11: ffff6006d53920f3
            fput_close_sync+0xe8/0x1e0
            __arm64_sys_close+0x84/0xd0
            invoke_syscall+0x8c/0x208
            do_el0_svc+0x128/0x1a0
            el0_svc+0x58/0x160
            el0t_64_sync_handler+0x78/0x108
            el0t_64_sync+0x198/0x1a0
           irq event stamp: 0
            x13: 0000000000000000
           hardirqs last  enabled at (0): [<0000000000000000>] 0x0
           hardirqs last disabled at (0): [<ffff8000801d876c>] copy_process+0xd5c/0x29f8


> The initial report contained only vhost_vsock traces IIUC, so I'm
> suspecting something in the vhost core.

Right, it seems we are hitting the same code path, on on both
vhost_vsock and vhost_net.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: vhost: linux-next: crash at vhost_dev_cleanup()
  2025-07-24 13:49             ` Breno Leitao
@ 2025-07-29  7:44               ` Jason Wang
  2025-07-29  9:10                 ` Breno Leitao
  2025-07-29  9:57                 ` Stefano Garzarella
  0 siblings, 2 replies; 12+ messages in thread
From: Jason Wang @ 2025-07-29  7:44 UTC (permalink / raw)
  To: Breno Leitao
  Cc: Stefano Garzarella, Will Deacon, Michael S. Tsirkin, eperezma,
	linux-arm-kernel, kvm, Stefan Hajnoczi, netdev

On Thu, Jul 24, 2025 at 9:50 PM Breno Leitao <leitao@debian.org> wrote:
>
> On Thu, Jul 24, 2025 at 02:52:08PM +0200, Stefano Garzarella wrote:
> > On Thu, 24 Jul 2025 at 14:48, Breno Leitao <leitao@debian.org> wrote:
> > >
> > > On Thu, Jul 24, 2025 at 09:44:38AM +0100, Will Deacon wrote:
> > > > > > On Thu, 24 Jul 2025 at 09:48, Michael S. Tsirkin <mst@redhat.com> wrote:
> > > > > > >
> > > > > > > On Wed, Jul 23, 2025 at 08:04:42AM -0700, Breno Leitao wrote:
> > > > > > > > Hello,
> > > > > > > >
> > > > > > > > I've seen a crash in linux-next for a while on my arm64 server, and
> > > > > > > > I decided to report.
> > > > > > > >
> > > > > > > > While running stress-ng on linux-next, I see the crash below.
> > > > > > > >
> > > > > > > > This is happening in a kernel configure with some debug options (KASAN,
> > > > > > > > LOCKDEP and KMEMLEAK).
> > > > > > > >
> > > > > > > > Basically running stress-ng in a loop would crash the host in 15-20
> > > > > > > > minutes:
> > > > > > > >       # while (true); do stress-ng -r 10 -t 10; done
> > > > > > > >
> > > > > > > > >From the early warning "virt_to_phys used for non-linear address",
> > > > > >
> > > > > > mmm, we recently added nonlinear SKBs support in vhost-vsock [1],
> > > > > > @Will can this issue be related?
> > > > >
> > > > > Good point.
> > > > >
> > > > > Breno, if bisecting is too much trouble, would you mind testing the commits
> > > > > c76f3c4364fe523cd2782269eab92529c86217aa
> > > > > and
> > > > > c7991b44d7b44f9270dec63acd0b2965d29aab43
> > > > > and telling us if this reproduces?
> > > >
> > > > That's definitely worth doing, but we should be careful not to confuse
> > > > the "non-linear address" from the warning (which refers to virtual
> > > > addresses that lie outside of the linear mapping of memory, e.g. in the
> > > > vmalloc space) and "non-linear SKBs" which refer to SKBs with fragment
> > > > pages.
> > >
> > > I've tested both commits above, and I see the crash on both commits
> > > above, thus, the problem reproduces in both cases. The only difference
> > > I noted is the fact that I haven't seen the warning before the crash.
> > >
> > >
> > > Log against c76f3c4364fe ("vhost/vsock: Avoid allocating
> > > arbitrarily-sized SKBs")
> > >
> > >          Unable to handle kernel paging request at virtual address 0000001fc0000048
> > >          Mem abort info:
> > >            ESR = 0x0000000096000005
> > >            EC = 0x25: DABT (current EL), IL = 32 bits
> > >            SET = 0, FnV = 0
> > >            EA = 0, S1PTW = 0
> > >            FSC = 0x05: level 1 translation fault
> > >          Data abort info:
> > >            ISV = 0, ISS = 0x00000005, ISS2 = 0x00000000
> > >            CM = 0, WnR = 0, TnD = 0, TagAccess = 0
> > >            GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
> > >          user pgtable: 64k pages, 48-bit VAs, pgdp=0000000cdcf2da00
> > >          [0000001fc0000048] pgd=0000000000000000, p4d=0000000000000000, pud=0000000000000000
> > >          Internal error: Oops: 0000000096000005 [#1]  SMP
> > >          Modules linked in: vfio_iommu_type1 vfio md4 crc32_cryptoapi ghash_generic unix_diag vhost_net tun vhost vhost_iotlb tap mpls_gso mpls_iptunnel mpls_router fou sch_fq ghes_edac tls tcp_diag inet_diag act_gact cls_bpf nvidia_c
> > >          CPU: 34 UID: 0 PID: 1727297 Comm: stress-ng-dev Kdump: loaded Not tainted 6.16.0-rc6-upstream-00027-gc76f3c4364fe #19 NONE
> > >          pstate: 23401009 (nzCv daif +PAN -UAO +TCO +DIT +SSBS BTYPE=--)
> > >          pc : kfree+0x48/0x2a8
> > >          lr : vhost_dev_cleanup+0x138/0x2b8 [vhost]
> > >          sp : ffff80013a0cfcd0
> > >          x29: ffff80013a0cfcd0 x28: ffff0008fd0b6240 x27: 0000000000000000
> > >          x26: 0000000000000000 x25: 0000000000000000 x24: 0000000000000000
> > >          x23: 00000000040e001f x22: ffffffffffffffff x21: ffff00014f1d4ac0
> > >          x20: 0000000000000001 x19: ffff00014f1d0000 x18: 0000000000000000
> > >          x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000
> > >          x14: 000000000000001f x13: 000000000000000f x12: 0000000000000001
> > >          x11: 0000000000000000 x10: 0000000000000402 x9 : ffffffdfc0000000
> > >          x8 : 0000001fc0000040 x7 : 0000000000000000 x6 : 0000000000000000
> > >          x5 : ffff000141931840 x4 : 0000000000000000 x3 : 0000000000000008
> > >          x2 : ffffffffffffffff x1 : ffffffffffffffff x0 : 0000000000010000
> > >          Call trace:
> > >           kfree+0x48/0x2a8 (P)
> > >           vhost_dev_cleanup+0x138/0x2b8 [vhost]
> > >           vhost_net_release+0xa0/0x1a8 [vhost_net]
> >
> > But here is the vhost_net, so I'm confused now.
> > Do you see the same (vhost_net) also on 9798752 ("Add linux-next
> > specific files for 20250721") ?
>
> I will need to reproduce, but, looking at my logs, I see the following
> against: c76f3c4364fe ("vhost/vsock: Avoid allocating arbitrarily-sized SKBs").
> The logs are a bit intermixed, probably there were multiple CPUs hitting
> the same code path.
>
>            virt_to_phys used for non-linear address: 000000001b662678 (0xffe61984a460)
>            WARNING: CPU: 15 PID: 112846 at arch/arm64/mm/physaddr.c:15 __virt_to_phys+0x80/0xa8
>            Modules linked in: vhost_vsock(E) vhost(E) vhost_iotlb(E) ghes_edac(E) tls(E) act_gact(E) cls_bpf(E) tcp_diag(E) inet_diag(E) ipmi_ssif(E) ipmi_devintf(E) ipmi_msghandler(E) sch_fq_codel(E) drm(E) backlight(E) drm_panel_orientation_quirks(E) acpi_power_meter(E) loop(E) efivarfs(E) autofs4(E)
>            CPU: 15 UID: 0 PID: 112846 Comm: stress-ng-dev Kdump: loaded Tainted: G        W   E    N  6.16.0-rc6-upstream-00027-gc76f3c4364fe #16 PREEMPT(none)
>            Tainted: [W]=WARN, [E]=UNSIGNED_MODULE, [N]=TEST
>            pstate: 63401009 (nZCv daif +PAN -UAO +TCO +DIT +SSBS BTYPE=--)
>            pc : __virt_to_phys+0x80/0xa8
>            lr : __virt_to_phys+0x7c/0xa8
>            sp : ffff8001184d7a30
>            x29: ffff8001184d7a30 x28: 00000000000045d8 x27: 1fffe000e7e88014
>            x26: 1fffe000e7e888f7 x25: ffff0007e578bf00 x24: 1fffe000e7e88013
>            x23: 0000000000000000 x22: 0000ffe61984a460 x21: ffff00073f440098
>            x20: ffffff1000080000 x19: 0000ffe61984a460 x18: 0000000000000002
>            x17: 6666783028203837 x16: 3632363662313030 x15: 0000000000000001
>            x14: 1fffe006d52e90f2 x13: 0000000000000000 x12: 0000000000000000
>            x11: ffff6006d52e90f3 x10: 0000000000000002 x9 : cfc659a21c727d00
>            x8 : ffff800083c19000 x7 : 0000000000000001 x6 : 0000000000000001
>            x5 : ffff8001184d7398 x4 : ffff800084866d60 x3 : ffff8000805fdd94
>            x2 : 0000000000000001 x1 : 0000000000000004 x0 : 000000000000004b
>            Call trace:
>             __virt_to_phys+0x80/0xa8 (P)
>             kfree+0xac/0x4b0
>             vhost_dev_cleanup+0x484/0x8b0 [vhost]
>             vhost_vsock_dev_release+0x2f4/0x358 [vhost_vsock]
>             __fput+0x2b4/0x608
>             fput_close_sync+0xe8/0x1e0
>             __arm64_sys_close+0x84/0xd0
>             invoke_syscall+0x8c/0x208
>             do_el0_svc+0x128/0x1a0
>             el0_svc+0x58/0x160
>             el0t_64_sync_handler+0x78/0x108
>             el0t_64_sync+0x198/0x1a0
>            irq event stamp: 0
>            hardirqs last  enabled at (0): [<0000000000000000>] 0x0
>            hardirqs last disabled at (0): [<ffff8000801d876c>] copy_process+0xd5c/0x29f8
>            softirqs last  enabled at (0): [<ffff8000801d879c>] copy_process+0xd8c/0x29f8
>            softirqs last disabled at (0): [<0000000000000000>] 0x0
>            ---[ end trace 0000000000000000 ]---
>            Unable to handle kernel paging request at virtual address 0000040053791288
>            ------------[ cut here ]------------
>            lr : kfree+0xac/0x4b0
>            virt_to_phys used for non-linear address: 00000000290839fd (0x2500000000)
>            WARNING: CPU: 41 PID: 112845 at arch/arm64/mm/physaddr.c:15 __virt_to_phys+0x80/0xa8
>            Modules linked in: vhost_vsock(E) vhost(E) vhost_iotlb(E) ghes_edac(E) tls(E) act_gact(E) cls_bpf(E) tcp_diag(E) inet_diag(E) ipmi_ssif(E) ipmi_devintf(E) ipmi_msghandler(E) sch_fq_codel(E) drm(E) backlight(E) drm_panel_orientation_quirks(E) acpi_power_meter(E) loop(E) efivarfs(E) autofs4(E)
>            CPU: 41 UID: 0 PID: 112845 Comm: stress-ng-dev Kdump: loaded Tainted: G        W   E    N  6.16.0-rc6-upstream-00027-gc76f3c4364fe #16 PREEMPT(none)
>            x23: 0000000000000001
>            Tainted: [W]=WARN, [E]=UNSIGNED_MODULE, [N]=TEST
>            pstate: 63401009 (nZCv daif +PAN -UAO +TCO +DIT +SSBS BTYPE=--)
>            pc : __virt_to_phys+0x80/0xa8
>            lr : __virt_to_phys+0x7c/0xa8
>            sp : ffff8001a8277a30
>            x29: ffff8001a8277a30 x28: 00000000000045d8 x27: 1fffe000e7e3c014
>            x26: 1fffe000e7e3c8f7 x25: ffff0007bcff0000 x24: 1fffe000e7e3c013
>            x23: 0000000000000000 x22: 0000002500000000 x21: ffff00073f1e0098
>            x20: ffffff1000080000 x19: 0000002500000000 x18: 0000000000000004
>            x17: 00000000ffffffff x16: 0000000000000001 x15: 0000000000000001
>            x14: 1fffe006d53920f2 x13: 0000000000000000 x12: 0000000000000000
>            sp : ffff8001184d7a50
>            x29: ffff8001184d7a60 x28: 00000000000045d8 x27: 1fffe000e7e88014
>            x26: 1fffe000e7e888f7 x25: ffff0007e578bf00 x24: 1fffe000e7e88013
>            x23: 0000000000000000 x22: 0000ffe61984a460 x21: ffff00073f440098
>            x20: 0000040053791280 x19: ffff80000d0b8bbc x18: 0000000000000002
>            x17: 6666783028203837 x16: 3632363662313030 x15: 0000000000000001
>             x22: 0000ffe6199a459d
>            x14: 1fffe006d52e90f2
>             x21: ffff00073f7f0098
>            x20: ffffff1000080000 x19: 0000ffe6199a459d x18: 0000000000000004
>            x17: 54455320203b2d2c x16: 0000000000000011 x15: 0000000000000001
>            x14: 1fffe006d52bb8f2 x13: 0000000000000000 x12: 0000000000000000
>            x11: ffff6006d52bb8f3 x10: dfff800000000000 x9 : 77521a2bd3e0be00
>            x8 : ffff800083c19000 x7 : 0000000000000000 x6 : ffff80008036bc2c
>            x5 : 0000000000000000 x4 : 0000000000000001 x3 : ffff8000805fdd94
>            x2 : 0000000000000001 x1 : 0000000000000004 x0 : 000000000000004b
>            Call trace:
>             __virt_to_phys+0x80/0xa8 (P)
>             kfree+0xac/0x4b0
>             vhost_dev_cleanup+0x484/0x8b0 [vhost]
>             vhost_vsock_dev_release+0x2f4/0x358 [vhost_vsock]
>             __fput+0x2b4/0x608
>            x11: ffff6006d53920f3
>             fput_close_sync+0xe8/0x1e0
>             __arm64_sys_close+0x84/0xd0
>             invoke_syscall+0x8c/0x208
>             do_el0_svc+0x128/0x1a0
>             el0_svc+0x58/0x160
>             el0t_64_sync_handler+0x78/0x108
>             el0t_64_sync+0x198/0x1a0
>            irq event stamp: 0
>             x13: 0000000000000000
>            hardirqs last  enabled at (0): [<0000000000000000>] 0x0
>            hardirqs last disabled at (0): [<ffff8000801d876c>] copy_process+0xd5c/0x29f8
>
>
> > The initial report contained only vhost_vsock traces IIUC, so I'm
> > suspecting something in the vhost core.
>
> Right, it seems we are hitting the same code path, on on both
> vhost_vsock and vhost_net.
>

I've posted a fix here:

https://lore.kernel.org/virtualization/20250729073916.80647-1-jasowang@redhat.com/T/#u

I think it should address this issue.

Thanks


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: vhost: linux-next: crash at vhost_dev_cleanup()
  2025-07-29  7:44               ` Jason Wang
@ 2025-07-29  9:10                 ` Breno Leitao
  2025-07-29  9:57                 ` Stefano Garzarella
  1 sibling, 0 replies; 12+ messages in thread
From: Breno Leitao @ 2025-07-29  9:10 UTC (permalink / raw)
  To: Jason Wang
  Cc: Stefano Garzarella, Will Deacon, Michael S. Tsirkin, eperezma,
	linux-arm-kernel, kvm, Stefan Hajnoczi, netdev

Hello Jason,

On Tue, Jul 29, 2025 at 03:44:49PM +0800, Jason Wang wrote:
> On Thu, Jul 24, 2025 at 9:50 PM Breno Leitao <leitao@debian.org> wrote:
> > > The initial report contained only vhost_vsock traces IIUC, so I'm
> > > suspecting something in the vhost core.
> >
> > Right, it seems we are hitting the same code path, on on both
> > vhost_vsock and vhost_net.
> >
> 
> I've posted a fix here:
> 
> https://lore.kernel.org/virtualization/20250729073916.80647-1-jasowang@redhat.com/T/#u
> 
> I think it should address this issue.

yes, it does. I've tested the fix on my machine and I was not able to
reproduce the error at all.

Thanks for the fix,
--breno

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: vhost: linux-next: crash at vhost_dev_cleanup()
  2025-07-29  7:44               ` Jason Wang
  2025-07-29  9:10                 ` Breno Leitao
@ 2025-07-29  9:57                 ` Stefano Garzarella
  1 sibling, 0 replies; 12+ messages in thread
From: Stefano Garzarella @ 2025-07-29  9:57 UTC (permalink / raw)
  To: Jason Wang
  Cc: Breno Leitao, Will Deacon, Michael S. Tsirkin, eperezma,
	linux-arm-kernel, kvm, Stefan Hajnoczi, netdev

On Tue, Jul 29, 2025 at 03:44:49PM +0800, Jason Wang wrote:
>On Thu, Jul 24, 2025 at 9:50 PM Breno Leitao <leitao@debian.org> wrote:
>>
>> On Thu, Jul 24, 2025 at 02:52:08PM +0200, Stefano Garzarella wrote:
>> > On Thu, 24 Jul 2025 at 14:48, Breno Leitao <leitao@debian.org> wrote:
>> > >
>> > > On Thu, Jul 24, 2025 at 09:44:38AM +0100, Will Deacon wrote:
>> > > > > > On Thu, 24 Jul 2025 at 09:48, Michael S. Tsirkin <mst@redhat.com> wrote:
>> > > > > > >
>> > > > > > > On Wed, Jul 23, 2025 at 08:04:42AM -0700, Breno Leitao wrote:
>> > > > > > > > Hello,
>> > > > > > > >
>> > > > > > > > I've seen a crash in linux-next for a while on my arm64 server, and
>> > > > > > > > I decided to report.
>> > > > > > > >
>> > > > > > > > While running stress-ng on linux-next, I see the crash below.
>> > > > > > > >
>> > > > > > > > This is happening in a kernel configure with some debug options (KASAN,
>> > > > > > > > LOCKDEP and KMEMLEAK).
>> > > > > > > >
>> > > > > > > > Basically running stress-ng in a loop would crash the host in 15-20
>> > > > > > > > minutes:
>> > > > > > > >       # while (true); do stress-ng -r 10 -t 10; done
>> > > > > > > >
>> > > > > > > > >From the early warning "virt_to_phys used for non-linear address",
>> > > > > >
>> > > > > > mmm, we recently added nonlinear SKBs support in vhost-vsock [1],
>> > > > > > @Will can this issue be related?
>> > > > >
>> > > > > Good point.
>> > > > >
>> > > > > Breno, if bisecting is too much trouble, would you mind testing the commits
>> > > > > c76f3c4364fe523cd2782269eab92529c86217aa
>> > > > > and
>> > > > > c7991b44d7b44f9270dec63acd0b2965d29aab43
>> > > > > and telling us if this reproduces?
>> > > >
>> > > > That's definitely worth doing, but we should be careful not to confuse
>> > > > the "non-linear address" from the warning (which refers to virtual
>> > > > addresses that lie outside of the linear mapping of memory, e.g. in the
>> > > > vmalloc space) and "non-linear SKBs" which refer to SKBs with fragment
>> > > > pages.
>> > >
>> > > I've tested both commits above, and I see the crash on both commits
>> > > above, thus, the problem reproduces in both cases. The only difference
>> > > I noted is the fact that I haven't seen the warning before the crash.
>> > >
>> > >
>> > > Log against c76f3c4364fe ("vhost/vsock: Avoid allocating
>> > > arbitrarily-sized SKBs")
>> > >
>> > >          Unable to handle kernel paging request at virtual address 0000001fc0000048
>> > >          Mem abort info:
>> > >            ESR = 0x0000000096000005
>> > >            EC = 0x25: DABT (current EL), IL = 32 bits
>> > >            SET = 0, FnV = 0
>> > >            EA = 0, S1PTW = 0
>> > >            FSC = 0x05: level 1 translation fault
>> > >          Data abort info:
>> > >            ISV = 0, ISS = 0x00000005, ISS2 = 0x00000000
>> > >            CM = 0, WnR = 0, TnD = 0, TagAccess = 0
>> > >            GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
>> > >          user pgtable: 64k pages, 48-bit VAs, pgdp=0000000cdcf2da00
>> > >          [0000001fc0000048] pgd=0000000000000000, p4d=0000000000000000, pud=0000000000000000
>> > >          Internal error: Oops: 0000000096000005 [#1]  SMP
>> > >          Modules linked in: vfio_iommu_type1 vfio md4 crc32_cryptoapi ghash_generic unix_diag vhost_net tun vhost vhost_iotlb tap mpls_gso mpls_iptunnel mpls_router fou sch_fq ghes_edac tls tcp_diag inet_diag act_gact cls_bpf nvidia_c
>> > >          CPU: 34 UID: 0 PID: 1727297 Comm: stress-ng-dev Kdump: loaded Not tainted 6.16.0-rc6-upstream-00027-gc76f3c4364fe #19 NONE
>> > >          pstate: 23401009 (nzCv daif +PAN -UAO +TCO +DIT +SSBS BTYPE=--)
>> > >          pc : kfree+0x48/0x2a8
>> > >          lr : vhost_dev_cleanup+0x138/0x2b8 [vhost]
>> > >          sp : ffff80013a0cfcd0
>> > >          x29: ffff80013a0cfcd0 x28: ffff0008fd0b6240 x27: 0000000000000000
>> > >          x26: 0000000000000000 x25: 0000000000000000 x24: 0000000000000000
>> > >          x23: 00000000040e001f x22: ffffffffffffffff x21: ffff00014f1d4ac0
>> > >          x20: 0000000000000001 x19: ffff00014f1d0000 x18: 0000000000000000
>> > >          x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000
>> > >          x14: 000000000000001f x13: 000000000000000f x12: 0000000000000001
>> > >          x11: 0000000000000000 x10: 0000000000000402 x9 : ffffffdfc0000000
>> > >          x8 : 0000001fc0000040 x7 : 0000000000000000 x6 : 0000000000000000
>> > >          x5 : ffff000141931840 x4 : 0000000000000000 x3 : 0000000000000008
>> > >          x2 : ffffffffffffffff x1 : ffffffffffffffff x0 : 0000000000010000
>> > >          Call trace:
>> > >           kfree+0x48/0x2a8 (P)
>> > >           vhost_dev_cleanup+0x138/0x2b8 [vhost]
>> > >           vhost_net_release+0xa0/0x1a8 [vhost_net]
>> >
>> > But here is the vhost_net, so I'm confused now.
>> > Do you see the same (vhost_net) also on 9798752 ("Add linux-next
>> > specific files for 20250721") ?
>>
>> I will need to reproduce, but, looking at my logs, I see the following
>> against: c76f3c4364fe ("vhost/vsock: Avoid allocating arbitrarily-sized SKBs").
>> The logs are a bit intermixed, probably there were multiple CPUs hitting
>> the same code path.
>>
>>            virt_to_phys used for non-linear address: 000000001b662678 (0xffe61984a460)
>>            WARNING: CPU: 15 PID: 112846 at arch/arm64/mm/physaddr.c:15 __virt_to_phys+0x80/0xa8
>>            Modules linked in: vhost_vsock(E) vhost(E) vhost_iotlb(E) ghes_edac(E) tls(E) act_gact(E) cls_bpf(E) tcp_diag(E) inet_diag(E) ipmi_ssif(E) ipmi_devintf(E) ipmi_msghandler(E) sch_fq_codel(E) drm(E) backlight(E) drm_panel_orientation_quirks(E) acpi_power_meter(E) loop(E) efivarfs(E) autofs4(E)
>>            CPU: 15 UID: 0 PID: 112846 Comm: stress-ng-dev Kdump: loaded Tainted: G        W   E    N  6.16.0-rc6-upstream-00027-gc76f3c4364fe #16 PREEMPT(none)
>>            Tainted: [W]=WARN, [E]=UNSIGNED_MODULE, [N]=TEST
>>            pstate: 63401009 (nZCv daif +PAN -UAO +TCO +DIT +SSBS BTYPE=--)
>>            pc : __virt_to_phys+0x80/0xa8
>>            lr : __virt_to_phys+0x7c/0xa8
>>            sp : ffff8001184d7a30
>>            x29: ffff8001184d7a30 x28: 00000000000045d8 x27: 1fffe000e7e88014
>>            x26: 1fffe000e7e888f7 x25: ffff0007e578bf00 x24: 1fffe000e7e88013
>>            x23: 0000000000000000 x22: 0000ffe61984a460 x21: ffff00073f440098
>>            x20: ffffff1000080000 x19: 0000ffe61984a460 x18: 0000000000000002
>>            x17: 6666783028203837 x16: 3632363662313030 x15: 0000000000000001
>>            x14: 1fffe006d52e90f2 x13: 0000000000000000 x12: 0000000000000000
>>            x11: ffff6006d52e90f3 x10: 0000000000000002 x9 : cfc659a21c727d00
>>            x8 : ffff800083c19000 x7 : 0000000000000001 x6 : 0000000000000001
>>            x5 : ffff8001184d7398 x4 : ffff800084866d60 x3 : ffff8000805fdd94
>>            x2 : 0000000000000001 x1 : 0000000000000004 x0 : 000000000000004b
>>            Call trace:
>>             __virt_to_phys+0x80/0xa8 (P)
>>             kfree+0xac/0x4b0
>>             vhost_dev_cleanup+0x484/0x8b0 [vhost]
>>             vhost_vsock_dev_release+0x2f4/0x358 [vhost_vsock]
>>             __fput+0x2b4/0x608
>>             fput_close_sync+0xe8/0x1e0
>>             __arm64_sys_close+0x84/0xd0
>>             invoke_syscall+0x8c/0x208
>>             do_el0_svc+0x128/0x1a0
>>             el0_svc+0x58/0x160
>>             el0t_64_sync_handler+0x78/0x108
>>             el0t_64_sync+0x198/0x1a0
>>            irq event stamp: 0
>>            hardirqs last  enabled at (0): [<0000000000000000>] 0x0
>>            hardirqs last disabled at (0): [<ffff8000801d876c>] copy_process+0xd5c/0x29f8
>>            softirqs last  enabled at (0): [<ffff8000801d879c>] copy_process+0xd8c/0x29f8
>>            softirqs last disabled at (0): [<0000000000000000>] 0x0
>>            ---[ end trace 0000000000000000 ]---
>>            Unable to handle kernel paging request at virtual address 0000040053791288
>>            ------------[ cut here ]------------
>>            lr : kfree+0xac/0x4b0
>>            virt_to_phys used for non-linear address: 00000000290839fd (0x2500000000)
>>            WARNING: CPU: 41 PID: 112845 at arch/arm64/mm/physaddr.c:15 __virt_to_phys+0x80/0xa8
>>            Modules linked in: vhost_vsock(E) vhost(E) vhost_iotlb(E) ghes_edac(E) tls(E) act_gact(E) cls_bpf(E) tcp_diag(E) inet_diag(E) ipmi_ssif(E) ipmi_devintf(E) ipmi_msghandler(E) sch_fq_codel(E) drm(E) backlight(E) drm_panel_orientation_quirks(E) acpi_power_meter(E) loop(E) efivarfs(E) autofs4(E)
>>            CPU: 41 UID: 0 PID: 112845 Comm: stress-ng-dev Kdump: loaded Tainted: G        W   E    N  6.16.0-rc6-upstream-00027-gc76f3c4364fe #16 PREEMPT(none)
>>            x23: 0000000000000001
>>            Tainted: [W]=WARN, [E]=UNSIGNED_MODULE, [N]=TEST
>>            pstate: 63401009 (nZCv daif +PAN -UAO +TCO +DIT +SSBS BTYPE=--)
>>            pc : __virt_to_phys+0x80/0xa8
>>            lr : __virt_to_phys+0x7c/0xa8
>>            sp : ffff8001a8277a30
>>            x29: ffff8001a8277a30 x28: 00000000000045d8 x27: 1fffe000e7e3c014
>>            x26: 1fffe000e7e3c8f7 x25: ffff0007bcff0000 x24: 1fffe000e7e3c013
>>            x23: 0000000000000000 x22: 0000002500000000 x21: ffff00073f1e0098
>>            x20: ffffff1000080000 x19: 0000002500000000 x18: 0000000000000004
>>            x17: 00000000ffffffff x16: 0000000000000001 x15: 0000000000000001
>>            x14: 1fffe006d53920f2 x13: 0000000000000000 x12: 0000000000000000
>>            sp : ffff8001184d7a50
>>            x29: ffff8001184d7a60 x28: 00000000000045d8 x27: 1fffe000e7e88014
>>            x26: 1fffe000e7e888f7 x25: ffff0007e578bf00 x24: 1fffe000e7e88013
>>            x23: 0000000000000000 x22: 0000ffe61984a460 x21: ffff00073f440098
>>            x20: 0000040053791280 x19: ffff80000d0b8bbc x18: 0000000000000002
>>            x17: 6666783028203837 x16: 3632363662313030 x15: 0000000000000001
>>             x22: 0000ffe6199a459d
>>            x14: 1fffe006d52e90f2
>>             x21: ffff00073f7f0098
>>            x20: ffffff1000080000 x19: 0000ffe6199a459d x18: 0000000000000004
>>            x17: 54455320203b2d2c x16: 0000000000000011 x15: 0000000000000001
>>            x14: 1fffe006d52bb8f2 x13: 0000000000000000 x12: 0000000000000000
>>            x11: ffff6006d52bb8f3 x10: dfff800000000000 x9 : 77521a2bd3e0be00
>>            x8 : ffff800083c19000 x7 : 0000000000000000 x6 : ffff80008036bc2c
>>            x5 : 0000000000000000 x4 : 0000000000000001 x3 : ffff8000805fdd94
>>            x2 : 0000000000000001 x1 : 0000000000000004 x0 : 000000000000004b
>>            Call trace:
>>             __virt_to_phys+0x80/0xa8 (P)
>>             kfree+0xac/0x4b0
>>             vhost_dev_cleanup+0x484/0x8b0 [vhost]
>>             vhost_vsock_dev_release+0x2f4/0x358 [vhost_vsock]
>>             __fput+0x2b4/0x608
>>            x11: ffff6006d53920f3
>>             fput_close_sync+0xe8/0x1e0
>>             __arm64_sys_close+0x84/0xd0
>>             invoke_syscall+0x8c/0x208
>>             do_el0_svc+0x128/0x1a0
>>             el0_svc+0x58/0x160
>>             el0t_64_sync_handler+0x78/0x108
>>             el0t_64_sync+0x198/0x1a0
>>            irq event stamp: 0
>>             x13: 0000000000000000
>>            hardirqs last  enabled at (0): [<0000000000000000>] 0x0
>>            hardirqs last disabled at (0): [<ffff8000801d876c>] copy_process+0xd5c/0x29f8
>>
>>
>> > The initial report contained only vhost_vsock traces IIUC, so I'm
>> > suspecting something in the vhost core.
>>
>> Right, it seems we are hitting the same code path, on on both
>> vhost_vsock and vhost_net.
>>
>
>I've posted a fix here:
>
>https://lore.kernel.org/virtualization/20250729073916.80647-1-jasowang@redhat.com/T/#u
>
>I think it should address this issue.

Thanks for the fix!

Stefano


^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2025-07-29  9:57 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-23 15:04 vhost: linux-next: crash at vhost_dev_cleanup() Breno Leitao
2025-07-23 19:09 ` Michael S. Tsirkin
2025-07-24  7:47 ` Michael S. Tsirkin
2025-07-24  8:14   ` Stefano Garzarella
2025-07-24  8:22     ` Michael S. Tsirkin
2025-07-24  8:44       ` Will Deacon
2025-07-24 12:48         ` Breno Leitao
2025-07-24 12:52           ` Stefano Garzarella
2025-07-24 13:49             ` Breno Leitao
2025-07-29  7:44               ` Jason Wang
2025-07-29  9:10                 ` Breno Leitao
2025-07-29  9:57                 ` Stefano Garzarella

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).