* vhost: linux-next: crash at vhost_dev_cleanup() @ 2025-07-23 15:04 Breno Leitao 2025-07-23 19:09 ` Michael S. Tsirkin 2025-07-24 7:47 ` Michael S. Tsirkin 0 siblings, 2 replies; 12+ messages in thread From: Breno Leitao @ 2025-07-23 15:04 UTC (permalink / raw) To: mst, jasowang, eperezma; +Cc: linux-arm-kernel, kvm Hello, I've seen a crash in linux-next for a while on my arm64 server, and I decided to report. While running stress-ng on linux-next, I see the crash below. This is happening in a kernel configure with some debug options (KASAN, LOCKDEP and KMEMLEAK). Basically running stress-ng in a loop would crash the host in 15-20 minutes: # while (true); do stress-ng -r 10 -t 10; done From the early warning "virt_to_phys used for non-linear address", I suppose corrupted data is at vq->nheads. Here is the decoded stack against 9798752 ("Add linux-next specific files for 20250721") [ 620.685144] [ T250731] VFIO - User Level meta-driver version: 0.3 [ 622.394448] [ T250254] ------------[ cut here ]------------ [ 622.413492] [ T250254] virt_to_phys used for non-linear address: 000000006e69fe64 (0xcfcecdcccbcac9c8) [ 622.447771] [ T250254] WARNING: arch/arm64/mm/physaddr.c:15 at __virt_to_phys+0x64/0x90, CPU#57: stress-ng-dev/250254 [ 622.487227] [ T250254] Modules linked in: vhost_vsock(E) vfio_iommu_type1(E) vfio(E) unix_diag(E) sch_fq(E) ghes_edac(E) tls(E) tcp_diag(E) inet_diag(E) act_gact(E) cls_bpf(E) nvidia_cspmu(E) ipmi_ssif(E) coresight_trbe(E) arm_cspmu_module(E) arm_smmuv3_pmu(E) ipmi_devintf(E) coresight_stm(E) coresight_funnel(E) coresight_etm4x(E) coresight_tmc(E) stm_core(E) ipmi_msghandler(E) coresight(E) cppc_cpufreq(E) sch_fq_codel(E) drm(E) backlight(E) drm_panel_orientation_quirks(E) sm3_ce(E) sha3_ce(E) spi_tegra210_quad(E) vhost_net(E) tap(E) tun(E) vhost(E) vhost_iotlb(E) mpls_gso(E) mpls_iptunnel(E) mpls_router(E) fou(E) acpi_power_meter(E) loop(E) efivarfs(E) autofs4(E) [last unloaded: test_bpf(E)] [ 622.734524] [ T250254] Tainted: [W]=WARN, [E]=UNSIGNED_MODULE, [N]=TEST [ 622.734525] [ T250254] Hardware name: ... [ 622.734526] [ T250254] pstate: 63401009 (nZCv daif +PAN -UAO +TCO +DIT +SSBS BTYPE=--) [ 622.734529] [ T250254] pc : __virt_to_phys (/home/user/Devel/linux-next/arch/arm64/mm/physaddr.c:?) [ 622.734531] [ T250254] lr : __virt_to_phys (/home/user/Devel/linux-next/arch/arm64/mm/physaddr.c:?) [ 622.734533] [ T250254] sp : ffff800158e8fc60 [ 622.734534] [ T250254] x29: ffff800158e8fc60 x28: ffff0034a7cc7900 x27: 0000000000000000 [ 622.734537] [ T250254] x26: 0000000000000000 x25: ffff0034a7cc7900 x24: 00000000040e001f [ 622.734539] [ T250254] x23: ffff0010858afb00 x22: cfcecdcccbcac9c8 x21: ffff0033526a01e0 [ 622.734541] [ T250254] x20: 0000000000008000 x19: ffcecdcccbcac9c8 x18: ffff80008149c8e4 [ 622.734543] [ T250254] x17: 0000000000000001 x16: 0000000000000000 x15: 0000000000000003 [ 622.734545] [ T250254] x14: ffff800082962e78 x13: 0000000000000003 x12: ffff003bc6231630 [ 622.734546] [ T250254] x11: 0000000000000000 x10: 0000000000000000 x9 : ed44a220ae716b00 [ 622.734548] [ T250254] x8 : 0001000000000000 x7 : 0720072007200720 x6 : ffff80008018710c [ 622.734550] [ T250254] x5 : 0000000000000001 x4 : 00000090ecc72ac0 x3 : 0000000000000000 [ 622.734552] [ T250254] x2 : 0000000000000000 x1 : ffff800081a72bc6 x0 : 000000000000004f [ 622.734554] [ T250254] Call trace: [ 622.734555] [ T250254] __virt_to_phys (/home/user/Devel/linux-next/arch/arm64/mm/physaddr.c:?) (P) [ 622.734557] [ T250254] kfree (/home/user/Devel/linux-next/./include/linux/mm.h:1180 /home/user/Devel/linux-next/mm/slub.c:4871) [ 622.734562] [ T250254] vhost_dev_cleanup (/home/user/Devel/linux-next/drivers/vhost/vhost.c:506 /home/user/Devel/linux-next/drivers/vhost/vhost.c:542 /home/user/Devel/linux-next/drivers/vhost/vhost.c:1214) vhost [ 622.734571] [ T250254] vhost_vsock_dev_release (/home/user/Devel/linux-next/drivers/vhost/vsock.c:756) vhost_vsock [ 622.734575] [ T250254] __fput (/home/user/Devel/linux-next/fs/file_table.c:469) [ 622.734578] [ T250254] fput_close_sync (/home/user/Devel/linux-next/fs/file_table.c:?) [ 622.734579] [ T250254] __arm64_sys_close (/home/user/Devel/linux-next/fs/open.c:1589 /home/user/Devel/linux-next/fs/open.c:1572 /home/user/Devel/linux-next/fs/open.c:1572) [ 622.734584] [ T250254] invoke_syscall (/home/user/Devel/linux-next/arch/arm64/kernel/syscall.c:50) [ 622.734589] [ T250254] el0_svc_common (/home/user/Devel/linux-next/./include/linux/thread_info.h:135 /home/user/Devel/linux-next/arch/arm64/kernel/syscall.c:140) [ 622.734591] [ T250254] do_el0_svc (/home/user/Devel/linux-next/arch/arm64/kernel/syscall.c:152) [ 622.734594] [ T250254] el0_svc (/home/user/Devel/linux-next/arch/arm64/kernel/entry-common.c:169 /home/user/Devel/linux-next/arch/arm64/kernel/entry-common.c:182 /home/user/Devel/linux-next/arch/arm64/kernel/entry-common.c:880) [ 622.734600] [ T250254] el0t_64_sync_handler (/home/user/Devel/linux-next/arch/arm64/kernel/entry-common.c:958) [ 622.734603] [ T250254] el0t_64_sync (/home/user/Devel/linux-next/arch/arm64/kernel/entry.S:596) [ 622.734605] [ T250254] irq event stamp: 0 [ 622.734606] [ T250254] hardirqs last enabled at (0): 0x0 [ 622.734610] [ T250254] hardirqs last disabled at (0): copy_process (/home/user/Devel/linux-next/kernel/fork.c:?) [ 622.734614] [ T250254] softirqs last enabled at (0): copy_process (/home/user/Devel/linux-next/kernel/fork.c:?) [ 622.734616] [ T250254] softirqs last disabled at (0): 0x0 [ 622.734618] [ T250254] ---[ end trace 0000000000000000 ]--- [ 622.734697] [ T250254] Unable to handle kernel paging request at virtual address 003ff3b33312f288 [ 622.734700] [ T250254] Mem abort info: [ 622.734701] [ T250254] ESR = 0x0000000096000004 [ 622.734702] [ T250254] EC = 0x25: DABT (current EL), IL = 32 bits [ 622.734704] [ T250254] SET = 0, FnV = 0 [ 622.734705] [ T250254] EA = 0, S1PTW = 0 [ 622.734706] [ T250254] FSC = 0x04: level 0 translation fault [ 622.734708] [ T250254] Data abort info: [ 622.734709] [ T250254] ISV = 0, ISS = 0x00000004, ISS2 = 0x00000000 [ 622.734711] [ T250254] CM = 0, WnR = 0, TnD = 0, TagAccess = 0 [ 622.734712] [ T250254] GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0 [ 622.734713] [ T250254] [003ff3b33312f288] address between user and kernel address ranges [ 622.734715] [ T250254] Internal error: Oops: 0000000096000004 [#1] SMP [ 622.734718] [ T250254] Modules linked in: vhost_vsock(E) vfio_iommu_type1(E) vfio(E) unix_diag(E) sch_fq(E) ghes_edac(E) tls(E) tcp_diag(E) inet_diag(E) act_gact(E) cls_bpf(E) nvidia_cspmu(E) ipmi_ssif(E) coresight_trbe(E) arm_cspmu_module(E) arm_smmuv3_pmu(E) ipmi_devintf(E) coresight_stm(E) coresight_funnel(E) coresight_etm4x(E) coresight_tmc(E) stm_core(E) ipmi_msghandler(E) coresight(E) cppc_cpufreq(E) sch_fq_codel(E) drm(E) backlight(E) drm_panel_orientation_quirks(E) sm3_ce(E) sha3_ce(E) spi_tegra210_quad(E) vhost_net(E) tap(E) tun(E) vhost(E) vhost_iotlb(E) mpls_gso(E) mpls_iptunnel(E) mpls_router(E) fou(E) acpi_power_meter(E) loop(E) efivarfs(E) autofs4(E) [last unloaded: test_bpf(E)] [ 622.734740] [ T250254] Tainted: [W]=WARN, [E]=UNSIGNED_MODULE, [N]=TEST [ 622.734740] [ T250254] Hardware name: ... [ 622.734741] [ T250254] pstate: 63401009 (nZCv daif +PAN -UAO +TCO +DIT +SSBS BTYPE=--) [ 622.734742] [ T250254] pc : kfree (/home/user/Devel/linux-next/./include/linux/page-flags.h:284 /home/user/Devel/linux-next/./include/linux/mm.h:1182 /home/user/Devel/linux-next/mm/slub.c:4871) [ 622.734745] [ T250254] lr : kfree (/home/user/Devel/linux-next/./include/linux/mm.h:1180 /home/user/Devel/linux-next/mm/slub.c:4871) [ 622.734747] [ T250254] sp : ffff800158e8fc80 [ 622.734748] [ T250254] x29: ffff800158e8fc90 x28: ffff0034a7cc7900 x27: 0000000000000000 [ 622.734749] [ T250254] x26: 0000000000000000 x25: ffff0034a7cc7900 x24: 00000000040e001f [ 622.734751] [ T250254] x23: ffff0010858afb00 x22: cfcecdcccbcac9c8 x21: ffff0033526a01e0 [ 622.734752] [ T250254] x20: 003ff3b33312f280 x19: ffff80000acd1a20 x18: ffff80008149c8e4 [ 622.734754] [ T250254] x17: 0000000000000001 x16: 0000000000000000 x15: 0000000000000003 [ 622.734755] [ T250254] x14: ffff800082962e78 x13: 0000000000000003 x12: ffff003bc6231630 [ 622.734757] [ T250254] x11: 0000000000000000 x10: 0000000000000000 x9 : ffffffdfc0000000 [ 622.734758] [ T250254] x8 : 003ff3d37312f280 x7 : 0720072007200720 x6 : ffff80008018710c [ 622.734760] [ T250254] x5 : 0000000000000001 x4 : 00000090ecc72ac0 x3 : 0000000000000000 [ 622.734761] [ T250254] x2 : 0000000000000000 x1 : ffff800081a72bc6 x0 : ffcf4dcccbcac9c8 [ 622.734763] [ T250254] Call trace: [ 622.734763] [ T250254] kfree (/home/user/Devel/linux-next/./include/linux/page-flags.h:284 /home/user/Devel/linux-next/./include/linux/mm.h:1182 /home/user/Devel/linux-next/mm/slub.c:4871) (P) [ 622.734766] [ T250254] vhost_dev_cleanup (/home/user/Devel/linux-next/drivers/vhost/vhost.c:506 /home/user/Devel/linux-next/drivers/vhost/vhost.c:542 /home/user/Devel/linux-next/drivers/vhost/vhost.c:1214) vhost [ 622.734769] [ T250254] vhost_vsock_dev_release (/home/user/Devel/linux-next/drivers/vhost/vsock.c:756) vhost_vsock [ 622.734771] [ T250254] __fput (/home/user/Devel/linux-next/fs/file_table.c:469) [ 622.734772] [ T250254] fput_close_sync (/home/user/Devel/linux-next/fs/file_table.c:?) [ 622.734773] [ T250254] __arm64_sys_close (/home/user/Devel/linux-next/fs/open.c:1589 /home/user/Devel/linux-next/fs/open.c:1572 /home/user/Devel/linux-next/fs/open.c:1572) [ 622.734776] [ T250254] invoke_syscall (/home/user/Devel/linux-next/arch/arm64/kernel/syscall.c:50) [ 622.734778] [ T250254] el0_svc_common (/home/user/Devel/linux-next/./include/linux/thread_info.h:135 /home/user/Devel/linux-next/arch/arm64/kernel/syscall.c:140) [ 622.734781] [ T250254] do_el0_svc (/home/user/Devel/linux-next/arch/arm64/kernel/syscall.c:152) [ 622.734783] [ T250254] el0_svc (/home/user/Devel/linux-next/arch/arm64/kernel/entry-common.c:169 /home/user/Devel/linux-next/arch/arm64/kernel/entry-common.c:182 /home/user/Devel/linux-next/arch/arm64/kernel/entry-common.c:880) [ 622.734787] [ T250254] el0t_64_sync_handler (/home/user/Devel/linux-next/arch/arm64/kernel/entry-common.c:958) [ 622.734790] [ T250254] el0t_64_sync (/home/user/Devel/linux-next/arch/arm64/kernel/entry.S:596) [ 622.734792] [ T250254] Code: f2dffbe9 927abd08 cb141908 8b090114 (f9400688) All code ======== 0:* e9 fb df f2 08 jmp 0x8f2e000 <-- trapping instruction 5: bd 7a 92 08 19 mov $0x1908927a,%ebp a: 14 cb adc $0xcb,%al c: 14 01 adc $0x1,%al e: 09 8b 88 06 40 f9 or %ecx,-0x6bff978(%rbx) Code starting with the faulting instruction =========================================== 0: 88 06 mov %al,(%rsi) 2: 40 f9 rex stc [ 622.734795] [ T250254] SMP: stopping secondary CPUs [ 622.735089] [ T250254] Starting crashdump kernel... [ 622.735091] [ T250254] Bye! ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: vhost: linux-next: crash at vhost_dev_cleanup() 2025-07-23 15:04 vhost: linux-next: crash at vhost_dev_cleanup() Breno Leitao @ 2025-07-23 19:09 ` Michael S. Tsirkin 2025-07-24 7:47 ` Michael S. Tsirkin 1 sibling, 0 replies; 12+ messages in thread From: Michael S. Tsirkin @ 2025-07-23 19:09 UTC (permalink / raw) To: Breno Leitao; +Cc: jasowang, eperezma, linux-arm-kernel, kvm On Wed, Jul 23, 2025 at 08:04:42AM -0700, Breno Leitao wrote: > Hello, > > I've seen a crash in linux-next for a while on my arm64 server, and > I decided to report. > > While running stress-ng on linux-next, I see the crash below. > > This is happening in a kernel configure with some debug options (KASAN, > LOCKDEP and KMEMLEAK). Thanks for the report! Any chance of a bisect? Much appreciated. > Basically running stress-ng in a loop would crash the host in 15-20 > minutes: > # while (true); do stress-ng -r 10 -t 10; done > > >From the early warning "virt_to_phys used for non-linear address", > I suppose corrupted data is at vq->nheads. > > Here is the decoded stack against 9798752 ("Add linux-next specific > files for 20250721") > > > [ 620.685144] [ T250731] VFIO - User Level meta-driver version: 0.3 > [ 622.394448] [ T250254] ------------[ cut here ]------------ > [ 622.413492] [ T250254] virt_to_phys used for non-linear address: 000000006e69fe64 (0xcfcecdcccbcac9c8) > [ 622.447771] [ T250254] WARNING: arch/arm64/mm/physaddr.c:15 at __virt_to_phys+0x64/0x90, CPU#57: stress-ng-dev/250254 > [ 622.487227] [ T250254] Modules linked in: vhost_vsock(E) vfio_iommu_type1(E) vfio(E) unix_diag(E) sch_fq(E) ghes_edac(E) tls(E) tcp_diag(E) inet_diag(E) act_gact(E) cls_bpf(E) nvidia_cspmu(E) ipmi_ssif(E) coresight_trbe(E) arm_cspmu_module(E) arm_smmuv3_pmu(E) ipmi_devintf(E) coresight_stm(E) coresight_funnel(E) coresight_etm4x(E) coresight_tmc(E) stm_core(E) ipmi_msghandler(E) coresight(E) cppc_cpufreq(E) sch_fq_codel(E) drm(E) backlight(E) drm_panel_orientation_quirks(E) sm3_ce(E) sha3_ce(E) spi_tegra210_quad(E) vhost_net(E) tap(E) tun(E) vhost(E) vhost_iotlb(E) mpls_gso(E) mpls_iptunnel(E) mpls_router(E) fou(E) acpi_power_meter(E) loop(E) efivarfs(E) autofs4(E) [last unloaded: test_bpf(E)] > [ 622.734524] [ T250254] Tainted: [W]=WARN, [E]=UNSIGNED_MODULE, [N]=TEST > [ 622.734525] [ T250254] Hardware name: ... > [ 622.734526] [ T250254] pstate: 63401009 (nZCv daif +PAN -UAO +TCO +DIT +SSBS BTYPE=--) > [ 622.734529] [ T250254] pc : __virt_to_phys (/home/user/Devel/linux-next/arch/arm64/mm/physaddr.c:?) > [ 622.734531] [ T250254] lr : __virt_to_phys (/home/user/Devel/linux-next/arch/arm64/mm/physaddr.c:?) > [ 622.734533] [ T250254] sp : ffff800158e8fc60 > [ 622.734534] [ T250254] x29: ffff800158e8fc60 x28: ffff0034a7cc7900 x27: 0000000000000000 > [ 622.734537] [ T250254] x26: 0000000000000000 x25: ffff0034a7cc7900 x24: 00000000040e001f > [ 622.734539] [ T250254] x23: ffff0010858afb00 x22: cfcecdcccbcac9c8 x21: ffff0033526a01e0 > [ 622.734541] [ T250254] x20: 0000000000008000 x19: ffcecdcccbcac9c8 x18: ffff80008149c8e4 > [ 622.734543] [ T250254] x17: 0000000000000001 x16: 0000000000000000 x15: 0000000000000003 > [ 622.734545] [ T250254] x14: ffff800082962e78 x13: 0000000000000003 x12: ffff003bc6231630 > [ 622.734546] [ T250254] x11: 0000000000000000 x10: 0000000000000000 x9 : ed44a220ae716b00 > [ 622.734548] [ T250254] x8 : 0001000000000000 x7 : 0720072007200720 x6 : ffff80008018710c > [ 622.734550] [ T250254] x5 : 0000000000000001 x4 : 00000090ecc72ac0 x3 : 0000000000000000 > [ 622.734552] [ T250254] x2 : 0000000000000000 x1 : ffff800081a72bc6 x0 : 000000000000004f > [ 622.734554] [ T250254] Call trace: > [ 622.734555] [ T250254] __virt_to_phys (/home/user/Devel/linux-next/arch/arm64/mm/physaddr.c:?) (P) > [ 622.734557] [ T250254] kfree (/home/user/Devel/linux-next/./include/linux/mm.h:1180 /home/user/Devel/linux-next/mm/slub.c:4871) > [ 622.734562] [ T250254] vhost_dev_cleanup (/home/user/Devel/linux-next/drivers/vhost/vhost.c:506 /home/user/Devel/linux-next/drivers/vhost/vhost.c:542 /home/user/Devel/linux-next/drivers/vhost/vhost.c:1214) vhost > [ 622.734571] [ T250254] vhost_vsock_dev_release (/home/user/Devel/linux-next/drivers/vhost/vsock.c:756) vhost_vsock > [ 622.734575] [ T250254] __fput (/home/user/Devel/linux-next/fs/file_table.c:469) > [ 622.734578] [ T250254] fput_close_sync (/home/user/Devel/linux-next/fs/file_table.c:?) > [ 622.734579] [ T250254] __arm64_sys_close (/home/user/Devel/linux-next/fs/open.c:1589 /home/user/Devel/linux-next/fs/open.c:1572 /home/user/Devel/linux-next/fs/open.c:1572) > [ 622.734584] [ T250254] invoke_syscall (/home/user/Devel/linux-next/arch/arm64/kernel/syscall.c:50) > [ 622.734589] [ T250254] el0_svc_common (/home/user/Devel/linux-next/./include/linux/thread_info.h:135 /home/user/Devel/linux-next/arch/arm64/kernel/syscall.c:140) > [ 622.734591] [ T250254] do_el0_svc (/home/user/Devel/linux-next/arch/arm64/kernel/syscall.c:152) > [ 622.734594] [ T250254] el0_svc (/home/user/Devel/linux-next/arch/arm64/kernel/entry-common.c:169 /home/user/Devel/linux-next/arch/arm64/kernel/entry-common.c:182 /home/user/Devel/linux-next/arch/arm64/kernel/entry-common.c:880) > [ 622.734600] [ T250254] el0t_64_sync_handler (/home/user/Devel/linux-next/arch/arm64/kernel/entry-common.c:958) > [ 622.734603] [ T250254] el0t_64_sync (/home/user/Devel/linux-next/arch/arm64/kernel/entry.S:596) > [ 622.734605] [ T250254] irq event stamp: 0 > [ 622.734606] [ T250254] hardirqs last enabled at (0): 0x0 > [ 622.734610] [ T250254] hardirqs last disabled at (0): copy_process (/home/user/Devel/linux-next/kernel/fork.c:?) > [ 622.734614] [ T250254] softirqs last enabled at (0): copy_process (/home/user/Devel/linux-next/kernel/fork.c:?) > [ 622.734616] [ T250254] softirqs last disabled at (0): 0x0 > [ 622.734618] [ T250254] ---[ end trace 0000000000000000 ]--- > [ 622.734697] [ T250254] Unable to handle kernel paging request at virtual address 003ff3b33312f288 > [ 622.734700] [ T250254] Mem abort info: > [ 622.734701] [ T250254] ESR = 0x0000000096000004 > [ 622.734702] [ T250254] EC = 0x25: DABT (current EL), IL = 32 bits > [ 622.734704] [ T250254] SET = 0, FnV = 0 > [ 622.734705] [ T250254] EA = 0, S1PTW = 0 > [ 622.734706] [ T250254] FSC = 0x04: level 0 translation fault > [ 622.734708] [ T250254] Data abort info: > [ 622.734709] [ T250254] ISV = 0, ISS = 0x00000004, ISS2 = 0x00000000 > [ 622.734711] [ T250254] CM = 0, WnR = 0, TnD = 0, TagAccess = 0 > [ 622.734712] [ T250254] GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0 > [ 622.734713] [ T250254] [003ff3b33312f288] address between user and kernel address ranges > [ 622.734715] [ T250254] Internal error: Oops: 0000000096000004 [#1] SMP > [ 622.734718] [ T250254] Modules linked in: vhost_vsock(E) vfio_iommu_type1(E) vfio(E) unix_diag(E) sch_fq(E) ghes_edac(E) tls(E) tcp_diag(E) inet_diag(E) act_gact(E) cls_bpf(E) nvidia_cspmu(E) ipmi_ssif(E) coresight_trbe(E) arm_cspmu_module(E) arm_smmuv3_pmu(E) ipmi_devintf(E) coresight_stm(E) coresight_funnel(E) coresight_etm4x(E) coresight_tmc(E) stm_core(E) ipmi_msghandler(E) coresight(E) cppc_cpufreq(E) sch_fq_codel(E) drm(E) backlight(E) drm_panel_orientation_quirks(E) sm3_ce(E) sha3_ce(E) spi_tegra210_quad(E) vhost_net(E) tap(E) tun(E) vhost(E) vhost_iotlb(E) mpls_gso(E) mpls_iptunnel(E) mpls_router(E) fou(E) acpi_power_meter(E) loop(E) efivarfs(E) autofs4(E) [last unloaded: test_bpf(E)] > [ 622.734740] [ T250254] Tainted: [W]=WARN, [E]=UNSIGNED_MODULE, [N]=TEST > [ 622.734740] [ T250254] Hardware name: ... > [ 622.734741] [ T250254] pstate: 63401009 (nZCv daif +PAN -UAO +TCO +DIT +SSBS BTYPE=--) > [ 622.734742] [ T250254] pc : kfree (/home/user/Devel/linux-next/./include/linux/page-flags.h:284 /home/user/Devel/linux-next/./include/linux/mm.h:1182 /home/user/Devel/linux-next/mm/slub.c:4871) > [ 622.734745] [ T250254] lr : kfree (/home/user/Devel/linux-next/./include/linux/mm.h:1180 /home/user/Devel/linux-next/mm/slub.c:4871) > [ 622.734747] [ T250254] sp : ffff800158e8fc80 > [ 622.734748] [ T250254] x29: ffff800158e8fc90 x28: ffff0034a7cc7900 x27: 0000000000000000 > [ 622.734749] [ T250254] x26: 0000000000000000 x25: ffff0034a7cc7900 x24: 00000000040e001f > [ 622.734751] [ T250254] x23: ffff0010858afb00 x22: cfcecdcccbcac9c8 x21: ffff0033526a01e0 > [ 622.734752] [ T250254] x20: 003ff3b33312f280 x19: ffff80000acd1a20 x18: ffff80008149c8e4 > [ 622.734754] [ T250254] x17: 0000000000000001 x16: 0000000000000000 x15: 0000000000000003 > [ 622.734755] [ T250254] x14: ffff800082962e78 x13: 0000000000000003 x12: ffff003bc6231630 > [ 622.734757] [ T250254] x11: 0000000000000000 x10: 0000000000000000 x9 : ffffffdfc0000000 > [ 622.734758] [ T250254] x8 : 003ff3d37312f280 x7 : 0720072007200720 x6 : ffff80008018710c > [ 622.734760] [ T250254] x5 : 0000000000000001 x4 : 00000090ecc72ac0 x3 : 0000000000000000 > [ 622.734761] [ T250254] x2 : 0000000000000000 x1 : ffff800081a72bc6 x0 : ffcf4dcccbcac9c8 > [ 622.734763] [ T250254] Call trace: > [ 622.734763] [ T250254] kfree (/home/user/Devel/linux-next/./include/linux/page-flags.h:284 /home/user/Devel/linux-next/./include/linux/mm.h:1182 /home/user/Devel/linux-next/mm/slub.c:4871) (P) > [ 622.734766] [ T250254] vhost_dev_cleanup (/home/user/Devel/linux-next/drivers/vhost/vhost.c:506 /home/user/Devel/linux-next/drivers/vhost/vhost.c:542 /home/user/Devel/linux-next/drivers/vhost/vhost.c:1214) vhost > [ 622.734769] [ T250254] vhost_vsock_dev_release (/home/user/Devel/linux-next/drivers/vhost/vsock.c:756) vhost_vsock > [ 622.734771] [ T250254] __fput (/home/user/Devel/linux-next/fs/file_table.c:469) > [ 622.734772] [ T250254] fput_close_sync (/home/user/Devel/linux-next/fs/file_table.c:?) > [ 622.734773] [ T250254] __arm64_sys_close (/home/user/Devel/linux-next/fs/open.c:1589 /home/user/Devel/linux-next/fs/open.c:1572 /home/user/Devel/linux-next/fs/open.c:1572) > [ 622.734776] [ T250254] invoke_syscall (/home/user/Devel/linux-next/arch/arm64/kernel/syscall.c:50) > [ 622.734778] [ T250254] el0_svc_common (/home/user/Devel/linux-next/./include/linux/thread_info.h:135 /home/user/Devel/linux-next/arch/arm64/kernel/syscall.c:140) > [ 622.734781] [ T250254] do_el0_svc (/home/user/Devel/linux-next/arch/arm64/kernel/syscall.c:152) > [ 622.734783] [ T250254] el0_svc (/home/user/Devel/linux-next/arch/arm64/kernel/entry-common.c:169 /home/user/Devel/linux-next/arch/arm64/kernel/entry-common.c:182 /home/user/Devel/linux-next/arch/arm64/kernel/entry-common.c:880) > [ 622.734787] [ T250254] el0t_64_sync_handler (/home/user/Devel/linux-next/arch/arm64/kernel/entry-common.c:958) > [ 622.734790] [ T250254] el0t_64_sync (/home/user/Devel/linux-next/arch/arm64/kernel/entry.S:596) > [ 622.734792] [ T250254] Code: f2dffbe9 927abd08 cb141908 8b090114 (f9400688) > All code > ======== > 0:* e9 fb df f2 08 jmp 0x8f2e000 <-- trapping instruction > 5: bd 7a 92 08 19 mov $0x1908927a,%ebp > a: 14 cb adc $0xcb,%al > c: 14 01 adc $0x1,%al > e: 09 8b 88 06 40 f9 or %ecx,-0x6bff978(%rbx) > > Code starting with the faulting instruction > =========================================== > 0: 88 06 mov %al,(%rsi) > 2: 40 f9 rex stc > [ 622.734795] [ T250254] SMP: stopping secondary CPUs > [ 622.735089] [ T250254] Starting crashdump kernel... > [ 622.735091] [ T250254] Bye! ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: vhost: linux-next: crash at vhost_dev_cleanup() 2025-07-23 15:04 vhost: linux-next: crash at vhost_dev_cleanup() Breno Leitao 2025-07-23 19:09 ` Michael S. Tsirkin @ 2025-07-24 7:47 ` Michael S. Tsirkin 2025-07-24 8:14 ` Stefano Garzarella 1 sibling, 1 reply; 12+ messages in thread From: Michael S. Tsirkin @ 2025-07-24 7:47 UTC (permalink / raw) To: Breno Leitao Cc: jasowang, eperezma, linux-arm-kernel, kvm, Stefan Hajnoczi, Stefano Garzarella, netdev On Wed, Jul 23, 2025 at 08:04:42AM -0700, Breno Leitao wrote: > Hello, > > I've seen a crash in linux-next for a while on my arm64 server, and > I decided to report. > > While running stress-ng on linux-next, I see the crash below. > > This is happening in a kernel configure with some debug options (KASAN, > LOCKDEP and KMEMLEAK). > > Basically running stress-ng in a loop would crash the host in 15-20 > minutes: > # while (true); do stress-ng -r 10 -t 10; done > > >From the early warning "virt_to_phys used for non-linear address", > I suppose corrupted data is at vq->nheads. > > Here is the decoded stack against 9798752 ("Add linux-next specific > files for 20250721") > > > [ 620.685144] [ T250731] VFIO - User Level meta-driver version: 0.3 > [ 622.394448] [ T250254] ------------[ cut here ]------------ > [ 622.413492] [ T250254] virt_to_phys used for non-linear address: 000000006e69fe64 (0xcfcecdcccbcac9c8) > [ 622.447771] [ T250254] WARNING: arch/arm64/mm/physaddr.c:15 at __virt_to_phys+0x64/0x90, CPU#57: stress-ng-dev/250254 > [ 622.487227] [ T250254] Modules linked in: vhost_vsock(E) vfio_iommu_type1(E) vfio(E) unix_diag(E) sch_fq(E) ghes_edac(E) tls(E) tcp_diag(E) inet_diag(E) act_gact(E) cls_bpf(E) nvidia_cspmu(E) ipmi_ssif(E) coresight_trbe(E) arm_cspmu_module(E) arm_smmuv3_pmu(E) ipmi_devintf(E) coresight_stm(E) coresight_funnel(E) coresight_etm4x(E) coresight_tmc(E) stm_core(E) ipmi_msghandler(E) coresight(E) cppc_cpufreq(E) sch_fq_codel(E) drm(E) backlight(E) drm_panel_orientation_quirks(E) sm3_ce(E) sha3_ce(E) spi_tegra210_quad(E) vhost_net(E) tap(E) tun(E) vhost(E) vhost_iotlb(E) mpls_gso(E) mpls_iptunnel(E) mpls_router(E) fou(E) acpi_power_meter(E) loop(E) efivarfs(E) autofs4(E) [last unloaded: test_bpf(E)] > [ 622.734524] [ T250254] Tainted: [W]=WARN, [E]=UNSIGNED_MODULE, [N]=TEST > [ 622.734525] [ T250254] Hardware name: ... > [ 622.734526] [ T250254] pstate: 63401009 (nZCv daif +PAN -UAO +TCO +DIT +SSBS BTYPE=--) > [ 622.734529] [ T250254] pc : __virt_to_phys (/home/user/Devel/linux-next/arch/arm64/mm/physaddr.c:?) > [ 622.734531] [ T250254] lr : __virt_to_phys (/home/user/Devel/linux-next/arch/arm64/mm/physaddr.c:?) > [ 622.734533] [ T250254] sp : ffff800158e8fc60 > [ 622.734534] [ T250254] x29: ffff800158e8fc60 x28: ffff0034a7cc7900 x27: 0000000000000000 > [ 622.734537] [ T250254] x26: 0000000000000000 x25: ffff0034a7cc7900 x24: 00000000040e001f > [ 622.734539] [ T250254] x23: ffff0010858afb00 x22: cfcecdcccbcac9c8 x21: ffff0033526a01e0 > [ 622.734541] [ T250254] x20: 0000000000008000 x19: ffcecdcccbcac9c8 x18: ffff80008149c8e4 > [ 622.734543] [ T250254] x17: 0000000000000001 x16: 0000000000000000 x15: 0000000000000003 > [ 622.734545] [ T250254] x14: ffff800082962e78 x13: 0000000000000003 x12: ffff003bc6231630 > [ 622.734546] [ T250254] x11: 0000000000000000 x10: 0000000000000000 x9 : ed44a220ae716b00 > [ 622.734548] [ T250254] x8 : 0001000000000000 x7 : 0720072007200720 x6 : ffff80008018710c > [ 622.734550] [ T250254] x5 : 0000000000000001 x4 : 00000090ecc72ac0 x3 : 0000000000000000 > [ 622.734552] [ T250254] x2 : 0000000000000000 x1 : ffff800081a72bc6 x0 : 000000000000004f > [ 622.734554] [ T250254] Call trace: > [ 622.734555] [ T250254] __virt_to_phys (/home/user/Devel/linux-next/arch/arm64/mm/physaddr.c:?) (P) > [ 622.734557] [ T250254] kfree (/home/user/Devel/linux-next/./include/linux/mm.h:1180 /home/user/Devel/linux-next/mm/slub.c:4871) > [ 622.734562] [ T250254] vhost_dev_cleanup (/home/user/Devel/linux-next/drivers/vhost/vhost.c:506 /home/user/Devel/linux-next/drivers/vhost/vhost.c:542 /home/user/Devel/linux-next/drivers/vhost/vhost.c:1214) vhost > [ 622.734571] [ T250254] vhost_vsock_dev_release (/home/user/Devel/linux-next/drivers/vhost/vsock.c:756) vhost_vsock Cc more vsock maintainers. > [ 622.734575] [ T250254] __fput (/home/user/Devel/linux-next/fs/file_table.c:469) > [ 622.734578] [ T250254] fput_close_sync (/home/user/Devel/linux-next/fs/file_table.c:?) > [ 622.734579] [ T250254] __arm64_sys_close (/home/user/Devel/linux-next/fs/open.c:1589 /home/user/Devel/linux-next/fs/open.c:1572 /home/user/Devel/linux-next/fs/open.c:1572) > [ 622.734584] [ T250254] invoke_syscall (/home/user/Devel/linux-next/arch/arm64/kernel/syscall.c:50) > [ 622.734589] [ T250254] el0_svc_common (/home/user/Devel/linux-next/./include/linux/thread_info.h:135 /home/user/Devel/linux-next/arch/arm64/kernel/syscall.c:140) > [ 622.734591] [ T250254] do_el0_svc (/home/user/Devel/linux-next/arch/arm64/kernel/syscall.c:152) > [ 622.734594] [ T250254] el0_svc (/home/user/Devel/linux-next/arch/arm64/kernel/entry-common.c:169 /home/user/Devel/linux-next/arch/arm64/kernel/entry-common.c:182 /home/user/Devel/linux-next/arch/arm64/kernel/entry-common.c:880) > [ 622.734600] [ T250254] el0t_64_sync_handler (/home/user/Devel/linux-next/arch/arm64/kernel/entry-common.c:958) > [ 622.734603] [ T250254] el0t_64_sync (/home/user/Devel/linux-next/arch/arm64/kernel/entry.S:596) > [ 622.734605] [ T250254] irq event stamp: 0 > [ 622.734606] [ T250254] hardirqs last enabled at (0): 0x0 > [ 622.734610] [ T250254] hardirqs last disabled at (0): copy_process (/home/user/Devel/linux-next/kernel/fork.c:?) > [ 622.734614] [ T250254] softirqs last enabled at (0): copy_process (/home/user/Devel/linux-next/kernel/fork.c:?) > [ 622.734616] [ T250254] softirqs last disabled at (0): 0x0 > [ 622.734618] [ T250254] ---[ end trace 0000000000000000 ]--- > [ 622.734697] [ T250254] Unable to handle kernel paging request at virtual address 003ff3b33312f288 > [ 622.734700] [ T250254] Mem abort info: > [ 622.734701] [ T250254] ESR = 0x0000000096000004 > [ 622.734702] [ T250254] EC = 0x25: DABT (current EL), IL = 32 bits > [ 622.734704] [ T250254] SET = 0, FnV = 0 > [ 622.734705] [ T250254] EA = 0, S1PTW = 0 > [ 622.734706] [ T250254] FSC = 0x04: level 0 translation fault > [ 622.734708] [ T250254] Data abort info: > [ 622.734709] [ T250254] ISV = 0, ISS = 0x00000004, ISS2 = 0x00000000 > [ 622.734711] [ T250254] CM = 0, WnR = 0, TnD = 0, TagAccess = 0 > [ 622.734712] [ T250254] GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0 > [ 622.734713] [ T250254] [003ff3b33312f288] address between user and kernel address ranges > [ 622.734715] [ T250254] Internal error: Oops: 0000000096000004 [#1] SMP > [ 622.734718] [ T250254] Modules linked in: vhost_vsock(E) vfio_iommu_type1(E) vfio(E) unix_diag(E) sch_fq(E) ghes_edac(E) tls(E) tcp_diag(E) inet_diag(E) act_gact(E) cls_bpf(E) nvidia_cspmu(E) ipmi_ssif(E) coresight_trbe(E) arm_cspmu_module(E) arm_smmuv3_pmu(E) ipmi_devintf(E) coresight_stm(E) coresight_funnel(E) coresight_etm4x(E) coresight_tmc(E) stm_core(E) ipmi_msghandler(E) coresight(E) cppc_cpufreq(E) sch_fq_codel(E) drm(E) backlight(E) drm_panel_orientation_quirks(E) sm3_ce(E) sha3_ce(E) spi_tegra210_quad(E) vhost_net(E) tap(E) tun(E) vhost(E) vhost_iotlb(E) mpls_gso(E) mpls_iptunnel(E) mpls_router(E) fou(E) acpi_power_meter(E) loop(E) efivarfs(E) autofs4(E) [last unloaded: test_bpf(E)] > [ 622.734740] [ T250254] Tainted: [W]=WARN, [E]=UNSIGNED_MODULE, [N]=TEST > [ 622.734740] [ T250254] Hardware name: ... > [ 622.734741] [ T250254] pstate: 63401009 (nZCv daif +PAN -UAO +TCO +DIT +SSBS BTYPE=--) > [ 622.734742] [ T250254] pc : kfree (/home/user/Devel/linux-next/./include/linux/page-flags.h:284 /home/user/Devel/linux-next/./include/linux/mm.h:1182 /home/user/Devel/linux-next/mm/slub.c:4871) > [ 622.734745] [ T250254] lr : kfree (/home/user/Devel/linux-next/./include/linux/mm.h:1180 /home/user/Devel/linux-next/mm/slub.c:4871) > [ 622.734747] [ T250254] sp : ffff800158e8fc80 > [ 622.734748] [ T250254] x29: ffff800158e8fc90 x28: ffff0034a7cc7900 x27: 0000000000000000 > [ 622.734749] [ T250254] x26: 0000000000000000 x25: ffff0034a7cc7900 x24: 00000000040e001f > [ 622.734751] [ T250254] x23: ffff0010858afb00 x22: cfcecdcccbcac9c8 x21: ffff0033526a01e0 > [ 622.734752] [ T250254] x20: 003ff3b33312f280 x19: ffff80000acd1a20 x18: ffff80008149c8e4 > [ 622.734754] [ T250254] x17: 0000000000000001 x16: 0000000000000000 x15: 0000000000000003 > [ 622.734755] [ T250254] x14: ffff800082962e78 x13: 0000000000000003 x12: ffff003bc6231630 > [ 622.734757] [ T250254] x11: 0000000000000000 x10: 0000000000000000 x9 : ffffffdfc0000000 > [ 622.734758] [ T250254] x8 : 003ff3d37312f280 x7 : 0720072007200720 x6 : ffff80008018710c > [ 622.734760] [ T250254] x5 : 0000000000000001 x4 : 00000090ecc72ac0 x3 : 0000000000000000 > [ 622.734761] [ T250254] x2 : 0000000000000000 x1 : ffff800081a72bc6 x0 : ffcf4dcccbcac9c8 > [ 622.734763] [ T250254] Call trace: > [ 622.734763] [ T250254] kfree (/home/user/Devel/linux-next/./include/linux/page-flags.h:284 /home/user/Devel/linux-next/./include/linux/mm.h:1182 /home/user/Devel/linux-next/mm/slub.c:4871) (P) > [ 622.734766] [ T250254] vhost_dev_cleanup (/home/user/Devel/linux-next/drivers/vhost/vhost.c:506 /home/user/Devel/linux-next/drivers/vhost/vhost.c:542 /home/user/Devel/linux-next/drivers/vhost/vhost.c:1214) vhost > [ 622.734769] [ T250254] vhost_vsock_dev_release (/home/user/Devel/linux-next/drivers/vhost/vsock.c:756) vhost_vsock > [ 622.734771] [ T250254] __fput (/home/user/Devel/linux-next/fs/file_table.c:469) > [ 622.734772] [ T250254] fput_close_sync (/home/user/Devel/linux-next/fs/file_table.c:?) > [ 622.734773] [ T250254] __arm64_sys_close (/home/user/Devel/linux-next/fs/open.c:1589 /home/user/Devel/linux-next/fs/open.c:1572 /home/user/Devel/linux-next/fs/open.c:1572) > [ 622.734776] [ T250254] invoke_syscall (/home/user/Devel/linux-next/arch/arm64/kernel/syscall.c:50) > [ 622.734778] [ T250254] el0_svc_common (/home/user/Devel/linux-next/./include/linux/thread_info.h:135 /home/user/Devel/linux-next/arch/arm64/kernel/syscall.c:140) > [ 622.734781] [ T250254] do_el0_svc (/home/user/Devel/linux-next/arch/arm64/kernel/syscall.c:152) > [ 622.734783] [ T250254] el0_svc (/home/user/Devel/linux-next/arch/arm64/kernel/entry-common.c:169 /home/user/Devel/linux-next/arch/arm64/kernel/entry-common.c:182 /home/user/Devel/linux-next/arch/arm64/kernel/entry-common.c:880) > [ 622.734787] [ T250254] el0t_64_sync_handler (/home/user/Devel/linux-next/arch/arm64/kernel/entry-common.c:958) > [ 622.734790] [ T250254] el0t_64_sync (/home/user/Devel/linux-next/arch/arm64/kernel/entry.S:596) > [ 622.734792] [ T250254] Code: f2dffbe9 927abd08 cb141908 8b090114 (f9400688) > All code > ======== > 0:* e9 fb df f2 08 jmp 0x8f2e000 <-- trapping instruction > 5: bd 7a 92 08 19 mov $0x1908927a,%ebp > a: 14 cb adc $0xcb,%al > c: 14 01 adc $0x1,%al > e: 09 8b 88 06 40 f9 or %ecx,-0x6bff978(%rbx) > > Code starting with the faulting instruction > =========================================== > 0: 88 06 mov %al,(%rsi) > 2: 40 f9 rex stc > [ 622.734795] [ T250254] SMP: stopping secondary CPUs > [ 622.735089] [ T250254] Starting crashdump kernel... > [ 622.735091] [ T250254] Bye! ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: vhost: linux-next: crash at vhost_dev_cleanup() 2025-07-24 7:47 ` Michael S. Tsirkin @ 2025-07-24 8:14 ` Stefano Garzarella 2025-07-24 8:22 ` Michael S. Tsirkin 0 siblings, 1 reply; 12+ messages in thread From: Stefano Garzarella @ 2025-07-24 8:14 UTC (permalink / raw) To: Michael S. Tsirkin, Will Deacon Cc: Breno Leitao, jasowang, eperezma, linux-arm-kernel, kvm, Stefan Hajnoczi, netdev CCing Will On Thu, 24 Jul 2025 at 09:48, Michael S. Tsirkin <mst@redhat.com> wrote: > > On Wed, Jul 23, 2025 at 08:04:42AM -0700, Breno Leitao wrote: > > Hello, > > > > I've seen a crash in linux-next for a while on my arm64 server, and > > I decided to report. > > > > While running stress-ng on linux-next, I see the crash below. > > > > This is happening in a kernel configure with some debug options (KASAN, > > LOCKDEP and KMEMLEAK). > > > > Basically running stress-ng in a loop would crash the host in 15-20 > > minutes: > > # while (true); do stress-ng -r 10 -t 10; done > > > > >From the early warning "virt_to_phys used for non-linear address", mmm, we recently added nonlinear SKBs support in vhost-vsock [1], @Will can this issue be related? I checked next-20250721 tag and I confirm that contains those changes. [1] https://lore.kernel.org/virtualization/20250717090116.11987-1-will@kernel.org/ Thanks, Stefano > > I suppose corrupted data is at vq->nheads. > > > > Here is the decoded stack against 9798752 ("Add linux-next specific > > files for 20250721") > > > > > > [ 620.685144] [ T250731] VFIO - User Level meta-driver version: 0.3 > > [ 622.394448] [ T250254] ------------[ cut here ]------------ > > [ 622.413492] [ T250254] virt_to_phys used for non-linear address: 000000006e69fe64 (0xcfcecdcccbcac9c8) > > [ 622.447771] [ T250254] WARNING: arch/arm64/mm/physaddr.c:15 at __virt_to_phys+0x64/0x90, CPU#57: stress-ng-dev/250254 > > [ 622.487227] [ T250254] Modules linked in: vhost_vsock(E) vfio_iommu_type1(E) vfio(E) unix_diag(E) sch_fq(E) ghes_edac(E) tls(E) tcp_diag(E) inet_diag(E) act_gact(E) cls_bpf(E) nvidia_cspmu(E) ipmi_ssif(E) coresight_trbe(E) arm_cspmu_module(E) arm_smmuv3_pmu(E) ipmi_devintf(E) coresight_stm(E) coresight_funnel(E) coresight_etm4x(E) coresight_tmc(E) stm_core(E) ipmi_msghandler(E) coresight(E) cppc_cpufreq(E) sch_fq_codel(E) drm(E) backlight(E) drm_panel_orientation_quirks(E) sm3_ce(E) sha3_ce(E) spi_tegra210_quad(E) vhost_net(E) tap(E) tun(E) vhost(E) vhost_iotlb(E) mpls_gso(E) mpls_iptunnel(E) mpls_router(E) fou(E) acpi_power_meter(E) loop(E) efivarfs(E) autofs4(E) [last unloaded: test_bpf(E)] > > [ 622.734524] [ T250254] Tainted: [W]=WARN, [E]=UNSIGNED_MODULE, [N]=TEST > > [ 622.734525] [ T250254] Hardware name: ... > > [ 622.734526] [ T250254] pstate: 63401009 (nZCv daif +PAN -UAO +TCO +DIT +SSBS BTYPE=--) > > [ 622.734529] [ T250254] pc : __virt_to_phys (/home/user/Devel/linux-next/arch/arm64/mm/physaddr.c:?) > > [ 622.734531] [ T250254] lr : __virt_to_phys (/home/user/Devel/linux-next/arch/arm64/mm/physaddr.c:?) > > [ 622.734533] [ T250254] sp : ffff800158e8fc60 > > [ 622.734534] [ T250254] x29: ffff800158e8fc60 x28: ffff0034a7cc7900 x27: 0000000000000000 > > [ 622.734537] [ T250254] x26: 0000000000000000 x25: ffff0034a7cc7900 x24: 00000000040e001f > > [ 622.734539] [ T250254] x23: ffff0010858afb00 x22: cfcecdcccbcac9c8 x21: ffff0033526a01e0 > > [ 622.734541] [ T250254] x20: 0000000000008000 x19: ffcecdcccbcac9c8 x18: ffff80008149c8e4 > > [ 622.734543] [ T250254] x17: 0000000000000001 x16: 0000000000000000 x15: 0000000000000003 > > [ 622.734545] [ T250254] x14: ffff800082962e78 x13: 0000000000000003 x12: ffff003bc6231630 > > [ 622.734546] [ T250254] x11: 0000000000000000 x10: 0000000000000000 x9 : ed44a220ae716b00 > > [ 622.734548] [ T250254] x8 : 0001000000000000 x7 : 0720072007200720 x6 : ffff80008018710c > > [ 622.734550] [ T250254] x5 : 0000000000000001 x4 : 00000090ecc72ac0 x3 : 0000000000000000 > > [ 622.734552] [ T250254] x2 : 0000000000000000 x1 : ffff800081a72bc6 x0 : 000000000000004f > > [ 622.734554] [ T250254] Call trace: > > [ 622.734555] [ T250254] __virt_to_phys (/home/user/Devel/linux-next/arch/arm64/mm/physaddr.c:?) (P) > > [ 622.734557] [ T250254] kfree (/home/user/Devel/linux-next/./include/linux/mm.h:1180 /home/user/Devel/linux-next/mm/slub.c:4871) > > [ 622.734562] [ T250254] vhost_dev_cleanup (/home/user/Devel/linux-next/drivers/vhost/vhost.c:506 /home/user/Devel/linux-next/drivers/vhost/vhost.c:542 /home/user/Devel/linux-next/drivers/vhost/vhost.c:1214) vhost > > [ 622.734571] [ T250254] vhost_vsock_dev_release (/home/user/Devel/linux-next/drivers/vhost/vsock.c:756) vhost_vsock > > > Cc more vsock maintainers. > > > > > > [ 622.734575] [ T250254] __fput (/home/user/Devel/linux-next/fs/file_table.c:469) > > [ 622.734578] [ T250254] fput_close_sync (/home/user/Devel/linux-next/fs/file_table.c:?) > > [ 622.734579] [ T250254] __arm64_sys_close (/home/user/Devel/linux-next/fs/open.c:1589 /home/user/Devel/linux-next/fs/open.c:1572 /home/user/Devel/linux-next/fs/open.c:1572) > > [ 622.734584] [ T250254] invoke_syscall (/home/user/Devel/linux-next/arch/arm64/kernel/syscall.c:50) > > [ 622.734589] [ T250254] el0_svc_common (/home/user/Devel/linux-next/./include/linux/thread_info.h:135 /home/user/Devel/linux-next/arch/arm64/kernel/syscall.c:140) > > [ 622.734591] [ T250254] do_el0_svc (/home/user/Devel/linux-next/arch/arm64/kernel/syscall.c:152) > > [ 622.734594] [ T250254] el0_svc (/home/user/Devel/linux-next/arch/arm64/kernel/entry-common.c:169 /home/user/Devel/linux-next/arch/arm64/kernel/entry-common.c:182 /home/user/Devel/linux-next/arch/arm64/kernel/entry-common.c:880) > > [ 622.734600] [ T250254] el0t_64_sync_handler (/home/user/Devel/linux-next/arch/arm64/kernel/entry-common.c:958) > > [ 622.734603] [ T250254] el0t_64_sync (/home/user/Devel/linux-next/arch/arm64/kernel/entry.S:596) > > [ 622.734605] [ T250254] irq event stamp: 0 > > [ 622.734606] [ T250254] hardirqs last enabled at (0): 0x0 > > [ 622.734610] [ T250254] hardirqs last disabled at (0): copy_process (/home/user/Devel/linux-next/kernel/fork.c:?) > > [ 622.734614] [ T250254] softirqs last enabled at (0): copy_process (/home/user/Devel/linux-next/kernel/fork.c:?) > > [ 622.734616] [ T250254] softirqs last disabled at (0): 0x0 > > [ 622.734618] [ T250254] ---[ end trace 0000000000000000 ]--- > > [ 622.734697] [ T250254] Unable to handle kernel paging request at virtual address 003ff3b33312f288 > > [ 622.734700] [ T250254] Mem abort info: > > [ 622.734701] [ T250254] ESR = 0x0000000096000004 > > [ 622.734702] [ T250254] EC = 0x25: DABT (current EL), IL = 32 bits > > [ 622.734704] [ T250254] SET = 0, FnV = 0 > > [ 622.734705] [ T250254] EA = 0, S1PTW = 0 > > [ 622.734706] [ T250254] FSC = 0x04: level 0 translation fault > > [ 622.734708] [ T250254] Data abort info: > > [ 622.734709] [ T250254] ISV = 0, ISS = 0x00000004, ISS2 = 0x00000000 > > [ 622.734711] [ T250254] CM = 0, WnR = 0, TnD = 0, TagAccess = 0 > > [ 622.734712] [ T250254] GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0 > > [ 622.734713] [ T250254] [003ff3b33312f288] address between user and kernel address ranges > > [ 622.734715] [ T250254] Internal error: Oops: 0000000096000004 [#1] SMP > > [ 622.734718] [ T250254] Modules linked in: vhost_vsock(E) vfio_iommu_type1(E) vfio(E) unix_diag(E) sch_fq(E) ghes_edac(E) tls(E) tcp_diag(E) inet_diag(E) act_gact(E) cls_bpf(E) nvidia_cspmu(E) ipmi_ssif(E) coresight_trbe(E) arm_cspmu_module(E) arm_smmuv3_pmu(E) ipmi_devintf(E) coresight_stm(E) coresight_funnel(E) coresight_etm4x(E) coresight_tmc(E) stm_core(E) ipmi_msghandler(E) coresight(E) cppc_cpufreq(E) sch_fq_codel(E) drm(E) backlight(E) drm_panel_orientation_quirks(E) sm3_ce(E) sha3_ce(E) spi_tegra210_quad(E) vhost_net(E) tap(E) tun(E) vhost(E) vhost_iotlb(E) mpls_gso(E) mpls_iptunnel(E) mpls_router(E) fou(E) acpi_power_meter(E) loop(E) efivarfs(E) autofs4(E) [last unloaded: test_bpf(E)] > > [ 622.734740] [ T250254] Tainted: [W]=WARN, [E]=UNSIGNED_MODULE, [N]=TEST > > [ 622.734740] [ T250254] Hardware name: ... > > [ 622.734741] [ T250254] pstate: 63401009 (nZCv daif +PAN -UAO +TCO +DIT +SSBS BTYPE=--) > > [ 622.734742] [ T250254] pc : kfree (/home/user/Devel/linux-next/./include/linux/page-flags.h:284 /home/user/Devel/linux-next/./include/linux/mm.h:1182 /home/user/Devel/linux-next/mm/slub.c:4871) > > [ 622.734745] [ T250254] lr : kfree (/home/user/Devel/linux-next/./include/linux/mm.h:1180 /home/user/Devel/linux-next/mm/slub.c:4871) > > [ 622.734747] [ T250254] sp : ffff800158e8fc80 > > [ 622.734748] [ T250254] x29: ffff800158e8fc90 x28: ffff0034a7cc7900 x27: 0000000000000000 > > [ 622.734749] [ T250254] x26: 0000000000000000 x25: ffff0034a7cc7900 x24: 00000000040e001f > > [ 622.734751] [ T250254] x23: ffff0010858afb00 x22: cfcecdcccbcac9c8 x21: ffff0033526a01e0 > > [ 622.734752] [ T250254] x20: 003ff3b33312f280 x19: ffff80000acd1a20 x18: ffff80008149c8e4 > > [ 622.734754] [ T250254] x17: 0000000000000001 x16: 0000000000000000 x15: 0000000000000003 > > [ 622.734755] [ T250254] x14: ffff800082962e78 x13: 0000000000000003 x12: ffff003bc6231630 > > [ 622.734757] [ T250254] x11: 0000000000000000 x10: 0000000000000000 x9 : ffffffdfc0000000 > > [ 622.734758] [ T250254] x8 : 003ff3d37312f280 x7 : 0720072007200720 x6 : ffff80008018710c > > [ 622.734760] [ T250254] x5 : 0000000000000001 x4 : 00000090ecc72ac0 x3 : 0000000000000000 > > [ 622.734761] [ T250254] x2 : 0000000000000000 x1 : ffff800081a72bc6 x0 : ffcf4dcccbcac9c8 > > [ 622.734763] [ T250254] Call trace: > > [ 622.734763] [ T250254] kfree (/home/user/Devel/linux-next/./include/linux/page-flags.h:284 /home/user/Devel/linux-next/./include/linux/mm.h:1182 /home/user/Devel/linux-next/mm/slub.c:4871) (P) > > [ 622.734766] [ T250254] vhost_dev_cleanup (/home/user/Devel/linux-next/drivers/vhost/vhost.c:506 /home/user/Devel/linux-next/drivers/vhost/vhost.c:542 /home/user/Devel/linux-next/drivers/vhost/vhost.c:1214) vhost > > [ 622.734769] [ T250254] vhost_vsock_dev_release (/home/user/Devel/linux-next/drivers/vhost/vsock.c:756) vhost_vsock > > [ 622.734771] [ T250254] __fput (/home/user/Devel/linux-next/fs/file_table.c:469) > > [ 622.734772] [ T250254] fput_close_sync (/home/user/Devel/linux-next/fs/file_table.c:?) > > [ 622.734773] [ T250254] __arm64_sys_close (/home/user/Devel/linux-next/fs/open.c:1589 /home/user/Devel/linux-next/fs/open.c:1572 /home/user/Devel/linux-next/fs/open.c:1572) > > [ 622.734776] [ T250254] invoke_syscall (/home/user/Devel/linux-next/arch/arm64/kernel/syscall.c:50) > > [ 622.734778] [ T250254] el0_svc_common (/home/user/Devel/linux-next/./include/linux/thread_info.h:135 /home/user/Devel/linux-next/arch/arm64/kernel/syscall.c:140) > > [ 622.734781] [ T250254] do_el0_svc (/home/user/Devel/linux-next/arch/arm64/kernel/syscall.c:152) > > [ 622.734783] [ T250254] el0_svc (/home/user/Devel/linux-next/arch/arm64/kernel/entry-common.c:169 /home/user/Devel/linux-next/arch/arm64/kernel/entry-common.c:182 /home/user/Devel/linux-next/arch/arm64/kernel/entry-common.c:880) > > [ 622.734787] [ T250254] el0t_64_sync_handler (/home/user/Devel/linux-next/arch/arm64/kernel/entry-common.c:958) > > [ 622.734790] [ T250254] el0t_64_sync (/home/user/Devel/linux-next/arch/arm64/kernel/entry.S:596) > > [ 622.734792] [ T250254] Code: f2dffbe9 927abd08 cb141908 8b090114 (f9400688) > > All code > > ======== > > 0:* e9 fb df f2 08 jmp 0x8f2e000 <-- trapping instruction > > 5: bd 7a 92 08 19 mov $0x1908927a,%ebp > > a: 14 cb adc $0xcb,%al > > c: 14 01 adc $0x1,%al > > e: 09 8b 88 06 40 f9 or %ecx,-0x6bff978(%rbx) > > > > Code starting with the faulting instruction > > =========================================== > > 0: 88 06 mov %al,(%rsi) > > 2: 40 f9 rex stc > > [ 622.734795] [ T250254] SMP: stopping secondary CPUs > > [ 622.735089] [ T250254] Starting crashdump kernel... > > [ 622.735091] [ T250254] Bye! > ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: vhost: linux-next: crash at vhost_dev_cleanup() 2025-07-24 8:14 ` Stefano Garzarella @ 2025-07-24 8:22 ` Michael S. Tsirkin 2025-07-24 8:44 ` Will Deacon 0 siblings, 1 reply; 12+ messages in thread From: Michael S. Tsirkin @ 2025-07-24 8:22 UTC (permalink / raw) To: Stefano Garzarella Cc: Will Deacon, Breno Leitao, jasowang, eperezma, linux-arm-kernel, kvm, Stefan Hajnoczi, netdev On Thu, Jul 24, 2025 at 10:14:36AM +0200, Stefano Garzarella wrote: > CCing Will > > On Thu, 24 Jul 2025 at 09:48, Michael S. Tsirkin <mst@redhat.com> wrote: > > > > On Wed, Jul 23, 2025 at 08:04:42AM -0700, Breno Leitao wrote: > > > Hello, > > > > > > I've seen a crash in linux-next for a while on my arm64 server, and > > > I decided to report. > > > > > > While running stress-ng on linux-next, I see the crash below. > > > > > > This is happening in a kernel configure with some debug options (KASAN, > > > LOCKDEP and KMEMLEAK). > > > > > > Basically running stress-ng in a loop would crash the host in 15-20 > > > minutes: > > > # while (true); do stress-ng -r 10 -t 10; done > > > > > > >From the early warning "virt_to_phys used for non-linear address", > > mmm, we recently added nonlinear SKBs support in vhost-vsock [1], > @Will can this issue be related? Good point. Breno, if bisecting is too much trouble, would you mind testing the commits c76f3c4364fe523cd2782269eab92529c86217aa and c7991b44d7b44f9270dec63acd0b2965d29aab43 and telling us if this reproduces? > I checked next-20250721 tag and I confirm that contains those changes. > > [1] https://lore.kernel.org/virtualization/20250717090116.11987-1-will@kernel.org/ > > Thanks, > Stefano > > > > I suppose corrupted data is at vq->nheads. > > > > > > Here is the decoded stack against 9798752 ("Add linux-next specific > > > files for 20250721") > > > > > > > > > [ 620.685144] [ T250731] VFIO - User Level meta-driver version: 0.3 > > > [ 622.394448] [ T250254] ------------[ cut here ]------------ > > > [ 622.413492] [ T250254] virt_to_phys used for non-linear address: 000000006e69fe64 (0xcfcecdcccbcac9c8) > > > [ 622.447771] [ T250254] WARNING: arch/arm64/mm/physaddr.c:15 at __virt_to_phys+0x64/0x90, CPU#57: stress-ng-dev/250254 > > > [ 622.487227] [ T250254] Modules linked in: vhost_vsock(E) vfio_iommu_type1(E) vfio(E) unix_diag(E) sch_fq(E) ghes_edac(E) tls(E) tcp_diag(E) inet_diag(E) act_gact(E) cls_bpf(E) nvidia_cspmu(E) ipmi_ssif(E) coresight_trbe(E) arm_cspmu_module(E) arm_smmuv3_pmu(E) ipmi_devintf(E) coresight_stm(E) coresight_funnel(E) coresight_etm4x(E) coresight_tmc(E) stm_core(E) ipmi_msghandler(E) coresight(E) cppc_cpufreq(E) sch_fq_codel(E) drm(E) backlight(E) drm_panel_orientation_quirks(E) sm3_ce(E) sha3_ce(E) spi_tegra210_quad(E) vhost_net(E) tap(E) tun(E) vhost(E) vhost_iotlb(E) mpls_gso(E) mpls_iptunnel(E) mpls_router(E) fou(E) acpi_power_meter(E) loop(E) efivarfs(E) autofs4(E) [last unloaded: test_bpf(E)] > > > [ 622.734524] [ T250254] Tainted: [W]=WARN, [E]=UNSIGNED_MODULE, [N]=TEST > > > [ 622.734525] [ T250254] Hardware name: ... > > > [ 622.734526] [ T250254] pstate: 63401009 (nZCv daif +PAN -UAO +TCO +DIT +SSBS BTYPE=--) > > > [ 622.734529] [ T250254] pc : __virt_to_phys (/home/user/Devel/linux-next/arch/arm64/mm/physaddr.c:?) > > > [ 622.734531] [ T250254] lr : __virt_to_phys (/home/user/Devel/linux-next/arch/arm64/mm/physaddr.c:?) > > > [ 622.734533] [ T250254] sp : ffff800158e8fc60 > > > [ 622.734534] [ T250254] x29: ffff800158e8fc60 x28: ffff0034a7cc7900 x27: 0000000000000000 > > > [ 622.734537] [ T250254] x26: 0000000000000000 x25: ffff0034a7cc7900 x24: 00000000040e001f > > > [ 622.734539] [ T250254] x23: ffff0010858afb00 x22: cfcecdcccbcac9c8 x21: ffff0033526a01e0 > > > [ 622.734541] [ T250254] x20: 0000000000008000 x19: ffcecdcccbcac9c8 x18: ffff80008149c8e4 > > > [ 622.734543] [ T250254] x17: 0000000000000001 x16: 0000000000000000 x15: 0000000000000003 > > > [ 622.734545] [ T250254] x14: ffff800082962e78 x13: 0000000000000003 x12: ffff003bc6231630 > > > [ 622.734546] [ T250254] x11: 0000000000000000 x10: 0000000000000000 x9 : ed44a220ae716b00 > > > [ 622.734548] [ T250254] x8 : 0001000000000000 x7 : 0720072007200720 x6 : ffff80008018710c > > > [ 622.734550] [ T250254] x5 : 0000000000000001 x4 : 00000090ecc72ac0 x3 : 0000000000000000 > > > [ 622.734552] [ T250254] x2 : 0000000000000000 x1 : ffff800081a72bc6 x0 : 000000000000004f > > > [ 622.734554] [ T250254] Call trace: > > > [ 622.734555] [ T250254] __virt_to_phys (/home/user/Devel/linux-next/arch/arm64/mm/physaddr.c:?) (P) > > > [ 622.734557] [ T250254] kfree (/home/user/Devel/linux-next/./include/linux/mm.h:1180 /home/user/Devel/linux-next/mm/slub.c:4871) > > > [ 622.734562] [ T250254] vhost_dev_cleanup (/home/user/Devel/linux-next/drivers/vhost/vhost.c:506 /home/user/Devel/linux-next/drivers/vhost/vhost.c:542 /home/user/Devel/linux-next/drivers/vhost/vhost.c:1214) vhost > > > [ 622.734571] [ T250254] vhost_vsock_dev_release (/home/user/Devel/linux-next/drivers/vhost/vsock.c:756) vhost_vsock > > > > > > Cc more vsock maintainers. > > > > > > > > > > > [ 622.734575] [ T250254] __fput (/home/user/Devel/linux-next/fs/file_table.c:469) > > > [ 622.734578] [ T250254] fput_close_sync (/home/user/Devel/linux-next/fs/file_table.c:?) > > > [ 622.734579] [ T250254] __arm64_sys_close (/home/user/Devel/linux-next/fs/open.c:1589 /home/user/Devel/linux-next/fs/open.c:1572 /home/user/Devel/linux-next/fs/open.c:1572) > > > [ 622.734584] [ T250254] invoke_syscall (/home/user/Devel/linux-next/arch/arm64/kernel/syscall.c:50) > > > [ 622.734589] [ T250254] el0_svc_common (/home/user/Devel/linux-next/./include/linux/thread_info.h:135 /home/user/Devel/linux-next/arch/arm64/kernel/syscall.c:140) > > > [ 622.734591] [ T250254] do_el0_svc (/home/user/Devel/linux-next/arch/arm64/kernel/syscall.c:152) > > > [ 622.734594] [ T250254] el0_svc (/home/user/Devel/linux-next/arch/arm64/kernel/entry-common.c:169 /home/user/Devel/linux-next/arch/arm64/kernel/entry-common.c:182 /home/user/Devel/linux-next/arch/arm64/kernel/entry-common.c:880) > > > [ 622.734600] [ T250254] el0t_64_sync_handler (/home/user/Devel/linux-next/arch/arm64/kernel/entry-common.c:958) > > > [ 622.734603] [ T250254] el0t_64_sync (/home/user/Devel/linux-next/arch/arm64/kernel/entry.S:596) > > > [ 622.734605] [ T250254] irq event stamp: 0 > > > [ 622.734606] [ T250254] hardirqs last enabled at (0): 0x0 > > > [ 622.734610] [ T250254] hardirqs last disabled at (0): copy_process (/home/user/Devel/linux-next/kernel/fork.c:?) > > > [ 622.734614] [ T250254] softirqs last enabled at (0): copy_process (/home/user/Devel/linux-next/kernel/fork.c:?) > > > [ 622.734616] [ T250254] softirqs last disabled at (0): 0x0 > > > [ 622.734618] [ T250254] ---[ end trace 0000000000000000 ]--- > > > [ 622.734697] [ T250254] Unable to handle kernel paging request at virtual address 003ff3b33312f288 > > > [ 622.734700] [ T250254] Mem abort info: > > > [ 622.734701] [ T250254] ESR = 0x0000000096000004 > > > [ 622.734702] [ T250254] EC = 0x25: DABT (current EL), IL = 32 bits > > > [ 622.734704] [ T250254] SET = 0, FnV = 0 > > > [ 622.734705] [ T250254] EA = 0, S1PTW = 0 > > > [ 622.734706] [ T250254] FSC = 0x04: level 0 translation fault > > > [ 622.734708] [ T250254] Data abort info: > > > [ 622.734709] [ T250254] ISV = 0, ISS = 0x00000004, ISS2 = 0x00000000 > > > [ 622.734711] [ T250254] CM = 0, WnR = 0, TnD = 0, TagAccess = 0 > > > [ 622.734712] [ T250254] GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0 > > > [ 622.734713] [ T250254] [003ff3b33312f288] address between user and kernel address ranges > > > [ 622.734715] [ T250254] Internal error: Oops: 0000000096000004 [#1] SMP > > > [ 622.734718] [ T250254] Modules linked in: vhost_vsock(E) vfio_iommu_type1(E) vfio(E) unix_diag(E) sch_fq(E) ghes_edac(E) tls(E) tcp_diag(E) inet_diag(E) act_gact(E) cls_bpf(E) nvidia_cspmu(E) ipmi_ssif(E) coresight_trbe(E) arm_cspmu_module(E) arm_smmuv3_pmu(E) ipmi_devintf(E) coresight_stm(E) coresight_funnel(E) coresight_etm4x(E) coresight_tmc(E) stm_core(E) ipmi_msghandler(E) coresight(E) cppc_cpufreq(E) sch_fq_codel(E) drm(E) backlight(E) drm_panel_orientation_quirks(E) sm3_ce(E) sha3_ce(E) spi_tegra210_quad(E) vhost_net(E) tap(E) tun(E) vhost(E) vhost_iotlb(E) mpls_gso(E) mpls_iptunnel(E) mpls_router(E) fou(E) acpi_power_meter(E) loop(E) efivarfs(E) autofs4(E) [last unloaded: test_bpf(E)] > > > [ 622.734740] [ T250254] Tainted: [W]=WARN, [E]=UNSIGNED_MODULE, [N]=TEST > > > [ 622.734740] [ T250254] Hardware name: ... > > > [ 622.734741] [ T250254] pstate: 63401009 (nZCv daif +PAN -UAO +TCO +DIT +SSBS BTYPE=--) > > > [ 622.734742] [ T250254] pc : kfree (/home/user/Devel/linux-next/./include/linux/page-flags.h:284 /home/user/Devel/linux-next/./include/linux/mm.h:1182 /home/user/Devel/linux-next/mm/slub.c:4871) > > > [ 622.734745] [ T250254] lr : kfree (/home/user/Devel/linux-next/./include/linux/mm.h:1180 /home/user/Devel/linux-next/mm/slub.c:4871) > > > [ 622.734747] [ T250254] sp : ffff800158e8fc80 > > > [ 622.734748] [ T250254] x29: ffff800158e8fc90 x28: ffff0034a7cc7900 x27: 0000000000000000 > > > [ 622.734749] [ T250254] x26: 0000000000000000 x25: ffff0034a7cc7900 x24: 00000000040e001f > > > [ 622.734751] [ T250254] x23: ffff0010858afb00 x22: cfcecdcccbcac9c8 x21: ffff0033526a01e0 > > > [ 622.734752] [ T250254] x20: 003ff3b33312f280 x19: ffff80000acd1a20 x18: ffff80008149c8e4 > > > [ 622.734754] [ T250254] x17: 0000000000000001 x16: 0000000000000000 x15: 0000000000000003 > > > [ 622.734755] [ T250254] x14: ffff800082962e78 x13: 0000000000000003 x12: ffff003bc6231630 > > > [ 622.734757] [ T250254] x11: 0000000000000000 x10: 0000000000000000 x9 : ffffffdfc0000000 > > > [ 622.734758] [ T250254] x8 : 003ff3d37312f280 x7 : 0720072007200720 x6 : ffff80008018710c > > > [ 622.734760] [ T250254] x5 : 0000000000000001 x4 : 00000090ecc72ac0 x3 : 0000000000000000 > > > [ 622.734761] [ T250254] x2 : 0000000000000000 x1 : ffff800081a72bc6 x0 : ffcf4dcccbcac9c8 > > > [ 622.734763] [ T250254] Call trace: > > > [ 622.734763] [ T250254] kfree (/home/user/Devel/linux-next/./include/linux/page-flags.h:284 /home/user/Devel/linux-next/./include/linux/mm.h:1182 /home/user/Devel/linux-next/mm/slub.c:4871) (P) > > > [ 622.734766] [ T250254] vhost_dev_cleanup (/home/user/Devel/linux-next/drivers/vhost/vhost.c:506 /home/user/Devel/linux-next/drivers/vhost/vhost.c:542 /home/user/Devel/linux-next/drivers/vhost/vhost.c:1214) vhost > > > [ 622.734769] [ T250254] vhost_vsock_dev_release (/home/user/Devel/linux-next/drivers/vhost/vsock.c:756) vhost_vsock > > > [ 622.734771] [ T250254] __fput (/home/user/Devel/linux-next/fs/file_table.c:469) > > > [ 622.734772] [ T250254] fput_close_sync (/home/user/Devel/linux-next/fs/file_table.c:?) > > > [ 622.734773] [ T250254] __arm64_sys_close (/home/user/Devel/linux-next/fs/open.c:1589 /home/user/Devel/linux-next/fs/open.c:1572 /home/user/Devel/linux-next/fs/open.c:1572) > > > [ 622.734776] [ T250254] invoke_syscall (/home/user/Devel/linux-next/arch/arm64/kernel/syscall.c:50) > > > [ 622.734778] [ T250254] el0_svc_common (/home/user/Devel/linux-next/./include/linux/thread_info.h:135 /home/user/Devel/linux-next/arch/arm64/kernel/syscall.c:140) > > > [ 622.734781] [ T250254] do_el0_svc (/home/user/Devel/linux-next/arch/arm64/kernel/syscall.c:152) > > > [ 622.734783] [ T250254] el0_svc (/home/user/Devel/linux-next/arch/arm64/kernel/entry-common.c:169 /home/user/Devel/linux-next/arch/arm64/kernel/entry-common.c:182 /home/user/Devel/linux-next/arch/arm64/kernel/entry-common.c:880) > > > [ 622.734787] [ T250254] el0t_64_sync_handler (/home/user/Devel/linux-next/arch/arm64/kernel/entry-common.c:958) > > > [ 622.734790] [ T250254] el0t_64_sync (/home/user/Devel/linux-next/arch/arm64/kernel/entry.S:596) > > > [ 622.734792] [ T250254] Code: f2dffbe9 927abd08 cb141908 8b090114 (f9400688) > > > All code > > > ======== > > > 0:* e9 fb df f2 08 jmp 0x8f2e000 <-- trapping instruction > > > 5: bd 7a 92 08 19 mov $0x1908927a,%ebp > > > a: 14 cb adc $0xcb,%al > > > c: 14 01 adc $0x1,%al > > > e: 09 8b 88 06 40 f9 or %ecx,-0x6bff978(%rbx) > > > > > > Code starting with the faulting instruction > > > =========================================== > > > 0: 88 06 mov %al,(%rsi) > > > 2: 40 f9 rex stc > > > [ 622.734795] [ T250254] SMP: stopping secondary CPUs > > > [ 622.735089] [ T250254] Starting crashdump kernel... > > > [ 622.735091] [ T250254] Bye! > > ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: vhost: linux-next: crash at vhost_dev_cleanup() 2025-07-24 8:22 ` Michael S. Tsirkin @ 2025-07-24 8:44 ` Will Deacon 2025-07-24 12:48 ` Breno Leitao 0 siblings, 1 reply; 12+ messages in thread From: Will Deacon @ 2025-07-24 8:44 UTC (permalink / raw) To: Michael S. Tsirkin Cc: Stefano Garzarella, Breno Leitao, jasowang, eperezma, linux-arm-kernel, kvm, Stefan Hajnoczi, netdev On Thu, Jul 24, 2025 at 04:22:15AM -0400, Michael S. Tsirkin wrote: > On Thu, Jul 24, 2025 at 10:14:36AM +0200, Stefano Garzarella wrote: > > CCing Will Thanks. > > On Thu, 24 Jul 2025 at 09:48, Michael S. Tsirkin <mst@redhat.com> wrote: > > > > > > On Wed, Jul 23, 2025 at 08:04:42AM -0700, Breno Leitao wrote: > > > > Hello, > > > > > > > > I've seen a crash in linux-next for a while on my arm64 server, and > > > > I decided to report. > > > > > > > > While running stress-ng on linux-next, I see the crash below. > > > > > > > > This is happening in a kernel configure with some debug options (KASAN, > > > > LOCKDEP and KMEMLEAK). > > > > > > > > Basically running stress-ng in a loop would crash the host in 15-20 > > > > minutes: > > > > # while (true); do stress-ng -r 10 -t 10; done > > > > > > > > >From the early warning "virt_to_phys used for non-linear address", > > > > mmm, we recently added nonlinear SKBs support in vhost-vsock [1], > > @Will can this issue be related? > > Good point. > > Breno, if bisecting is too much trouble, would you mind testing the commits > c76f3c4364fe523cd2782269eab92529c86217aa > and > c7991b44d7b44f9270dec63acd0b2965d29aab43 > and telling us if this reproduces? That's definitely worth doing, but we should be careful not to confuse the "non-linear address" from the warning (which refers to virtual addresses that lie outside of the linear mapping of memory, e.g. in the vmalloc space) and "non-linear SKBs" which refer to SKBs with fragment pages. Breno -- when you say you've been seeing this "for a while", what's the earliest kernel you know you saw it on? > > > > I suppose corrupted data is at vq->nheads. > > > > > > > > Here is the decoded stack against 9798752 ("Add linux-next specific > > > > files for 20250721") > > > > > > > > > > > > [ 620.685144] [ T250731] VFIO - User Level meta-driver version: 0.3 > > > > [ 622.394448] [ T250254] ------------[ cut here ]------------ > > > > [ 622.413492] [ T250254] virt_to_phys used for non-linear address: 000000006e69fe64 (0xcfcecdcccbcac9c8) So here's the bad (non-linear) pointer. Do you know if 0xcfcecdcccbcac9c8 correlates with the packet data that stress-ng is generating? I wonder if we're somehow overflowing vq->iov[]. Will ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: vhost: linux-next: crash at vhost_dev_cleanup() 2025-07-24 8:44 ` Will Deacon @ 2025-07-24 12:48 ` Breno Leitao 2025-07-24 12:52 ` Stefano Garzarella 0 siblings, 1 reply; 12+ messages in thread From: Breno Leitao @ 2025-07-24 12:48 UTC (permalink / raw) To: Will Deacon Cc: Michael S. Tsirkin, Stefano Garzarella, jasowang, eperezma, linux-arm-kernel, kvm, Stefan Hajnoczi, netdev On Thu, Jul 24, 2025 at 09:44:38AM +0100, Will Deacon wrote: > > > On Thu, 24 Jul 2025 at 09:48, Michael S. Tsirkin <mst@redhat.com> wrote: > > > > > > > > On Wed, Jul 23, 2025 at 08:04:42AM -0700, Breno Leitao wrote: > > > > > Hello, > > > > > > > > > > I've seen a crash in linux-next for a while on my arm64 server, and > > > > > I decided to report. > > > > > > > > > > While running stress-ng on linux-next, I see the crash below. > > > > > > > > > > This is happening in a kernel configure with some debug options (KASAN, > > > > > LOCKDEP and KMEMLEAK). > > > > > > > > > > Basically running stress-ng in a loop would crash the host in 15-20 > > > > > minutes: > > > > > # while (true); do stress-ng -r 10 -t 10; done > > > > > > > > > > >From the early warning "virt_to_phys used for non-linear address", > > > > > > mmm, we recently added nonlinear SKBs support in vhost-vsock [1], > > > @Will can this issue be related? > > > > Good point. > > > > Breno, if bisecting is too much trouble, would you mind testing the commits > > c76f3c4364fe523cd2782269eab92529c86217aa > > and > > c7991b44d7b44f9270dec63acd0b2965d29aab43 > > and telling us if this reproduces? > > That's definitely worth doing, but we should be careful not to confuse > the "non-linear address" from the warning (which refers to virtual > addresses that lie outside of the linear mapping of memory, e.g. in the > vmalloc space) and "non-linear SKBs" which refer to SKBs with fragment > pages. I've tested both commits above, and I see the crash on both commits above, thus, the problem reproduces in both cases. The only difference I noted is the fact that I haven't seen the warning before the crash. Log against c76f3c4364fe ("vhost/vsock: Avoid allocating arbitrarily-sized SKBs") Unable to handle kernel paging request at virtual address 0000001fc0000048 Mem abort info: ESR = 0x0000000096000005 EC = 0x25: DABT (current EL), IL = 32 bits SET = 0, FnV = 0 EA = 0, S1PTW = 0 FSC = 0x05: level 1 translation fault Data abort info: ISV = 0, ISS = 0x00000005, ISS2 = 0x00000000 CM = 0, WnR = 0, TnD = 0, TagAccess = 0 GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0 user pgtable: 64k pages, 48-bit VAs, pgdp=0000000cdcf2da00 [0000001fc0000048] pgd=0000000000000000, p4d=0000000000000000, pud=0000000000000000 Internal error: Oops: 0000000096000005 [#1] SMP Modules linked in: vfio_iommu_type1 vfio md4 crc32_cryptoapi ghash_generic unix_diag vhost_net tun vhost vhost_iotlb tap mpls_gso mpls_iptunnel mpls_router fou sch_fq ghes_edac tls tcp_diag inet_diag act_gact cls_bpf nvidia_c CPU: 34 UID: 0 PID: 1727297 Comm: stress-ng-dev Kdump: loaded Not tainted 6.16.0-rc6-upstream-00027-gc76f3c4364fe #19 NONE pstate: 23401009 (nzCv daif +PAN -UAO +TCO +DIT +SSBS BTYPE=--) pc : kfree+0x48/0x2a8 lr : vhost_dev_cleanup+0x138/0x2b8 [vhost] sp : ffff80013a0cfcd0 x29: ffff80013a0cfcd0 x28: ffff0008fd0b6240 x27: 0000000000000000 x26: 0000000000000000 x25: 0000000000000000 x24: 0000000000000000 x23: 00000000040e001f x22: ffffffffffffffff x21: ffff00014f1d4ac0 x20: 0000000000000001 x19: ffff00014f1d0000 x18: 0000000000000000 x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000 x14: 000000000000001f x13: 000000000000000f x12: 0000000000000001 x11: 0000000000000000 x10: 0000000000000402 x9 : ffffffdfc0000000 x8 : 0000001fc0000040 x7 : 0000000000000000 x6 : 0000000000000000 x5 : ffff000141931840 x4 : 0000000000000000 x3 : 0000000000000008 x2 : ffffffffffffffff x1 : ffffffffffffffff x0 : 0000000000010000 Call trace: kfree+0x48/0x2a8 (P) vhost_dev_cleanup+0x138/0x2b8 [vhost] vhost_net_release+0xa0/0x1a8 [vhost_net] __fput+0xfc/0x2f0 fput_close_sync+0x38/0xc8 __arm64_sys_close+0xb4/0x108 invoke_syscall+0x4c/0xd0 do_el0_svc+0x80/0xb0 el0_svc+0x3c/0xd0 el0t_64_sync_handler+0x70/0x100 el0t_64_sync+0x170/0x178 Code: 8b080008 f2dffbe9 d350fd08 8b081928 (f9400509) Log against c7991b44d7b4 ("vsock/virtio: Allocate nonlinear SKBs for handling large transmit buffers") Unable to handle kernel paging request at virtual address 0010502f8f8f4f08 Mem abort info: ESR = 0x0000000096000004 EC = 0x25: DABT (current EL), IL = 32 bits SET = 0, FnV = 0 EA = 0, S1PTW = 0 FSC = 0x04: level 0 translation fault Data abort info: ISV = 0, ISS = 0x00000004, ISS2 = 0x00000000 CM = 0, WnR = 0, TnD = 0, TagAccess = 0 GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0 [0010502f8f8f4f08] address between user and kernel address ranges Internal error: Oops: 0000000096000004 [#1] SMP Modules linked in: vhost_vsock vfio_iommu_type1 vfio md4 crc32_cryptoapi ghash_generic vhost_net tun vhost vhost_iotlb tap mpls_gso mpls_iptunnel mpls_router fou sch_fq ghes_edac tls tcp_diag inet_diag act_gact cls_bpf ipmi_s CPU: 47 UID: 0 PID: 1239699 Comm: stress-ng-dev Kdump: loaded Tainted: G W 6.16.0-rc6-upstream-00035-gc7991b44d7b4 #18 NONE Tainted: [W]=WARN pstate: 23401009 (nzCv daif +PAN -UAO +TCO +DIT +SSBS BTYPE=--) pc : kfree+0x48/0x2a8 lr : vhost_dev_cleanup+0x138/0x2b8 [vhost] sp : ffff80016c0cfcd0 x29: ffff80016c0cfcd0 x28: ffff001ad6210d80 x27: 0000000000000000 x26: 0000000000000000 x25: 0000000000000000 x24: 0000000000000000 x23: 00000000040e001f x22: ffffffffffffffff x21: ffff001bb76f00c0 x20: 0000000000000000 x19: ffff001bb76f0000 x18: 0000000000000000 x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000 x14: 000000000000001f x13: 000000000000000f x12: 0000000000000001 x11: 0000000000000000 x10: 0000000000000402 x9 : ffffffdfc0000000 x8 : 0010502f8f8f4f00 x7 : 0000000000000000 x6 : 0000000000000000 x5 : ffff00012e7e2128 x4 : 0000000000000000 x3 : 0000000000000008 x2 : ffffffffffffffff x1 : ffffffffffffffff x0 : 41403f3e3d3c3b3a Call trace: kfree+0x48/0x2a8 (P) vhost_dev_cleanup+0x138/0x2b8 [vhost] vhost_net_release+0xa0/0x1a8 [vhost_net] __fput+0xfc/0x2f0 fput_close_sync+0x38/0xc8 __arm64_sys_close+0xb4/0x108 invoke_syscall+0x4c/0xd0 do_el0_svc+0x80/0xb0 el0_svc+0x3c/0xd0 el0t_64_sync_handler+0x70/0x100 el0t_64_sync+0x170/0x178 Code: 8b080008 f2dffbe9 d350fd08 8b081928 (f9400509) > Breno -- when you say you've been seeing this "for a while", what's the > earliest kernel you know you saw it on? Looking at my logs, the older kernel that I saw it was net-next from 20250717, which was around the time I decided to test net-next in preparation for 6.17, so, not very helpful. Sorry. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: vhost: linux-next: crash at vhost_dev_cleanup() 2025-07-24 12:48 ` Breno Leitao @ 2025-07-24 12:52 ` Stefano Garzarella 2025-07-24 13:49 ` Breno Leitao 0 siblings, 1 reply; 12+ messages in thread From: Stefano Garzarella @ 2025-07-24 12:52 UTC (permalink / raw) To: Breno Leitao Cc: Will Deacon, Michael S. Tsirkin, jasowang, eperezma, linux-arm-kernel, kvm, Stefan Hajnoczi, netdev On Thu, 24 Jul 2025 at 14:48, Breno Leitao <leitao@debian.org> wrote: > > On Thu, Jul 24, 2025 at 09:44:38AM +0100, Will Deacon wrote: > > > > On Thu, 24 Jul 2025 at 09:48, Michael S. Tsirkin <mst@redhat.com> wrote: > > > > > > > > > > On Wed, Jul 23, 2025 at 08:04:42AM -0700, Breno Leitao wrote: > > > > > > Hello, > > > > > > > > > > > > I've seen a crash in linux-next for a while on my arm64 server, and > > > > > > I decided to report. > > > > > > > > > > > > While running stress-ng on linux-next, I see the crash below. > > > > > > > > > > > > This is happening in a kernel configure with some debug options (KASAN, > > > > > > LOCKDEP and KMEMLEAK). > > > > > > > > > > > > Basically running stress-ng in a loop would crash the host in 15-20 > > > > > > minutes: > > > > > > # while (true); do stress-ng -r 10 -t 10; done > > > > > > > > > > > > >From the early warning "virt_to_phys used for non-linear address", > > > > > > > > mmm, we recently added nonlinear SKBs support in vhost-vsock [1], > > > > @Will can this issue be related? > > > > > > Good point. > > > > > > Breno, if bisecting is too much trouble, would you mind testing the commits > > > c76f3c4364fe523cd2782269eab92529c86217aa > > > and > > > c7991b44d7b44f9270dec63acd0b2965d29aab43 > > > and telling us if this reproduces? > > > > That's definitely worth doing, but we should be careful not to confuse > > the "non-linear address" from the warning (which refers to virtual > > addresses that lie outside of the linear mapping of memory, e.g. in the > > vmalloc space) and "non-linear SKBs" which refer to SKBs with fragment > > pages. > > I've tested both commits above, and I see the crash on both commits > above, thus, the problem reproduces in both cases. The only difference > I noted is the fact that I haven't seen the warning before the crash. > > > Log against c76f3c4364fe ("vhost/vsock: Avoid allocating > arbitrarily-sized SKBs") > > Unable to handle kernel paging request at virtual address 0000001fc0000048 > Mem abort info: > ESR = 0x0000000096000005 > EC = 0x25: DABT (current EL), IL = 32 bits > SET = 0, FnV = 0 > EA = 0, S1PTW = 0 > FSC = 0x05: level 1 translation fault > Data abort info: > ISV = 0, ISS = 0x00000005, ISS2 = 0x00000000 > CM = 0, WnR = 0, TnD = 0, TagAccess = 0 > GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0 > user pgtable: 64k pages, 48-bit VAs, pgdp=0000000cdcf2da00 > [0000001fc0000048] pgd=0000000000000000, p4d=0000000000000000, pud=0000000000000000 > Internal error: Oops: 0000000096000005 [#1] SMP > Modules linked in: vfio_iommu_type1 vfio md4 crc32_cryptoapi ghash_generic unix_diag vhost_net tun vhost vhost_iotlb tap mpls_gso mpls_iptunnel mpls_router fou sch_fq ghes_edac tls tcp_diag inet_diag act_gact cls_bpf nvidia_c > CPU: 34 UID: 0 PID: 1727297 Comm: stress-ng-dev Kdump: loaded Not tainted 6.16.0-rc6-upstream-00027-gc76f3c4364fe #19 NONE > pstate: 23401009 (nzCv daif +PAN -UAO +TCO +DIT +SSBS BTYPE=--) > pc : kfree+0x48/0x2a8 > lr : vhost_dev_cleanup+0x138/0x2b8 [vhost] > sp : ffff80013a0cfcd0 > x29: ffff80013a0cfcd0 x28: ffff0008fd0b6240 x27: 0000000000000000 > x26: 0000000000000000 x25: 0000000000000000 x24: 0000000000000000 > x23: 00000000040e001f x22: ffffffffffffffff x21: ffff00014f1d4ac0 > x20: 0000000000000001 x19: ffff00014f1d0000 x18: 0000000000000000 > x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000 > x14: 000000000000001f x13: 000000000000000f x12: 0000000000000001 > x11: 0000000000000000 x10: 0000000000000402 x9 : ffffffdfc0000000 > x8 : 0000001fc0000040 x7 : 0000000000000000 x6 : 0000000000000000 > x5 : ffff000141931840 x4 : 0000000000000000 x3 : 0000000000000008 > x2 : ffffffffffffffff x1 : ffffffffffffffff x0 : 0000000000010000 > Call trace: > kfree+0x48/0x2a8 (P) > vhost_dev_cleanup+0x138/0x2b8 [vhost] > vhost_net_release+0xa0/0x1a8 [vhost_net] But here is the vhost_net, so I'm confused now. Do you see the same (vhost_net) also on 9798752 ("Add linux-next specific files for 20250721") ? The initial report contained only vhost_vsock traces IIUC, so I'm suspecting something in the vhost core. Thanks, Stefano > __fput+0xfc/0x2f0 > fput_close_sync+0x38/0xc8 > __arm64_sys_close+0xb4/0x108 > invoke_syscall+0x4c/0xd0 > do_el0_svc+0x80/0xb0 > el0_svc+0x3c/0xd0 > el0t_64_sync_handler+0x70/0x100 > el0t_64_sync+0x170/0x178 > Code: 8b080008 f2dffbe9 d350fd08 8b081928 (f9400509) > > Log against c7991b44d7b4 ("vsock/virtio: Allocate nonlinear SKBs for > handling large transmit buffers") > > Unable to handle kernel paging request at virtual address 0010502f8f8f4f08 > Mem abort info: > ESR = 0x0000000096000004 > EC = 0x25: DABT (current EL), IL = 32 bits > SET = 0, FnV = 0 > EA = 0, S1PTW = 0 > FSC = 0x04: level 0 translation fault > Data abort info: > ISV = 0, ISS = 0x00000004, ISS2 = 0x00000000 > CM = 0, WnR = 0, TnD = 0, TagAccess = 0 > GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0 > [0010502f8f8f4f08] address between user and kernel address ranges > Internal error: Oops: 0000000096000004 [#1] SMP > Modules linked in: vhost_vsock vfio_iommu_type1 vfio md4 crc32_cryptoapi ghash_generic vhost_net tun vhost vhost_iotlb tap mpls_gso mpls_iptunnel mpls_router fou sch_fq ghes_edac tls tcp_diag inet_diag act_gact cls_bpf ipmi_s > CPU: 47 UID: 0 PID: 1239699 Comm: stress-ng-dev Kdump: loaded Tainted: G W 6.16.0-rc6-upstream-00035-gc7991b44d7b4 #18 NONE > Tainted: [W]=WARN > pstate: 23401009 (nzCv daif +PAN -UAO +TCO +DIT +SSBS BTYPE=--) > pc : kfree+0x48/0x2a8 > lr : vhost_dev_cleanup+0x138/0x2b8 [vhost] > sp : ffff80016c0cfcd0 > x29: ffff80016c0cfcd0 x28: ffff001ad6210d80 x27: 0000000000000000 > x26: 0000000000000000 x25: 0000000000000000 x24: 0000000000000000 > x23: 00000000040e001f x22: ffffffffffffffff x21: ffff001bb76f00c0 > x20: 0000000000000000 x19: ffff001bb76f0000 x18: 0000000000000000 > x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000 > x14: 000000000000001f x13: 000000000000000f x12: 0000000000000001 > x11: 0000000000000000 x10: 0000000000000402 x9 : ffffffdfc0000000 > x8 : 0010502f8f8f4f00 x7 : 0000000000000000 x6 : 0000000000000000 > x5 : ffff00012e7e2128 x4 : 0000000000000000 x3 : 0000000000000008 > x2 : ffffffffffffffff x1 : ffffffffffffffff x0 : 41403f3e3d3c3b3a > Call trace: > kfree+0x48/0x2a8 (P) > vhost_dev_cleanup+0x138/0x2b8 [vhost] > vhost_net_release+0xa0/0x1a8 [vhost_net] > __fput+0xfc/0x2f0 > fput_close_sync+0x38/0xc8 > __arm64_sys_close+0xb4/0x108 > invoke_syscall+0x4c/0xd0 > do_el0_svc+0x80/0xb0 > el0_svc+0x3c/0xd0 > el0t_64_sync_handler+0x70/0x100 > el0t_64_sync+0x170/0x178 > Code: 8b080008 f2dffbe9 d350fd08 8b081928 (f9400509) > > > > Breno -- when you say you've been seeing this "for a while", what's the > > earliest kernel you know you saw it on? > > Looking at my logs, the older kernel that I saw it was net-next from > 20250717, which was around the time I decided to test net-next in > preparation for 6.17, so, not very helpful. Sorry. > ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: vhost: linux-next: crash at vhost_dev_cleanup() 2025-07-24 12:52 ` Stefano Garzarella @ 2025-07-24 13:49 ` Breno Leitao 2025-07-29 7:44 ` Jason Wang 0 siblings, 1 reply; 12+ messages in thread From: Breno Leitao @ 2025-07-24 13:49 UTC (permalink / raw) To: Stefano Garzarella Cc: Will Deacon, Michael S. Tsirkin, jasowang, eperezma, linux-arm-kernel, kvm, Stefan Hajnoczi, netdev On Thu, Jul 24, 2025 at 02:52:08PM +0200, Stefano Garzarella wrote: > On Thu, 24 Jul 2025 at 14:48, Breno Leitao <leitao@debian.org> wrote: > > > > On Thu, Jul 24, 2025 at 09:44:38AM +0100, Will Deacon wrote: > > > > > On Thu, 24 Jul 2025 at 09:48, Michael S. Tsirkin <mst@redhat.com> wrote: > > > > > > > > > > > > On Wed, Jul 23, 2025 at 08:04:42AM -0700, Breno Leitao wrote: > > > > > > > Hello, > > > > > > > > > > > > > > I've seen a crash in linux-next for a while on my arm64 server, and > > > > > > > I decided to report. > > > > > > > > > > > > > > While running stress-ng on linux-next, I see the crash below. > > > > > > > > > > > > > > This is happening in a kernel configure with some debug options (KASAN, > > > > > > > LOCKDEP and KMEMLEAK). > > > > > > > > > > > > > > Basically running stress-ng in a loop would crash the host in 15-20 > > > > > > > minutes: > > > > > > > # while (true); do stress-ng -r 10 -t 10; done > > > > > > > > > > > > > > >From the early warning "virt_to_phys used for non-linear address", > > > > > > > > > > mmm, we recently added nonlinear SKBs support in vhost-vsock [1], > > > > > @Will can this issue be related? > > > > > > > > Good point. > > > > > > > > Breno, if bisecting is too much trouble, would you mind testing the commits > > > > c76f3c4364fe523cd2782269eab92529c86217aa > > > > and > > > > c7991b44d7b44f9270dec63acd0b2965d29aab43 > > > > and telling us if this reproduces? > > > > > > That's definitely worth doing, but we should be careful not to confuse > > > the "non-linear address" from the warning (which refers to virtual > > > addresses that lie outside of the linear mapping of memory, e.g. in the > > > vmalloc space) and "non-linear SKBs" which refer to SKBs with fragment > > > pages. > > > > I've tested both commits above, and I see the crash on both commits > > above, thus, the problem reproduces in both cases. The only difference > > I noted is the fact that I haven't seen the warning before the crash. > > > > > > Log against c76f3c4364fe ("vhost/vsock: Avoid allocating > > arbitrarily-sized SKBs") > > > > Unable to handle kernel paging request at virtual address 0000001fc0000048 > > Mem abort info: > > ESR = 0x0000000096000005 > > EC = 0x25: DABT (current EL), IL = 32 bits > > SET = 0, FnV = 0 > > EA = 0, S1PTW = 0 > > FSC = 0x05: level 1 translation fault > > Data abort info: > > ISV = 0, ISS = 0x00000005, ISS2 = 0x00000000 > > CM = 0, WnR = 0, TnD = 0, TagAccess = 0 > > GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0 > > user pgtable: 64k pages, 48-bit VAs, pgdp=0000000cdcf2da00 > > [0000001fc0000048] pgd=0000000000000000, p4d=0000000000000000, pud=0000000000000000 > > Internal error: Oops: 0000000096000005 [#1] SMP > > Modules linked in: vfio_iommu_type1 vfio md4 crc32_cryptoapi ghash_generic unix_diag vhost_net tun vhost vhost_iotlb tap mpls_gso mpls_iptunnel mpls_router fou sch_fq ghes_edac tls tcp_diag inet_diag act_gact cls_bpf nvidia_c > > CPU: 34 UID: 0 PID: 1727297 Comm: stress-ng-dev Kdump: loaded Not tainted 6.16.0-rc6-upstream-00027-gc76f3c4364fe #19 NONE > > pstate: 23401009 (nzCv daif +PAN -UAO +TCO +DIT +SSBS BTYPE=--) > > pc : kfree+0x48/0x2a8 > > lr : vhost_dev_cleanup+0x138/0x2b8 [vhost] > > sp : ffff80013a0cfcd0 > > x29: ffff80013a0cfcd0 x28: ffff0008fd0b6240 x27: 0000000000000000 > > x26: 0000000000000000 x25: 0000000000000000 x24: 0000000000000000 > > x23: 00000000040e001f x22: ffffffffffffffff x21: ffff00014f1d4ac0 > > x20: 0000000000000001 x19: ffff00014f1d0000 x18: 0000000000000000 > > x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000 > > x14: 000000000000001f x13: 000000000000000f x12: 0000000000000001 > > x11: 0000000000000000 x10: 0000000000000402 x9 : ffffffdfc0000000 > > x8 : 0000001fc0000040 x7 : 0000000000000000 x6 : 0000000000000000 > > x5 : ffff000141931840 x4 : 0000000000000000 x3 : 0000000000000008 > > x2 : ffffffffffffffff x1 : ffffffffffffffff x0 : 0000000000010000 > > Call trace: > > kfree+0x48/0x2a8 (P) > > vhost_dev_cleanup+0x138/0x2b8 [vhost] > > vhost_net_release+0xa0/0x1a8 [vhost_net] > > But here is the vhost_net, so I'm confused now. > Do you see the same (vhost_net) also on 9798752 ("Add linux-next > specific files for 20250721") ? I will need to reproduce, but, looking at my logs, I see the following against: c76f3c4364fe ("vhost/vsock: Avoid allocating arbitrarily-sized SKBs"). The logs are a bit intermixed, probably there were multiple CPUs hitting the same code path. virt_to_phys used for non-linear address: 000000001b662678 (0xffe61984a460) WARNING: CPU: 15 PID: 112846 at arch/arm64/mm/physaddr.c:15 __virt_to_phys+0x80/0xa8 Modules linked in: vhost_vsock(E) vhost(E) vhost_iotlb(E) ghes_edac(E) tls(E) act_gact(E) cls_bpf(E) tcp_diag(E) inet_diag(E) ipmi_ssif(E) ipmi_devintf(E) ipmi_msghandler(E) sch_fq_codel(E) drm(E) backlight(E) drm_panel_orientation_quirks(E) acpi_power_meter(E) loop(E) efivarfs(E) autofs4(E) CPU: 15 UID: 0 PID: 112846 Comm: stress-ng-dev Kdump: loaded Tainted: G W E N 6.16.0-rc6-upstream-00027-gc76f3c4364fe #16 PREEMPT(none) Tainted: [W]=WARN, [E]=UNSIGNED_MODULE, [N]=TEST pstate: 63401009 (nZCv daif +PAN -UAO +TCO +DIT +SSBS BTYPE=--) pc : __virt_to_phys+0x80/0xa8 lr : __virt_to_phys+0x7c/0xa8 sp : ffff8001184d7a30 x29: ffff8001184d7a30 x28: 00000000000045d8 x27: 1fffe000e7e88014 x26: 1fffe000e7e888f7 x25: ffff0007e578bf00 x24: 1fffe000e7e88013 x23: 0000000000000000 x22: 0000ffe61984a460 x21: ffff00073f440098 x20: ffffff1000080000 x19: 0000ffe61984a460 x18: 0000000000000002 x17: 6666783028203837 x16: 3632363662313030 x15: 0000000000000001 x14: 1fffe006d52e90f2 x13: 0000000000000000 x12: 0000000000000000 x11: ffff6006d52e90f3 x10: 0000000000000002 x9 : cfc659a21c727d00 x8 : ffff800083c19000 x7 : 0000000000000001 x6 : 0000000000000001 x5 : ffff8001184d7398 x4 : ffff800084866d60 x3 : ffff8000805fdd94 x2 : 0000000000000001 x1 : 0000000000000004 x0 : 000000000000004b Call trace: __virt_to_phys+0x80/0xa8 (P) kfree+0xac/0x4b0 vhost_dev_cleanup+0x484/0x8b0 [vhost] vhost_vsock_dev_release+0x2f4/0x358 [vhost_vsock] __fput+0x2b4/0x608 fput_close_sync+0xe8/0x1e0 __arm64_sys_close+0x84/0xd0 invoke_syscall+0x8c/0x208 do_el0_svc+0x128/0x1a0 el0_svc+0x58/0x160 el0t_64_sync_handler+0x78/0x108 el0t_64_sync+0x198/0x1a0 irq event stamp: 0 hardirqs last enabled at (0): [<0000000000000000>] 0x0 hardirqs last disabled at (0): [<ffff8000801d876c>] copy_process+0xd5c/0x29f8 softirqs last enabled at (0): [<ffff8000801d879c>] copy_process+0xd8c/0x29f8 softirqs last disabled at (0): [<0000000000000000>] 0x0 ---[ end trace 0000000000000000 ]--- Unable to handle kernel paging request at virtual address 0000040053791288 ------------[ cut here ]------------ lr : kfree+0xac/0x4b0 virt_to_phys used for non-linear address: 00000000290839fd (0x2500000000) WARNING: CPU: 41 PID: 112845 at arch/arm64/mm/physaddr.c:15 __virt_to_phys+0x80/0xa8 Modules linked in: vhost_vsock(E) vhost(E) vhost_iotlb(E) ghes_edac(E) tls(E) act_gact(E) cls_bpf(E) tcp_diag(E) inet_diag(E) ipmi_ssif(E) ipmi_devintf(E) ipmi_msghandler(E) sch_fq_codel(E) drm(E) backlight(E) drm_panel_orientation_quirks(E) acpi_power_meter(E) loop(E) efivarfs(E) autofs4(E) CPU: 41 UID: 0 PID: 112845 Comm: stress-ng-dev Kdump: loaded Tainted: G W E N 6.16.0-rc6-upstream-00027-gc76f3c4364fe #16 PREEMPT(none) x23: 0000000000000001 Tainted: [W]=WARN, [E]=UNSIGNED_MODULE, [N]=TEST pstate: 63401009 (nZCv daif +PAN -UAO +TCO +DIT +SSBS BTYPE=--) pc : __virt_to_phys+0x80/0xa8 lr : __virt_to_phys+0x7c/0xa8 sp : ffff8001a8277a30 x29: ffff8001a8277a30 x28: 00000000000045d8 x27: 1fffe000e7e3c014 x26: 1fffe000e7e3c8f7 x25: ffff0007bcff0000 x24: 1fffe000e7e3c013 x23: 0000000000000000 x22: 0000002500000000 x21: ffff00073f1e0098 x20: ffffff1000080000 x19: 0000002500000000 x18: 0000000000000004 x17: 00000000ffffffff x16: 0000000000000001 x15: 0000000000000001 x14: 1fffe006d53920f2 x13: 0000000000000000 x12: 0000000000000000 sp : ffff8001184d7a50 x29: ffff8001184d7a60 x28: 00000000000045d8 x27: 1fffe000e7e88014 x26: 1fffe000e7e888f7 x25: ffff0007e578bf00 x24: 1fffe000e7e88013 x23: 0000000000000000 x22: 0000ffe61984a460 x21: ffff00073f440098 x20: 0000040053791280 x19: ffff80000d0b8bbc x18: 0000000000000002 x17: 6666783028203837 x16: 3632363662313030 x15: 0000000000000001 x22: 0000ffe6199a459d x14: 1fffe006d52e90f2 x21: ffff00073f7f0098 x20: ffffff1000080000 x19: 0000ffe6199a459d x18: 0000000000000004 x17: 54455320203b2d2c x16: 0000000000000011 x15: 0000000000000001 x14: 1fffe006d52bb8f2 x13: 0000000000000000 x12: 0000000000000000 x11: ffff6006d52bb8f3 x10: dfff800000000000 x9 : 77521a2bd3e0be00 x8 : ffff800083c19000 x7 : 0000000000000000 x6 : ffff80008036bc2c x5 : 0000000000000000 x4 : 0000000000000001 x3 : ffff8000805fdd94 x2 : 0000000000000001 x1 : 0000000000000004 x0 : 000000000000004b Call trace: __virt_to_phys+0x80/0xa8 (P) kfree+0xac/0x4b0 vhost_dev_cleanup+0x484/0x8b0 [vhost] vhost_vsock_dev_release+0x2f4/0x358 [vhost_vsock] __fput+0x2b4/0x608 x11: ffff6006d53920f3 fput_close_sync+0xe8/0x1e0 __arm64_sys_close+0x84/0xd0 invoke_syscall+0x8c/0x208 do_el0_svc+0x128/0x1a0 el0_svc+0x58/0x160 el0t_64_sync_handler+0x78/0x108 el0t_64_sync+0x198/0x1a0 irq event stamp: 0 x13: 0000000000000000 hardirqs last enabled at (0): [<0000000000000000>] 0x0 hardirqs last disabled at (0): [<ffff8000801d876c>] copy_process+0xd5c/0x29f8 > The initial report contained only vhost_vsock traces IIUC, so I'm > suspecting something in the vhost core. Right, it seems we are hitting the same code path, on on both vhost_vsock and vhost_net. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: vhost: linux-next: crash at vhost_dev_cleanup() 2025-07-24 13:49 ` Breno Leitao @ 2025-07-29 7:44 ` Jason Wang 2025-07-29 9:10 ` Breno Leitao 2025-07-29 9:57 ` Stefano Garzarella 0 siblings, 2 replies; 12+ messages in thread From: Jason Wang @ 2025-07-29 7:44 UTC (permalink / raw) To: Breno Leitao Cc: Stefano Garzarella, Will Deacon, Michael S. Tsirkin, eperezma, linux-arm-kernel, kvm, Stefan Hajnoczi, netdev On Thu, Jul 24, 2025 at 9:50 PM Breno Leitao <leitao@debian.org> wrote: > > On Thu, Jul 24, 2025 at 02:52:08PM +0200, Stefano Garzarella wrote: > > On Thu, 24 Jul 2025 at 14:48, Breno Leitao <leitao@debian.org> wrote: > > > > > > On Thu, Jul 24, 2025 at 09:44:38AM +0100, Will Deacon wrote: > > > > > > On Thu, 24 Jul 2025 at 09:48, Michael S. Tsirkin <mst@redhat.com> wrote: > > > > > > > > > > > > > > On Wed, Jul 23, 2025 at 08:04:42AM -0700, Breno Leitao wrote: > > > > > > > > Hello, > > > > > > > > > > > > > > > > I've seen a crash in linux-next for a while on my arm64 server, and > > > > > > > > I decided to report. > > > > > > > > > > > > > > > > While running stress-ng on linux-next, I see the crash below. > > > > > > > > > > > > > > > > This is happening in a kernel configure with some debug options (KASAN, > > > > > > > > LOCKDEP and KMEMLEAK). > > > > > > > > > > > > > > > > Basically running stress-ng in a loop would crash the host in 15-20 > > > > > > > > minutes: > > > > > > > > # while (true); do stress-ng -r 10 -t 10; done > > > > > > > > > > > > > > > > >From the early warning "virt_to_phys used for non-linear address", > > > > > > > > > > > > mmm, we recently added nonlinear SKBs support in vhost-vsock [1], > > > > > > @Will can this issue be related? > > > > > > > > > > Good point. > > > > > > > > > > Breno, if bisecting is too much trouble, would you mind testing the commits > > > > > c76f3c4364fe523cd2782269eab92529c86217aa > > > > > and > > > > > c7991b44d7b44f9270dec63acd0b2965d29aab43 > > > > > and telling us if this reproduces? > > > > > > > > That's definitely worth doing, but we should be careful not to confuse > > > > the "non-linear address" from the warning (which refers to virtual > > > > addresses that lie outside of the linear mapping of memory, e.g. in the > > > > vmalloc space) and "non-linear SKBs" which refer to SKBs with fragment > > > > pages. > > > > > > I've tested both commits above, and I see the crash on both commits > > > above, thus, the problem reproduces in both cases. The only difference > > > I noted is the fact that I haven't seen the warning before the crash. > > > > > > > > > Log against c76f3c4364fe ("vhost/vsock: Avoid allocating > > > arbitrarily-sized SKBs") > > > > > > Unable to handle kernel paging request at virtual address 0000001fc0000048 > > > Mem abort info: > > > ESR = 0x0000000096000005 > > > EC = 0x25: DABT (current EL), IL = 32 bits > > > SET = 0, FnV = 0 > > > EA = 0, S1PTW = 0 > > > FSC = 0x05: level 1 translation fault > > > Data abort info: > > > ISV = 0, ISS = 0x00000005, ISS2 = 0x00000000 > > > CM = 0, WnR = 0, TnD = 0, TagAccess = 0 > > > GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0 > > > user pgtable: 64k pages, 48-bit VAs, pgdp=0000000cdcf2da00 > > > [0000001fc0000048] pgd=0000000000000000, p4d=0000000000000000, pud=0000000000000000 > > > Internal error: Oops: 0000000096000005 [#1] SMP > > > Modules linked in: vfio_iommu_type1 vfio md4 crc32_cryptoapi ghash_generic unix_diag vhost_net tun vhost vhost_iotlb tap mpls_gso mpls_iptunnel mpls_router fou sch_fq ghes_edac tls tcp_diag inet_diag act_gact cls_bpf nvidia_c > > > CPU: 34 UID: 0 PID: 1727297 Comm: stress-ng-dev Kdump: loaded Not tainted 6.16.0-rc6-upstream-00027-gc76f3c4364fe #19 NONE > > > pstate: 23401009 (nzCv daif +PAN -UAO +TCO +DIT +SSBS BTYPE=--) > > > pc : kfree+0x48/0x2a8 > > > lr : vhost_dev_cleanup+0x138/0x2b8 [vhost] > > > sp : ffff80013a0cfcd0 > > > x29: ffff80013a0cfcd0 x28: ffff0008fd0b6240 x27: 0000000000000000 > > > x26: 0000000000000000 x25: 0000000000000000 x24: 0000000000000000 > > > x23: 00000000040e001f x22: ffffffffffffffff x21: ffff00014f1d4ac0 > > > x20: 0000000000000001 x19: ffff00014f1d0000 x18: 0000000000000000 > > > x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000 > > > x14: 000000000000001f x13: 000000000000000f x12: 0000000000000001 > > > x11: 0000000000000000 x10: 0000000000000402 x9 : ffffffdfc0000000 > > > x8 : 0000001fc0000040 x7 : 0000000000000000 x6 : 0000000000000000 > > > x5 : ffff000141931840 x4 : 0000000000000000 x3 : 0000000000000008 > > > x2 : ffffffffffffffff x1 : ffffffffffffffff x0 : 0000000000010000 > > > Call trace: > > > kfree+0x48/0x2a8 (P) > > > vhost_dev_cleanup+0x138/0x2b8 [vhost] > > > vhost_net_release+0xa0/0x1a8 [vhost_net] > > > > But here is the vhost_net, so I'm confused now. > > Do you see the same (vhost_net) also on 9798752 ("Add linux-next > > specific files for 20250721") ? > > I will need to reproduce, but, looking at my logs, I see the following > against: c76f3c4364fe ("vhost/vsock: Avoid allocating arbitrarily-sized SKBs"). > The logs are a bit intermixed, probably there were multiple CPUs hitting > the same code path. > > virt_to_phys used for non-linear address: 000000001b662678 (0xffe61984a460) > WARNING: CPU: 15 PID: 112846 at arch/arm64/mm/physaddr.c:15 __virt_to_phys+0x80/0xa8 > Modules linked in: vhost_vsock(E) vhost(E) vhost_iotlb(E) ghes_edac(E) tls(E) act_gact(E) cls_bpf(E) tcp_diag(E) inet_diag(E) ipmi_ssif(E) ipmi_devintf(E) ipmi_msghandler(E) sch_fq_codel(E) drm(E) backlight(E) drm_panel_orientation_quirks(E) acpi_power_meter(E) loop(E) efivarfs(E) autofs4(E) > CPU: 15 UID: 0 PID: 112846 Comm: stress-ng-dev Kdump: loaded Tainted: G W E N 6.16.0-rc6-upstream-00027-gc76f3c4364fe #16 PREEMPT(none) > Tainted: [W]=WARN, [E]=UNSIGNED_MODULE, [N]=TEST > pstate: 63401009 (nZCv daif +PAN -UAO +TCO +DIT +SSBS BTYPE=--) > pc : __virt_to_phys+0x80/0xa8 > lr : __virt_to_phys+0x7c/0xa8 > sp : ffff8001184d7a30 > x29: ffff8001184d7a30 x28: 00000000000045d8 x27: 1fffe000e7e88014 > x26: 1fffe000e7e888f7 x25: ffff0007e578bf00 x24: 1fffe000e7e88013 > x23: 0000000000000000 x22: 0000ffe61984a460 x21: ffff00073f440098 > x20: ffffff1000080000 x19: 0000ffe61984a460 x18: 0000000000000002 > x17: 6666783028203837 x16: 3632363662313030 x15: 0000000000000001 > x14: 1fffe006d52e90f2 x13: 0000000000000000 x12: 0000000000000000 > x11: ffff6006d52e90f3 x10: 0000000000000002 x9 : cfc659a21c727d00 > x8 : ffff800083c19000 x7 : 0000000000000001 x6 : 0000000000000001 > x5 : ffff8001184d7398 x4 : ffff800084866d60 x3 : ffff8000805fdd94 > x2 : 0000000000000001 x1 : 0000000000000004 x0 : 000000000000004b > Call trace: > __virt_to_phys+0x80/0xa8 (P) > kfree+0xac/0x4b0 > vhost_dev_cleanup+0x484/0x8b0 [vhost] > vhost_vsock_dev_release+0x2f4/0x358 [vhost_vsock] > __fput+0x2b4/0x608 > fput_close_sync+0xe8/0x1e0 > __arm64_sys_close+0x84/0xd0 > invoke_syscall+0x8c/0x208 > do_el0_svc+0x128/0x1a0 > el0_svc+0x58/0x160 > el0t_64_sync_handler+0x78/0x108 > el0t_64_sync+0x198/0x1a0 > irq event stamp: 0 > hardirqs last enabled at (0): [<0000000000000000>] 0x0 > hardirqs last disabled at (0): [<ffff8000801d876c>] copy_process+0xd5c/0x29f8 > softirqs last enabled at (0): [<ffff8000801d879c>] copy_process+0xd8c/0x29f8 > softirqs last disabled at (0): [<0000000000000000>] 0x0 > ---[ end trace 0000000000000000 ]--- > Unable to handle kernel paging request at virtual address 0000040053791288 > ------------[ cut here ]------------ > lr : kfree+0xac/0x4b0 > virt_to_phys used for non-linear address: 00000000290839fd (0x2500000000) > WARNING: CPU: 41 PID: 112845 at arch/arm64/mm/physaddr.c:15 __virt_to_phys+0x80/0xa8 > Modules linked in: vhost_vsock(E) vhost(E) vhost_iotlb(E) ghes_edac(E) tls(E) act_gact(E) cls_bpf(E) tcp_diag(E) inet_diag(E) ipmi_ssif(E) ipmi_devintf(E) ipmi_msghandler(E) sch_fq_codel(E) drm(E) backlight(E) drm_panel_orientation_quirks(E) acpi_power_meter(E) loop(E) efivarfs(E) autofs4(E) > CPU: 41 UID: 0 PID: 112845 Comm: stress-ng-dev Kdump: loaded Tainted: G W E N 6.16.0-rc6-upstream-00027-gc76f3c4364fe #16 PREEMPT(none) > x23: 0000000000000001 > Tainted: [W]=WARN, [E]=UNSIGNED_MODULE, [N]=TEST > pstate: 63401009 (nZCv daif +PAN -UAO +TCO +DIT +SSBS BTYPE=--) > pc : __virt_to_phys+0x80/0xa8 > lr : __virt_to_phys+0x7c/0xa8 > sp : ffff8001a8277a30 > x29: ffff8001a8277a30 x28: 00000000000045d8 x27: 1fffe000e7e3c014 > x26: 1fffe000e7e3c8f7 x25: ffff0007bcff0000 x24: 1fffe000e7e3c013 > x23: 0000000000000000 x22: 0000002500000000 x21: ffff00073f1e0098 > x20: ffffff1000080000 x19: 0000002500000000 x18: 0000000000000004 > x17: 00000000ffffffff x16: 0000000000000001 x15: 0000000000000001 > x14: 1fffe006d53920f2 x13: 0000000000000000 x12: 0000000000000000 > sp : ffff8001184d7a50 > x29: ffff8001184d7a60 x28: 00000000000045d8 x27: 1fffe000e7e88014 > x26: 1fffe000e7e888f7 x25: ffff0007e578bf00 x24: 1fffe000e7e88013 > x23: 0000000000000000 x22: 0000ffe61984a460 x21: ffff00073f440098 > x20: 0000040053791280 x19: ffff80000d0b8bbc x18: 0000000000000002 > x17: 6666783028203837 x16: 3632363662313030 x15: 0000000000000001 > x22: 0000ffe6199a459d > x14: 1fffe006d52e90f2 > x21: ffff00073f7f0098 > x20: ffffff1000080000 x19: 0000ffe6199a459d x18: 0000000000000004 > x17: 54455320203b2d2c x16: 0000000000000011 x15: 0000000000000001 > x14: 1fffe006d52bb8f2 x13: 0000000000000000 x12: 0000000000000000 > x11: ffff6006d52bb8f3 x10: dfff800000000000 x9 : 77521a2bd3e0be00 > x8 : ffff800083c19000 x7 : 0000000000000000 x6 : ffff80008036bc2c > x5 : 0000000000000000 x4 : 0000000000000001 x3 : ffff8000805fdd94 > x2 : 0000000000000001 x1 : 0000000000000004 x0 : 000000000000004b > Call trace: > __virt_to_phys+0x80/0xa8 (P) > kfree+0xac/0x4b0 > vhost_dev_cleanup+0x484/0x8b0 [vhost] > vhost_vsock_dev_release+0x2f4/0x358 [vhost_vsock] > __fput+0x2b4/0x608 > x11: ffff6006d53920f3 > fput_close_sync+0xe8/0x1e0 > __arm64_sys_close+0x84/0xd0 > invoke_syscall+0x8c/0x208 > do_el0_svc+0x128/0x1a0 > el0_svc+0x58/0x160 > el0t_64_sync_handler+0x78/0x108 > el0t_64_sync+0x198/0x1a0 > irq event stamp: 0 > x13: 0000000000000000 > hardirqs last enabled at (0): [<0000000000000000>] 0x0 > hardirqs last disabled at (0): [<ffff8000801d876c>] copy_process+0xd5c/0x29f8 > > > > The initial report contained only vhost_vsock traces IIUC, so I'm > > suspecting something in the vhost core. > > Right, it seems we are hitting the same code path, on on both > vhost_vsock and vhost_net. > I've posted a fix here: https://lore.kernel.org/virtualization/20250729073916.80647-1-jasowang@redhat.com/T/#u I think it should address this issue. Thanks ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: vhost: linux-next: crash at vhost_dev_cleanup() 2025-07-29 7:44 ` Jason Wang @ 2025-07-29 9:10 ` Breno Leitao 2025-07-29 9:57 ` Stefano Garzarella 1 sibling, 0 replies; 12+ messages in thread From: Breno Leitao @ 2025-07-29 9:10 UTC (permalink / raw) To: Jason Wang Cc: Stefano Garzarella, Will Deacon, Michael S. Tsirkin, eperezma, linux-arm-kernel, kvm, Stefan Hajnoczi, netdev Hello Jason, On Tue, Jul 29, 2025 at 03:44:49PM +0800, Jason Wang wrote: > On Thu, Jul 24, 2025 at 9:50 PM Breno Leitao <leitao@debian.org> wrote: > > > The initial report contained only vhost_vsock traces IIUC, so I'm > > > suspecting something in the vhost core. > > > > Right, it seems we are hitting the same code path, on on both > > vhost_vsock and vhost_net. > > > > I've posted a fix here: > > https://lore.kernel.org/virtualization/20250729073916.80647-1-jasowang@redhat.com/T/#u > > I think it should address this issue. yes, it does. I've tested the fix on my machine and I was not able to reproduce the error at all. Thanks for the fix, --breno ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: vhost: linux-next: crash at vhost_dev_cleanup() 2025-07-29 7:44 ` Jason Wang 2025-07-29 9:10 ` Breno Leitao @ 2025-07-29 9:57 ` Stefano Garzarella 1 sibling, 0 replies; 12+ messages in thread From: Stefano Garzarella @ 2025-07-29 9:57 UTC (permalink / raw) To: Jason Wang Cc: Breno Leitao, Will Deacon, Michael S. Tsirkin, eperezma, linux-arm-kernel, kvm, Stefan Hajnoczi, netdev On Tue, Jul 29, 2025 at 03:44:49PM +0800, Jason Wang wrote: >On Thu, Jul 24, 2025 at 9:50 PM Breno Leitao <leitao@debian.org> wrote: >> >> On Thu, Jul 24, 2025 at 02:52:08PM +0200, Stefano Garzarella wrote: >> > On Thu, 24 Jul 2025 at 14:48, Breno Leitao <leitao@debian.org> wrote: >> > > >> > > On Thu, Jul 24, 2025 at 09:44:38AM +0100, Will Deacon wrote: >> > > > > > On Thu, 24 Jul 2025 at 09:48, Michael S. Tsirkin <mst@redhat.com> wrote: >> > > > > > > >> > > > > > > On Wed, Jul 23, 2025 at 08:04:42AM -0700, Breno Leitao wrote: >> > > > > > > > Hello, >> > > > > > > > >> > > > > > > > I've seen a crash in linux-next for a while on my arm64 server, and >> > > > > > > > I decided to report. >> > > > > > > > >> > > > > > > > While running stress-ng on linux-next, I see the crash below. >> > > > > > > > >> > > > > > > > This is happening in a kernel configure with some debug options (KASAN, >> > > > > > > > LOCKDEP and KMEMLEAK). >> > > > > > > > >> > > > > > > > Basically running stress-ng in a loop would crash the host in 15-20 >> > > > > > > > minutes: >> > > > > > > > # while (true); do stress-ng -r 10 -t 10; done >> > > > > > > > >> > > > > > > > >From the early warning "virt_to_phys used for non-linear address", >> > > > > > >> > > > > > mmm, we recently added nonlinear SKBs support in vhost-vsock [1], >> > > > > > @Will can this issue be related? >> > > > > >> > > > > Good point. >> > > > > >> > > > > Breno, if bisecting is too much trouble, would you mind testing the commits >> > > > > c76f3c4364fe523cd2782269eab92529c86217aa >> > > > > and >> > > > > c7991b44d7b44f9270dec63acd0b2965d29aab43 >> > > > > and telling us if this reproduces? >> > > > >> > > > That's definitely worth doing, but we should be careful not to confuse >> > > > the "non-linear address" from the warning (which refers to virtual >> > > > addresses that lie outside of the linear mapping of memory, e.g. in the >> > > > vmalloc space) and "non-linear SKBs" which refer to SKBs with fragment >> > > > pages. >> > > >> > > I've tested both commits above, and I see the crash on both commits >> > > above, thus, the problem reproduces in both cases. The only difference >> > > I noted is the fact that I haven't seen the warning before the crash. >> > > >> > > >> > > Log against c76f3c4364fe ("vhost/vsock: Avoid allocating >> > > arbitrarily-sized SKBs") >> > > >> > > Unable to handle kernel paging request at virtual address 0000001fc0000048 >> > > Mem abort info: >> > > ESR = 0x0000000096000005 >> > > EC = 0x25: DABT (current EL), IL = 32 bits >> > > SET = 0, FnV = 0 >> > > EA = 0, S1PTW = 0 >> > > FSC = 0x05: level 1 translation fault >> > > Data abort info: >> > > ISV = 0, ISS = 0x00000005, ISS2 = 0x00000000 >> > > CM = 0, WnR = 0, TnD = 0, TagAccess = 0 >> > > GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0 >> > > user pgtable: 64k pages, 48-bit VAs, pgdp=0000000cdcf2da00 >> > > [0000001fc0000048] pgd=0000000000000000, p4d=0000000000000000, pud=0000000000000000 >> > > Internal error: Oops: 0000000096000005 [#1] SMP >> > > Modules linked in: vfio_iommu_type1 vfio md4 crc32_cryptoapi ghash_generic unix_diag vhost_net tun vhost vhost_iotlb tap mpls_gso mpls_iptunnel mpls_router fou sch_fq ghes_edac tls tcp_diag inet_diag act_gact cls_bpf nvidia_c >> > > CPU: 34 UID: 0 PID: 1727297 Comm: stress-ng-dev Kdump: loaded Not tainted 6.16.0-rc6-upstream-00027-gc76f3c4364fe #19 NONE >> > > pstate: 23401009 (nzCv daif +PAN -UAO +TCO +DIT +SSBS BTYPE=--) >> > > pc : kfree+0x48/0x2a8 >> > > lr : vhost_dev_cleanup+0x138/0x2b8 [vhost] >> > > sp : ffff80013a0cfcd0 >> > > x29: ffff80013a0cfcd0 x28: ffff0008fd0b6240 x27: 0000000000000000 >> > > x26: 0000000000000000 x25: 0000000000000000 x24: 0000000000000000 >> > > x23: 00000000040e001f x22: ffffffffffffffff x21: ffff00014f1d4ac0 >> > > x20: 0000000000000001 x19: ffff00014f1d0000 x18: 0000000000000000 >> > > x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000 >> > > x14: 000000000000001f x13: 000000000000000f x12: 0000000000000001 >> > > x11: 0000000000000000 x10: 0000000000000402 x9 : ffffffdfc0000000 >> > > x8 : 0000001fc0000040 x7 : 0000000000000000 x6 : 0000000000000000 >> > > x5 : ffff000141931840 x4 : 0000000000000000 x3 : 0000000000000008 >> > > x2 : ffffffffffffffff x1 : ffffffffffffffff x0 : 0000000000010000 >> > > Call trace: >> > > kfree+0x48/0x2a8 (P) >> > > vhost_dev_cleanup+0x138/0x2b8 [vhost] >> > > vhost_net_release+0xa0/0x1a8 [vhost_net] >> > >> > But here is the vhost_net, so I'm confused now. >> > Do you see the same (vhost_net) also on 9798752 ("Add linux-next >> > specific files for 20250721") ? >> >> I will need to reproduce, but, looking at my logs, I see the following >> against: c76f3c4364fe ("vhost/vsock: Avoid allocating arbitrarily-sized SKBs"). >> The logs are a bit intermixed, probably there were multiple CPUs hitting >> the same code path. >> >> virt_to_phys used for non-linear address: 000000001b662678 (0xffe61984a460) >> WARNING: CPU: 15 PID: 112846 at arch/arm64/mm/physaddr.c:15 __virt_to_phys+0x80/0xa8 >> Modules linked in: vhost_vsock(E) vhost(E) vhost_iotlb(E) ghes_edac(E) tls(E) act_gact(E) cls_bpf(E) tcp_diag(E) inet_diag(E) ipmi_ssif(E) ipmi_devintf(E) ipmi_msghandler(E) sch_fq_codel(E) drm(E) backlight(E) drm_panel_orientation_quirks(E) acpi_power_meter(E) loop(E) efivarfs(E) autofs4(E) >> CPU: 15 UID: 0 PID: 112846 Comm: stress-ng-dev Kdump: loaded Tainted: G W E N 6.16.0-rc6-upstream-00027-gc76f3c4364fe #16 PREEMPT(none) >> Tainted: [W]=WARN, [E]=UNSIGNED_MODULE, [N]=TEST >> pstate: 63401009 (nZCv daif +PAN -UAO +TCO +DIT +SSBS BTYPE=--) >> pc : __virt_to_phys+0x80/0xa8 >> lr : __virt_to_phys+0x7c/0xa8 >> sp : ffff8001184d7a30 >> x29: ffff8001184d7a30 x28: 00000000000045d8 x27: 1fffe000e7e88014 >> x26: 1fffe000e7e888f7 x25: ffff0007e578bf00 x24: 1fffe000e7e88013 >> x23: 0000000000000000 x22: 0000ffe61984a460 x21: ffff00073f440098 >> x20: ffffff1000080000 x19: 0000ffe61984a460 x18: 0000000000000002 >> x17: 6666783028203837 x16: 3632363662313030 x15: 0000000000000001 >> x14: 1fffe006d52e90f2 x13: 0000000000000000 x12: 0000000000000000 >> x11: ffff6006d52e90f3 x10: 0000000000000002 x9 : cfc659a21c727d00 >> x8 : ffff800083c19000 x7 : 0000000000000001 x6 : 0000000000000001 >> x5 : ffff8001184d7398 x4 : ffff800084866d60 x3 : ffff8000805fdd94 >> x2 : 0000000000000001 x1 : 0000000000000004 x0 : 000000000000004b >> Call trace: >> __virt_to_phys+0x80/0xa8 (P) >> kfree+0xac/0x4b0 >> vhost_dev_cleanup+0x484/0x8b0 [vhost] >> vhost_vsock_dev_release+0x2f4/0x358 [vhost_vsock] >> __fput+0x2b4/0x608 >> fput_close_sync+0xe8/0x1e0 >> __arm64_sys_close+0x84/0xd0 >> invoke_syscall+0x8c/0x208 >> do_el0_svc+0x128/0x1a0 >> el0_svc+0x58/0x160 >> el0t_64_sync_handler+0x78/0x108 >> el0t_64_sync+0x198/0x1a0 >> irq event stamp: 0 >> hardirqs last enabled at (0): [<0000000000000000>] 0x0 >> hardirqs last disabled at (0): [<ffff8000801d876c>] copy_process+0xd5c/0x29f8 >> softirqs last enabled at (0): [<ffff8000801d879c>] copy_process+0xd8c/0x29f8 >> softirqs last disabled at (0): [<0000000000000000>] 0x0 >> ---[ end trace 0000000000000000 ]--- >> Unable to handle kernel paging request at virtual address 0000040053791288 >> ------------[ cut here ]------------ >> lr : kfree+0xac/0x4b0 >> virt_to_phys used for non-linear address: 00000000290839fd (0x2500000000) >> WARNING: CPU: 41 PID: 112845 at arch/arm64/mm/physaddr.c:15 __virt_to_phys+0x80/0xa8 >> Modules linked in: vhost_vsock(E) vhost(E) vhost_iotlb(E) ghes_edac(E) tls(E) act_gact(E) cls_bpf(E) tcp_diag(E) inet_diag(E) ipmi_ssif(E) ipmi_devintf(E) ipmi_msghandler(E) sch_fq_codel(E) drm(E) backlight(E) drm_panel_orientation_quirks(E) acpi_power_meter(E) loop(E) efivarfs(E) autofs4(E) >> CPU: 41 UID: 0 PID: 112845 Comm: stress-ng-dev Kdump: loaded Tainted: G W E N 6.16.0-rc6-upstream-00027-gc76f3c4364fe #16 PREEMPT(none) >> x23: 0000000000000001 >> Tainted: [W]=WARN, [E]=UNSIGNED_MODULE, [N]=TEST >> pstate: 63401009 (nZCv daif +PAN -UAO +TCO +DIT +SSBS BTYPE=--) >> pc : __virt_to_phys+0x80/0xa8 >> lr : __virt_to_phys+0x7c/0xa8 >> sp : ffff8001a8277a30 >> x29: ffff8001a8277a30 x28: 00000000000045d8 x27: 1fffe000e7e3c014 >> x26: 1fffe000e7e3c8f7 x25: ffff0007bcff0000 x24: 1fffe000e7e3c013 >> x23: 0000000000000000 x22: 0000002500000000 x21: ffff00073f1e0098 >> x20: ffffff1000080000 x19: 0000002500000000 x18: 0000000000000004 >> x17: 00000000ffffffff x16: 0000000000000001 x15: 0000000000000001 >> x14: 1fffe006d53920f2 x13: 0000000000000000 x12: 0000000000000000 >> sp : ffff8001184d7a50 >> x29: ffff8001184d7a60 x28: 00000000000045d8 x27: 1fffe000e7e88014 >> x26: 1fffe000e7e888f7 x25: ffff0007e578bf00 x24: 1fffe000e7e88013 >> x23: 0000000000000000 x22: 0000ffe61984a460 x21: ffff00073f440098 >> x20: 0000040053791280 x19: ffff80000d0b8bbc x18: 0000000000000002 >> x17: 6666783028203837 x16: 3632363662313030 x15: 0000000000000001 >> x22: 0000ffe6199a459d >> x14: 1fffe006d52e90f2 >> x21: ffff00073f7f0098 >> x20: ffffff1000080000 x19: 0000ffe6199a459d x18: 0000000000000004 >> x17: 54455320203b2d2c x16: 0000000000000011 x15: 0000000000000001 >> x14: 1fffe006d52bb8f2 x13: 0000000000000000 x12: 0000000000000000 >> x11: ffff6006d52bb8f3 x10: dfff800000000000 x9 : 77521a2bd3e0be00 >> x8 : ffff800083c19000 x7 : 0000000000000000 x6 : ffff80008036bc2c >> x5 : 0000000000000000 x4 : 0000000000000001 x3 : ffff8000805fdd94 >> x2 : 0000000000000001 x1 : 0000000000000004 x0 : 000000000000004b >> Call trace: >> __virt_to_phys+0x80/0xa8 (P) >> kfree+0xac/0x4b0 >> vhost_dev_cleanup+0x484/0x8b0 [vhost] >> vhost_vsock_dev_release+0x2f4/0x358 [vhost_vsock] >> __fput+0x2b4/0x608 >> x11: ffff6006d53920f3 >> fput_close_sync+0xe8/0x1e0 >> __arm64_sys_close+0x84/0xd0 >> invoke_syscall+0x8c/0x208 >> do_el0_svc+0x128/0x1a0 >> el0_svc+0x58/0x160 >> el0t_64_sync_handler+0x78/0x108 >> el0t_64_sync+0x198/0x1a0 >> irq event stamp: 0 >> x13: 0000000000000000 >> hardirqs last enabled at (0): [<0000000000000000>] 0x0 >> hardirqs last disabled at (0): [<ffff8000801d876c>] copy_process+0xd5c/0x29f8 >> >> >> > The initial report contained only vhost_vsock traces IIUC, so I'm >> > suspecting something in the vhost core. >> >> Right, it seems we are hitting the same code path, on on both >> vhost_vsock and vhost_net. >> > >I've posted a fix here: > >https://lore.kernel.org/virtualization/20250729073916.80647-1-jasowang@redhat.com/T/#u > >I think it should address this issue. Thanks for the fix! Stefano ^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2025-07-29 10:01 UTC | newest] Thread overview: 12+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-07-23 15:04 vhost: linux-next: crash at vhost_dev_cleanup() Breno Leitao 2025-07-23 19:09 ` Michael S. Tsirkin 2025-07-24 7:47 ` Michael S. Tsirkin 2025-07-24 8:14 ` Stefano Garzarella 2025-07-24 8:22 ` Michael S. Tsirkin 2025-07-24 8:44 ` Will Deacon 2025-07-24 12:48 ` Breno Leitao 2025-07-24 12:52 ` Stefano Garzarella 2025-07-24 13:49 ` Breno Leitao 2025-07-29 7:44 ` Jason Wang 2025-07-29 9:10 ` Breno Leitao 2025-07-29 9:57 ` Stefano Garzarella
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).