public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [linus:master] [RDMA/iwcm]  aee2424246: WARNING:at_kernel/workqueue.c:#check_flush_dependency
@ 2024-08-16  1:07 kernel test robot
  2024-08-16  4:14 ` Zhu Yanjun
  0 siblings, 1 reply; 11+ messages in thread
From: kernel test robot @ 2024-08-16  1:07 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: oe-lkp, lkp, linux-kernel, Leon Romanovsky, Zhu Yanjun,
	Shin'ichiro Kawasaki, linux-rdma, oliver.sang



Hello,

kernel test robot noticed "WARNING:at_kernel/workqueue.c:#check_flush_dependency" on:

commit: aee2424246f9f1dadc33faa78990c1e2eb7826e4 ("RDMA/iwcm: Fix a use-after-free related to destroying CM IDs")
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master

[test failed on linus/master      5189dafa4cf950e675f02ee04b577dfbbad0d9b1]
[test failed on linux-next/master 61c01d2e181adfba02fe09764f9fca1de2be0dbe]

in testcase: blktests
version: blktests-x86_64-775a058-1_20240702
with following parameters:

	disk: 1SSD
	test: nvme-group-01
	nvme_trtype: rdma



compiler: gcc-12
test machine: 224 threads 2 sockets Intel(R) Xeon(R) Platinum 8480+ (Sapphire Rapids) with 256G memory

(please refer to attached dmesg/kmsg for entire log/backtrace)



If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <oliver.sang@intel.com>
| Closes: https://lore.kernel.org/oe-lkp/202408151633.fc01893c-oliver.sang@intel.com


[  125.048981][ T1430] ------------[ cut here ]------------
[  125.056856][ T1430] workqueue: WQ_MEM_RECLAIM nvme-reset-wq:nvme_rdma_reset_ctrl_work [nvme_rdma] is flushing !WQ_MEM_RECLAIM iw_cm_wq:0x0
[ 125.056873][ T1430] WARNING: CPU: 2 PID: 1430 at kernel/workqueue.c:3706 check_flush_dependency (kernel/workqueue.c:3706 (discriminator 9)) 
[  125.085014][ T1430] Modules linked in: siw ib_uverbs nvmet_rdma nvmet nvme_rdma nvme_fabrics rdma_cm iw_cm ib_cm ib_core dimlib dm_multipath btrfs blake2b_generic intel_rapl_msr xor intel_rapl_common zstd_compress intel_uncore_frequency intel_uncore_frequency_common raid6_pq libcrc32c x86_pkg_temp_thermal intel_powerclamp coretemp nvme nvme_core ast t10_pi kvm_intel qat_4xxx drm_shmem_helper mei_me kvm crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel sha512_ssse3 rapl intel_cstate intel_uncore dax_hmem intel_th_gth intel_qat crc64_rocksoft_generic dh_generic intel_th_pci idxd crc8 i2c_i801 crc64_rocksoft mei intel_vsec idxd_bus drm_kms_helper intel_th authenc crc64 i2c_smbus i2c_ismt ipmi_ssif wmi acpi_power_meter ipmi_si acpi_ipmi ipmi_devintf ipmi_msghandler acpi_pad binfmt_misc loop fuse drm dm_mod ip_tables [last unloaded: ib_uverbs]
[  125.176472][ T1430] CPU: 2 PID: 1430 Comm: kworker/u898:26 Not tainted 6.10.0-rc1-00015-gaee2424246f9 #1
[  125.188840][ T1430] Workqueue: nvme-reset-wq nvme_rdma_reset_ctrl_work [nvme_rdma]
[ 125.199152][ T1430] RIP: 0010:check_flush_dependency (kernel/workqueue.c:3706 (discriminator 9)) 
[ 125.207527][ T1430] Code: fa 48 c1 ea 03 80 3c 02 00 0f 85 a8 00 00 00 49 8b 54 24 18 49 8d b5 c0 00 00 00 49 89 e8 48 c7 c7 20 46 08 84 e8 ed 8b f9 ff <0f> 0b e9 93 fd ff ff e8 a1 bf 82 00 e9 80 fd ff ff e8 97 bf 82 00
All code
========
   0:	fa                   	cli    
   1:	48 c1 ea 03          	shr    $0x3,%rdx
   5:	80 3c 02 00          	cmpb   $0x0,(%rdx,%rax,1)
   9:	0f 85 a8 00 00 00    	jne    0xb7
   f:	49 8b 54 24 18       	mov    0x18(%r12),%rdx
  14:	49 8d b5 c0 00 00 00 	lea    0xc0(%r13),%rsi
  1b:	49 89 e8             	mov    %rbp,%r8
  1e:	48 c7 c7 20 46 08 84 	mov    $0xffffffff84084620,%rdi
  25:	e8 ed 8b f9 ff       	callq  0xfffffffffff98c17
  2a:*	0f 0b                	ud2    		<-- trapping instruction
  2c:	e9 93 fd ff ff       	jmpq   0xfffffffffffffdc4
  31:	e8 a1 bf 82 00       	callq  0x82bfd7
  36:	e9 80 fd ff ff       	jmpq   0xfffffffffffffdbb
  3b:	e8 97 bf 82 00       	callq  0x82bfd7

Code starting with the faulting instruction
===========================================
   0:	0f 0b                	ud2    
   2:	e9 93 fd ff ff       	jmpq   0xfffffffffffffd9a
   7:	e8 a1 bf 82 00       	callq  0x82bfad
   c:	e9 80 fd ff ff       	jmpq   0xfffffffffffffd91
  11:	e8 97 bf 82 00       	callq  0x82bfad
[  125.231266][ T1430] RSP: 0018:ffa000001375fb88 EFLAGS: 00010282
[  125.239626][ T1430] RAX: 0000000000000000 RBX: ff11000341233c00 RCX: 0000000000000027
[  125.250304][ T1430] RDX: 0000000000000027 RSI: 0000000000000004 RDI: ff110017fc930b08
[  125.260878][ T1430] RBP: 0000000000000000 R08: 0000000000000001 R09: ffe21c02ff926161
[  125.271664][ T1430] R10: ff110017fc930b0b R11: 0000000000000010 R12: ff1100208d2a4000
[  125.282387][ T1430] R13: ff1100020d87a000 R14: 0000000000000000 R15: ff11000341233c00
[  125.292984][ T1430] FS:  0000000000000000(0000) GS:ff110017fc900000(0000) knlGS:0000000000000000
[  125.304552][ T1430] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  125.313446][ T1430] CR2: 00007fa84066aa1c CR3: 000000407c85a005 CR4: 0000000000f71ef0
[  125.324080][ T1430] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  125.334584][ T1430] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
[  125.345252][ T1430] PKRU: 55555554
[  125.350876][ T1430] Call Trace:
[  125.356281][ T1430]  <TASK>
[ 125.361285][ T1430] ? __warn (kernel/panic.c:693) 
[ 125.367640][ T1430] ? check_flush_dependency (kernel/workqueue.c:3706 (discriminator 9)) 
[ 125.375689][ T1430] ? report_bug (lib/bug.c:180 lib/bug.c:219) 
[ 125.382505][ T1430] ? handle_bug (arch/x86/kernel/traps.c:239) 
[ 125.388987][ T1430] ? exc_invalid_op (arch/x86/kernel/traps.c:260 (discriminator 1)) 
[ 125.395831][ T1430] ? asm_exc_invalid_op (arch/x86/include/asm/idtentry.h:621) 
[ 125.403125][ T1430] ? check_flush_dependency (kernel/workqueue.c:3706 (discriminator 9)) 
[ 125.410984][ T1430] ? check_flush_dependency (kernel/workqueue.c:3706 (discriminator 9)) 
[ 125.418764][ T1430] __flush_workqueue (kernel/workqueue.c:3970) 
[ 125.426021][ T1430] ? __pfx___might_resched (kernel/sched/core.c:10151) 
[ 125.433431][ T1430] ? destroy_cm_id (drivers/infiniband/core/iwcm.c:375) iw_cm
[  125.440844][ T2411] /usr/bin/wget -q --timeout=3600 --tries=1 --local-encoding=UTF-8 http://internal-lkp-server:80/~lkp/cgi-bin/lkp-jobfile-append-var?job_file=/lkp/jobs/scheduled/lkp-spr-2sp1/blktests-1SSD-rdma-nvme-group-01-debian-12-x86_64-20240206.cgz-aee2424246f9-20240809-69442-1dktaed-4.yaml&job_state=running -O /dev/null
[ 125.441209][ T1430] ? __pfx___flush_workqueue (kernel/workqueue.c:3910) 
[  125.441215][ T2411]
[ 125.473900][ T1430] ? _raw_spin_lock_irqsave (arch/x86/include/asm/atomic.h:107 include/linux/atomic/atomic-arch-fallback.h:2170 include/linux/atomic/atomic-instrumented.h:1302 include/asm-generic/qspinlock.h:111 include/linux/spinlock.h:187 include/linux/spinlock_api_smp.h:111 kernel/locking/spinlock.c:162) 
[ 125.473909][ T1430] ? __pfx__raw_spin_lock_irqsave (kernel/locking/spinlock.c:161) 
[  125.480265][ T2411] target ucode: 0x2b0004b1
[ 125.482537][ T1430] _destroy_id (drivers/infiniband/core/cma.c:2044) rdma_cm
[  125.488511][ T2411]
[ 125.495072][ T1430] nvme_rdma_free_queue (drivers/nvme/host/rdma.c:656 drivers/nvme/host/rdma.c:650) nvme_rdma
[  125.500747][ T2411] LKP: stdout: 2876: current_version: 2b0004b1, target_version: 2b0004b1
[ 125.505827][ T1430] nvme_rdma_reset_ctrl_work (drivers/nvme/host/rdma.c:2180) nvme_rdma
[ 125.505831][ T1430] process_one_work (kernel/workqueue.c:3231) 
[  125.508377][ T2411]
[ 125.515122][ T1430] worker_thread (kernel/workqueue.c:3306 kernel/workqueue.c:3393) 
[ 125.515127][ T1430] ? __pfx_worker_thread (kernel/workqueue.c:3339) 
[  125.524642][ T2411] check_nr_cpu
[ 125.531837][ T1430] kthread (kernel/kthread.c:389) 
[  125.537327][ T2411]
[ 125.539864][ T1430] ? __pfx_kthread (kernel/kthread.c:342) 
[  125.545392][ T2411] CPU(s):                               224
[ 125.550628][ T1430] ret_from_fork (arch/x86/kernel/process.c:147) 
[  125.554342][ T2411]
[ 125.558840][ T1430] ? __pfx_kthread (kernel/kthread.c:342) 
[ 125.558844][ T1430] ret_from_fork_asm (arch/x86/entry/entry_64.S:257) 
[  125.561843][ T2411] On-line CPU(s) list:                  0-223
[  125.566487][ T1430]  </TASK>
[  125.566488][ T1430] ---[ end trace 0000000000000000 ]---



The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20240815/202408151633.fc01893c-oliver.sang@intel.com



-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [linus:master] [RDMA/iwcm] aee2424246: WARNING:at_kernel/workqueue.c:#check_flush_dependency
  2024-08-16  1:07 [linus:master] [RDMA/iwcm] aee2424246: WARNING:at_kernel/workqueue.c:#check_flush_dependency kernel test robot
@ 2024-08-16  4:14 ` Zhu Yanjun
  2024-08-16  5:27   ` Zhu Yanjun
  0 siblings, 1 reply; 11+ messages in thread
From: Zhu Yanjun @ 2024-08-16  4:14 UTC (permalink / raw)
  To: kernel test robot, Bart Van Assche
  Cc: oe-lkp, lkp, linux-kernel, Leon Romanovsky,
	Shin'ichiro Kawasaki, linux-rdma


在 2024/8/16 9:07, kernel test robot 写道:
>
> Hello,
>
> kernel test robot noticed "WARNING:at_kernel/workqueue.c:#check_flush_dependency" on:
>
> commit: aee2424246f9f1dadc33faa78990c1e2eb7826e4 ("RDMA/iwcm: Fix a use-after-free related to destroying CM IDs")
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
>
> [test failed on linus/master      5189dafa4cf950e675f02ee04b577dfbbad0d9b1]
> [test failed on linux-next/master 61c01d2e181adfba02fe09764f9fca1de2be0dbe]
>
> in testcase: blktests
> version: blktests-x86_64-775a058-1_20240702
> with following parameters:
>
> 	disk: 1SSD
> 	test: nvme-group-01
> 	nvme_trtype: rdma

Hi, Bart && Jason && Leon

It seems that it is related with WQ_MEM_RECLAIM.

Not sure if adding WQ_MEM_RECLAIM to iw_cm_wq can fix this or not.

Best Regards,

Zhu Yanjun

>
>
>
> compiler: gcc-12
> test machine: 224 threads 2 sockets Intel(R) Xeon(R) Platinum 8480+ (Sapphire Rapids) with 256G memory
>
> (please refer to attached dmesg/kmsg for entire log/backtrace)
>
>
>
> If you fix the issue in a separate patch/commit (i.e. not just a new version of
> the same patch/commit), kindly add following tags
> | Reported-by: kernel test robot <oliver.sang@intel.com>
> | Closes: https://lore.kernel.org/oe-lkp/202408151633.fc01893c-oliver.sang@intel.com
>
>
> [  125.048981][ T1430] ------------[ cut here ]------------
> [  125.056856][ T1430] workqueue: WQ_MEM_RECLAIM nvme-reset-wq:nvme_rdma_reset_ctrl_work [nvme_rdma] is flushing !WQ_MEM_RECLAIM iw_cm_wq:0x0
> [ 125.056873][ T1430] WARNING: CPU: 2 PID: 1430 at kernel/workqueue.c:3706 check_flush_dependency (kernel/workqueue.c:3706 (discriminator 9))
> [  125.085014][ T1430] Modules linked in: siw ib_uverbs nvmet_rdma nvmet nvme_rdma nvme_fabrics rdma_cm iw_cm ib_cm ib_core dimlib dm_multipath btrfs blake2b_generic intel_rapl_msr xor intel_rapl_common zstd_compress intel_uncore_frequency intel_uncore_frequency_common raid6_pq libcrc32c x86_pkg_temp_thermal intel_powerclamp coretemp nvme nvme_core ast t10_pi kvm_intel qat_4xxx drm_shmem_helper mei_me kvm crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel sha512_ssse3 rapl intel_cstate intel_uncore dax_hmem intel_th_gth intel_qat crc64_rocksoft_generic dh_generic intel_th_pci idxd crc8 i2c_i801 crc64_rocksoft mei intel_vsec idxd_bus drm_kms_helper intel_th authenc crc64 i2c_smbus i2c_ismt ipmi_ssif wmi acpi_power_meter ipmi_si acpi_ipmi ipmi_devintf ipmi_msghandler acpi_pad binfmt_misc loop fuse drm dm_mod ip_tables [last unloaded: ib_uverbs]
> [  125.176472][ T1430] CPU: 2 PID: 1430 Comm: kworker/u898:26 Not tainted 6.10.0-rc1-00015-gaee2424246f9 #1
> [  125.188840][ T1430] Workqueue: nvme-reset-wq nvme_rdma_reset_ctrl_work [nvme_rdma]
> [ 125.199152][ T1430] RIP: 0010:check_flush_dependency (kernel/workqueue.c:3706 (discriminator 9))
> [ 125.207527][ T1430] Code: fa 48 c1 ea 03 80 3c 02 00 0f 85 a8 00 00 00 49 8b 54 24 18 49 8d b5 c0 00 00 00 49 89 e8 48 c7 c7 20 46 08 84 e8 ed 8b f9 ff <0f> 0b e9 93 fd ff ff e8 a1 bf 82 00 e9 80 fd ff ff e8 97 bf 82 00
> All code
> ========
>     0:	fa                   	cli
>     1:	48 c1 ea 03          	shr    $0x3,%rdx
>     5:	80 3c 02 00          	cmpb   $0x0,(%rdx,%rax,1)
>     9:	0f 85 a8 00 00 00    	jne    0xb7
>     f:	49 8b 54 24 18       	mov    0x18(%r12),%rdx
>    14:	49 8d b5 c0 00 00 00 	lea    0xc0(%r13),%rsi
>    1b:	49 89 e8             	mov    %rbp,%r8
>    1e:	48 c7 c7 20 46 08 84 	mov    $0xffffffff84084620,%rdi
>    25:	e8 ed 8b f9 ff       	callq  0xfffffffffff98c17
>    2a:*	0f 0b                	ud2    		<-- trapping instruction
>    2c:	e9 93 fd ff ff       	jmpq   0xfffffffffffffdc4
>    31:	e8 a1 bf 82 00       	callq  0x82bfd7
>    36:	e9 80 fd ff ff       	jmpq   0xfffffffffffffdbb
>    3b:	e8 97 bf 82 00       	callq  0x82bfd7
>
> Code starting with the faulting instruction
> ===========================================
>     0:	0f 0b                	ud2
>     2:	e9 93 fd ff ff       	jmpq   0xfffffffffffffd9a
>     7:	e8 a1 bf 82 00       	callq  0x82bfad
>     c:	e9 80 fd ff ff       	jmpq   0xfffffffffffffd91
>    11:	e8 97 bf 82 00       	callq  0x82bfad
> [  125.231266][ T1430] RSP: 0018:ffa000001375fb88 EFLAGS: 00010282
> [  125.239626][ T1430] RAX: 0000000000000000 RBX: ff11000341233c00 RCX: 0000000000000027
> [  125.250304][ T1430] RDX: 0000000000000027 RSI: 0000000000000004 RDI: ff110017fc930b08
> [  125.260878][ T1430] RBP: 0000000000000000 R08: 0000000000000001 R09: ffe21c02ff926161
> [  125.271664][ T1430] R10: ff110017fc930b0b R11: 0000000000000010 R12: ff1100208d2a4000
> [  125.282387][ T1430] R13: ff1100020d87a000 R14: 0000000000000000 R15: ff11000341233c00
> [  125.292984][ T1430] FS:  0000000000000000(0000) GS:ff110017fc900000(0000) knlGS:0000000000000000
> [  125.304552][ T1430] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  125.313446][ T1430] CR2: 00007fa84066aa1c CR3: 000000407c85a005 CR4: 0000000000f71ef0
> [  125.324080][ T1430] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [  125.334584][ T1430] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
> [  125.345252][ T1430] PKRU: 55555554
> [  125.350876][ T1430] Call Trace:
> [  125.356281][ T1430]  <TASK>
> [ 125.361285][ T1430] ? __warn (kernel/panic.c:693)
> [ 125.367640][ T1430] ? check_flush_dependency (kernel/workqueue.c:3706 (discriminator 9))
> [ 125.375689][ T1430] ? report_bug (lib/bug.c:180 lib/bug.c:219)
> [ 125.382505][ T1430] ? handle_bug (arch/x86/kernel/traps.c:239)
> [ 125.388987][ T1430] ? exc_invalid_op (arch/x86/kernel/traps.c:260 (discriminator 1))
> [ 125.395831][ T1430] ? asm_exc_invalid_op (arch/x86/include/asm/idtentry.h:621)
> [ 125.403125][ T1430] ? check_flush_dependency (kernel/workqueue.c:3706 (discriminator 9))
> [ 125.410984][ T1430] ? check_flush_dependency (kernel/workqueue.c:3706 (discriminator 9))
> [ 125.418764][ T1430] __flush_workqueue (kernel/workqueue.c:3970)
> [ 125.426021][ T1430] ? __pfx___might_resched (kernel/sched/core.c:10151)
> [ 125.433431][ T1430] ? destroy_cm_id (drivers/infiniband/core/iwcm.c:375) iw_cm
> [  125.440844][ T2411] /usr/bin/wget -q --timeout=3600 --tries=1 --local-encoding=UTF-8 http://internal-lkp-server:80/~lkp/cgi-bin/lkp-jobfile-append-var?job_file=/lkp/jobs/scheduled/lkp-spr-2sp1/blktests-1SSD-rdma-nvme-group-01-debian-12-x86_64-20240206.cgz-aee2424246f9-20240809-69442-1dktaed-4.yaml&job_state=running -O /dev/null
> [ 125.441209][ T1430] ? __pfx___flush_workqueue (kernel/workqueue.c:3910)
> [  125.441215][ T2411]
> [ 125.473900][ T1430] ? _raw_spin_lock_irqsave (arch/x86/include/asm/atomic.h:107 include/linux/atomic/atomic-arch-fallback.h:2170 include/linux/atomic/atomic-instrumented.h:1302 include/asm-generic/qspinlock.h:111 include/linux/spinlock.h:187 include/linux/spinlock_api_smp.h:111 kernel/locking/spinlock.c:162)
> [ 125.473909][ T1430] ? __pfx__raw_spin_lock_irqsave (kernel/locking/spinlock.c:161)
> [  125.480265][ T2411] target ucode: 0x2b0004b1
> [ 125.482537][ T1430] _destroy_id (drivers/infiniband/core/cma.c:2044) rdma_cm
> [  125.488511][ T2411]
> [ 125.495072][ T1430] nvme_rdma_free_queue (drivers/nvme/host/rdma.c:656 drivers/nvme/host/rdma.c:650) nvme_rdma
> [  125.500747][ T2411] LKP: stdout: 2876: current_version: 2b0004b1, target_version: 2b0004b1
> [ 125.505827][ T1430] nvme_rdma_reset_ctrl_work (drivers/nvme/host/rdma.c:2180) nvme_rdma
> [ 125.505831][ T1430] process_one_work (kernel/workqueue.c:3231)
> [  125.508377][ T2411]
> [ 125.515122][ T1430] worker_thread (kernel/workqueue.c:3306 kernel/workqueue.c:3393)
> [ 125.515127][ T1430] ? __pfx_worker_thread (kernel/workqueue.c:3339)
> [  125.524642][ T2411] check_nr_cpu
> [ 125.531837][ T1430] kthread (kernel/kthread.c:389)
> [  125.537327][ T2411]
> [ 125.539864][ T1430] ? __pfx_kthread (kernel/kthread.c:342)
> [  125.545392][ T2411] CPU(s):                               224
> [ 125.550628][ T1430] ret_from_fork (arch/x86/kernel/process.c:147)
> [  125.554342][ T2411]
> [ 125.558840][ T1430] ? __pfx_kthread (kernel/kthread.c:342)
> [ 125.558844][ T1430] ret_from_fork_asm (arch/x86/entry/entry_64.S:257)
> [  125.561843][ T2411] On-line CPU(s) list:                  0-223
> [  125.566487][ T1430]  </TASK>
> [  125.566488][ T1430] ---[ end trace 0000000000000000 ]---
>
>
>
> The kernel config and materials to reproduce are available at:
> https://download.01.org/0day-ci/archive/20240815/202408151633.fc01893c-oliver.sang@intel.com
>
>
>
-- 
Best Regards,
Yanjun.Zhu


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [linus:master] [RDMA/iwcm] aee2424246: WARNING:at_kernel/workqueue.c:#check_flush_dependency
  2024-08-16  4:14 ` Zhu Yanjun
@ 2024-08-16  5:27   ` Zhu Yanjun
  2024-08-16 12:49     ` Zhu Yanjun
  2024-08-16 17:09     ` Bart Van Assche
  0 siblings, 2 replies; 11+ messages in thread
From: Zhu Yanjun @ 2024-08-16  5:27 UTC (permalink / raw)
  To: kernel test robot, Bart Van Assche
  Cc: oe-lkp, lkp, linux-kernel, Leon Romanovsky,
	Shin'ichiro Kawasaki, linux-rdma


在 2024/8/16 12:14, Zhu Yanjun 写道:
>
> 在 2024/8/16 9:07, kernel test robot 写道:
>>
>> Hello,
>>
>> kernel test robot noticed 
>> "WARNING:at_kernel/workqueue.c:#check_flush_dependency" on:
>>
>> commit: aee2424246f9f1dadc33faa78990c1e2eb7826e4 ("RDMA/iwcm: Fix a 
>> use-after-free related to destroying CM IDs")
>> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
>>
>> [test failed on linus/master 5189dafa4cf950e675f02ee04b577dfbbad0d9b1]
>> [test failed on linux-next/master 
>> 61c01d2e181adfba02fe09764f9fca1de2be0dbe]
>>
>> in testcase: blktests
>> version: blktests-x86_64-775a058-1_20240702
>> with following parameters:
>>
>>     disk: 1SSD
>>     test: nvme-group-01
>>     nvme_trtype: rdma
>
> Hi, Bart && Jason && Leon
>
> It seems that it is related with WQ_MEM_RECLAIM.
>
> Not sure if adding WQ_MEM_RECLAIM to iw_cm_wq can fix this or not.

The commit is as below.

diff --git a/drivers/infiniband/core/iwcm.c b/drivers/infiniband/core/iwcm.c
index 1a6339f3a63f..7e3a55349e10 100644
--- a/drivers/infiniband/core/iwcm.c
+++ b/drivers/infiniband/core/iwcm.c
@@ -1182,7 +1182,7 @@ static int __init iw_cm_init(void)
         if (ret)
                 return ret;

-       iwcm_wq = alloc_ordered_workqueue("iw_cm_wq", 0);
+       iwcm_wq = alloc_ordered_workqueue("iw_cm_wq", WQ_MEM_RECLAIM);
         if (!iwcm_wq)
                 goto err_alloc;

Zhu Yanjun

>
> Best Regards,
>
> Zhu Yanjun
>
>>
>>
>>
>> compiler: gcc-12
>> test machine: 224 threads 2 sockets Intel(R) Xeon(R) Platinum 8480+ 
>> (Sapphire Rapids) with 256G memory
>>
>> (please refer to attached dmesg/kmsg for entire log/backtrace)
>>
>>
>>
>> If you fix the issue in a separate patch/commit (i.e. not just a new 
>> version of
>> the same patch/commit), kindly add following tags
>> | Reported-by: kernel test robot <oliver.sang@intel.com>
>> | Closes: 
>> https://lore.kernel.org/oe-lkp/202408151633.fc01893c-oliver.sang@intel.com
>>
>>
>> [  125.048981][ T1430] ------------[ cut here ]------------
>> [  125.056856][ T1430] workqueue: WQ_MEM_RECLAIM 
>> nvme-reset-wq:nvme_rdma_reset_ctrl_work [nvme_rdma] is flushing 
>> !WQ_MEM_RECLAIM iw_cm_wq:0x0
>> [ 125.056873][ T1430] WARNING: CPU: 2 PID: 1430 at 
>> kernel/workqueue.c:3706 check_flush_dependency 
>> (kernel/workqueue.c:3706 (discriminator 9))
>> [  125.085014][ T1430] Modules linked in: siw ib_uverbs nvmet_rdma 
>> nvmet nvme_rdma nvme_fabrics rdma_cm iw_cm ib_cm ib_core dimlib 
>> dm_multipath btrfs blake2b_generic intel_rapl_msr xor 
>> intel_rapl_common zstd_compress intel_uncore_frequency 
>> intel_uncore_frequency_common raid6_pq libcrc32c x86_pkg_temp_thermal 
>> intel_powerclamp coretemp nvme nvme_core ast t10_pi kvm_intel 
>> qat_4xxx drm_shmem_helper mei_me kvm crct10dif_pclmul crc32_pclmul 
>> crc32c_intel ghash_clmulni_intel sha512_ssse3 rapl intel_cstate 
>> intel_uncore dax_hmem intel_th_gth intel_qat crc64_rocksoft_generic 
>> dh_generic intel_th_pci idxd crc8 i2c_i801 crc64_rocksoft mei 
>> intel_vsec idxd_bus drm_kms_helper intel_th authenc crc64 i2c_smbus 
>> i2c_ismt ipmi_ssif wmi acpi_power_meter ipmi_si acpi_ipmi 
>> ipmi_devintf ipmi_msghandler acpi_pad binfmt_misc loop fuse drm 
>> dm_mod ip_tables [last unloaded: ib_uverbs]
>> [  125.176472][ T1430] CPU: 2 PID: 1430 Comm: kworker/u898:26 Not 
>> tainted 6.10.0-rc1-00015-gaee2424246f9 #1
>> [  125.188840][ T1430] Workqueue: nvme-reset-wq 
>> nvme_rdma_reset_ctrl_work [nvme_rdma]
>> [ 125.199152][ T1430] RIP: 0010:check_flush_dependency 
>> (kernel/workqueue.c:3706 (discriminator 9))
>> [ 125.207527][ T1430] Code: fa 48 c1 ea 03 80 3c 02 00 0f 85 a8 00 00 
>> 00 49 8b 54 24 18 49 8d b5 c0 00 00 00 49 89 e8 48 c7 c7 20 46 08 84 
>> e8 ed 8b f9 ff <0f> 0b e9 93 fd ff ff e8 a1 bf 82 00 e9 80 fd ff ff 
>> e8 97 bf 82 00
>> All code
>> ========
>>     0:    fa                       cli
>>     1:    48 c1 ea 03              shr    $0x3,%rdx
>>     5:    80 3c 02 00              cmpb   $0x0,(%rdx,%rax,1)
>>     9:    0f 85 a8 00 00 00        jne    0xb7
>>     f:    49 8b 54 24 18           mov    0x18(%r12),%rdx
>>    14:    49 8d b5 c0 00 00 00     lea    0xc0(%r13),%rsi
>>    1b:    49 89 e8                 mov    %rbp,%r8
>>    1e:    48 c7 c7 20 46 08 84     mov $0xffffffff84084620,%rdi
>>    25:    e8 ed 8b f9 ff           callq  0xfffffffffff98c17
>>    2a:*    0f 0b                    ud2            <-- trapping 
>> instruction
>>    2c:    e9 93 fd ff ff           jmpq   0xfffffffffffffdc4
>>    31:    e8 a1 bf 82 00           callq  0x82bfd7
>>    36:    e9 80 fd ff ff           jmpq   0xfffffffffffffdbb
>>    3b:    e8 97 bf 82 00           callq  0x82bfd7
>>
>> Code starting with the faulting instruction
>> ===========================================
>>     0:    0f 0b                    ud2
>>     2:    e9 93 fd ff ff           jmpq   0xfffffffffffffd9a
>>     7:    e8 a1 bf 82 00           callq  0x82bfad
>>     c:    e9 80 fd ff ff           jmpq   0xfffffffffffffd91
>>    11:    e8 97 bf 82 00           callq  0x82bfad
>> [  125.231266][ T1430] RSP: 0018:ffa000001375fb88 EFLAGS: 00010282
>> [  125.239626][ T1430] RAX: 0000000000000000 RBX: ff11000341233c00 
>> RCX: 0000000000000027
>> [  125.250304][ T1430] RDX: 0000000000000027 RSI: 0000000000000004 
>> RDI: ff110017fc930b08
>> [  125.260878][ T1430] RBP: 0000000000000000 R08: 0000000000000001 
>> R09: ffe21c02ff926161
>> [  125.271664][ T1430] R10: ff110017fc930b0b R11: 0000000000000010 
>> R12: ff1100208d2a4000
>> [  125.282387][ T1430] R13: ff1100020d87a000 R14: 0000000000000000 
>> R15: ff11000341233c00
>> [  125.292984][ T1430] FS:  0000000000000000(0000) 
>> GS:ff110017fc900000(0000) knlGS:0000000000000000
>> [  125.304552][ T1430] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [  125.313446][ T1430] CR2: 00007fa84066aa1c CR3: 000000407c85a005 
>> CR4: 0000000000f71ef0
>> [  125.324080][ T1430] DR0: 0000000000000000 DR1: 0000000000000000 
>> DR2: 0000000000000000
>> [  125.334584][ T1430] DR3: 0000000000000000 DR6: 00000000fffe07f0 
>> DR7: 0000000000000400
>> [  125.345252][ T1430] PKRU: 55555554
>> [  125.350876][ T1430] Call Trace:
>> [  125.356281][ T1430]  <TASK>
>> [ 125.361285][ T1430] ? __warn (kernel/panic.c:693)
>> [ 125.367640][ T1430] ? check_flush_dependency 
>> (kernel/workqueue.c:3706 (discriminator 9))
>> [ 125.375689][ T1430] ? report_bug (lib/bug.c:180 lib/bug.c:219)
>> [ 125.382505][ T1430] ? handle_bug (arch/x86/kernel/traps.c:239)
>> [ 125.388987][ T1430] ? exc_invalid_op (arch/x86/kernel/traps.c:260 
>> (discriminator 1))
>> [ 125.395831][ T1430] ? asm_exc_invalid_op 
>> (arch/x86/include/asm/idtentry.h:621)
>> [ 125.403125][ T1430] ? check_flush_dependency 
>> (kernel/workqueue.c:3706 (discriminator 9))
>> [ 125.410984][ T1430] ? check_flush_dependency 
>> (kernel/workqueue.c:3706 (discriminator 9))
>> [ 125.418764][ T1430] __flush_workqueue (kernel/workqueue.c:3970)
>> [ 125.426021][ T1430] ? __pfx___might_resched 
>> (kernel/sched/core.c:10151)
>> [ 125.433431][ T1430] ? destroy_cm_id 
>> (drivers/infiniband/core/iwcm.c:375) iw_cm
>> [  125.440844][ T2411] /usr/bin/wget -q --timeout=3600 --tries=1 
>> --local-encoding=UTF-8 
>> http://internal-lkp-server:80/~lkp/cgi-bin/lkp-jobfile-append-var?job_file=/lkp/jobs/scheduled/lkp-spr-2sp1/blktests-1SSD-rdma-nvme-group-01-debian-12-x86_64-20240206.cgz-aee2424246f9-20240809-69442-1dktaed-4.yaml&job_state=running 
>> -O /dev/null
>> [ 125.441209][ T1430] ? __pfx___flush_workqueue 
>> (kernel/workqueue.c:3910)
>> [  125.441215][ T2411]
>> [ 125.473900][ T1430] ? _raw_spin_lock_irqsave 
>> (arch/x86/include/asm/atomic.h:107 
>> include/linux/atomic/atomic-arch-fallback.h:2170 
>> include/linux/atomic/atomic-instrumented.h:1302 
>> include/asm-generic/qspinlock.h:111 include/linux/spinlock.h:187 
>> include/linux/spinlock_api_smp.h:111 kernel/locking/spinlock.c:162)
>> [ 125.473909][ T1430] ? __pfx__raw_spin_lock_irqsave 
>> (kernel/locking/spinlock.c:161)
>> [  125.480265][ T2411] target ucode: 0x2b0004b1
>> [ 125.482537][ T1430] _destroy_id 
>> (drivers/infiniband/core/cma.c:2044) rdma_cm
>> [  125.488511][ T2411]
>> [ 125.495072][ T1430] nvme_rdma_free_queue 
>> (drivers/nvme/host/rdma.c:656 drivers/nvme/host/rdma.c:650) nvme_rdma
>> [  125.500747][ T2411] LKP: stdout: 2876: current_version: 2b0004b1, 
>> target_version: 2b0004b1
>> [ 125.505827][ T1430] nvme_rdma_reset_ctrl_work 
>> (drivers/nvme/host/rdma.c:2180) nvme_rdma
>> [ 125.505831][ T1430] process_one_work (kernel/workqueue.c:3231)
>> [  125.508377][ T2411]
>> [ 125.515122][ T1430] worker_thread (kernel/workqueue.c:3306 
>> kernel/workqueue.c:3393)
>> [ 125.515127][ T1430] ? __pfx_worker_thread (kernel/workqueue.c:3339)
>> [  125.524642][ T2411] check_nr_cpu
>> [ 125.531837][ T1430] kthread (kernel/kthread.c:389)
>> [  125.537327][ T2411]
>> [ 125.539864][ T1430] ? __pfx_kthread (kernel/kthread.c:342)
>> [  125.545392][ T2411] CPU(s):                               224
>> [ 125.550628][ T1430] ret_from_fork (arch/x86/kernel/process.c:147)
>> [  125.554342][ T2411]
>> [ 125.558840][ T1430] ? __pfx_kthread (kernel/kthread.c:342)
>> [ 125.558844][ T1430] ret_from_fork_asm (arch/x86/entry/entry_64.S:257)
>> [  125.561843][ T2411] On-line CPU(s) list: 0-223
>> [  125.566487][ T1430]  </TASK>
>> [  125.566488][ T1430] ---[ end trace 0000000000000000 ]---
>>
>>
>>
>> The kernel config and materials to reproduce are available at:
>> https://download.01.org/0day-ci/archive/20240815/202408151633.fc01893c-oliver.sang@intel.com 
>>
>>
>>
>>
-- 
Best Regards,
Yanjun.Zhu


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [linus:master] [RDMA/iwcm] aee2424246: WARNING:at_kernel/workqueue.c:#check_flush_dependency
  2024-08-16  5:27   ` Zhu Yanjun
@ 2024-08-16 12:49     ` Zhu Yanjun
  2024-08-16 17:10       ` Bart Van Assche
  2024-08-16 17:09     ` Bart Van Assche
  1 sibling, 1 reply; 11+ messages in thread
From: Zhu Yanjun @ 2024-08-16 12:49 UTC (permalink / raw)
  To: kernel test robot, Bart Van Assche
  Cc: oe-lkp, lkp, linux-kernel, Leon Romanovsky,
	Shin'ichiro Kawasaki, linux-rdma


在 2024/8/16 13:27, Zhu Yanjun 写道:
>
> 在 2024/8/16 12:14, Zhu Yanjun 写道:
>>
>> 在 2024/8/16 9:07, kernel test robot 写道:
>>>
>>> Hello,
>>>
>>> kernel test robot noticed 
>>> "WARNING:at_kernel/workqueue.c:#check_flush_dependency" on:
>>>
>>> commit: aee2424246f9f1dadc33faa78990c1e2eb7826e4 ("RDMA/iwcm: Fix a 
>>> use-after-free related to destroying CM IDs")
>>> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
>>>
>>> [test failed on linus/master 5189dafa4cf950e675f02ee04b577dfbbad0d9b1]
>>> [test failed on linux-next/master 
>>> 61c01d2e181adfba02fe09764f9fca1de2be0dbe]
>>>
>>> in testcase: blktests
>>> version: blktests-x86_64-775a058-1_20240702
>>> with following parameters:
>>>
>>>     disk: 1SSD
>>>     test: nvme-group-01
>>>     nvme_trtype: rdma
>>
>> Hi, Bart && Jason && Leon
>>
>> It seems that it is related with WQ_MEM_RECLAIM.
>>
>> Not sure if adding WQ_MEM_RECLAIM to iw_cm_wq can fix this or not.
>
> The commit is as below.

Hi, kernel test robot

Please help to make tests with the following commits.

Please let us know the result.

>
> diff --git a/drivers/infiniband/core/iwcm.c 
> b/drivers/infiniband/core/iwcm.c
> index 1a6339f3a63f..7e3a55349e10 100644
> --- a/drivers/infiniband/core/iwcm.c
> +++ b/drivers/infiniband/core/iwcm.c
> @@ -1182,7 +1182,7 @@ static int __init iw_cm_init(void)
>         if (ret)
>                 return ret;
>
> -       iwcm_wq = alloc_ordered_workqueue("iw_cm_wq", 0);
> +       iwcm_wq = alloc_ordered_workqueue("iw_cm_wq", WQ_MEM_RECLAIM);
>         if (!iwcm_wq)
>                 goto err_alloc;

Best Regards,

Zhu Yanjun

>
> Zhu Yanjun
>
>>
>> Best Regards,
>>
>> Zhu Yanjun
>>
>>>
>>>
>>>
>>> compiler: gcc-12
>>> test machine: 224 threads 2 sockets Intel(R) Xeon(R) Platinum 8480+ 
>>> (Sapphire Rapids) with 256G memory
>>>
>>> (please refer to attached dmesg/kmsg for entire log/backtrace)
>>>
>>>
>>>
>>> If you fix the issue in a separate patch/commit (i.e. not just a new 
>>> version of
>>> the same patch/commit), kindly add following tags
>>> | Reported-by: kernel test robot <oliver.sang@intel.com>
>>> | Closes: 
>>> https://lore.kernel.org/oe-lkp/202408151633.fc01893c-oliver.sang@intel.com
>>>
>>>
>>> [  125.048981][ T1430] ------------[ cut here ]------------
>>> [  125.056856][ T1430] workqueue: WQ_MEM_RECLAIM 
>>> nvme-reset-wq:nvme_rdma_reset_ctrl_work [nvme_rdma] is flushing 
>>> !WQ_MEM_RECLAIM iw_cm_wq:0x0
>>> [ 125.056873][ T1430] WARNING: CPU: 2 PID: 1430 at 
>>> kernel/workqueue.c:3706 check_flush_dependency 
>>> (kernel/workqueue.c:3706 (discriminator 9))
>>> [  125.085014][ T1430] Modules linked in: siw ib_uverbs nvmet_rdma 
>>> nvmet nvme_rdma nvme_fabrics rdma_cm iw_cm ib_cm ib_core dimlib 
>>> dm_multipath btrfs blake2b_generic intel_rapl_msr xor 
>>> intel_rapl_common zstd_compress intel_uncore_frequency 
>>> intel_uncore_frequency_common raid6_pq libcrc32c 
>>> x86_pkg_temp_thermal intel_powerclamp coretemp nvme nvme_core ast 
>>> t10_pi kvm_intel qat_4xxx drm_shmem_helper mei_me kvm 
>>> crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel 
>>> sha512_ssse3 rapl intel_cstate intel_uncore dax_hmem intel_th_gth 
>>> intel_qat crc64_rocksoft_generic dh_generic intel_th_pci idxd crc8 
>>> i2c_i801 crc64_rocksoft mei intel_vsec idxd_bus drm_kms_helper 
>>> intel_th authenc crc64 i2c_smbus i2c_ismt ipmi_ssif wmi 
>>> acpi_power_meter ipmi_si acpi_ipmi ipmi_devintf ipmi_msghandler 
>>> acpi_pad binfmt_misc loop fuse drm dm_mod ip_tables [last unloaded: 
>>> ib_uverbs]
>>> [  125.176472][ T1430] CPU: 2 PID: 1430 Comm: kworker/u898:26 Not 
>>> tainted 6.10.0-rc1-00015-gaee2424246f9 #1
>>> [  125.188840][ T1430] Workqueue: nvme-reset-wq 
>>> nvme_rdma_reset_ctrl_work [nvme_rdma]
>>> [ 125.199152][ T1430] RIP: 0010:check_flush_dependency 
>>> (kernel/workqueue.c:3706 (discriminator 9))
>>> [ 125.207527][ T1430] Code: fa 48 c1 ea 03 80 3c 02 00 0f 85 a8 00 
>>> 00 00 49 8b 54 24 18 49 8d b5 c0 00 00 00 49 89 e8 48 c7 c7 20 46 08 
>>> 84 e8 ed 8b f9 ff <0f> 0b e9 93 fd ff ff e8 a1 bf 82 00 e9 80 fd ff 
>>> ff e8 97 bf 82 00
>>> All code
>>> ========
>>>     0:    fa                       cli
>>>     1:    48 c1 ea 03              shr    $0x3,%rdx
>>>     5:    80 3c 02 00              cmpb   $0x0,(%rdx,%rax,1)
>>>     9:    0f 85 a8 00 00 00        jne    0xb7
>>>     f:    49 8b 54 24 18           mov    0x18(%r12),%rdx
>>>    14:    49 8d b5 c0 00 00 00     lea    0xc0(%r13),%rsi
>>>    1b:    49 89 e8                 mov    %rbp,%r8
>>>    1e:    48 c7 c7 20 46 08 84     mov $0xffffffff84084620,%rdi
>>>    25:    e8 ed 8b f9 ff           callq  0xfffffffffff98c17
>>>    2a:*    0f 0b                    ud2            <-- trapping 
>>> instruction
>>>    2c:    e9 93 fd ff ff           jmpq   0xfffffffffffffdc4
>>>    31:    e8 a1 bf 82 00           callq  0x82bfd7
>>>    36:    e9 80 fd ff ff           jmpq   0xfffffffffffffdbb
>>>    3b:    e8 97 bf 82 00           callq  0x82bfd7
>>>
>>> Code starting with the faulting instruction
>>> ===========================================
>>>     0:    0f 0b                    ud2
>>>     2:    e9 93 fd ff ff           jmpq   0xfffffffffffffd9a
>>>     7:    e8 a1 bf 82 00           callq  0x82bfad
>>>     c:    e9 80 fd ff ff           jmpq   0xfffffffffffffd91
>>>    11:    e8 97 bf 82 00           callq  0x82bfad
>>> [  125.231266][ T1430] RSP: 0018:ffa000001375fb88 EFLAGS: 00010282
>>> [  125.239626][ T1430] RAX: 0000000000000000 RBX: ff11000341233c00 
>>> RCX: 0000000000000027
>>> [  125.250304][ T1430] RDX: 0000000000000027 RSI: 0000000000000004 
>>> RDI: ff110017fc930b08
>>> [  125.260878][ T1430] RBP: 0000000000000000 R08: 0000000000000001 
>>> R09: ffe21c02ff926161
>>> [  125.271664][ T1430] R10: ff110017fc930b0b R11: 0000000000000010 
>>> R12: ff1100208d2a4000
>>> [  125.282387][ T1430] R13: ff1100020d87a000 R14: 0000000000000000 
>>> R15: ff11000341233c00
>>> [  125.292984][ T1430] FS:  0000000000000000(0000) 
>>> GS:ff110017fc900000(0000) knlGS:0000000000000000
>>> [  125.304552][ T1430] CS:  0010 DS: 0000 ES: 0000 CR0: 
>>> 0000000080050033
>>> [  125.313446][ T1430] CR2: 00007fa84066aa1c CR3: 000000407c85a005 
>>> CR4: 0000000000f71ef0
>>> [  125.324080][ T1430] DR0: 0000000000000000 DR1: 0000000000000000 
>>> DR2: 0000000000000000
>>> [  125.334584][ T1430] DR3: 0000000000000000 DR6: 00000000fffe07f0 
>>> DR7: 0000000000000400
>>> [  125.345252][ T1430] PKRU: 55555554
>>> [  125.350876][ T1430] Call Trace:
>>> [  125.356281][ T1430]  <TASK>
>>> [ 125.361285][ T1430] ? __warn (kernel/panic.c:693)
>>> [ 125.367640][ T1430] ? check_flush_dependency 
>>> (kernel/workqueue.c:3706 (discriminator 9))
>>> [ 125.375689][ T1430] ? report_bug (lib/bug.c:180 lib/bug.c:219)
>>> [ 125.382505][ T1430] ? handle_bug (arch/x86/kernel/traps.c:239)
>>> [ 125.388987][ T1430] ? exc_invalid_op (arch/x86/kernel/traps.c:260 
>>> (discriminator 1))
>>> [ 125.395831][ T1430] ? asm_exc_invalid_op 
>>> (arch/x86/include/asm/idtentry.h:621)
>>> [ 125.403125][ T1430] ? check_flush_dependency 
>>> (kernel/workqueue.c:3706 (discriminator 9))
>>> [ 125.410984][ T1430] ? check_flush_dependency 
>>> (kernel/workqueue.c:3706 (discriminator 9))
>>> [ 125.418764][ T1430] __flush_workqueue (kernel/workqueue.c:3970)
>>> [ 125.426021][ T1430] ? __pfx___might_resched 
>>> (kernel/sched/core.c:10151)
>>> [ 125.433431][ T1430] ? destroy_cm_id 
>>> (drivers/infiniband/core/iwcm.c:375) iw_cm
>>> [  125.440844][ T2411] /usr/bin/wget -q --timeout=3600 --tries=1 
>>> --local-encoding=UTF-8 
>>> http://internal-lkp-server:80/~lkp/cgi-bin/lkp-jobfile-append-var?job_file=/lkp/jobs/scheduled/lkp-spr-2sp1/blktests-1SSD-rdma-nvme-group-01-debian-12-x86_64-20240206.cgz-aee2424246f9-20240809-69442-1dktaed-4.yaml&job_state=running 
>>> -O /dev/null
>>> [ 125.441209][ T1430] ? __pfx___flush_workqueue 
>>> (kernel/workqueue.c:3910)
>>> [  125.441215][ T2411]
>>> [ 125.473900][ T1430] ? _raw_spin_lock_irqsave 
>>> (arch/x86/include/asm/atomic.h:107 
>>> include/linux/atomic/atomic-arch-fallback.h:2170 
>>> include/linux/atomic/atomic-instrumented.h:1302 
>>> include/asm-generic/qspinlock.h:111 include/linux/spinlock.h:187 
>>> include/linux/spinlock_api_smp.h:111 kernel/locking/spinlock.c:162)
>>> [ 125.473909][ T1430] ? __pfx__raw_spin_lock_irqsave 
>>> (kernel/locking/spinlock.c:161)
>>> [  125.480265][ T2411] target ucode: 0x2b0004b1
>>> [ 125.482537][ T1430] _destroy_id 
>>> (drivers/infiniband/core/cma.c:2044) rdma_cm
>>> [  125.488511][ T2411]
>>> [ 125.495072][ T1430] nvme_rdma_free_queue 
>>> (drivers/nvme/host/rdma.c:656 drivers/nvme/host/rdma.c:650) nvme_rdma
>>> [  125.500747][ T2411] LKP: stdout: 2876: current_version: 2b0004b1, 
>>> target_version: 2b0004b1
>>> [ 125.505827][ T1430] nvme_rdma_reset_ctrl_work 
>>> (drivers/nvme/host/rdma.c:2180) nvme_rdma
>>> [ 125.505831][ T1430] process_one_work (kernel/workqueue.c:3231)
>>> [  125.508377][ T2411]
>>> [ 125.515122][ T1430] worker_thread (kernel/workqueue.c:3306 
>>> kernel/workqueue.c:3393)
>>> [ 125.515127][ T1430] ? __pfx_worker_thread (kernel/workqueue.c:3339)
>>> [  125.524642][ T2411] check_nr_cpu
>>> [ 125.531837][ T1430] kthread (kernel/kthread.c:389)
>>> [  125.537327][ T2411]
>>> [ 125.539864][ T1430] ? __pfx_kthread (kernel/kthread.c:342)
>>> [  125.545392][ T2411] CPU(s): 224
>>> [ 125.550628][ T1430] ret_from_fork (arch/x86/kernel/process.c:147)
>>> [  125.554342][ T2411]
>>> [ 125.558840][ T1430] ? __pfx_kthread (kernel/kthread.c:342)
>>> [ 125.558844][ T1430] ret_from_fork_asm (arch/x86/entry/entry_64.S:257)
>>> [  125.561843][ T2411] On-line CPU(s) list: 0-223
>>> [  125.566487][ T1430]  </TASK>
>>> [  125.566488][ T1430] ---[ end trace 0000000000000000 ]---
>>>
>>>
>>>
>>> The kernel config and materials to reproduce are available at:
>>> https://download.01.org/0day-ci/archive/20240815/202408151633.fc01893c-oliver.sang@intel.com 
>>>
>>>
>>>
>>>
-- 
Best Regards,
Yanjun.Zhu


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [linus:master] [RDMA/iwcm] aee2424246: WARNING:at_kernel/workqueue.c:#check_flush_dependency
  2024-08-16  5:27   ` Zhu Yanjun
  2024-08-16 12:49     ` Zhu Yanjun
@ 2024-08-16 17:09     ` Bart Van Assche
  2024-08-17  6:51       ` Zhu Yanjun
  1 sibling, 1 reply; 11+ messages in thread
From: Bart Van Assche @ 2024-08-16 17:09 UTC (permalink / raw)
  To: Zhu Yanjun, kernel test robot
  Cc: oe-lkp, lkp, linux-kernel, Leon Romanovsky,
	Shin'ichiro Kawasaki, linux-rdma

On 8/15/24 10:27 PM, Zhu Yanjun wrote:
> diff --git a/drivers/infiniband/core/iwcm.c 
> b/drivers/infiniband/core/iwcm.c
> index 1a6339f3a63f..7e3a55349e10 100644
> --- a/drivers/infiniband/core/iwcm.c
> +++ b/drivers/infiniband/core/iwcm.c
> @@ -1182,7 +1182,7 @@ static int __init iw_cm_init(void)
>          if (ret)
>                  return ret;
> 
> -       iwcm_wq = alloc_ordered_workqueue("iw_cm_wq", 0);
> +       iwcm_wq = alloc_ordered_workqueue("iw_cm_wq", WQ_MEM_RECLAIM);
>          if (!iwcm_wq)
>                  goto err_alloc;

This change looks good go me. Do you plan to post this as a proper patch?

Thanks,

Bart.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [linus:master] [RDMA/iwcm] aee2424246: WARNING:at_kernel/workqueue.c:#check_flush_dependency
  2024-08-16 12:49     ` Zhu Yanjun
@ 2024-08-16 17:10       ` Bart Van Assche
  2024-08-17  8:46         ` Zhu Yanjun
  0 siblings, 1 reply; 11+ messages in thread
From: Bart Van Assche @ 2024-08-16 17:10 UTC (permalink / raw)
  To: Zhu Yanjun, kernel test robot
  Cc: oe-lkp, lkp, linux-kernel, Leon Romanovsky,
	Shin'ichiro Kawasaki, linux-rdma

On 8/16/24 5:49 AM, Zhu Yanjun wrote:
> Hi, kernel test robot
> 
> Please help to make tests with the following commits.
> 
> Please let us know the result.
I don't think that the kernel test robot understands the above request.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [linus:master] [RDMA/iwcm] aee2424246: WARNING:at_kernel/workqueue.c:#check_flush_dependency
  2024-08-16 17:09     ` Bart Van Assche
@ 2024-08-17  6:51       ` Zhu Yanjun
  0 siblings, 0 replies; 11+ messages in thread
From: Zhu Yanjun @ 2024-08-17  6:51 UTC (permalink / raw)
  To: Bart Van Assche, kernel test robot
  Cc: oe-lkp, lkp, linux-kernel, Leon Romanovsky,
	Shin'ichiro Kawasaki, linux-rdma


在 2024/8/17 1:09, Bart Van Assche 写道:
> On 8/15/24 10:27 PM, Zhu Yanjun wrote:
>> diff --git a/drivers/infiniband/core/iwcm.c 
>> b/drivers/infiniband/core/iwcm.c
>> index 1a6339f3a63f..7e3a55349e10 100644
>> --- a/drivers/infiniband/core/iwcm.c
>> +++ b/drivers/infiniband/core/iwcm.c
>> @@ -1182,7 +1182,7 @@ static int __init iw_cm_init(void)
>>          if (ret)
>>                  return ret;
>>
>> -       iwcm_wq = alloc_ordered_workqueue("iw_cm_wq", 0);
>> +       iwcm_wq = alloc_ordered_workqueue("iw_cm_wq", WQ_MEM_RECLAIM);
>>          if (!iwcm_wq)
>>                  goto err_alloc;
>
> This change looks good go me. Do you plan to post this as a proper patch?

Hi, Bart

Thanks a lot for your review. I will post the patch ASAP.

Zhu Yanjun


>
> Thanks,
>
> Bart.
>
-- 
Best Regards,
Yanjun.Zhu


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [linus:master] [RDMA/iwcm] aee2424246: WARNING:at_kernel/workqueue.c:#check_flush_dependency
  2024-08-16 17:10       ` Bart Van Assche
@ 2024-08-17  8:46         ` Zhu Yanjun
  2024-08-18  6:24           ` Oliver Sang
  0 siblings, 1 reply; 11+ messages in thread
From: Zhu Yanjun @ 2024-08-17  8:46 UTC (permalink / raw)
  To: Bart Van Assche, kernel test robot
  Cc: oe-lkp, lkp, linux-kernel, Leon Romanovsky,
	Shin'ichiro Kawasaki, linux-rdma


在 2024/8/17 1:10, Bart Van Assche 写道:
> On 8/16/24 5:49 AM, Zhu Yanjun wrote:
>> Hi, kernel test robot
>>
>> Please help to make tests with the following commits.
>>
>> Please let us know the result.
> I don't think that the kernel test robot understands the above request.

Got it. I do not know how to let test robot make tests with this patch.^_^

Follow your advice, I have sent out a patch to rdma maillist. Please review.

Best Regards,

Zhu Yanjun

>
> Thanks,
>
> Bart.

-- 
Best Regards,
Yanjun.Zhu


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [linus:master] [RDMA/iwcm] aee2424246: WARNING:at_kernel/workqueue.c:#check_flush_dependency
  2024-08-17  8:46         ` Zhu Yanjun
@ 2024-08-18  6:24           ` Oliver Sang
  2024-08-20  1:42             ` Oliver Sang
  0 siblings, 1 reply; 11+ messages in thread
From: Oliver Sang @ 2024-08-18  6:24 UTC (permalink / raw)
  To: Zhu Yanjun
  Cc: Bart Van Assche, oe-lkp, lkp, linux-kernel, Leon Romanovsky,
	Shin'ichiro Kawasaki, linux-rdma, oliver.sang

hi, Yanjun.Zhu,

On Sat, Aug 17, 2024 at 04:46:23PM +0800, Zhu Yanjun wrote:
> 
> 在 2024/8/17 1:10, Bart Van Assche 写道:
> > On 8/16/24 5:49 AM, Zhu Yanjun wrote:
> > > Hi, kernel test robot
> > > 
> > > Please help to make tests with the following commits.
> > > 
> > > Please let us know the result.
> > I don't think that the kernel test robot understands the above request.
> 
> Got it. I do not know how to let test robot make tests with this patch.^_^

we can test the patch for you. just cannot test quickly due to resource
constraint. will let you know the results in one or two days. thanks

> 
> Follow your advice, I have sent out a patch to rdma maillist. Please review.
> 
> Best Regards,
> 
> Zhu Yanjun
> 
> > 
> > Thanks,
> > 
> > Bart.
> 
> -- 
> Best Regards,
> Yanjun.Zhu
> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [linus:master] [RDMA/iwcm] aee2424246: WARNING:at_kernel/workqueue.c:#check_flush_dependency
  2024-08-18  6:24           ` Oliver Sang
@ 2024-08-20  1:42             ` Oliver Sang
  2024-08-20  6:50               ` Zhu Yanjun
  0 siblings, 1 reply; 11+ messages in thread
From: Oliver Sang @ 2024-08-20  1:42 UTC (permalink / raw)
  To: Zhu Yanjun
  Cc: Bart Van Assche, oe-lkp, lkp, linux-kernel, Leon Romanovsky,
	Shin'ichiro Kawasaki, linux-rdma, oliver.sang

hi, Zhu Yanjun,

On Sun, Aug 18, 2024 at 02:24:52PM +0800, Oliver Sang wrote:
> hi, Yanjun.Zhu,
> 
> On Sat, Aug 17, 2024 at 04:46:23PM +0800, Zhu Yanjun wrote:
> > 
> > 在 2024/8/17 1:10, Bart Van Assche 写道:
> > > On 8/16/24 5:49 AM, Zhu Yanjun wrote:
> > > > Hi, kernel test robot
> > > > 
> > > > Please help to make tests with the following commits.
> > > > 
> > > > Please let us know the result.
> > > I don't think that the kernel test robot understands the above request.
> > 
> > Got it. I do not know how to let test robot make tests with this patch.^_^
> 
> we can test the patch for you. just cannot test quickly due to resource
> constraint. will let you know the results in one or two days. thanks

the WARNING is random in our tests. for aee2424246, it shows up 6 times in 20
runs as below table.

the "e0cc1e2cd74a66b5252ea674a26" is just your fix patch.

we run it to 100 times, and the issue doesn't show.

Tested-by: kernel test robot <oliver.sang@intel.com>

a1babdb5b615751e aee2424246f9f1dadc33faa7899 e0cc1e2cd74a66b5252ea674a26
---------------- --------------------------- ---------------------------
       fail:runs  %reproduction    fail:runs  %reproduction    fail:runs
           |             |             |             |             |
           :20          30%           6:20           0%            :100   dmesg.RIP:check_flush_dependency
           :20          30%           6:20           0%            :100   dmesg.WARNING:at_kernel/workqueue.c:#check_flush_dependency


> 
> > 
> > Follow your advice, I have sent out a patch to rdma maillist. Please review.
> > 
> > Best Regards,
> > 
> > Zhu Yanjun
> > 
> > > 
> > > Thanks,
> > > 
> > > Bart.
> > 
> > -- 
> > Best Regards,
> > Yanjun.Zhu
> > 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [linus:master] [RDMA/iwcm] aee2424246: WARNING:at_kernel/workqueue.c:#check_flush_dependency
  2024-08-20  1:42             ` Oliver Sang
@ 2024-08-20  6:50               ` Zhu Yanjun
  0 siblings, 0 replies; 11+ messages in thread
From: Zhu Yanjun @ 2024-08-20  6:50 UTC (permalink / raw)
  To: Oliver Sang
  Cc: Bart Van Assche, oe-lkp, lkp, linux-kernel, Leon Romanovsky,
	Shin'ichiro Kawasaki, linux-rdma


在 2024/8/20 9:42, Oliver Sang 写道:
> hi, Zhu Yanjun,
>
> On Sun, Aug 18, 2024 at 02:24:52PM +0800, Oliver Sang wrote:
>> hi, Yanjun.Zhu,
>>
>> On Sat, Aug 17, 2024 at 04:46:23PM +0800, Zhu Yanjun wrote:
>>> 在 2024/8/17 1:10, Bart Van Assche 写道:
>>>> On 8/16/24 5:49 AM, Zhu Yanjun wrote:
>>>>> Hi, kernel test robot
>>>>>
>>>>> Please help to make tests with the following commits.
>>>>>
>>>>> Please let us know the result.
>>>> I don't think that the kernel test robot understands the above request.
>>> Got it. I do not know how to let test robot make tests with this patch.^_^
>> we can test the patch for you. just cannot test quickly due to resource
>> constraint. will let you know the results in one or two days. thanks
> the WARNING is random in our tests. for aee2424246, it shows up 6 times in 20
> runs as below table.
>
> the "e0cc1e2cd74a66b5252ea674a26" is just your fix patch.
>
> we run it to 100 times, and the issue doesn't show.
>
> Tested-by: kernel test robot <oliver.sang@intel.com>

Thanks a lot for your tests. I will send out the latest patch very soon.

Zhu Yanjun

>
> a1babdb5b615751e aee2424246f9f1dadc33faa7899 e0cc1e2cd74a66b5252ea674a26
> ---------------- --------------------------- ---------------------------
>         fail:runs  %reproduction    fail:runs  %reproduction    fail:runs
>             |             |             |             |             |
>             :20          30%           6:20           0%            :100   dmesg.RIP:check_flush_dependency
>             :20          30%           6:20           0%            :100   dmesg.WARNING:at_kernel/workqueue.c:#check_flush_dependency
>
>
>>> Follow your advice, I have sent out a patch to rdma maillist. Please review.
>>>
>>> Best Regards,
>>>
>>> Zhu Yanjun
>>>
>>>> Thanks,
>>>>
>>>> Bart.
>>> -- 
>>> Best Regards,
>>> Yanjun.Zhu
>>>
-- 
Best Regards,
Yanjun.Zhu


^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2024-08-20  6:50 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-08-16  1:07 [linus:master] [RDMA/iwcm] aee2424246: WARNING:at_kernel/workqueue.c:#check_flush_dependency kernel test robot
2024-08-16  4:14 ` Zhu Yanjun
2024-08-16  5:27   ` Zhu Yanjun
2024-08-16 12:49     ` Zhu Yanjun
2024-08-16 17:10       ` Bart Van Assche
2024-08-17  8:46         ` Zhu Yanjun
2024-08-18  6:24           ` Oliver Sang
2024-08-20  1:42             ` Oliver Sang
2024-08-20  6:50               ` Zhu Yanjun
2024-08-16 17:09     ` Bart Van Assche
2024-08-17  6:51       ` Zhu Yanjun

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox