* [ISSUE] `cxl destory-region region0` causes kernel panic when cxl memory is occupied
@ 2023-11-03 11:11 Cao, Quanquan/曹 全全
2023-11-03 15:37 ` Dave Jiang
0 siblings, 1 reply; 2+ messages in thread
From: Cao, Quanquan/曹 全全 @ 2023-11-03 11:11 UTC (permalink / raw)
To: Dave Jiang, vishal.l.verma; +Cc: linux-cxl, nvdimm
Hi guys,
I am writing to report an issue that I have encountered while executing
'cxl destroy-region region0', causing a kernel panic when the cxl memory
is occupied. I have provided a detailed description of the problem along
with relevant test for reference.
Problem Description:
After 'create-region', if cxl memory is occupied using a script, then
'disable-region' without `daxctl offline-memory` firstly, it will result
in a kernel panic.
I made a few investigation on this, the panic was caused during the
process of resetting the region decode in preparation for removal within
the "destroy_region()" function in cxl/region.c. When the value of
"/sys/bus/cxl/devices/root0/decoder0.0/region0/commit" is changed from 1
to 0, it will invoke the driver code to reset the region decode, which
in turn leads to a kernel panic:
[ 397.898809] Code: Unable to access opcode bytes at 0xffffffffffffffd6.
[ 397.908416] systemd[1]: segfault at 0 ip 0000000000000000 sp
00007ffcdc242520 error 14 in systemd[55555aef50)
[ 397.910578] Code: Unable to access opcode bytes at 0xffffffffffffffd6.
[ 397.920233] systemd[1]: segfault at 0 ip 0000000000000000 sp
00007ffcdc2416a0 error 14 in systemd[55555aef50)
[ 397.922309] Code: Unable to access opcode bytes at 0xffffffffffffffd6.
[ 397.933175] systemd[1]: segfault at 0 ip 0000000000000000 sp
00007ffcdc240820 error 14 in systemd[55555aef50)
[ 397.935553] Code: Unable to access opcode bytes at 0xffffffffffffffd6.
[ 397.945611] systemd[1]: segfault at 0 ip 0000000000000000 sp
00007ffcdc23f9a0 error 14 in systemd[55555aef50)
[ 397.947751] Code: Unable to access opcode bytes at 0xffffffffffffffd6.
[ 400.474068] Kernel panic - not syncing: Attempted to kill init!
exitcode=0x0000000b
[ 400.474583] CPU: 2 PID: 1 Comm: systemd Tainted: G O N
6.6.0-rc6+ #1
[ 400.474583] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS
rel-1.16.2-0-gea1b7a073390-prebuilt.qem4
[ 400.474583] Call Trace:
[ 400.474583] <TASK>
[ 400.474583] dump_stack_lvl+0x43/0x60
[ 400.474583] panic+0x32a/0x340
[ 400.474583] ? _raw_spin_unlock+0x15/0x30
[ 400.474583] do_exit+0x9a1/0xb30
[ 400.474583] do_group_exit+0x2d/0x80
[ 400.474583] get_signal+0x9c7/0xa00
[ 400.474583] arch_do_signal_or_restart+0x3a/0x280
[ 400.474583] exit_to_user_mode_prepare+0x192/0x1f0
[ 400.474583] irqentry_exit_to_user_mode+0x5/0x30
[ 400.474583] asm_exc_page_fault+0x22/0x30
[ 400.474583] RIP: 0033:0x0
[ 400.474583] Code: Unable to access opcode bytes at 0xffffffffffffffd6.
[ 400.474583] RSP: 002b:00007ffcdc1579a0 EFLAGS: 00000207
[ 400.474583] RAX: 0000000000000001 RBX: 0000000000000000 RCX:
00007fb10db2796d
[ 400.474583] RDX: 00007fb10db2796d RSI: 00000000ffffffff RDI:
00007ffcdc157c70
[ 400.474583] RBP: 000000000000000b R08: 0000000000000000 R09:
0000000000000000
[ 400.474583] R10: 0000000000000000 R11: 0000000000000246 R12:
00007ffcdc94dce8
[ 400.474583] R13: 00007ffcdc94dce0 R14: 00000000000004bb R15:
000000000000005d
[ 400.474583] </TASK>
[ 400.474583] Kernel Offset: 0x20000000 from 0xffffffff81000000
(relocation range: 0xffffffff80000000-0xffffff)
[ 400.474583] ---[ end Kernel panic - not syncing: Attempted to kill
init! exitcode=0x0000000b ]---
According to the panic message, the systemd process in the system
encountered a segmentation fault (segfault), resulting in a kernel panic.
Test Example:
1.echo online_movable > /sys/devices/system/memory/auto_online_blocks
2.cxl create-region -t ram -d decoder0.0 -m mem0
3.python consumemem.py <------execute script
4.cxl disable-region region0
5.cxl destory-region region0 <------kernel panic !!!
Thank you very much for taking the time to look on this issue. Looking
forward to your response.
Best regards,
Quanquan Cao
caoqq@fujitsu.com
^ permalink raw reply [flat|nested] 2+ messages in thread
* Re: [ISSUE] `cxl destory-region region0` causes kernel panic when cxl memory is occupied
2023-11-03 11:11 [ISSUE] `cxl destory-region region0` causes kernel panic when cxl memory is occupied Cao, Quanquan/曹 全全
@ 2023-11-03 15:37 ` Dave Jiang
0 siblings, 0 replies; 2+ messages in thread
From: Dave Jiang @ 2023-11-03 15:37 UTC (permalink / raw)
To: Cao, Quanquan/曹 全全, vishal.l.verma; +Cc: linux-cxl, nvdimm
On 11/3/23 04:11, Cao, Quanquan/曹 全全 wrote:
> Hi guys,
>
> I am writing to report an issue that I have encountered while executing 'cxl destroy-region region0', causing a kernel panic when the cxl memory is occupied. I have provided a detailed description of the problem along with relevant test for reference.
>
> Problem Description:
>
> After 'create-region', if cxl memory is occupied using a script, then 'disable-region' without `daxctl offline-memory` firstly, it will result in a kernel panic.
Hi Quanquan,
This NDCTL change is suppose to prevent you from doing that:
https://lore.kernel.org/linux-cxl/169878724592.82931.11180459815481606425.stgit@djiang5-mobl3/
Otherwise this behavior is expected. If you don't offline the memory and force rip away the regions, you get to deal with the consequences.
>
> I made a few investigation on this, the panic was caused during the process of resetting the region decode in preparation for removal within the "destroy_region()" function in cxl/region.c. When the value of "/sys/bus/cxl/devices/root0/decoder0.0/region0/commit" is changed from 1 to 0, it will invoke the driver code to reset the region decode, which in turn leads to a kernel panic:
>
> [ 397.898809] Code: Unable to access opcode bytes at 0xffffffffffffffd6.
> [ 397.908416] systemd[1]: segfault at 0 ip 0000000000000000 sp 00007ffcdc242520 error 14 in systemd[55555aef50)
> [ 397.910578] Code: Unable to access opcode bytes at 0xffffffffffffffd6.
> [ 397.920233] systemd[1]: segfault at 0 ip 0000000000000000 sp 00007ffcdc2416a0 error 14 in systemd[55555aef50)
> [ 397.922309] Code: Unable to access opcode bytes at 0xffffffffffffffd6.
> [ 397.933175] systemd[1]: segfault at 0 ip 0000000000000000 sp 00007ffcdc240820 error 14 in systemd[55555aef50)
> [ 397.935553] Code: Unable to access opcode bytes at 0xffffffffffffffd6.
> [ 397.945611] systemd[1]: segfault at 0 ip 0000000000000000 sp 00007ffcdc23f9a0 error 14 in systemd[55555aef50)
> [ 397.947751] Code: Unable to access opcode bytes at 0xffffffffffffffd6.
> [ 400.474068] Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b
> [ 400.474583] CPU: 2 PID: 1 Comm: systemd Tainted: G O N 6.6.0-rc6+ #1
> [ 400.474583] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qem4
> [ 400.474583] Call Trace:
> [ 400.474583] <TASK>
> [ 400.474583] dump_stack_lvl+0x43/0x60
> [ 400.474583] panic+0x32a/0x340
> [ 400.474583] ? _raw_spin_unlock+0x15/0x30
> [ 400.474583] do_exit+0x9a1/0xb30
> [ 400.474583] do_group_exit+0x2d/0x80
> [ 400.474583] get_signal+0x9c7/0xa00
> [ 400.474583] arch_do_signal_or_restart+0x3a/0x280
> [ 400.474583] exit_to_user_mode_prepare+0x192/0x1f0
> [ 400.474583] irqentry_exit_to_user_mode+0x5/0x30
> [ 400.474583] asm_exc_page_fault+0x22/0x30
> [ 400.474583] RIP: 0033:0x0
> [ 400.474583] Code: Unable to access opcode bytes at 0xffffffffffffffd6.
> [ 400.474583] RSP: 002b:00007ffcdc1579a0 EFLAGS: 00000207
> [ 400.474583] RAX: 0000000000000001 RBX: 0000000000000000 RCX: 00007fb10db2796d
> [ 400.474583] RDX: 00007fb10db2796d RSI: 00000000ffffffff RDI: 00007ffcdc157c70
> [ 400.474583] RBP: 000000000000000b R08: 0000000000000000 R09: 0000000000000000
> [ 400.474583] R10: 0000000000000000 R11: 0000000000000246 R12: 00007ffcdc94dce8
> [ 400.474583] R13: 00007ffcdc94dce0 R14: 00000000000004bb R15: 000000000000005d
> [ 400.474583] </TASK>
> [ 400.474583] Kernel Offset: 0x20000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffff)
> [ 400.474583] ---[ end Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b ]---
>
> According to the panic message, the systemd process in the system encountered a segmentation fault (segfault), resulting in a kernel panic.
>
> Test Example:
>
> 1.echo online_movable > /sys/devices/system/memory/auto_online_blocks
> 2.cxl create-region -t ram -d decoder0.0 -m mem0
> 3.python consumemem.py <------execute script
> 4.cxl disable-region region0
> 5.cxl destory-region region0 <------kernel panic !!!
>
> Thank you very much for taking the time to look on this issue. Looking forward to your response.
>
> Best regards,
> Quanquan Cao
> caoqq@fujitsu.com
^ permalink raw reply [flat|nested] 2+ messages in thread
end of thread, other threads:[~2023-11-03 15:37 UTC | newest]
Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-11-03 11:11 [ISSUE] `cxl destory-region region0` causes kernel panic when cxl memory is occupied Cao, Quanquan/曹 全全
2023-11-03 15:37 ` Dave Jiang
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox