Internal error: Oops: 0000000096000044 [#11] SMP

Linux CXL
 help / color / mirror / Atom feed

* Internal error: Oops: 0000000096000044 [#11]  SMP
@ 2025-05-21  8:39 Itaru Kitayama
  2025-05-21 15:31 ` Dave Jiang
                   ` (3 more replies)
  0 siblings, 4 replies; 15+ messages in thread
From: Itaru Kitayama @ 2025-05-21  8:39 UTC (permalink / raw)
  To: linux-cxl

Hi,
On arm64/virt QEMU, the cxl/next (as of today) kernel prints out Internal errors:

[   80.968299] [     T48] Internal error: Oops: 0000000096000044 [#11]
SMP
[   80.989250] [     T48] Modules linked in: cxl_mock_mem(O) cfg80211
rfkill cxl_test(O) cxl_mem(O) cxl_pmem(O) cxl_acpi(O) cxl_port(O)
cxl_mock(O) libnvdimm encrypted_keys trusted caam_jr caam asn1_encoder
caamhash_desc caamalg_desc error crypto_engine authenc libdes fuse drm
backlight ip_tables x_tables sm3_ce sm3 sha3_ce sha512_ce sha512_arm64
cxl_core(O) fwctl btrfs blake2b_generic xor xor_neon raid6_pq
zstd_compress ipv6
[   80.992210] [     T48] CPU: 1 UID: 0 PID: 48 Comm: kworker/u8:2
Tainted: G      D    O        6.15.0-rc4-00040-g128ad8fa385b #40 PREEMPT
[   80.992791] [     T48] Tainted: [D]=DIE, [O]=OOT_MODULE
[   80.993039] [     T48] Hardware name: QEMU QEMU Virtual Machine, BIOS
2025.02-3ubuntu2 04/04/2025
[   80.993400] [     T48] Workqueue: async async_run_entry_fn
[   80.994718] [     T48] pstate: 61402005 (nZCv daif +PAN -UAO -TCO
+DIT -SSBS BTYPE=--)
[   80.995329] [     T48] pc : cxl_mock_mbox_send+0xec/0x12c0
[cxl_mock_mem]
[   80.995691] [     T48] lr : cxl_internal_send_cmd+0x40/0x104
[cxl_core]
[   80.996189] [     T48] sp : ffff800080d0b9f0
[   80.996380] [     T48] x29: ffff800080d0ba70 x28: fff0000008dd2410
x27: fff00000088fb390
[   80.996714] [     T48] x26: ffff800080d0bb07 x25: 0000000000000100
x24: 0000000000000003
[   80.997135] [     T48] x23: 0000000000000020 x22: fff0000008dd2410
x21: 0000000000000002
[   80.998119] [     T48] x20: fff00000088fb080 x19: ffff800080d0bb08
x18: 00000000ffffffff
[   80.998419] [     T48] x17: 0000000000000000 x16: ffffa8d169128748
x15: ffff800080d0b5ad
[   80.999243] [     T48] x14: ffff800080d0b400 x13: ffff800080d0b5b8
x12: fff000006f7a0000
[   81.000519] [     T48] x11: 0000000000000058 x10: 0000000000000018 x9
: fff000006f7a0000
[   81.001337] [     T48] x8 : ffff800080d0bb48 x7 : fff0000074fa0000 x6
: fff0000074fa0000
[   81.002497] [     T48] x5 : fff000007f937508 x4 : 0000000000000001 x3
: 0000000000001000
[   81.003993] [     T48] x2 : 0000000000001000 x1 : 0000000000000000 x0
: 0000000000000088
[   81.004223] [     T48] Call trace:
[   81.004795] [     T48]  cxl_mock_mbox_send+0xec/0x12c0 [cxl_mock_mem]
(P)
[   81.005136] [     T48]  cxl_internal_send_cmd+0x40/0x104 [cxl_core]
[   81.005520] [     T48]  cxl_mem_get_records_log+0xbc/0x198 [cxl_core]
[   81.006042] [     T48]  cxl_mem_get_event_records+0xb0/0xc0
[cxl_core]
[   81.006246] [     T48]  cxl_mock_mem_probe+0x568/0x6f0 [cxl_mock_mem]
[   81.006417] [     T48]  platform_probe+0x68/0xd8
[   81.008340] [     T48]  really_probe+0xc0/0x39c
[   81.008885] [     T48]  __driver_probe_device+0xd0/0x14c
[   81.009539] [     T48]  driver_probe_device+0x3c/0x120
[   81.010239] [     T48]  __driver_attach_async_helper+0x50/0xec
[   81.011130] [     T48]  async_run_entry_fn+0x34/0x14c
[   81.011276] [     T48]  process_one_work+0x148/0x284
[   81.011420] [     T48]  worker_thread+0x2c4/0x3e0
[   81.011552] [     T48]  kthread+0x12c/0x204
[   81.011693] [     T48]  ret_from_fork+0x10/0x20
[   81.011840] [     T48] Code: 54001b28 a90c6bf9 52801100 f9400a61
(a9007c3f)
[   81.013772] [     T48] ---[ end trace 0000000000000000 ]---

How serious is this?

Thanks,
Itaru.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Internal error: Oops: 0000000096000044 [#11] SMP
  2025-05-21  8:39 Internal error: Oops: 0000000096000044 [#11] SMP Itaru Kitayama
@ 2025-05-21 15:31 ` Dave Jiang
  2025-05-21 20:38   ` Itaru Kitayama
  2025-05-21 15:33 ` Alison Schofield
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 15+ messages in thread
From: Dave Jiang @ 2025-05-21 15:31 UTC (permalink / raw)
  To: Itaru Kitayama, linux-cxl



On 5/21/25 1:39 AM, Itaru Kitayama wrote:
> Hi,
> On arm64/virt QEMU, the cxl/next (as of today) kernel prints out Internal errors:
> 
> [   80.968299] [     T48] Internal error: Oops: 0000000096000044 [#11]
> SMP
> [   80.989250] [     T48] Modules linked in: cxl_mock_mem(O) cfg80211
> rfkill cxl_test(O) cxl_mem(O) cxl_pmem(O) cxl_acpi(O) cxl_port(O)
> cxl_mock(O) libnvdimm encrypted_keys trusted caam_jr caam asn1_encoder
> caamhash_desc caamalg_desc error crypto_engine authenc libdes fuse drm
> backlight ip_tables x_tables sm3_ce sm3 sha3_ce sha512_ce sha512_arm64
> cxl_core(O) fwctl btrfs blake2b_generic xor xor_neon raid6_pq
> zstd_compress ipv6
> [   80.992210] [     T48] CPU: 1 UID: 0 PID: 48 Comm: kworker/u8:2
> Tainted: G      D    O        6.15.0-rc4-00040-g128ad8fa385b #40 PREEMPT
> [   80.992791] [     T48] Tainted: [D]=DIE, [O]=OOT_MODULE
> [   80.993039] [     T48] Hardware name: QEMU QEMU Virtual Machine, BIOS
> 2025.02-3ubuntu2 04/04/2025
> [   80.993400] [     T48] Workqueue: async async_run_entry_fn
> [   80.994718] [     T48] pstate: 61402005 (nZCv daif +PAN -UAO -TCO
> +DIT -SSBS BTYPE=--)
> [   80.995329] [     T48] pc : cxl_mock_mbox_send+0xec/0x12c0
> [cxl_mock_mem]

Can you do this in your kernel tree?
./scripts/faddr2line tools/testing/cxl/test/cxl_mock_mem.ko cxl_mock_mbox_send+0xec/0x12c0

I've not see this issue on x86 running cxl/next. How consistently can you reproduce this? If it's every time, is it possible for you to do a git bisect on the kernel and see which commit causes this please? Thanks!

DJ

> [   80.995691] [     T48] lr : cxl_internal_send_cmd+0x40/0x104
> [cxl_core]
> [   80.996189] [     T48] sp : ffff800080d0b9f0
> [   80.996380] [     T48] x29: ffff800080d0ba70 x28: fff0000008dd2410
> x27: fff00000088fb390
> [   80.996714] [     T48] x26: ffff800080d0bb07 x25: 0000000000000100
> x24: 0000000000000003
> [   80.997135] [     T48] x23: 0000000000000020 x22: fff0000008dd2410
> x21: 0000000000000002
> [   80.998119] [     T48] x20: fff00000088fb080 x19: ffff800080d0bb08
> x18: 00000000ffffffff
> [   80.998419] [     T48] x17: 0000000000000000 x16: ffffa8d169128748
> x15: ffff800080d0b5ad
> [   80.999243] [     T48] x14: ffff800080d0b400 x13: ffff800080d0b5b8
> x12: fff000006f7a0000
> [   81.000519] [     T48] x11: 0000000000000058 x10: 0000000000000018 x9
> : fff000006f7a0000
> [   81.001337] [     T48] x8 : ffff800080d0bb48 x7 : fff0000074fa0000 x6
> : fff0000074fa0000
> [   81.002497] [     T48] x5 : fff000007f937508 x4 : 0000000000000001 x3
> : 0000000000001000
> [   81.003993] [     T48] x2 : 0000000000001000 x1 : 0000000000000000 x0
> : 0000000000000088
> [   81.004223] [     T48] Call trace:
> [   81.004795] [     T48]  cxl_mock_mbox_send+0xec/0x12c0 [cxl_mock_mem]
> (P)
> [   81.005136] [     T48]  cxl_internal_send_cmd+0x40/0x104 [cxl_core]
> [   81.005520] [     T48]  cxl_mem_get_records_log+0xbc/0x198 [cxl_core]
> [   81.006042] [     T48]  cxl_mem_get_event_records+0xb0/0xc0
> [cxl_core]
> [   81.006246] [     T48]  cxl_mock_mem_probe+0x568/0x6f0 [cxl_mock_mem]
> [   81.006417] [     T48]  platform_probe+0x68/0xd8
> [   81.008340] [     T48]  really_probe+0xc0/0x39c
> [   81.008885] [     T48]  __driver_probe_device+0xd0/0x14c
> [   81.009539] [     T48]  driver_probe_device+0x3c/0x120
> [   81.010239] [     T48]  __driver_attach_async_helper+0x50/0xec
> [   81.011130] [     T48]  async_run_entry_fn+0x34/0x14c
> [   81.011276] [     T48]  process_one_work+0x148/0x284
> [   81.011420] [     T48]  worker_thread+0x2c4/0x3e0
> [   81.011552] [     T48]  kthread+0x12c/0x204
> [   81.011693] [     T48]  ret_from_fork+0x10/0x20
> [   81.011840] [     T48] Code: 54001b28 a90c6bf9 52801100 f9400a61
> (a9007c3f)
> [   81.013772] [     T48] ---[ end trace 0000000000000000 ]---
> 
> How serious is this?
> 
> Thanks,
> Itaru.
> 


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Internal error: Oops: 0000000096000044 [#11]  SMP
  2025-05-21  8:39 Internal error: Oops: 0000000096000044 [#11] SMP Itaru Kitayama
  2025-05-21 15:31 ` Dave Jiang
@ 2025-05-21 15:33 ` Alison Schofield
  2025-05-21 15:36 ` Jonathan Cameron
  2025-05-21 15:41 ` Alison Schofield
  3 siblings, 0 replies; 15+ messages in thread
From: Alison Schofield @ 2025-05-21 15:33 UTC (permalink / raw)
  To: Itaru Kitayama; +Cc: linux-cxl

On Wed, May 21, 2025 at 05:39:05PM +0900, Itaru Kitayama wrote:
> Hi,
> On arm64/virt QEMU, the cxl/next (as of today) kernel prints out Internal errors:

How do you repro this?  Is it repeating on each run of cxl-events.sh for
you?  Any repro info is useful. I've seen once but have not been able to 
repro myself. Since I saw in March I'm thinking not new to cxl/next.

> 
> [   80.968299] [     T48] Internal error: Oops: 0000000096000044 [#11]
> SMP
> [   80.989250] [     T48] Modules linked in: cxl_mock_mem(O) cfg80211
> rfkill cxl_test(O) cxl_mem(O) cxl_pmem(O) cxl_acpi(O) cxl_port(O)
> cxl_mock(O) libnvdimm encrypted_keys trusted caam_jr caam asn1_encoder
> caamhash_desc caamalg_desc error crypto_engine authenc libdes fuse drm
> backlight ip_tables x_tables sm3_ce sm3 sha3_ce sha512_ce sha512_arm64
> cxl_core(O) fwctl btrfs blake2b_generic xor xor_neon raid6_pq
> zstd_compress ipv6
> [   80.992210] [     T48] CPU: 1 UID: 0 PID: 48 Comm: kworker/u8:2
> Tainted: G      D    O        6.15.0-rc4-00040-g128ad8fa385b #40 PREEMPT
> [   80.992791] [     T48] Tainted: [D]=DIE, [O]=OOT_MODULE
> [   80.993039] [     T48] Hardware name: QEMU QEMU Virtual Machine, BIOS
> 2025.02-3ubuntu2 04/04/2025
> [   80.993400] [     T48] Workqueue: async async_run_entry_fn
> [   80.994718] [     T48] pstate: 61402005 (nZCv daif +PAN -UAO -TCO
> +DIT -SSBS BTYPE=--)
> [   80.995329] [     T48] pc : cxl_mock_mbox_send+0xec/0x12c0
> [cxl_mock_mem]
> [   80.995691] [     T48] lr : cxl_internal_send_cmd+0x40/0x104
> [cxl_core]
> [   80.996189] [     T48] sp : ffff800080d0b9f0
> [   80.996380] [     T48] x29: ffff800080d0ba70 x28: fff0000008dd2410
> x27: fff00000088fb390
> [   80.996714] [     T48] x26: ffff800080d0bb07 x25: 0000000000000100
> x24: 0000000000000003
> [   80.997135] [     T48] x23: 0000000000000020 x22: fff0000008dd2410
> x21: 0000000000000002
> [   80.998119] [     T48] x20: fff00000088fb080 x19: ffff800080d0bb08
> x18: 00000000ffffffff
> [   80.998419] [     T48] x17: 0000000000000000 x16: ffffa8d169128748
> x15: ffff800080d0b5ad
> [   80.999243] [     T48] x14: ffff800080d0b400 x13: ffff800080d0b5b8
> x12: fff000006f7a0000
> [   81.000519] [     T48] x11: 0000000000000058 x10: 0000000000000018 x9
> : fff000006f7a0000
> [   81.001337] [     T48] x8 : ffff800080d0bb48 x7 : fff0000074fa0000 x6
> : fff0000074fa0000
> [   81.002497] [     T48] x5 : fff000007f937508 x4 : 0000000000000001 x3
> : 0000000000001000
> [   81.003993] [     T48] x2 : 0000000000001000 x1 : 0000000000000000 x0
> : 0000000000000088
> [   81.004223] [     T48] Call trace:
> [   81.004795] [     T48]  cxl_mock_mbox_send+0xec/0x12c0 [cxl_mock_mem]
> (P)
> [   81.005136] [     T48]  cxl_internal_send_cmd+0x40/0x104 [cxl_core]
> [   81.005520] [     T48]  cxl_mem_get_records_log+0xbc/0x198 [cxl_core]
> [   81.006042] [     T48]  cxl_mem_get_event_records+0xb0/0xc0
> [cxl_core]
> [   81.006246] [     T48]  cxl_mock_mem_probe+0x568/0x6f0 [cxl_mock_mem]
> [   81.006417] [     T48]  platform_probe+0x68/0xd8
> [   81.008340] [     T48]  really_probe+0xc0/0x39c
> [   81.008885] [     T48]  __driver_probe_device+0xd0/0x14c
> [   81.009539] [     T48]  driver_probe_device+0x3c/0x120
> [   81.010239] [     T48]  __driver_attach_async_helper+0x50/0xec
> [   81.011130] [     T48]  async_run_entry_fn+0x34/0x14c
> [   81.011276] [     T48]  process_one_work+0x148/0x284
> [   81.011420] [     T48]  worker_thread+0x2c4/0x3e0
> [   81.011552] [     T48]  kthread+0x12c/0x204
> [   81.011693] [     T48]  ret_from_fork+0x10/0x20
> [   81.011840] [     T48] Code: 54001b28 a90c6bf9 52801100 f9400a61
> (a9007c3f)
> [   81.013772] [     T48] ---[ end trace 0000000000000000 ]---
> 
> How serious is this?
> 
> Thanks,
> Itaru.
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Internal error: Oops: 0000000096000044 [#11]  SMP
  2025-05-21  8:39 Internal error: Oops: 0000000096000044 [#11] SMP Itaru Kitayama
  2025-05-21 15:31 ` Dave Jiang
  2025-05-21 15:33 ` Alison Schofield
@ 2025-05-21 15:36 ` Jonathan Cameron
  2025-05-21 15:41 ` Alison Schofield
  3 siblings, 0 replies; 15+ messages in thread
From: Jonathan Cameron @ 2025-05-21 15:36 UTC (permalink / raw)
  To: Itaru Kitayama; +Cc: linux-cxl

On Wed, 21 May 2025 17:39:05 +0900
Itaru Kitayama <itaru.kitayama@linux.dev> wrote:

> Hi,
> On arm64/virt QEMU, the cxl/next (as of today) kernel prints out Internal errors:
> 
> [   80.968299] [     T48] Internal error: Oops: 0000000096000044 [#11]
> SMP
> [   80.989250] [     T48] Modules linked in: cxl_mock_mem(O) cfg80211
> rfkill cxl_test(O) cxl_mem(O) cxl_pmem(O) cxl_acpi(O) cxl_port(O)
> cxl_mock(O) libnvdimm encrypted_keys trusted caam_jr caam asn1_encoder
> caamhash_desc caamalg_desc error crypto_engine authenc libdes fuse drm
> backlight ip_tables x_tables sm3_ce sm3 sha3_ce sha512_ce sha512_arm64
> cxl_core(O) fwctl btrfs blake2b_generic xor xor_neon raid6_pq
> zstd_compress ipv6
> [   80.992210] [     T48] CPU: 1 UID: 0 PID: 48 Comm: kworker/u8:2
> Tainted: G      D    O        6.15.0-rc4-00040-g128ad8fa385b #40 PREEMPT
> [   80.992791] [     T48] Tainted: [D]=DIE, [O]=OOT_MODULE
> [   80.993039] [     T48] Hardware name: QEMU QEMU Virtual Machine, BIOS
> 2025.02-3ubuntu2 04/04/2025
> [   80.993400] [     T48] Workqueue: async async_run_entry_fn
> [   80.994718] [     T48] pstate: 61402005 (nZCv daif +PAN -UAO -TCO
> +DIT -SSBS BTYPE=--)
> [   80.995329] [     T48] pc : cxl_mock_mbox_send+0xec/0x12c0
> [cxl_mock_mem]
> [   80.995691] [     T48] lr : cxl_internal_send_cmd+0x40/0x104
> [cxl_core]
> [   80.996189] [     T48] sp : ffff800080d0b9f0
> [   80.996380] [     T48] x29: ffff800080d0ba70 x28: fff0000008dd2410
> x27: fff00000088fb390
> [   80.996714] [     T48] x26: ffff800080d0bb07 x25: 0000000000000100
> x24: 0000000000000003
> [   80.997135] [     T48] x23: 0000000000000020 x22: fff0000008dd2410
> x21: 0000000000000002
> [   80.998119] [     T48] x20: fff00000088fb080 x19: ffff800080d0bb08
> x18: 00000000ffffffff
> [   80.998419] [     T48] x17: 0000000000000000 x16: ffffa8d169128748
> x15: ffff800080d0b5ad
> [   80.999243] [     T48] x14: ffff800080d0b400 x13: ffff800080d0b5b8
> x12: fff000006f7a0000
> [   81.000519] [     T48] x11: 0000000000000058 x10: 0000000000000018 x9
> : fff000006f7a0000
> [   81.001337] [     T48] x8 : ffff800080d0bb48 x7 : fff0000074fa0000 x6
> : fff0000074fa0000
> [   81.002497] [     T48] x5 : fff000007f937508 x4 : 0000000000000001 x3
> : 0000000000001000
> [   81.003993] [     T48] x2 : 0000000000001000 x1 : 0000000000000000 x0
> : 0000000000000088
> [   81.004223] [     T48] Call trace:
> [   81.004795] [     T48]  cxl_mock_mbox_send+0xec/0x12c0 [cxl_mock_mem]
> (P)
> [   81.005136] [     T48]  cxl_internal_send_cmd+0x40/0x104 [cxl_core]
> [   81.005520] [     T48]  cxl_mem_get_records_log+0xbc/0x198 [cxl_core]
> [   81.006042] [     T48]  cxl_mem_get_event_records+0xb0/0xc0
> [cxl_core]
> [   81.006246] [     T48]  cxl_mock_mem_probe+0x568/0x6f0 [cxl_mock_mem]
> [   81.006417] [     T48]  platform_probe+0x68/0xd8
> [   81.008340] [     T48]  really_probe+0xc0/0x39c
> [   81.008885] [     T48]  __driver_probe_device+0xd0/0x14c
> [   81.009539] [     T48]  driver_probe_device+0x3c/0x120
> [   81.010239] [     T48]  __driver_attach_async_helper+0x50/0xec
> [   81.011130] [     T48]  async_run_entry_fn+0x34/0x14c
> [   81.011276] [     T48]  process_one_work+0x148/0x284
> [   81.011420] [     T48]  worker_thread+0x2c4/0x3e0
> [   81.011552] [     T48]  kthread+0x12c/0x204
> [   81.011693] [     T48]  ret_from_fork+0x10/0x20
> [   81.011840] [     T48] Code: 54001b28 a90c6bf9 52801100 f9400a61
> (a9007c3f)
> [   81.013772] [     T48] ---[ end trace 0000000000000000 ]---
> 
> How serious is this?
> 

Definitely not encouraging.  I'll see if I can replicate.
I had a suitable x86 config ready and not seeing it there.

Jonathan

> Thanks,
> Itaru.
> 
> 


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Internal error: Oops: 0000000096000044 [#11]  SMP
  2025-05-21  8:39 Internal error: Oops: 0000000096000044 [#11] SMP Itaru Kitayama
                   ` (2 preceding siblings ...)
  2025-05-21 15:36 ` Jonathan Cameron
@ 2025-05-21 15:41 ` Alison Schofield
  3 siblings, 0 replies; 15+ messages in thread
From: Alison Schofield @ 2025-05-21 15:41 UTC (permalink / raw)
  To: Itaru Kitayama; +Cc: linux-cxl

On Wed, May 21, 2025 at 05:39:05PM +0900, Itaru Kitayama wrote:
> Hi,
> On arm64/virt QEMU, the cxl/next (as of today) kernel prints out Internal errors:

Now I'm connecting more dots. You posted this previously here:

https://lore.kernel.org/linux-cxl/49A4B521-AB66-4037-A23D-1D0D7AF0F42F@linux.dev/




> 
> [   80.968299] [     T48] Internal error: Oops: 0000000096000044 [#11]
> SMP
> [   80.989250] [     T48] Modules linked in: cxl_mock_mem(O) cfg80211
> rfkill cxl_test(O) cxl_mem(O) cxl_pmem(O) cxl_acpi(O) cxl_port(O)
> cxl_mock(O) libnvdimm encrypted_keys trusted caam_jr caam asn1_encoder
> caamhash_desc caamalg_desc error crypto_engine authenc libdes fuse drm
> backlight ip_tables x_tables sm3_ce sm3 sha3_ce sha512_ce sha512_arm64
> cxl_core(O) fwctl btrfs blake2b_generic xor xor_neon raid6_pq
> zstd_compress ipv6
> [   80.992210] [     T48] CPU: 1 UID: 0 PID: 48 Comm: kworker/u8:2
> Tainted: G      D    O        6.15.0-rc4-00040-g128ad8fa385b #40 PREEMPT
> [   80.992791] [     T48] Tainted: [D]=DIE, [O]=OOT_MODULE
> [   80.993039] [     T48] Hardware name: QEMU QEMU Virtual Machine, BIOS
> 2025.02-3ubuntu2 04/04/2025
> [   80.993400] [     T48] Workqueue: async async_run_entry_fn
> [   80.994718] [     T48] pstate: 61402005 (nZCv daif +PAN -UAO -TCO
> +DIT -SSBS BTYPE=--)
> [   80.995329] [     T48] pc : cxl_mock_mbox_send+0xec/0x12c0
> [cxl_mock_mem]
> [   80.995691] [     T48] lr : cxl_internal_send_cmd+0x40/0x104
> [cxl_core]
> [   80.996189] [     T48] sp : ffff800080d0b9f0
> [   80.996380] [     T48] x29: ffff800080d0ba70 x28: fff0000008dd2410
> x27: fff00000088fb390
> [   80.996714] [     T48] x26: ffff800080d0bb07 x25: 0000000000000100
> x24: 0000000000000003
> [   80.997135] [     T48] x23: 0000000000000020 x22: fff0000008dd2410
> x21: 0000000000000002
> [   80.998119] [     T48] x20: fff00000088fb080 x19: ffff800080d0bb08
> x18: 00000000ffffffff
> [   80.998419] [     T48] x17: 0000000000000000 x16: ffffa8d169128748
> x15: ffff800080d0b5ad
> [   80.999243] [     T48] x14: ffff800080d0b400 x13: ffff800080d0b5b8
> x12: fff000006f7a0000
> [   81.000519] [     T48] x11: 0000000000000058 x10: 0000000000000018 x9
> : fff000006f7a0000
> [   81.001337] [     T48] x8 : ffff800080d0bb48 x7 : fff0000074fa0000 x6
> : fff0000074fa0000
> [   81.002497] [     T48] x5 : fff000007f937508 x4 : 0000000000000001 x3
> : 0000000000001000
> [   81.003993] [     T48] x2 : 0000000000001000 x1 : 0000000000000000 x0
> : 0000000000000088
> [   81.004223] [     T48] Call trace:
> [   81.004795] [     T48]  cxl_mock_mbox_send+0xec/0x12c0 [cxl_mock_mem]
> (P)
> [   81.005136] [     T48]  cxl_internal_send_cmd+0x40/0x104 [cxl_core]
> [   81.005520] [     T48]  cxl_mem_get_records_log+0xbc/0x198 [cxl_core]
> [   81.006042] [     T48]  cxl_mem_get_event_records+0xb0/0xc0
> [cxl_core]
> [   81.006246] [     T48]  cxl_mock_mem_probe+0x568/0x6f0 [cxl_mock_mem]
> [   81.006417] [     T48]  platform_probe+0x68/0xd8
> [   81.008340] [     T48]  really_probe+0xc0/0x39c
> [   81.008885] [     T48]  __driver_probe_device+0xd0/0x14c
> [   81.009539] [     T48]  driver_probe_device+0x3c/0x120
> [   81.010239] [     T48]  __driver_attach_async_helper+0x50/0xec
> [   81.011130] [     T48]  async_run_entry_fn+0x34/0x14c
> [   81.011276] [     T48]  process_one_work+0x148/0x284
> [   81.011420] [     T48]  worker_thread+0x2c4/0x3e0
> [   81.011552] [     T48]  kthread+0x12c/0x204
> [   81.011693] [     T48]  ret_from_fork+0x10/0x20
> [   81.011840] [     T48] Code: 54001b28 a90c6bf9 52801100 f9400a61
> (a9007c3f)
> [   81.013772] [     T48] ---[ end trace 0000000000000000 ]---
> 
> How serious is this?
> 
> Thanks,
> Itaru.
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Internal error: Oops: 0000000096000044 [#11] SMP
  2025-05-21 15:31 ` Dave Jiang
@ 2025-05-21 20:38   ` Itaru Kitayama
  2025-05-21 20:46     ` Dave Jiang
  0 siblings, 1 reply; 15+ messages in thread
From: Itaru Kitayama @ 2025-05-21 20:38 UTC (permalink / raw)
  To: Dave Jiang; +Cc: linux-cxl

Dave

> On May 22, 2025, at 0:31, Dave Jiang <dave.jiang@intel.com> wrote:
> 
> 
> 
> On 5/21/25 1:39 AM, Itaru Kitayama wrote:
>> Hi,
>> On arm64/virt QEMU, the cxl/next (as of today) kernel prints out Internal errors:
>> 
>> [   80.968299] [     T48] Internal error: Oops: 0000000096000044 [#11]
>> SMP
>> [   80.989250] [     T48] Modules linked in: cxl_mock_mem(O) cfg80211
>> rfkill cxl_test(O) cxl_mem(O) cxl_pmem(O) cxl_acpi(O) cxl_port(O)
>> cxl_mock(O) libnvdimm encrypted_keys trusted caam_jr caam asn1_encoder
>> caamhash_desc caamalg_desc error crypto_engine authenc libdes fuse drm
>> backlight ip_tables x_tables sm3_ce sm3 sha3_ce sha512_ce sha512_arm64
>> cxl_core(O) fwctl btrfs blake2b_generic xor xor_neon raid6_pq
>> zstd_compress ipv6
>> [   80.992210] [     T48] CPU: 1 UID: 0 PID: 48 Comm: kworker/u8:2
>> Tainted: G      D    O        6.15.0-rc4-00040-g128ad8fa385b #40 PREEMPT
>> [   80.992791] [     T48] Tainted: [D]=DIE, [O]=OOT_MODULE
>> [   80.993039] [     T48] Hardware name: QEMU QEMU Virtual Machine, BIOS
>> 2025.02-3ubuntu2 04/04/2025
>> [   80.993400] [     T48] Workqueue: async async_run_entry_fn
>> [   80.994718] [     T48] pstate: 61402005 (nZCv daif +PAN -UAO -TCO
>> +DIT -SSBS BTYPE=--)
>> [   80.995329] [     T48] pc : cxl_mock_mbox_send+0xec/0x12c0
>> [cxl_mock_mem]
> 
> Can you do this in your kernel tree?
> ./scripts/faddr2line tools/testing/cxl/test/cxl_mock_mem.ko cxl_mock_mbox_send+0xec/0x12c0

realm@machine-1:~/projects/cxl$ ./scripts/faddr2line tools/testing/cxl/test/cxl_mock_mem.ko cxl_mock_mbox_send+0xec/0x12c0
cxl_mock_mbox_send+0xec/0x12c0:
mock_get_event at /home/realm/projects/cxl/tools/testing/cxl/test/mem.c:277
(inlined by) cxl_mock_mbox_send at /home/realm/projects/cxl/tools/testing/cxl/test/mem.c:1571

> 
> I've not see this issue on x86 running cxl/next. How consistently can you reproduce this? If it's every time, is it possible for you to do a git bisect on the kernel and see which commit causes this please? Thanks!

Fairly reliably (100% of the boot time, and cx/fixes did not change this BTW, which branch is seen as stable for you folks?), yes, I should try git bisect.

Itaru.

> 
> DJ
> 
>> [   80.995691] [     T48] lr : cxl_internal_send_cmd+0x40/0x104
>> [cxl_core]
>> [   80.996189] [     T48] sp : ffff800080d0b9f0
>> [   80.996380] [     T48] x29: ffff800080d0ba70 x28: fff0000008dd2410
>> x27: fff00000088fb390
>> [   80.996714] [     T48] x26: ffff800080d0bb07 x25: 0000000000000100
>> x24: 0000000000000003
>> [   80.997135] [     T48] x23: 0000000000000020 x22: fff0000008dd2410
>> x21: 0000000000000002
>> [   80.998119] [     T48] x20: fff00000088fb080 x19: ffff800080d0bb08
>> x18: 00000000ffffffff
>> [   80.998419] [     T48] x17: 0000000000000000 x16: ffffa8d169128748
>> x15: ffff800080d0b5ad
>> [   80.999243] [     T48] x14: ffff800080d0b400 x13: ffff800080d0b5b8
>> x12: fff000006f7a0000
>> [   81.000519] [     T48] x11: 0000000000000058 x10: 0000000000000018 x9
>> : fff000006f7a0000
>> [   81.001337] [     T48] x8 : ffff800080d0bb48 x7 : fff0000074fa0000 x6
>> : fff0000074fa0000
>> [   81.002497] [     T48] x5 : fff000007f937508 x4 : 0000000000000001 x3
>> : 0000000000001000
>> [   81.003993] [     T48] x2 : 0000000000001000 x1 : 0000000000000000 x0
>> : 0000000000000088
>> [   81.004223] [     T48] Call trace:
>> [   81.004795] [     T48]  cxl_mock_mbox_send+0xec/0x12c0 [cxl_mock_mem]
>> (P)
>> [   81.005136] [     T48]  cxl_internal_send_cmd+0x40/0x104 [cxl_core]
>> [   81.005520] [     T48]  cxl_mem_get_records_log+0xbc/0x198 [cxl_core]
>> [   81.006042] [     T48]  cxl_mem_get_event_records+0xb0/0xc0
>> [cxl_core]
>> [   81.006246] [     T48]  cxl_mock_mem_probe+0x568/0x6f0 [cxl_mock_mem]
>> [   81.006417] [     T48]  platform_probe+0x68/0xd8
>> [   81.008340] [     T48]  really_probe+0xc0/0x39c
>> [   81.008885] [     T48]  __driver_probe_device+0xd0/0x14c
>> [   81.009539] [     T48]  driver_probe_device+0x3c/0x120
>> [   81.010239] [     T48]  __driver_attach_async_helper+0x50/0xec
>> [   81.011130] [     T48]  async_run_entry_fn+0x34/0x14c
>> [   81.011276] [     T48]  process_one_work+0x148/0x284
>> [   81.011420] [     T48]  worker_thread+0x2c4/0x3e0
>> [   81.011552] [     T48]  kthread+0x12c/0x204
>> [   81.011693] [     T48]  ret_from_fork+0x10/0x20
>> [   81.011840] [     T48] Code: 54001b28 a90c6bf9 52801100 f9400a61
>> (a9007c3f)
>> [   81.013772] [     T48] ---[ end trace 0000000000000000 ]---
>> 
>> How serious is this?
>> 
>> Thanks,
>> Itaru.



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Internal error: Oops: 0000000096000044 [#11] SMP
  2025-05-21 20:38   ` Itaru Kitayama
@ 2025-05-21 20:46     ` Dave Jiang
  2025-05-21 23:28       ` Itaru Kitayama
  0 siblings, 1 reply; 15+ messages in thread
From: Dave Jiang @ 2025-05-21 20:46 UTC (permalink / raw)
  To: Itaru Kitayama; +Cc: linux-cxl



On 5/21/25 1:38 PM, Itaru Kitayama wrote:
> Dave
> 
>> On May 22, 2025, at 0:31, Dave Jiang <dave.jiang@intel.com> wrote:
>>
>>
>>
>> On 5/21/25 1:39 AM, Itaru Kitayama wrote:
>>> Hi,
>>> On arm64/virt QEMU, the cxl/next (as of today) kernel prints out Internal errors:
>>>
>>> [   80.968299] [     T48] Internal error: Oops: 0000000096000044 [#11]
>>> SMP
>>> [   80.989250] [     T48] Modules linked in: cxl_mock_mem(O) cfg80211
>>> rfkill cxl_test(O) cxl_mem(O) cxl_pmem(O) cxl_acpi(O) cxl_port(O)
>>> cxl_mock(O) libnvdimm encrypted_keys trusted caam_jr caam asn1_encoder
>>> caamhash_desc caamalg_desc error crypto_engine authenc libdes fuse drm
>>> backlight ip_tables x_tables sm3_ce sm3 sha3_ce sha512_ce sha512_arm64
>>> cxl_core(O) fwctl btrfs blake2b_generic xor xor_neon raid6_pq
>>> zstd_compress ipv6
>>> [   80.992210] [     T48] CPU: 1 UID: 0 PID: 48 Comm: kworker/u8:2
>>> Tainted: G      D    O        6.15.0-rc4-00040-g128ad8fa385b #40 PREEMPT
>>> [   80.992791] [     T48] Tainted: [D]=DIE, [O]=OOT_MODULE
>>> [   80.993039] [     T48] Hardware name: QEMU QEMU Virtual Machine, BIOS
>>> 2025.02-3ubuntu2 04/04/2025
>>> [   80.993400] [     T48] Workqueue: async async_run_entry_fn
>>> [   80.994718] [     T48] pstate: 61402005 (nZCv daif +PAN -UAO -TCO
>>> +DIT -SSBS BTYPE=--)
>>> [   80.995329] [     T48] pc : cxl_mock_mbox_send+0xec/0x12c0
>>> [cxl_mock_mem]
>>
>> Can you do this in your kernel tree?
>> ./scripts/faddr2line tools/testing/cxl/test/cxl_mock_mem.ko cxl_mock_mbox_send+0xec/0x12c0
> 
> realm@machine-1:~/projects/cxl$ ./scripts/faddr2line tools/testing/cxl/test/cxl_mock_mem.ko cxl_mock_mbox_send+0xec/0x12c0
> cxl_mock_mbox_send+0xec/0x12c0:
> mock_get_event at /home/realm/projects/cxl/tools/testing/cxl/test/mem.c:277
> (inlined by) cxl_mock_mbox_send at /home/realm/projects/cxl/tools/testing/cxl/test/mem.c:1571
> 
>>
>> I've not see this issue on x86 running cxl/next. How consistently can you reproduce this? If it's every time, is it possible for you to do a git bisect on the kernel and see which commit causes this please? Thanks!
> 
> Fairly reliably (100% of the boot time, and cx/fixes did not change this BTW, which branch is seen as stable for you folks?), yes, I should try git bisect.

The current cxl/next is based on 6.15-rc4, which should have everything that was in cxl/fixes. And you do not see this with 6.14-final? git bisect would be very helpful. Thank you!

> 
> Itaru.
> 
>>
>> DJ
>>
>>> [   80.995691] [     T48] lr : cxl_internal_send_cmd+0x40/0x104
>>> [cxl_core]
>>> [   80.996189] [     T48] sp : ffff800080d0b9f0
>>> [   80.996380] [     T48] x29: ffff800080d0ba70 x28: fff0000008dd2410
>>> x27: fff00000088fb390
>>> [   80.996714] [     T48] x26: ffff800080d0bb07 x25: 0000000000000100
>>> x24: 0000000000000003
>>> [   80.997135] [     T48] x23: 0000000000000020 x22: fff0000008dd2410
>>> x21: 0000000000000002
>>> [   80.998119] [     T48] x20: fff00000088fb080 x19: ffff800080d0bb08
>>> x18: 00000000ffffffff
>>> [   80.998419] [     T48] x17: 0000000000000000 x16: ffffa8d169128748
>>> x15: ffff800080d0b5ad
>>> [   80.999243] [     T48] x14: ffff800080d0b400 x13: ffff800080d0b5b8
>>> x12: fff000006f7a0000
>>> [   81.000519] [     T48] x11: 0000000000000058 x10: 0000000000000018 x9
>>> : fff000006f7a0000
>>> [   81.001337] [     T48] x8 : ffff800080d0bb48 x7 : fff0000074fa0000 x6
>>> : fff0000074fa0000
>>> [   81.002497] [     T48] x5 : fff000007f937508 x4 : 0000000000000001 x3
>>> : 0000000000001000
>>> [   81.003993] [     T48] x2 : 0000000000001000 x1 : 0000000000000000 x0
>>> : 0000000000000088
>>> [   81.004223] [     T48] Call trace:
>>> [   81.004795] [     T48]  cxl_mock_mbox_send+0xec/0x12c0 [cxl_mock_mem]
>>> (P)
>>> [   81.005136] [     T48]  cxl_internal_send_cmd+0x40/0x104 [cxl_core]
>>> [   81.005520] [     T48]  cxl_mem_get_records_log+0xbc/0x198 [cxl_core]
>>> [   81.006042] [     T48]  cxl_mem_get_event_records+0xb0/0xc0
>>> [cxl_core]
>>> [   81.006246] [     T48]  cxl_mock_mem_probe+0x568/0x6f0 [cxl_mock_mem]
>>> [   81.006417] [     T48]  platform_probe+0x68/0xd8
>>> [   81.008340] [     T48]  really_probe+0xc0/0x39c
>>> [   81.008885] [     T48]  __driver_probe_device+0xd0/0x14c
>>> [   81.009539] [     T48]  driver_probe_device+0x3c/0x120
>>> [   81.010239] [     T48]  __driver_attach_async_helper+0x50/0xec
>>> [   81.011130] [     T48]  async_run_entry_fn+0x34/0x14c
>>> [   81.011276] [     T48]  process_one_work+0x148/0x284
>>> [   81.011420] [     T48]  worker_thread+0x2c4/0x3e0
>>> [   81.011552] [     T48]  kthread+0x12c/0x204
>>> [   81.011693] [     T48]  ret_from_fork+0x10/0x20
>>> [   81.011840] [     T48] Code: 54001b28 a90c6bf9 52801100 f9400a61
>>> (a9007c3f)
>>> [   81.013772] [     T48] ---[ end trace 0000000000000000 ]---
>>>
>>> How serious is this?
>>>
>>> Thanks,
>>> Itaru.
> 
> 


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Internal error: Oops: 0000000096000044 [#11] SMP
  2025-05-21 20:46     ` Dave Jiang
@ 2025-05-21 23:28       ` Itaru Kitayama
  2025-05-21 23:34         ` Dan Williams
  0 siblings, 1 reply; 15+ messages in thread
From: Itaru Kitayama @ 2025-05-21 23:28 UTC (permalink / raw)
  To: Dave Jiang; +Cc: linux-cxl

Dave et al.,

> On May 22, 2025, at 5:46, Dave Jiang <dave.jiang@intel.com> wrote:
> 
> 
> 
> On 5/21/25 1:38 PM, Itaru Kitayama wrote:
>> Dave
>> 
>>> On May 22, 2025, at 0:31, Dave Jiang <dave.jiang@intel.com> wrote:
>>> 
>>> 
>>> 
>>> On 5/21/25 1:39 AM, Itaru Kitayama wrote:
>>>> Hi,
>>>> On arm64/virt QEMU, the cxl/next (as of today) kernel prints out Internal errors:
>>>> 
>>>> [   80.968299] [     T48] Internal error: Oops: 0000000096000044 [#11]
>>>> SMP
>>>> [   80.989250] [     T48] Modules linked in: cxl_mock_mem(O) cfg80211
>>>> rfkill cxl_test(O) cxl_mem(O) cxl_pmem(O) cxl_acpi(O) cxl_port(O)
>>>> cxl_mock(O) libnvdimm encrypted_keys trusted caam_jr caam asn1_encoder
>>>> caamhash_desc caamalg_desc error crypto_engine authenc libdes fuse drm
>>>> backlight ip_tables x_tables sm3_ce sm3 sha3_ce sha512_ce sha512_arm64
>>>> cxl_core(O) fwctl btrfs blake2b_generic xor xor_neon raid6_pq
>>>> zstd_compress ipv6
>>>> [   80.992210] [     T48] CPU: 1 UID: 0 PID: 48 Comm: kworker/u8:2
>>>> Tainted: G      D    O        6.15.0-rc4-00040-g128ad8fa385b #40 PREEMPT
>>>> [   80.992791] [     T48] Tainted: [D]=DIE, [O]=OOT_MODULE
>>>> [   80.993039] [     T48] Hardware name: QEMU QEMU Virtual Machine, BIOS
>>>> 2025.02-3ubuntu2 04/04/2025
>>>> [   80.993400] [     T48] Workqueue: async async_run_entry_fn
>>>> [   80.994718] [     T48] pstate: 61402005 (nZCv daif +PAN -UAO -TCO
>>>> +DIT -SSBS BTYPE=--)
>>>> [   80.995329] [     T48] pc : cxl_mock_mbox_send+0xec/0x12c0
>>>> [cxl_mock_mem]
>>> 
>>> Can you do this in your kernel tree?
>>> ./scripts/faddr2line tools/testing/cxl/test/cxl_mock_mem.ko cxl_mock_mbox_send+0xec/0x12c0
>> 
>> realm@machine-1:~/projects/cxl$ ./scripts/faddr2line tools/testing/cxl/test/cxl_mock_mem.ko cxl_mock_mbox_send+0xec/0x12c0
>> cxl_mock_mbox_send+0xec/0x12c0:
>> mock_get_event at /home/realm/projects/cxl/tools/testing/cxl/test/mem.c:277
>> (inlined by) cxl_mock_mbox_send at /home/realm/projects/cxl/tools/testing/cxl/test/mem.c:1571
>> 
>>> 
>>> I've not see this issue on x86 running cxl/next. How consistently can you reproduce this? If it's every time, is it possible for you to do a git bisect on the kernel and see which commit causes this please? Thanks!
>> 
>> Fairly reliably (100% of the boot time, and cx/fixes did not change this BTW, which branch is seen as stable for you folks?), yes, I should try git bisect.
> 
> The current cxl/next is based on 6.15-rc4, which should have everything that was in cxl/fixes. And you do not see this with 6.14-final? git bisect would be very helpful. Thank you!
> 
>> 
>> Itaru.
>> 
>>> 
>>> DJ
>>> 
>>>> [   80.995691] [     T48] lr : cxl_internal_send_cmd+0x40/0x104
>>>> [cxl_core]
>>>> [   80.996189] [     T48] sp : ffff800080d0b9f0
>>>> [   80.996380] [     T48] x29: ffff800080d0ba70 x28: fff0000008dd2410
>>>> x27: fff00000088fb390
>>>> [   80.996714] [     T48] x26: ffff800080d0bb07 x25: 0000000000000100
>>>> x24: 0000000000000003
>>>> [   80.997135] [     T48] x23: 0000000000000020 x22: fff0000008dd2410
>>>> x21: 0000000000000002
>>>> [   80.998119] [     T48] x20: fff00000088fb080 x19: ffff800080d0bb08
>>>> x18: 00000000ffffffff
>>>> [   80.998419] [     T48] x17: 0000000000000000 x16: ffffa8d169128748
>>>> x15: ffff800080d0b5ad
>>>> [   80.999243] [     T48] x14: ffff800080d0b400 x13: ffff800080d0b5b8
>>>> x12: fff000006f7a0000
>>>> [   81.000519] [     T48] x11: 0000000000000058 x10: 0000000000000018 x9
>>>> : fff000006f7a0000
>>>> [   81.001337] [     T48] x8 : ffff800080d0bb48 x7 : fff0000074fa0000 x6
>>>> : fff0000074fa0000
>>>> [   81.002497] [     T48] x5 : fff000007f937508 x4 : 0000000000000001 x3
>>>> : 0000000000001000
>>>> [   81.003993] [     T48] x2 : 0000000000001000 x1 : 0000000000000000 x0
>>>> : 0000000000000088
>>>> [   81.004223] [     T48] Call trace:
>>>> [   81.004795] [     T48]  cxl_mock_mbox_send+0xec/0x12c0 [cxl_mock_mem]
>>>> (P)
>>>> [   81.005136] [     T48]  cxl_internal_send_cmd+0x40/0x104 [cxl_core]
>>>> [   81.005520] [     T48]  cxl_mem_get_records_log+0xbc/0x198 [cxl_core]
>>>> [   81.006042] [     T48]  cxl_mem_get_event_records+0xb0/0xc0
>>>> [cxl_core]
>>>> [   81.006246] [     T48]  cxl_mock_mem_probe+0x568/0x6f0 [cxl_mock_mem]
>>>> [   81.006417] [     T48]  platform_probe+0x68/0xd8
>>>> [   81.008340] [     T48]  really_probe+0xc0/0x39c
>>>> [   81.008885] [     T48]  __driver_probe_device+0xd0/0x14c
>>>> [   81.009539] [     T48]  driver_probe_device+0x3c/0x120
>>>> [   81.010239] [     T48]  __driver_attach_async_helper+0x50/0xec
>>>> [   81.011130] [     T48]  async_run_entry_fn+0x34/0x14c
>>>> [   81.011276] [     T48]  process_one_work+0x148/0x284
>>>> [   81.011420] [     T48]  worker_thread+0x2c4/0x3e0
>>>> [   81.011552] [     T48]  kthread+0x12c/0x204
>>>> [   81.011693] [     T48]  ret_from_fork+0x10/0x20
>>>> [   81.011840] [     T48] Code: 54001b28 a90c6bf9 52801100 f9400a61
>>>> (a9007c3f)
>>>> [   81.013772] [     T48] ---[ end trace 0000000000000000 ]---
>>>> 
>>>> How serious is this?
>>>> 
>>>> Thanks,
>>>> Itaru.

Rebuilt the rootfs image and tried today’s cx/next (6.15.0-rc4-00046-g6eed708a5693) again to boot now I don’t see the splats, so something I was messing my dev environment sorry about that.

CXL utility commands work reasonably now and I can execute meson test —suite cxl, while most of them still fails due to the HPA allocation error which makes me wonder as the resource requests are quite modest. 

Itaru.
 



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Internal error: Oops: 0000000096000044 [#11] SMP
  2025-05-21 23:28       ` Itaru Kitayama
@ 2025-05-21 23:34         ` Dan Williams
  2025-05-22 13:56           ` Jonathan Cameron
  0 siblings, 1 reply; 15+ messages in thread
From: Dan Williams @ 2025-05-21 23:34 UTC (permalink / raw)
  To: Itaru Kitayama, Dave Jiang; +Cc: linux-cxl

Itaru Kitayama wrote:
> Dave et al.,
[..]
> Rebuilt the rootfs image and tried today’s cx/next
> (6.15.0-rc4-00046-g6eed708a5693) again to boot now I don’t see the
> splats, so something I was messing my dev environment sorry about
> that.
> 
> CXL utility commands work reasonably now and I can execute meson test
> —suite cxl, while most of them still fails due to the HPA allocation
> error which makes me wonder as the resource requests are quite modest. 

So cxl_test_init() just "hopes" that the top of the system physical
address space is free to use to emulate CXL windows. That might be an
assumption that only works for x86_64, not ARM64. I would double check
that this code in cxl_test_init()

        rc = gen_pool_add(cxl_mock_pool, iomem_resource.end + 1 - SZ_64G,
                          SZ_64G, NUMA_NO_NODE);
        if (rc)
                goto err_gen_pool_add;

...is not setting up CXL Windows that overlap with existing resources in
that range.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Internal error: Oops: 0000000096000044 [#11] SMP
  2025-05-21 23:34         ` Dan Williams
@ 2025-05-22 13:56           ` Jonathan Cameron
  2025-05-22 18:19             ` Dan Williams
                               ` (2 more replies)
  0 siblings, 3 replies; 15+ messages in thread
From: Jonathan Cameron @ 2025-05-22 13:56 UTC (permalink / raw)
  To: Dan Williams; +Cc: Itaru Kitayama, Dave Jiang, linux-cxl

On Wed, 21 May 2025 16:34:16 -0700
Dan Williams <dan.j.williams@intel.com> wrote:

> Itaru Kitayama wrote:
> > Dave et al.,  
> [..]
> > Rebuilt the rootfs image and tried today’s cx/next
> > (6.15.0-rc4-00046-g6eed708a5693) again to boot now I don’t see the
> > splats, so something I was messing my dev environment sorry about
> > that.
> > 
> > CXL utility commands work reasonably now and I can execute meson test
> > —suite cxl, while most of them still fails due to the HPA allocation
> > error which makes me wonder as the resource requests are quite modest.   
> 
> So cxl_test_init() just "hopes" that the top of the system physical
> address space is free to use to emulate CXL windows. That might be an
> assumption that only works for x86_64, not ARM64. I would double check
> that this code in cxl_test_init()
> 
>         rc = gen_pool_add(cxl_mock_pool, iomem_resource.end + 1 - SZ_64G,
>                           SZ_64G, NUMA_NO_NODE);
>         if (rc)
>                 goto err_gen_pool_add;
> 
> ...is not setting up CXL Windows that overlap with existing resources in
> that range.
> 

I think there are checks that block use of ranges up there.

Print I'm seeing is
Hotplug memory [0xfffffff010000000-0xfffffff030000000] exceeds maximum addressable range [0x40000000-0xf80003fffffff]

I think right answer is to use mhp_get_pluggable_range(true); to check
for limits on the range we can use.

On architectures that don't define arch_get_mappable_range()
that ends up the as (unsigned long)-1 which I think would work
though there may be other stuff up there.  Maybe min(iomem_resource.end + 1 - SZ_64G,
						     mappable_range.end + 1 - SZ_64G)
or something like that adapted to avoid wrap around.

I haven't yet sanity checked this doesn't break x86 but I think it should
end up making no difference to the locations on x86.


With the below - all 11 tests in ndctl cxl test suite pass for me.

From b287ff2c5ee7fbe507ef8cb61df3e4e156a9773f Mon Sep 17 00:00:00 2001
From: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Date: Thu, 22 May 2025 14:20:42 +0100
Subject: [PATCH] cxl_test: Limit location for fake CFMWS to mappable range

Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
---
 tools/testing/cxl/test/cxl.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/tools/testing/cxl/test/cxl.c b/tools/testing/cxl/test/cxl.c
index 8a5815ca870d..b4e6c7659ac4 100644
--- a/tools/testing/cxl/test/cxl.c
+++ b/tools/testing/cxl/test/cxl.c
@@ -1328,6 +1328,7 @@ static int cxl_mem_init(void)
 static __init int cxl_test_init(void)
 {
 	int rc, i;
+	struct range mappable;
 
 	cxl_acpi_test();
 	cxl_core_test();
@@ -1342,8 +1343,11 @@ static __init int cxl_test_init(void)
 		rc = -ENOMEM;
 		goto err_gen_pool_create;
 	}
+	mappable = mhp_get_pluggable_range(true);
 
-	rc = gen_pool_add(cxl_mock_pool, iomem_resource.end + 1 - SZ_64G,
+	rc = gen_pool_add(cxl_mock_pool,
+			  min(iomem_resource.end + 1 - SZ_64G,
+			      mappable.end + 1 - SZ_64G),
 			  SZ_64G, NUMA_NO_NODE);
 	if (rc)
 		goto err_gen_pool_add;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: Internal error: Oops: 0000000096000044 [#11] SMP
  2025-05-22 13:56           ` Jonathan Cameron
@ 2025-05-22 18:19             ` Dan Williams
  2025-05-22 21:46             ` Itaru Kitayama
  2025-05-23  5:52             ` Marc Herbert
  2 siblings, 0 replies; 15+ messages in thread
From: Dan Williams @ 2025-05-22 18:19 UTC (permalink / raw)
  To: Jonathan Cameron, Dan Williams; +Cc: Itaru Kitayama, Dave Jiang, linux-cxl

Jonathan Cameron wrote:
> On Wed, 21 May 2025 16:34:16 -0700
> Dan Williams <dan.j.williams@intel.com> wrote:
> 
> > Itaru Kitayama wrote:
> > > Dave et al.,  
> > [..]
> > > Rebuilt the rootfs image and tried today’s cx/next
> > > (6.15.0-rc4-00046-g6eed708a5693) again to boot now I don’t see the
> > > splats, so something I was messing my dev environment sorry about
> > > that.
> > > 
> > > CXL utility commands work reasonably now and I can execute meson test
> > > —suite cxl, while most of them still fails due to the HPA allocation
> > > error which makes me wonder as the resource requests are quite modest.   
> > 
> > So cxl_test_init() just "hopes" that the top of the system physical
> > address space is free to use to emulate CXL windows. That might be an
> > assumption that only works for x86_64, not ARM64. I would double check
> > that this code in cxl_test_init()
> > 
> >         rc = gen_pool_add(cxl_mock_pool, iomem_resource.end + 1 - SZ_64G,
> >                           SZ_64G, NUMA_NO_NODE);
> >         if (rc)
> >                 goto err_gen_pool_add;
> > 
> > ...is not setting up CXL Windows that overlap with existing resources in
> > that range.
> > 
> 
> I think there are checks that block use of ranges up there.
> 
> Print I'm seeing is
> Hotplug memory [0xfffffff010000000-0xfffffff030000000] exceeds maximum addressable range [0x40000000-0xf80003fffffff]
> 
> I think right answer is to use mhp_get_pluggable_range(true); to check
> for limits on the range we can use.
> 
> On architectures that don't define arch_get_mappable_range()
> that ends up the as (unsigned long)-1 which I think would work
> though there may be other stuff up there.  Maybe min(iomem_resource.end + 1 - SZ_64G,
> 						     mappable_range.end + 1 - SZ_64G)
> or something like that adapted to avoid wrap around.
> 
> I haven't yet sanity checked this doesn't break x86 but I think it should
> end up making no difference to the locations on x86.
> 
> 
> With the below - all 11 tests in ndctl cxl test suite pass for me.

It was whitespace clobbered but after fixing that up and adding #include
<linux/memory_hotplug.h>, it works here, thanks!

Reviewed-by: Dan Williams <dan.j.williams@intel.com>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Internal error: Oops: 0000000096000044 [#11] SMP
  2025-05-22 13:56           ` Jonathan Cameron
  2025-05-22 18:19             ` Dan Williams
@ 2025-05-22 21:46             ` Itaru Kitayama
  2025-05-23  3:28               ` Alison Schofield
  2025-05-23  5:52             ` Marc Herbert
  2 siblings, 1 reply; 15+ messages in thread
From: Itaru Kitayama @ 2025-05-22 21:46 UTC (permalink / raw)
  To: Jonathan Cameron; +Cc: Dan Williams, Dave Jiang, linux-cxl

Hi Jonathan,

> On May 22, 2025, at 22:56, Jonathan Cameron <Jonathan.Cameron@huawei.com> wrote:
> 
> On Wed, 21 May 2025 16:34:16 -0700
> Dan Williams <dan.j.williams@intel.com> wrote:
> 
>> Itaru Kitayama wrote:
>>> Dave et al.,  
>> [..]
>>> Rebuilt the rootfs image and tried today’s cx/next
>>> (6.15.0-rc4-00046-g6eed708a5693) again to boot now I don’t see the
>>> splats, so something I was messing my dev environment sorry about
>>> that.
>>> 
>>> CXL utility commands work reasonably now and I can execute meson test
>>> —suite cxl, while most of them still fails due to the HPA allocation
>>> error which makes me wonder as the resource requests are quite modest.   
>> 
>> So cxl_test_init() just "hopes" that the top of the system physical
>> address space is free to use to emulate CXL windows. That might be an
>> assumption that only works for x86_64, not ARM64. I would double check
>> that this code in cxl_test_init()
>> 
>>        rc = gen_pool_add(cxl_mock_pool, iomem_resource.end + 1 - SZ_64G,
>>                          SZ_64G, NUMA_NO_NODE);
>>        if (rc)
>>                goto err_gen_pool_add;
>> 
>> ...is not setting up CXL Windows that overlap with existing resources in
>> that range.
>> 
> 
> I think there are checks that block use of ranges up there.
> 
> Print I'm seeing is
> Hotplug memory [0xfffffff010000000-0xfffffff030000000] exceeds maximum addressable range [0x40000000-0xf80003fffffff]
> 
> I think right answer is to use mhp_get_pluggable_range(true); to check
> for limits on the range we can use.
> 
> On architectures that don't define arch_get_mappable_range()
> that ends up the as (unsigned long)-1 which I think would work
> though there may be other stuff up there.  Maybe min(iomem_resource.end + 1 - SZ_64G,
>     mappable_range.end + 1 - SZ_64G)
> or something like that adapted to avoid wrap around.
> 
> I haven't yet sanity checked this doesn't break x86 but I think it should
> end up making no difference to the locations on x86.
> 
> 
> With the below - all 11 tests in ndctl cxl test suite pass for me.
> 
> From b287ff2c5ee7fbe507ef8cb61df3e4e156a9773f Mon Sep 17 00:00:00 2001
> From: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> Date: Thu, 22 May 2025 14:20:42 +0100
> Subject: [PATCH] cxl_test: Limit location for fake CFMWS to mappable range
> 
> Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> ---
> tools/testing/cxl/test/cxl.c | 6 +++++-
> 1 file changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/tools/testing/cxl/test/cxl.c b/tools/testing/cxl/test/cxl.c
> index 8a5815ca870d..b4e6c7659ac4 100644
> --- a/tools/testing/cxl/test/cxl.c
> +++ b/tools/testing/cxl/test/cxl.c
> @@ -1328,6 +1328,7 @@ static int cxl_mem_init(void)
> static __init int cxl_test_init(void)
> {
> int rc, i;
> + struct range mappable;
> 
> cxl_acpi_test();
> cxl_core_test();
> @@ -1342,8 +1343,11 @@ static __init int cxl_test_init(void)
> rc = -ENOMEM;
> goto err_gen_pool_create;
> }
> + mappable = mhp_get_pluggable_range(true);
> 
> - rc = gen_pool_add(cxl_mock_pool, iomem_resource.end + 1 - SZ_64G,
> + rc = gen_pool_add(cxl_mock_pool,
> +  min(iomem_resource.end + 1 - SZ_64G,
> +      mappable.end + 1 - SZ_64G),
>  SZ_64G, NUMA_NO_NODE);
> if (rc)
> goto err_gen_pool_add;
> -- 
> 2.43.0
> 

Tested-by: Itaru Kitayama <itaru.kitayama@fujitsu.com <mailto:itaru.kitayama@fujitsu.com>>

# meson test --suite cxl
ninja: Entering directory `/root/ndctl/build'
[1/82] Generating version.h with a custom command
 1/12 ndctl:cxl / cxl-topology.sh                OK              33.96s
 2/12 ndctl:cxl / cxl-region-sysfs.sh            OK              18.00s
 3/12 ndctl:cxl / cxl-labels.sh                  OK              23.78s
 4/12 ndctl:cxl / cxl-create-region.sh           OK              43.03s
 5/12 ndctl:cxl / cxl-xor-region.sh              OK              19.30s
 6/12 ndctl:cxl / cxl-events.sh                  FAIL             6.40s   exit status 1
>>> LD_LIBRARY_PATH=/root/ndctl/build/daxctl/lib:/root/ndctl/build/cxl/lib:/root/ndctl/build/ndctl/lib MALLOC_PERTURB_=45 TEST_PATH=/root/ndctl/build/test UBSAN_OPTIONS=halt_on_error=1:abort_on_error=1:print_summary=1:print_stacktrace=1 MSAN_OPTIONS=halt_on_error=1:abort_on_error=1:print_summary=1:print_stacktrace=1 MESON_TEST_ITERATION=1 DAXCTL=/root/ndctl/build/daxctl/daxctl NDCTL=/root/ndctl/build/ndctl/ndctl ASAN_OPTIONS=halt_on_error=1:abort_on_error=1:print_summary=1 DATA_PATH=/root/ndctl/test /bin/bash /root/ndctl/test/cxl-events.sh

 7/12 ndctl:cxl / cxl-sanitize.sh                OK              14.77s
 8/12 ndctl:cxl / cxl-destroy-region.sh          OK              13.69s
 9/12 ndctl:cxl / cxl-qos-class.sh               OK              14.31s
10/12 ndctl:cxl / cxl-poison.sh                  FAIL             3.46s   exit status 1
>>> LD_LIBRARY_PATH=/root/ndctl/build/daxctl/lib:/root/ndctl/build/cxl/lib:/root/ndctl/build/ndctl/lib MSAN_OPTIONS=halt_on_error=1:abort_on_error=1:print_summary=1:print_stacktrace=1 MALLOC_PERTURB_=80 UBSAN_OPTIONS=halt_on_error=1:abort_on_error=1:print_summary=1:print_stacktrace=1 TEST_PATH=/root/ndctl/build/test MESON_TEST_ITERATION=1 DAXCTL=/root/ndctl/build/daxctl/daxctl NDCTL=/root/ndctl/build/ndctl/ndctl ASAN_OPTIONS=halt_on_error=1:abort_on_error=1:print_summary=1 DATA_PATH=/root/ndctl/test /bin/bash /root/ndctl/test/cxl-poison.sh

11/12 ndctl:cxl / cxl-update-firmware.sh         OK              66.23s
12/12 ndctl:cxl / cxl-security.sh                SKIP             0.34s   exit status 77

Ok:                 9
Expected Fail:      0
Fail:               2
Unexpected Pass:    0
Skipped:            1
Timeout:            0

My understanding is that these CXL tests are using mock CFMWs, not the actual physical memory regions at their fixed locations. So I wonder executing these set of test on a “sane" CXL emulation setup (run_qemu.sh creates) that the Intel folk is using does matter or not.

Itaru.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Internal error: Oops: 0000000096000044 [#11] SMP
  2025-05-22 21:46             ` Itaru Kitayama
@ 2025-05-23  3:28               ` Alison Schofield
  2025-05-23  4:56                 ` Itaru Kitayama
  0 siblings, 1 reply; 15+ messages in thread
From: Alison Schofield @ 2025-05-23  3:28 UTC (permalink / raw)
  To: Itaru Kitayama; +Cc: Jonathan Cameron, Dan Williams, Dave Jiang, linux-cxl

On Fri, May 23, 2025 at 06:46:53AM +0900, Itaru Kitayama wrote:
> Hi Jonathan,
> 
> > On May 22, 2025, at 22:56, Jonathan Cameron <Jonathan.Cameron@huawei.com> wrote:
> > 
> > On Wed, 21 May 2025 16:34:16 -0700
> > Dan Williams <dan.j.williams@intel.com> wrote:
> > 
> >> Itaru Kitayama wrote:
> >>> Dave et al.,  
> >> [..]
> >>> Rebuilt the rootfs image and tried today’s cx/next
> >>> (6.15.0-rc4-00046-g6eed708a5693) again to boot now I don’t see the
> >>> splats, so something I was messing my dev environment sorry about
> >>> that.
> >>> 
> >>> CXL utility commands work reasonably now and I can execute meson test
> >>> —suite cxl, while most of them still fails due to the HPA allocation
> >>> error which makes me wonder as the resource requests are quite modest.   
> >> 
> >> So cxl_test_init() just "hopes" that the top of the system physical
> >> address space is free to use to emulate CXL windows. That might be an
> >> assumption that only works for x86_64, not ARM64. I would double check
> >> that this code in cxl_test_init()
> >> 
> >>        rc = gen_pool_add(cxl_mock_pool, iomem_resource.end + 1 - SZ_64G,
> >>                          SZ_64G, NUMA_NO_NODE);
> >>        if (rc)
> >>                goto err_gen_pool_add;
> >> 
> >> ...is not setting up CXL Windows that overlap with existing resources in
> >> that range.
> >> 
> > 
> > I think there are checks that block use of ranges up there.
> > 
> > Print I'm seeing is
> > Hotplug memory [0xfffffff010000000-0xfffffff030000000] exceeds maximum addressable range [0x40000000-0xf80003fffffff]
> > 
> > I think right answer is to use mhp_get_pluggable_range(true); to check
> > for limits on the range we can use.
> > 
> > On architectures that don't define arch_get_mappable_range()
> > that ends up the as (unsigned long)-1 which I think would work
> > though there may be other stuff up there.  Maybe min(iomem_resource.end + 1 - SZ_64G,
> >     mappable_range.end + 1 - SZ_64G)
> > or something like that adapted to avoid wrap around.
> > 
> > I haven't yet sanity checked this doesn't break x86 but I think it should
> > end up making no difference to the locations on x86.
> > 
> > 
> > With the below - all 11 tests in ndctl cxl test suite pass for me.
> > 
> > From b287ff2c5ee7fbe507ef8cb61df3e4e156a9773f Mon Sep 17 00:00:00 2001
> > From: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> > Date: Thu, 22 May 2025 14:20:42 +0100
> > Subject: [PATCH] cxl_test: Limit location for fake CFMWS to mappable range
> > 
> > Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> > ---
> > tools/testing/cxl/test/cxl.c | 6 +++++-
> > 1 file changed, 5 insertions(+), 1 deletion(-)
> > 
> > diff --git a/tools/testing/cxl/test/cxl.c b/tools/testing/cxl/test/cxl.c
> > index 8a5815ca870d..b4e6c7659ac4 100644
> > --- a/tools/testing/cxl/test/cxl.c
> > +++ b/tools/testing/cxl/test/cxl.c
> > @@ -1328,6 +1328,7 @@ static int cxl_mem_init(void)
> > static __init int cxl_test_init(void)
> > {
> > int rc, i;
> > + struct range mappable;
> > 
> > cxl_acpi_test();
> > cxl_core_test();
> > @@ -1342,8 +1343,11 @@ static __init int cxl_test_init(void)
> > rc = -ENOMEM;
> > goto err_gen_pool_create;
> > }
> > + mappable = mhp_get_pluggable_range(true);
> > 
> > - rc = gen_pool_add(cxl_mock_pool, iomem_resource.end + 1 - SZ_64G,
> > + rc = gen_pool_add(cxl_mock_pool,
> > +  min(iomem_resource.end + 1 - SZ_64G,
> > +      mappable.end + 1 - SZ_64G),
> >  SZ_64G, NUMA_NO_NODE);
> > if (rc)
> > goto err_gen_pool_add;
> > -- 
> > 2.43.0
> > 
> 
> Tested-by: Itaru Kitayama <itaru.kitayama@fujitsu.com <mailto:itaru.kitayama@fujitsu.com>>
> 
> # meson test --suite cxl
> ninja: Entering directory `/root/ndctl/build'
> [1/82] Generating version.h with a custom command
>  1/12 ndctl:cxl / cxl-topology.sh                OK              33.96s
>  2/12 ndctl:cxl / cxl-region-sysfs.sh            OK              18.00s
>  3/12 ndctl:cxl / cxl-labels.sh                  OK              23.78s
>  4/12 ndctl:cxl / cxl-create-region.sh           OK              43.03s
>  5/12 ndctl:cxl / cxl-xor-region.sh              OK              19.30s
>  6/12 ndctl:cxl / cxl-events.sh                  FAIL             6.40s   exit status 1
> >>> LD_LIBRARY_PATH=/root/ndctl/build/daxctl/lib:/root/ndctl/build/cxl/lib:/root/ndctl/build/ndctl/lib MALLOC_PERTURB_=45 TEST_PATH=/root/ndctl/build/test UBSAN_OPTIONS=halt_on_error=1:abort_on_error=1:print_summary=1:print_stacktrace=1 MSAN_OPTIONS=halt_on_error=1:abort_on_error=1:print_summary=1:print_stacktrace=1 MESON_TEST_ITERATION=1 DAXCTL=/root/ndctl/build/daxctl/daxctl NDCTL=/root/ndctl/build/ndctl/ndctl ASAN_OPTIONS=halt_on_error=1:abort_on_error=1:print_summary=1 DATA_PATH=/root/ndctl/test /bin/bash /root/ndctl/test/cxl-events.sh
> 
>  7/12 ndctl:cxl / cxl-sanitize.sh                OK              14.77s
>  8/12 ndctl:cxl / cxl-destroy-region.sh          OK              13.69s
>  9/12 ndctl:cxl / cxl-qos-class.sh               OK              14.31s
> 10/12 ndctl:cxl / cxl-poison.sh                  FAIL             3.46s   exit status 1
> >>> LD_LIBRARY_PATH=/root/ndctl/build/daxctl/lib:/root/ndctl/build/cxl/lib:/root/ndctl/build/ndctl/lib MSAN_OPTIONS=halt_on_error=1:abort_on_error=1:print_summary=1:print_stacktrace=1 MALLOC_PERTURB_=80 UBSAN_OPTIONS=halt_on_error=1:abort_on_error=1:print_summary=1:print_stacktrace=1 TEST_PATH=/root/ndctl/build/test MESON_TEST_ITERATION=1 DAXCTL=/root/ndctl/build/daxctl/daxctl NDCTL=/root/ndctl/build/ndctl/ndctl ASAN_OPTIONS=halt_on_error=1:abort_on_error=1:print_summary=1 DATA_PATH=/root/ndctl/test /bin/bash /root/ndctl/test/cxl-poison.sh
> 
> 11/12 ndctl:cxl / cxl-update-firmware.sh         OK              66.23s
> 12/12 ndctl:cxl / cxl-security.sh                SKIP             0.34s   exit status 77
> 
> Ok:                 9
> Expected Fail:      0
> Fail:               2
> Unexpected Pass:    0
> Skipped:            1
> Timeout:            0
> 
> My understanding is that these CXL tests are using mock CFMWs, not the actual physical memory regions at their fixed locations. So I wonder executing these set of test on a “sane" CXL emulation setup (run_qemu.sh creates) that the Intel folk is using does matter or not.

Right - these test run on the mock CFMW's that the cxl-test module
creates. As far as running on a 'sane' CXL emulation setup, like
run_qemu.sh, I may not be understanding the question, but I'll take
a shot. The qemu defined CXL devices do not matter at all for the cxl
unit test run. The unit tests only uses the mock cxl/test environment
provided by the cxl-test module. The qemu CXL devices are irrelevant.

Let me know if I missed the point of you were making.

I noticed your test output FAIL cases, probably for CONFIG_TRACING not
enabled, and posted a patch to turn those into SKIPs.

--Alison

> 
> Itaru.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Internal error: Oops: 0000000096000044 [#11] SMP
  2025-05-23  3:28               ` Alison Schofield
@ 2025-05-23  4:56                 ` Itaru Kitayama
  0 siblings, 0 replies; 15+ messages in thread
From: Itaru Kitayama @ 2025-05-23  4:56 UTC (permalink / raw)
  To: Alison Schofield; +Cc: Jonathan Cameron, Dan Williams, Dave Jiang, linux-cxl

Hi Alison,

> On May 23, 2025, at 12:28, Alison Schofield <alison.schofield@intel.com> wrote:
> 
> On Fri, May 23, 2025 at 06:46:53AM +0900, Itaru Kitayama wrote:
>> Hi Jonathan,
>> 
>>> On May 22, 2025, at 22:56, Jonathan Cameron <Jonathan.Cameron@huawei.com> wrote:
>>> 
>>> On Wed, 21 May 2025 16:34:16 -0700
>>> Dan Williams <dan.j.williams@intel.com> wrote:
>>> 
>>>> Itaru Kitayama wrote:
>>>>> Dave et al.,  
>>>> [..]
>>>>> Rebuilt the rootfs image and tried today’s cx/next
>>>>> (6.15.0-rc4-00046-g6eed708a5693) again to boot now I don’t see the
>>>>> splats, so something I was messing my dev environment sorry about
>>>>> that.
>>>>> 
>>>>> CXL utility commands work reasonably now and I can execute meson test
>>>>> —suite cxl, while most of them still fails due to the HPA allocation
>>>>> error which makes me wonder as the resource requests are quite modest.   
>>>> 
>>>> So cxl_test_init() just "hopes" that the top of the system physical
>>>> address space is free to use to emulate CXL windows. That might be an
>>>> assumption that only works for x86_64, not ARM64. I would double check
>>>> that this code in cxl_test_init()
>>>> 
>>>>       rc = gen_pool_add(cxl_mock_pool, iomem_resource.end + 1 - SZ_64G,
>>>>                         SZ_64G, NUMA_NO_NODE);
>>>>       if (rc)
>>>>               goto err_gen_pool_add;
>>>> 
>>>> ...is not setting up CXL Windows that overlap with existing resources in
>>>> that range.
>>>> 
>>> 
>>> I think there are checks that block use of ranges up there.
>>> 
>>> Print I'm seeing is
>>> Hotplug memory [0xfffffff010000000-0xfffffff030000000] exceeds maximum addressable range [0x40000000-0xf80003fffffff]
>>> 
>>> I think right answer is to use mhp_get_pluggable_range(true); to check
>>> for limits on the range we can use.
>>> 
>>> On architectures that don't define arch_get_mappable_range()
>>> that ends up the as (unsigned long)-1 which I think would work
>>> though there may be other stuff up there.  Maybe min(iomem_resource.end + 1 - SZ_64G,
>>>    mappable_range.end + 1 - SZ_64G)
>>> or something like that adapted to avoid wrap around.
>>> 
>>> I haven't yet sanity checked this doesn't break x86 but I think it should
>>> end up making no difference to the locations on x86.
>>> 
>>> 
>>> With the below - all 11 tests in ndctl cxl test suite pass for me.
>>> 
>>> From b287ff2c5ee7fbe507ef8cb61df3e4e156a9773f Mon Sep 17 00:00:00 2001
>>> From: Jonathan Cameron <Jonathan.Cameron@huawei.com>
>>> Date: Thu, 22 May 2025 14:20:42 +0100
>>> Subject: [PATCH] cxl_test: Limit location for fake CFMWS to mappable range
>>> 
>>> Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
>>> ---
>>> tools/testing/cxl/test/cxl.c | 6 +++++-
>>> 1 file changed, 5 insertions(+), 1 deletion(-)
>>> 
>>> diff --git a/tools/testing/cxl/test/cxl.c b/tools/testing/cxl/test/cxl.c
>>> index 8a5815ca870d..b4e6c7659ac4 100644
>>> --- a/tools/testing/cxl/test/cxl.c
>>> +++ b/tools/testing/cxl/test/cxl.c
>>> @@ -1328,6 +1328,7 @@ static int cxl_mem_init(void)
>>> static __init int cxl_test_init(void)
>>> {
>>> int rc, i;
>>> + struct range mappable;
>>> 
>>> cxl_acpi_test();
>>> cxl_core_test();
>>> @@ -1342,8 +1343,11 @@ static __init int cxl_test_init(void)
>>> rc = -ENOMEM;
>>> goto err_gen_pool_create;
>>> }
>>> + mappable = mhp_get_pluggable_range(true);
>>> 
>>> - rc = gen_pool_add(cxl_mock_pool, iomem_resource.end + 1 - SZ_64G,
>>> + rc = gen_pool_add(cxl_mock_pool,
>>> +  min(iomem_resource.end + 1 - SZ_64G,
>>> +      mappable.end + 1 - SZ_64G),
>>> SZ_64G, NUMA_NO_NODE);
>>> if (rc)
>>> goto err_gen_pool_add;
>>> -- 
>>> 2.43.0
>>> 
>> 
>> Tested-by: Itaru Kitayama <itaru.kitayama@fujitsu.com <mailto:itaru.kitayama@fujitsu.com>>
>> 
>> # meson test --suite cxl
>> ninja: Entering directory `/root/ndctl/build'
>> [1/82] Generating version.h with a custom command
>> 1/12 ndctl:cxl / cxl-topology.sh                OK              33.96s
>> 2/12 ndctl:cxl / cxl-region-sysfs.sh            OK              18.00s
>> 3/12 ndctl:cxl / cxl-labels.sh                  OK              23.78s
>> 4/12 ndctl:cxl / cxl-create-region.sh           OK              43.03s
>> 5/12 ndctl:cxl / cxl-xor-region.sh              OK              19.30s
>> 6/12 ndctl:cxl / cxl-events.sh                  FAIL             6.40s   exit status 1
>>>>> LD_LIBRARY_PATH=/root/ndctl/build/daxctl/lib:/root/ndctl/build/cxl/lib:/root/ndctl/build/ndctl/lib MALLOC_PERTURB_=45 TEST_PATH=/root/ndctl/build/test UBSAN_OPTIONS=halt_on_error=1:abort_on_error=1:print_summary=1:print_stacktrace=1 MSAN_OPTIONS=halt_on_error=1:abort_on_error=1:print_summary=1:print_stacktrace=1 MESON_TEST_ITERATION=1 DAXCTL=/root/ndctl/build/daxctl/daxctl NDCTL=/root/ndctl/build/ndctl/ndctl ASAN_OPTIONS=halt_on_error=1:abort_on_error=1:print_summary=1 DATA_PATH=/root/ndctl/test /bin/bash /root/ndctl/test/cxl-events.sh
>> 
>> 7/12 ndctl:cxl / cxl-sanitize.sh                OK              14.77s
>> 8/12 ndctl:cxl / cxl-destroy-region.sh          OK              13.69s
>> 9/12 ndctl:cxl / cxl-qos-class.sh               OK              14.31s
>> 10/12 ndctl:cxl / cxl-poison.sh                  FAIL             3.46s   exit status 1
>>>>> LD_LIBRARY_PATH=/root/ndctl/build/daxctl/lib:/root/ndctl/build/cxl/lib:/root/ndctl/build/ndctl/lib MSAN_OPTIONS=halt_on_error=1:abort_on_error=1:print_summary=1:print_stacktrace=1 MALLOC_PERTURB_=80 UBSAN_OPTIONS=halt_on_error=1:abort_on_error=1:print_summary=1:print_stacktrace=1 TEST_PATH=/root/ndctl/build/test MESON_TEST_ITERATION=1 DAXCTL=/root/ndctl/build/daxctl/daxctl NDCTL=/root/ndctl/build/ndctl/ndctl ASAN_OPTIONS=halt_on_error=1:abort_on_error=1:print_summary=1 DATA_PATH=/root/ndctl/test /bin/bash /root/ndctl/test/cxl-poison.sh
>> 
>> 11/12 ndctl:cxl / cxl-update-firmware.sh         OK              66.23s
>> 12/12 ndctl:cxl / cxl-security.sh                SKIP             0.34s   exit status 77
>> 
>> Ok:                 9
>> Expected Fail:      0
>> Fail:               2
>> Unexpected Pass:    0
>> Skipped:            1
>> Timeout:            0
>> 
>> My understanding is that these CXL tests are using mock CFMWs, not the actual physical memory regions at their fixed locations. So I wonder executing these set of test on a “sane" CXL emulation setup (run_qemu.sh creates) that the Intel folk is using does matter or not.
> 
> Right - these test run on the mock CFMW's that the cxl-test module
> creates. As far as running on a 'sane' CXL emulation setup, like
> run_qemu.sh, I may not be understanding the question, but I'll take
> a shot. The qemu defined CXL devices do not matter at all for the cxl
> unit test run. The unit tests only uses the mock cxl/test environment
> provided by the cxl-test module. The qemu CXL devices are irrelevant.

Ah, I see thanks for the clarification. That’s what I needed to know.   

> 
> Let me know if I missed the point of you were making.
> 
> I noticed your test output FAIL cases, probably for CONFIG_TRACING not
> enabled, and posted a patch to turn those into SKIPs.

Indeed, by looking at the test logs I figured that. Now like Jonathan confirmed I just seen the same results:

 1/12 ndctl:cxl / cxl-topology.sh                OK             106.48s
 2/12 ndctl:cxl / cxl-region-sysfs.sh            OK              55.90s
 3/12 ndctl:cxl / cxl-labels.sh                  OK              54.95s
 4/12 ndctl:cxl / cxl-create-region.sh           OK             141.98s
 5/12 ndctl:cxl / cxl-xor-region.sh              OK              66.00s
 6/12 ndctl:cxl / cxl-events.sh                  OK              33.82s
 7/12 ndctl:cxl / cxl-sanitize.sh                OK              34.92s
 8/12 ndctl:cxl / cxl-destroy-region.sh          OK              41.08s
 9/12 ndctl:cxl / cxl-qos-class.sh               OK              40.55s
10/12 ndctl:cxl / cxl-poison.sh                  OK              82.08s
11/12 ndctl:cxl / cxl-update-firmware.sh         OK              99.39s
12/12 ndctl:cxl / cxl-security.sh                SKIP             1.03s   exit status 77

Thanks again for your comments.

Itaru.

> 
> --Alison
> 
>> 
>> Itaru.



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Internal error: Oops: 0000000096000044 [#11] SMP
  2025-05-22 13:56           ` Jonathan Cameron
  2025-05-22 18:19             ` Dan Williams
  2025-05-22 21:46             ` Itaru Kitayama
@ 2025-05-23  5:52             ` Marc Herbert
  2 siblings, 0 replies; 15+ messages in thread
From: Marc Herbert @ 2025-05-23  5:52 UTC (permalink / raw)
  To: Jonathan Cameron, Dan Williams; +Cc: Itaru Kitayama, Dave Jiang, linux-cxl

On 2025-05-22 06:56, Jonathan Cameron wrote:

> 
> With the below - all 11 tests in ndctl cxl test suite pass for me.
> 
> From b287ff2c5ee7fbe507ef8cb61df3e4e156a9773f Mon Sep 17 00:00:00 2001
> From: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> Date: Thu, 22 May 2025 14:20:42 +0100
> Subject: [PATCH] cxl_test: Limit location for fake CFMWS to mappable range
> 
> Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> ---
>  tools/testing/cxl/test/cxl.c | 6 +++++-
>  1 file changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/tools/testing/cxl/test/cxl.c b/tools/testing/cxl/test/cxl.c
> index 8a5815ca870d..b4e6c7659ac4 100644
> --- a/tools/testing/cxl/test/cxl.c
> +++ b/tools/testing/cxl/test/cxl.c
> @@ -1328,6 +1328,7 @@ static int cxl_mem_init(void)
>  static __init int cxl_test_init(void)
>  {
>  	int rc, i;
> +	struct range mappable;
>  
>  	cxl_acpi_test();
>  	cxl_core_test();
> @@ -1342,8 +1343,11 @@ static __init int cxl_test_init(void)
>  		rc = -ENOMEM;
>  		goto err_gen_pool_create;
>  	}
> +	mappable = mhp_get_pluggable_range(true);
>  
> -	rc = gen_pool_add(cxl_mock_pool, iomem_resource.end + 1 - SZ_64G,
> +	rc = gen_pool_add(cxl_mock_pool,
> +			  min(iomem_resource.end + 1 - SZ_64G,
> +			      mappable.end + 1 - SZ_64G),
>  			  SZ_64G, NUMA_NO_NODE);
>  	if (rc)
>  		goto err_gen_pool_add;

Tested-by: Marc Herbert <marc.herbert@linux.intel.com>

cxl-security.sh aside, this patch turns all CXL test results from red to
green and finally fixes the 3 months old
https://github.com/pmem/ndctl/issues/278 which has a ton of relevant
context.

Without Johanthan's fix, Itaru's and everyone else's run should be all
red too. But they are not all red because the current test suite does
not care about kernel errors (!) which causes a lot of false negatives
("green failures").

These serious false negatives are addressed by my patch v2
https://lore.kernel.org/linux-cxl/20250515021730.1201996-3-marc.herbert@linux.intel.com/T/#u

Please help with review and testing it, thanks! This is the perfect
timing to test it.

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2025-05-23  5:52 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-21  8:39 Internal error: Oops: 0000000096000044 [#11] SMP Itaru Kitayama
2025-05-21 15:31 ` Dave Jiang
2025-05-21 20:38   ` Itaru Kitayama
2025-05-21 20:46     ` Dave Jiang
2025-05-21 23:28       ` Itaru Kitayama
2025-05-21 23:34         ` Dan Williams
2025-05-22 13:56           ` Jonathan Cameron
2025-05-22 18:19             ` Dan Williams
2025-05-22 21:46             ` Itaru Kitayama
2025-05-23  3:28               ` Alison Schofield
2025-05-23  4:56                 ` Itaru Kitayama
2025-05-23  5:52             ` Marc Herbert
2025-05-21 15:33 ` Alison Schofield
2025-05-21 15:36 ` Jonathan Cameron
2025-05-21 15:41 ` Alison Schofield

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox