From: Alison Schofield <alison.schofield@intel.com>
To: <shiju.jose@huawei.com>
Cc: <linux-cxl@vger.kernel.org>, <dan.j.williams@intel.com>,
<jonathan.cameron@huawei.com>, <dave.jiang@intel.com>,
<dave@stgolabs.net>, <vishal.l.verma@intel.com>,
<ira.weiny@intel.com>, <linux-edac@vger.kernel.org>,
<linux-doc@vger.kernel.org>, <bp@alien8.de>,
<tony.luck@intel.com>, <lenb@kernel.org>, <Yazen.Ghannam@amd.com>,
<mchehab@kernel.org>, <nifan.cxl@gmail.com>,
<linuxarm@huawei.com>, <tanxiaofei@huawei.com>,
<prime.zeng@hisilicon.com>, <roberto.sassu@huawei.com>,
<kangkang.shen@futurewei.com>, <wanghuiqiang@huawei.com>
Subject: Re: [PATCH v5 6/8] cxl/edac: Support for finding memory operation attributes from the current boot
Date: Mon, 19 May 2025 21:31:00 -0700 [thread overview]
Message-ID: <aCwFhNeeX0YC41A3@aschofie-mobl2.lan> (raw)
In-Reply-To: <20250515115927.772-7-shiju.jose@huawei.com>
On Thu, May 15, 2025 at 12:59:22PM +0100, shiju.jose@huawei.com wrote:
> From: Shiju Jose <shiju.jose@huawei.com>
>
> Certain operations on memory, such as memory repair, are permitted
> only when the address and other attributes for the operation are
> from the current boot. This is determined by checking whether the
> memory attributes for the operation match those in the CXL gen_media
> or CXL DRAM memory event records reported during the current boot.
>
> The CXL event records must be backed up because they are cleared
> in the hardware after being processed by the kernel.
>
> Support is added for storing CXL gen_media or CXL DRAM memory event
> records in xarrays. Old records are deleted when they expire or when
> there is an overflow and which depends on platform correctly report
> Event Record Timestamp field of CXL spec Table 8-55 Common Event
> Record Format.
>
> Additionally, helper functions are implemented to find a matching
> record in the xarray storage based on the memory attributes and
> repair type.
>
> Add validity check, when matching attributes for sparing, using
> the validity flag in the DRAM event record, to ensure that all
> required attributes for a requested repair operation are valid and
> set.
Using cxl-test I'm getting the below stack trace when trying to
modprobe -r cxl-test.
These cxl-test memdevs have num_ras_features == 0 so upon adding the memdevs
edac_dev_register() returns -EINVAL,
as does
devm_cxl_memdev_edac_register()
Later on when I try to modprobe -r cxl-test, it seems odd that the
devm_cxl_memdev_edac_release() is being called for this device, but
it is, and that where the kfree() oops happens. Are those register
and release functions intended to be paired/reciprocal ?
Here's the code segment (from this patch) where it is failing,
and the stack trace follows.
> +
> +void devm_cxl_memdev_edac_release(struct cxl_memdev *cxlmd)
> +{
> +#ifdef CONFIG_CXL_EDAC_MEM_REPAIR
> + struct cxl_mem_err_rec *array_rec = cxlmd->array_err_rec;
> + struct cxl_event_gen_media *rec_gen_media;
> + struct cxl_event_dram *rec_dram;
> + unsigned long index;
> +
> + if (!array_rec)
> + return;
> +
> + xa_for_each(&array_rec->rec_dram, index, rec_dram)
> + kfree(rec_dram);
It fails above. That's as far as I've gotten. Hoping that you recognize
something with this case where num_ras_features == 0 that can lead to this.
Here's the stack trace, and I'll hop to the point in the diff below
where it's failing:
[ 108.114491] cxl mem9: ALISON try_free_rec_dram
[ 108.115654] Oops: general protection fault, probably for non-canonical address 0x1ad998badadad88: 0000 [#1] SMP NOPTI
[ 108.118125] CPU: 0 UID: 0 PID: 1187 Comm: modprobe Tainted: G O 6.15.0-rc4+ #1 PREEMPT(voluntary)
[ 108.120507] Tainted: [O]=OOT_MODULE
[ 108.121202] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
[ 108.122466] RIP: 0010:kfree+0x6e/0x300
[ 108.123099] Code: 80 48 01 d8 0f 82 98 02 00 00 48 c7 c2 00 00 00 80 48 2b 15 94 70 61 01 48 01 d0 48 c1 e8 0c 48 c1 e0 06 48 03 05 72 70 61 01 <48> 8b 50 08 49 89 c5 f6 c2 01 0f 85 4c 01 00 00 0f 1f 44 00 00 41
[ 108.126050] RSP: 0018:ffffc90002c03be0 EFLAGS: 00010203
[ 108.126905] RAX: 01ad998badadad80 RBX: 6b6b6b6b6b6b6b6b RCX: 0000000000000002
[ 108.128035] RDX: 0000777f80000000 RSI: ffffffff82a44991 RDI: ffffffff82a5ac86
[ 108.129194] RBP: ffffc90002c03c30 R08: 0000000000000000 R09: 0000000000000000
[ 108.130346] R10: 0000000000000001 R11: ffff888276ffe000 R12: ffff8880075a5000
[ 108.131414] R13: ffff88800680a578 R14: 0000000000000000 R15: 000000000000000b
[ 108.132234] FS: 00007fce1dd38740(0000) GS:ffff8880fa52b000(0000) knlGS:0000000000000000
[ 108.133256] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 108.133952] CR2: 00007ffce474e6e8 CR3: 0000000143296006 CR4: 0000000000370ef0
[ 108.134787] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 108.135560] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 108.136375] Call Trace:
[ 108.136694] <TASK>
[ 108.136939] devm_cxl_memdev_edac_release+0x6d/0x130 [cxl_core]
[ 108.137611] cxl_memdev_release+0x22/0x40 [cxl_core]
[ 108.138268] device_release+0x30/0x90
[ 108.138720] kobject_put+0x85/0x1c0
[ 108.139111] put_device+0xe/0x20
[ 108.139478] cxl_memdev_unregister+0x42/0x50 [cxl_core]
[ 108.140092] devm_action_release+0xd/0x20
[ 108.140531] devres_release_all+0xa5/0xe0
[ 108.141059] device_unbind_cleanup+0xd/0x70
[ 108.141416] device_release_driver_internal+0x1d3/0x220
[ 108.141890] device_release_driver+0xd/0x20
[ 108.142236] bus_remove_device+0xd7/0x140
[ 108.142578] device_del+0x15a/0x3a0
[ 108.142932] platform_device_del.part.0+0x13/0x80
[ 108.143330] platform_device_unregister+0x1b/0x40
[ 108.143786] cxl_test_exit+0x1a/0xd80 [cxl_test]
[ 108.144171] __x64_sys_delete_module+0x17d/0x260
[ 108.144579] ? debug_smp_processor_id+0x17/0x20
[ 108.145004] x64_sys_call+0x19bd/0x1f90
[ 108.145321] do_syscall_64+0x64/0x140
[ 108.145628] entry_SYSCALL_64_after_hwframe+0x71/0x79
[ 108.146284] RIP: 0033:0x7fce1d5128cb
[ 108.146775] Code: 73 01 c3 48 8b 0d 55 55 0e 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa b8 b0 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 25 55 0e 00 f7 d8 64 89 01 48
[ 108.148672] RSP: 002b:00007ffce4752768 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
[ 108.149490] RAX: ffffffffffffffda RBX: 0000556b12b5dd60 RCX: 00007fce1d5128cb
[ 108.150284] RDX: 0000000000000000 RSI: 0000000000000800 RDI: 0000556b12b5ddc8
[ 108.151169] RBP: 0000556b12b5ddc8 R08: 1999999999999999 R09: 0000000000000000
[ 108.151982] R10: 00007fce1d59dac0 R11: 0000000000000206 R12: 0000000000000001
[ 108.152756] R13: 0000000000000000 R14: 00007ffce4754aa8 R15: 0000556b12b5def0
[ 108.153523] </TASK>
[ 108.153925] Modules linked in: dax_pmem nd_pmem kmem nd_btt device_dax dax_cxl nd_e820 nfit cxl_mock_mem(O) cxl_test(O-) cxl_mem(O) cxl_pmem(O) cxl_acpi(O) cxl_port(O) cxl_mock(O) cxl_core(O) libnvdimm
[ 108.155784] ---[ end trace 0000000000000000 ]---
next prev parent reply other threads:[~2025-05-20 4:31 UTC|newest]
Thread overview: 21+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-05-15 11:59 [PATCH v5 0/8] cxl: support CXL memory RAS features shiju.jose
2025-05-15 11:59 ` [PATCH v5 1/8] EDAC: Update documentation for the CXL memory patrol scrub control feature shiju.jose
2025-05-15 17:18 ` Randy Dunlap
2025-05-15 11:59 ` [PATCH v5 2/8] cxl: Update prototype of function get_support_feature_info() shiju.jose
2025-05-15 11:59 ` [PATCH v5 3/8] cxl/edac: Add CXL memory device patrol scrub control feature shiju.jose
2025-05-20 2:02 ` Alison Schofield
2025-05-20 10:21 ` Jonathan Cameron
2025-05-20 23:44 ` Shiju Jose
2025-05-15 11:59 ` [PATCH v5 4/8] cxl/edac: Add CXL memory device ECS " shiju.jose
2025-05-15 11:59 ` [PATCH v5 5/8] cxl/edac: Add support for PERFORM_MAINTENANCE command shiju.jose
2025-05-15 11:59 ` [PATCH v5 6/8] cxl/edac: Support for finding memory operation attributes from the current boot shiju.jose
2025-05-20 4:31 ` Alison Schofield [this message]
2025-05-20 23:45 ` Shiju Jose
2025-05-15 11:59 ` [PATCH v5 7/8] cxl/edac: Add CXL memory device memory sparing control feature shiju.jose
2025-05-19 21:01 ` Alison Schofield
2025-05-19 21:15 ` Dave Jiang
2025-05-19 21:34 ` Shiju Jose
2025-05-15 11:59 ` [PATCH v5 8/8] cxl/edac: Add CXL memory device soft PPR " shiju.jose
2025-05-15 17:23 ` Randy Dunlap
2025-05-20 1:36 ` [PATCH v5 0/8] cxl: support CXL memory RAS features Alison Schofield
2025-05-20 23:44 ` Shiju Jose
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aCwFhNeeX0YC41A3@aschofie-mobl2.lan \
--to=alison.schofield@intel.com \
--cc=Yazen.Ghannam@amd.com \
--cc=bp@alien8.de \
--cc=dan.j.williams@intel.com \
--cc=dave.jiang@intel.com \
--cc=dave@stgolabs.net \
--cc=ira.weiny@intel.com \
--cc=jonathan.cameron@huawei.com \
--cc=kangkang.shen@futurewei.com \
--cc=lenb@kernel.org \
--cc=linux-cxl@vger.kernel.org \
--cc=linux-doc@vger.kernel.org \
--cc=linux-edac@vger.kernel.org \
--cc=linuxarm@huawei.com \
--cc=mchehab@kernel.org \
--cc=nifan.cxl@gmail.com \
--cc=prime.zeng@hisilicon.com \
--cc=roberto.sassu@huawei.com \
--cc=shiju.jose@huawei.com \
--cc=tanxiaofei@huawei.com \
--cc=tony.luck@intel.com \
--cc=vishal.l.verma@intel.com \
--cc=wanghuiqiang@huawei.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).