Linux CXL
 help / color / mirror / Atom feed
From: Shiju Jose <shiju.jose@huawei.com>
To: Alison Schofield <alison.schofield@intel.com>
Cc: "linux-cxl@vger.kernel.org" <linux-cxl@vger.kernel.org>,
	"dan.j.williams@intel.com" <dan.j.williams@intel.com>,
	Jonathan Cameron <jonathan.cameron@huawei.com>,
	"dave.jiang@intel.com" <dave.jiang@intel.com>,
	"dave@stgolabs.net" <dave@stgolabs.net>,
	"vishal.l.verma@intel.com" <vishal.l.verma@intel.com>,
	"ira.weiny@intel.com" <ira.weiny@intel.com>,
	"linux-edac@vger.kernel.org" <linux-edac@vger.kernel.org>,
	"linux-doc@vger.kernel.org" <linux-doc@vger.kernel.org>,
	"bp@alien8.de" <bp@alien8.de>,
	"tony.luck@intel.com" <tony.luck@intel.com>,
	"lenb@kernel.org" <lenb@kernel.org>,
	"Yazen.Ghannam@amd.com" <Yazen.Ghannam@amd.com>,
	"mchehab@kernel.org" <mchehab@kernel.org>,
	"nifan.cxl@gmail.com" <nifan.cxl@gmail.com>,
	Linuxarm <linuxarm@huawei.com>,
	tanxiaofei <tanxiaofei@huawei.com>,
	"Zengtao (B)" <prime.zeng@hisilicon.com>,
	Roberto Sassu <roberto.sassu@huawei.com>,
	"kangkang.shen@futurewei.com" <kangkang.shen@futurewei.com>,
	wanghuiqiang <wanghuiqiang@huawei.com>
Subject: RE: [PATCH v5 6/8] cxl/edac: Support for finding memory operation attributes from the current boot
Date: Tue, 20 May 2025 23:45:12 +0000	[thread overview]
Message-ID: <879667a6c73a475ab8ef06b0448258a3@huawei.com> (raw)
In-Reply-To: <aCwFhNeeX0YC41A3@aschofie-mobl2.lan>

>-----Original Message-----
>From: Alison Schofield <alison.schofield@intel.com>
>Sent: 20 May 2025 05:31
>To: Shiju Jose <shiju.jose@huawei.com>
>Cc: linux-cxl@vger.kernel.org; dan.j.williams@intel.com; Jonathan Cameron
><jonathan.cameron@huawei.com>; dave.jiang@intel.com; dave@stgolabs.net;
>vishal.l.verma@intel.com; ira.weiny@intel.com; linux-edac@vger.kernel.org;
>linux-doc@vger.kernel.org; bp@alien8.de; tony.luck@intel.com;
>lenb@kernel.org; Yazen.Ghannam@amd.com; mchehab@kernel.org;
>nifan.cxl@gmail.com; Linuxarm <linuxarm@huawei.com>; tanxiaofei
><tanxiaofei@huawei.com>; Zengtao (B) <prime.zeng@hisilicon.com>; Roberto
>Sassu <roberto.sassu@huawei.com>; kangkang.shen@futurewei.com;
>wanghuiqiang <wanghuiqiang@huawei.com>
>Subject: Re: [PATCH v5 6/8] cxl/edac: Support for finding memory operation
>attributes from the current boot
>
>On Thu, May 15, 2025 at 12:59:22PM +0100, shiju.jose@huawei.com wrote:
>> From: Shiju Jose <shiju.jose@huawei.com>
>>
>> Certain operations on memory, such as memory repair, are permitted
>> only when the address and other attributes for the operation are from
>> the current boot. This is determined by checking whether the memory
>> attributes for the operation match those in the CXL gen_media or CXL
>> DRAM memory event records reported during the current boot.
>>
>> The CXL event records must be backed up because they are cleared in
>> the hardware after being processed by the kernel.
>>
>> Support is added for storing CXL gen_media or CXL DRAM memory event
>> records in xarrays. Old records are deleted when they expire or when
>> there is an overflow and which depends on platform correctly report
>> Event Record Timestamp field of CXL spec Table 8-55 Common Event
>> Record Format.
>>
>> Additionally, helper functions are implemented to find a matching
>> record in the xarray storage based on the memory attributes and repair
>> type.
>>
>> Add validity check, when matching attributes for sparing, using the
>> validity flag in the DRAM event record, to ensure that all required
>> attributes for a requested repair operation are valid and set.
>
>Using cxl-test I'm getting the below stack trace when trying to modprobe -r cxl-
>test.
>
>These cxl-test memdevs have num_ras_features == 0 so upon adding the
>memdevs
>	edac_dev_register() returns -EINVAL,
>as does
>	devm_cxl_memdev_edac_register()
>
>Later on when I try to modprobe -r cxl-test, it seems odd that the
>devm_cxl_memdev_edac_release() is being called for this device, but it is, and
>that where the kfree() oops happens.  Are those register and release functions
>intended to be paired/reciprocal ?
>
>Here's the code segment (from this patch) where it is failing, and the stack trace
>follows.
>
>> +
>> +void devm_cxl_memdev_edac_release(struct cxl_memdev *cxlmd) { #ifdef
>> +CONFIG_CXL_EDAC_MEM_REPAIR
>> +	struct cxl_mem_err_rec *array_rec = cxlmd->array_err_rec;
>> +	struct cxl_event_gen_media *rec_gen_media;
>> +	struct cxl_event_dram *rec_dram;
>> +	unsigned long index;
>> +
>> +	if (!array_rec)
>> +		return;
>> +
>> +	xa_for_each(&array_rec->rec_dram, index, rec_dram)
>> +		kfree(rec_dram);
>
>It fails above. That's as far as I've gotten. Hoping that you recognize something
>with this case where num_ras_features == 0 that can lead to this.
>
>Here's the stack trace, and I'll hop to the point in the diff below where it's
>failing:

Thanks Alison for reporting and sharing the debug info.

I think following  changes should fix the issue.  
Dave verified fix in his test setup. Thanks Dave.

Thanks,
Shiju
===============================
diff --git a/drivers/cxl/core/edac.c b/drivers/cxl/core/edac.c index 87b347c232b2..0f001f969228 100644
--- a/drivers/cxl/core/edac.c
+++ b/drivers/cxl/core/edac.c
@@ -1992,15 +1992,6 @@ int devm_cxl_memdev_edac_register(struct cxl_memdev *cxlmd)  #endif
 
 #ifdef CONFIG_CXL_EDAC_MEM_REPAIR
-       struct cxl_mem_err_rec *array_rec =
-               devm_kzalloc(&cxlmd->dev, sizeof(*array_rec), GFP_KERNEL);
-       if (!array_rec)
-               return -ENOMEM;
-
-       cxlmd->array_err_rec = array_rec;
-       xa_init(&array_rec->rec_gen_media);
-       xa_init(&array_rec->rec_dram);
-
        for (int i = 0; i < CXL_MEM_SPARING_MAX; i++) {
                rc = cxl_memdev_sparing_init(cxlmd,
                                             &ras_features[num_ras_features], @@ -2023,6 +2014,16 @@ int devm_cxl_memdev_edac_register(struct cxl_memdev *cxlmd)
                repair_inst++;
                num_ras_features++;
        }
+       if (repair_inst) {
+               struct cxl_mem_err_rec *array_rec =
+                       devm_kzalloc(&cxlmd->dev, sizeof(*array_rec), GFP_KERNEL);
+               if (!array_rec)
+                       return -ENOMEM;
+
+               cxlmd->array_err_rec = array_rec;
+               xa_init(&array_rec->rec_gen_media);
+               xa_init(&array_rec->rec_dram);
+       }
 #endif
        char *cxl_dev_name __free(kfree) =
                kasprintf(GFP_KERNEL, "cxl_%s", dev_name(&cxlmd->dev));

==========================
>
>
>[  108.114491] cxl mem9: ALISON try_free_rec_dram [  108.115654] Oops:
>general protection fault, probably for non-canonical address
>0x1ad998badadad88: 0000 [#1] SMP NOPTI
>[  108.118125] CPU: 0 UID: 0 PID: 1187 Comm: modprobe Tainted: G           O
>6.15.0-rc4+ #1 PREEMPT(voluntary)
>[  108.120507] Tainted: [O]=OOT_MODULE
>[  108.121202] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS
>0.0.0 02/06/2015 [  108.122466] RIP: 0010:kfree+0x6e/0x300 [  108.123099]
>Code: 80 48 01 d8 0f 82 98 02 00 00 48 c7 c2 00 00 00 80 48 2b 15 94 70 61 01
>48 01 d0 48 c1 e8 0c 48 c1 e0 06 48 03 05 72 70 61 01 <48> 8b 50 08 49 89 c5 f6
>c2 01 0f 85 4c 01 00 00 0f 1f 44 00 00 41 [  108.126050] RSP:
>0018:ffffc90002c03be0 EFLAGS: 00010203 [  108.126905] RAX:
>01ad998badadad80 RBX: 6b6b6b6b6b6b6b6b RCX: 0000000000000002 [
>108.128035] RDX: 0000777f80000000 RSI: ffffffff82a44991 RDI: ffffffff82a5ac86
>[  108.129194] RBP: ffffc90002c03c30 R08: 0000000000000000 R09:
>0000000000000000 [  108.130346] R10: 0000000000000001 R11:
>ffff888276ffe000 R12: ffff8880075a5000 [  108.131414] R13: ffff88800680a578
>R14: 0000000000000000 R15: 000000000000000b [  108.132234] FS:
>00007fce1dd38740(0000) GS:ffff8880fa52b000(0000) knlGS:0000000000000000
>[  108.133256] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [
>108.133952] CR2: 00007ffce474e6e8 CR3: 0000000143296006 CR4:
>0000000000370ef0 [  108.134787] DR0: 0000000000000000 DR1:
>0000000000000000 DR2: 0000000000000000 [  108.135560] DR3:
>0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [
>108.136375] Call Trace:
>[  108.136694]  <TASK>
>[  108.136939]  devm_cxl_memdev_edac_release+0x6d/0x130 [cxl_core] [
>108.137611]  cxl_memdev_release+0x22/0x40 [cxl_core] [  108.138268]
>device_release+0x30/0x90 [  108.138720]  kobject_put+0x85/0x1c0 [
>108.139111]  put_device+0xe/0x20 [  108.139478]
>cxl_memdev_unregister+0x42/0x50 [cxl_core] [  108.140092]
>devm_action_release+0xd/0x20 [  108.140531]  devres_release_all+0xa5/0xe0 [
>108.141059]  device_unbind_cleanup+0xd/0x70 [  108.141416]
>device_release_driver_internal+0x1d3/0x220
>[  108.141890]  device_release_driver+0xd/0x20 [  108.142236]
>bus_remove_device+0xd7/0x140 [  108.142578]  device_del+0x15a/0x3a0 [
>108.142932]  platform_device_del.part.0+0x13/0x80
>[  108.143330]  platform_device_unregister+0x1b/0x40
>[  108.143786]  cxl_test_exit+0x1a/0xd80 [cxl_test] [  108.144171]
>__x64_sys_delete_module+0x17d/0x260
>[  108.144579]  ? debug_smp_processor_id+0x17/0x20 [  108.145004]
>x64_sys_call+0x19bd/0x1f90 [  108.145321]  do_syscall_64+0x64/0x140 [
>108.145628]  entry_SYSCALL_64_after_hwframe+0x71/0x79
>[  108.146284] RIP: 0033:0x7fce1d5128cb
>[  108.146775] Code: 73 01 c3 48 8b 0d 55 55 0e 00 f7 d8 64 89 01 48 83 c8 ff c3
>66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa b8 b0 00 00 00 0f 05 <48> 3d 01 f0 ff
>ff 73 01 c3 48 8b 0d 25 55 0e 00 f7 d8 64 89 01 48 [  108.148672] RSP:
>002b:00007ffce4752768 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0 [
>108.149490] RAX: ffffffffffffffda RBX: 0000556b12b5dd60 RCX:
>00007fce1d5128cb [  108.150284] RDX: 0000000000000000 RSI:
>0000000000000800 RDI: 0000556b12b5ddc8 [  108.151169] RBP:
>0000556b12b5ddc8 R08: 1999999999999999 R09: 0000000000000000 [
>108.151982] R10: 00007fce1d59dac0 R11: 0000000000000206 R12:
>0000000000000001 [  108.152756] R13: 0000000000000000 R14:
>00007ffce4754aa8 R15: 0000556b12b5def0 [  108.153523]  </TASK> [
>108.153925] Modules linked in: dax_pmem nd_pmem kmem nd_btt device_dax
>dax_cxl nd_e820 nfit cxl_mock_mem(O) cxl_test(O-) cxl_mem(O) cxl_pmem(O)
>cxl_acpi(O) cxl_port(O) cxl_mock(O) cxl_core(O) libnvdimm [  108.155784] ---[
>end trace 0000000000000000 ]---
>


  reply	other threads:[~2025-05-20 23:45 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-05-15 11:59 [PATCH v5 0/8] cxl: support CXL memory RAS features shiju.jose
2025-05-15 11:59 ` [PATCH v5 1/8] EDAC: Update documentation for the CXL memory patrol scrub control feature shiju.jose
2025-05-15 17:18   ` Randy Dunlap
2025-05-15 11:59 ` [PATCH v5 2/8] cxl: Update prototype of function get_support_feature_info() shiju.jose
2025-05-15 11:59 ` [PATCH v5 3/8] cxl/edac: Add CXL memory device patrol scrub control feature shiju.jose
2025-05-20  2:02   ` Alison Schofield
2025-05-20 10:21     ` Jonathan Cameron
2025-05-20 23:44     ` Shiju Jose
2025-05-15 11:59 ` [PATCH v5 4/8] cxl/edac: Add CXL memory device ECS " shiju.jose
2025-05-15 11:59 ` [PATCH v5 5/8] cxl/edac: Add support for PERFORM_MAINTENANCE command shiju.jose
2025-05-15 11:59 ` [PATCH v5 6/8] cxl/edac: Support for finding memory operation attributes from the current boot shiju.jose
2025-05-20  4:31   ` Alison Schofield
2025-05-20 23:45     ` Shiju Jose [this message]
2025-05-15 11:59 ` [PATCH v5 7/8] cxl/edac: Add CXL memory device memory sparing control feature shiju.jose
2025-05-19 21:01   ` Alison Schofield
2025-05-19 21:15     ` Dave Jiang
2025-05-19 21:34       ` Shiju Jose
2025-05-15 11:59 ` [PATCH v5 8/8] cxl/edac: Add CXL memory device soft PPR " shiju.jose
2025-05-15 17:23   ` Randy Dunlap
2025-05-20  1:36 ` [PATCH v5 0/8] cxl: support CXL memory RAS features Alison Schofield
2025-05-20 23:44   ` Shiju Jose

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=879667a6c73a475ab8ef06b0448258a3@huawei.com \
    --to=shiju.jose@huawei.com \
    --cc=Yazen.Ghannam@amd.com \
    --cc=alison.schofield@intel.com \
    --cc=bp@alien8.de \
    --cc=dan.j.williams@intel.com \
    --cc=dave.jiang@intel.com \
    --cc=dave@stgolabs.net \
    --cc=ira.weiny@intel.com \
    --cc=jonathan.cameron@huawei.com \
    --cc=kangkang.shen@futurewei.com \
    --cc=lenb@kernel.org \
    --cc=linux-cxl@vger.kernel.org \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-edac@vger.kernel.org \
    --cc=linuxarm@huawei.com \
    --cc=mchehab@kernel.org \
    --cc=nifan.cxl@gmail.com \
    --cc=prime.zeng@hisilicon.com \
    --cc=roberto.sassu@huawei.com \
    --cc=tanxiaofei@huawei.com \
    --cc=tony.luck@intel.com \
    --cc=vishal.l.verma@intel.com \
    --cc=wanghuiqiang@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox