From: Jonathan Cameron <Jonathan.Cameron@huawei.com>
To: <shiju.jose@huawei.com>
Cc: <linux-cxl@vger.kernel.org>, <dan.j.williams@intel.com>,
<dave@stgolabs.net>, <dave.jiang@intel.com>,
<alison.schofield@intel.com>, <vishal.l.verma@intel.com>,
<ira.weiny@intel.com>, <david@redhat.com>,
<Vilas.Sridharan@amd.com>, <linux-edac@vger.kernel.org>,
<linux-acpi@vger.kernel.org>, <linux-mm@kvack.org>,
<linux-kernel@vger.kernel.org>, <bp@alien8.de>,
<tony.luck@intel.com>, <rafael@kernel.org>, <lenb@kernel.org>,
<mchehab@kernel.org>, <leo.duran@amd.com>,
<Yazen.Ghannam@amd.com>, <rientjes@google.com>,
<jiaqiyan@google.com>, <Jon.Grimm@amd.com>,
<dave.hansen@linux.intel.com>, <naoya.horiguchi@nec.com>,
<james.morse@arm.com>, <jthoughton@google.com>,
<somasundaram.a@hpe.com>, <erdemaktas@google.com>,
<pgonda@google.com>, <duenwen@google.com>, <gthelen@google.com>,
<wschwartz@amperecomputing.com>, <dferguson@amperecomputing.com>,
<wbs@os.amperecomputing.com>, <nifan.cxl@gmail.com>,
<tanxiaofei@huawei.com>, <prime.zeng@hisilicon.com>,
<roberto.sassu@huawei.com>, <kangkang.shen@futurewei.com>,
<wanghuiqiang@huawei.com>, <linuxarm@huawei.com>
Subject: Re: [PATCH 8/8] cxl/memfeature: Add CXL memory device memory sparing control feature
Date: Fri, 7 Mar 2025 09:11:37 +0800 [thread overview]
Message-ID: <20250307091137.00006a0a@huawei.com> (raw)
In-Reply-To: <20250227223816.2036-9-shiju.jose@huawei.com>
On Thu, 27 Feb 2025 22:38:15 +0000
<shiju.jose@huawei.com> wrote:
> From: Shiju Jose <shiju.jose@huawei.com>
>
> Memory sparing is defined as a repair function that replaces a portion of
> memory with a portion of functional memory at that same DPA. The subclasses
> for this operation vary in terms of the scope of the sparing being
> performed. The cacheline sparing subclass refers to a sparing action that
> can replace a full cacheline. Row sparing is provided as an alternative to
> PPR sparing functions and its scope is that of a single DDR row.
> As per CXL r3.2 Table 8-125 foot note 1. Memory sparing is preferred over
> PPR when possible.
> Bank sparing allows an entire bank to be replaced. Rank sparing is defined
> as an operation in which an entire DDR rank is replaced.
>
> Memory sparing maintenance operations may be supported by CXL devices
> that implement CXL.mem protocol. A sparing maintenance operation requests
> the CXL device to perform a repair operation on its media.
> For example, a CXL device with DRAM components that support memory sparing
> features may implement sparing maintenance operations.
>
> The host may issue a query command by setting query resources flag in the
> input payload (CXL spec 3.2 Table 8-120) to determine availability of
> sparing resources for a given address. In response to a query request,
> the device shall report the resource availability by producing the memory
> sparing event record (CXL spec 3.2 Table 8-60) in which the Channel, Rank,
> Nibble Mask, Bank Group, Bank, Row, Column, Sub-Channel fields are a copy
> of the values specified in the request.
>
> During the execution of a sparing maintenance operation, a CXL memory
> device:
> - may not retain data
> - may not be able to process CXL.mem requests correctly.
> These CXL memory device capabilities are specified by restriction flags
> in the memory sparing feature readable attributes.
>
> When a CXL device identifies error on a memory component, the device
> may inform the host about the need for a memory sparing maintenance
> operation by using DRAM event record, where the 'maintenance needed' flag
> may set. The event record contains some of the DPA, Channel, Rank,
> Nibble Mask, Bank Group, Bank, Row, Column, Sub-Channel fields that
> should be repaired. The userspace tool requests for maintenance operation
> if the 'maintenance needed' flag set in the CXL DRAM error record.
>
> CXL spec 3.2 section 8.2.10.7.1.4 describes the device's memory sparing
> maintenance operation feature.
>
> CXL spec 3.2 section 8.2.10.7.2.3 describes the memory sparing feature
> discovery and configuration.
>
> Add support for controlling CXL memory device memory sparing feature.
> Register with EDAC driver, which gets the memory repair attr descriptors
> from the EDAC memory repair driver and exposes sysfs repair control
> attributes for memory sparing to the userspace. For example CXL memory
> sparing control for the CXL mem0 device is exposed in
> /sys/bus/edac/devices/cxl_mem0/mem_repairX/
>
> Use case
> ========
> 1. CXL device identifies a failure in a memory component, report to
> userspace in a CXL DRAM trace event with DPA and other attributes of
> memory to repair such as channel, rank, nibble mask, bank Group,
> bank, row, column, sub-channel.
>
> 2. Rasdaemon process the trace event and may issue query request in sysfs
> check resources available for memory sparing if either of the following
> conditions met.
> - 'maintenance needed' flag set in the event record.
> - 'threshold event' flag set for CVME threshold feature.
> - If the previous case is not enough, may be when the number of corrected
> error reported on a CXL.mem media to the user space exceeds an error
> threshold set in the userspace policy.
>
> 3. Rasdaemon process the memory sparing trace event and issue repair
> request for memory sparing.
>
> Kernel CXL driver shall report memory sparing event record to the userspace
> with the resource availability in order rasdaemon to process the event
> record and issue a repair request in sysfs for the memory sparing operation
> in the CXL device.
>
> Note: Based on the feedbacks from the community 'query' sysfs attribute is
> removed and reporting memory sparing error record to the userspace are not
> supported. Instead userspace issues sparing operation and kernel does the
> same to the CXL memory device, when 'maintenance needed' flag set in the
> DRAM event record.
>
> Add checks to ensure the memory to be repaired is offline and if online,
> then originates from a CXL DRAM error record reported in the current boot
> before requesting a memory sparing operation on the device.
>
> Tested for memory sparing control feature with
> "hw/cxl: Add memory sparing control feature"
> Repository: "https://gitlab.com/shiju.jose/qemu.git"
> Branch: cxl-ras-features-2024-10-24
>
> Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Similar comment to earlier on maybe using single line comments
in more places rather than multiline. Perhaps worth doing
that if you are respinning for other reasons.
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> +static int cxl_mem_do_sparing_op(struct device *dev,
> + struct cxl_mem_sparing_context *cxl_sparing_ctx,
> + struct cxl_memdev_sparing_params *rd_params)
> +{
> + struct cxl_memdev *cxlmd = cxl_sparing_ctx->cxlmd;
> + struct cxl_memdev_sparing_in_payload sparing_pi;
> + struct cxl_event_dram *rec = NULL;
> + u16 validity_flags = 0;
> +
> + if (!rd_params->cap_safe_when_in_use) {
> + /*
> + * Memory to repair must be offline
> + */
> + if (cxl_are_decoders_committed(cxlmd))
> + return -EBUSY;
> + /*
> + * offline, so good for repair
> + */
More places as below where a single line comment would be fine
and make a reader scroll a bit less.
> +static int cxl_memdev_sparing_init(struct cxl_memdev *cxlmd,
> + struct edac_dev_feature *ras_feature,
> + const struct cxl_mem_sparing_desc *desc,
> + u8 repair_inst)
> +{
> + struct cxl_mem_sparing_context *cxl_sparing_ctx;
> + struct cxl_memdev_sparing_params rd_params;
> + struct cxl_feat_entry *feat_entry;
> + int ret;
> +
> + feat_entry = cxl_get_feature_entry(cxlmd->cxlds, &desc->repair_uuid);
> + if (IS_ERR(feat_entry))
> + return -EOPNOTSUPP;
> +
> + if (!(le32_to_cpu(feat_entry->flags) & CXL_FEATURE_F_CHANGEABLE))
> + return -EOPNOTSUPP;
> +
> + cxl_sparing_ctx = devm_kzalloc(&cxlmd->dev, sizeof(*cxl_sparing_ctx),
> + GFP_KERNEL);
> + if (!cxl_sparing_ctx)
> + return -ENOMEM;
> +
> + *cxl_sparing_ctx = (struct cxl_mem_sparing_context) {
> + .get_feat_size = le16_to_cpu(feat_entry->get_feat_size),
> + .set_feat_size = le16_to_cpu(feat_entry->set_feat_size),
> + .get_version = feat_entry->get_feat_ver,
> + .set_version = feat_entry->set_feat_ver,
> + .effects = le16_to_cpu(feat_entry->effects),
> + .cxlmd = cxlmd,
> + .repair_type = desc->repair_type,
> + .granularity = desc->granularity,
> + .instance = repair_inst++,
> + };
> + uuid_copy(&cxl_sparing_ctx->repair_uuid, &desc->repair_uuid);
> +
> + /*
> + * Read CXL device's sparing capabilities.
a below.
> + */
> + ret = cxl_mem_sparing_get_attrs(cxl_sparing_ctx, &rd_params);
> + if (ret)
> + return ret;
> +
> + /*
> + * Set default value for persist_mode.
> + */
If respining some of these comments don't need to be multiline.
next prev parent reply other threads:[~2025-03-07 1:11 UTC|newest]
Thread overview: 26+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-02-27 22:38 [PATCH 0/8] cxl: support CXL memory RAS features shiju.jose
2025-02-27 22:38 ` [PATCH 1/8] cxl: Add helper function to retrieve a feature entry shiju.jose
2025-03-07 0:55 ` Jonathan Cameron
2025-03-07 1:58 ` Fan Ni
2025-03-07 19:19 ` Alison Schofield
2025-03-10 18:15 ` Shiju Jose
2025-03-10 20:28 ` Alison Schofield
2025-03-11 9:51 ` Shiju Jose
2025-02-27 22:38 ` [PATCH 2/8] cxl/memfeature: Add CXL memory device patrol scrub control feature shiju.jose
2025-03-07 1:39 ` Dan Williams
2025-03-07 19:53 ` Alison Schofield
2025-02-27 22:38 ` [PATCH 3/8] cxl/memfeature: Add CXL memory device ECS " shiju.jose
2025-03-07 1:01 ` Jonathan Cameron
2025-03-07 2:46 ` Fan Ni
2025-03-08 1:48 ` Alison Schofield
2025-02-27 22:38 ` [PATCH 4/8] cxl/mbox: Add support for PERFORM_MAINTENANCE mailbox command shiju.jose
2025-02-27 22:38 ` [PATCH 5/8] cxl/region: Add helper function to determine memory is online shiju.jose
2025-03-07 22:01 ` Alison Schofield
2025-02-27 22:38 ` [PATCH 6/8] cxl: Support for finding memory operation attributes from the current boot shiju.jose
2025-03-08 2:09 ` Alison Schofield
2025-02-27 22:38 ` [PATCH 7/8] cxl/memfeature: Add CXL memory device soft PPR control feature shiju.jose
2025-03-07 1:04 ` Jonathan Cameron
2025-02-27 22:38 ` [PATCH 8/8] cxl/memfeature: Add CXL memory device memory sparing " shiju.jose
2025-03-07 1:11 ` Jonathan Cameron [this message]
2025-03-07 23:32 ` Alison Schofield
2025-03-08 2:35 ` Alison Schofield
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20250307091137.00006a0a@huawei.com \
--to=jonathan.cameron@huawei.com \
--cc=Jon.Grimm@amd.com \
--cc=Vilas.Sridharan@amd.com \
--cc=Yazen.Ghannam@amd.com \
--cc=alison.schofield@intel.com \
--cc=bp@alien8.de \
--cc=dan.j.williams@intel.com \
--cc=dave.hansen@linux.intel.com \
--cc=dave.jiang@intel.com \
--cc=dave@stgolabs.net \
--cc=david@redhat.com \
--cc=dferguson@amperecomputing.com \
--cc=duenwen@google.com \
--cc=erdemaktas@google.com \
--cc=gthelen@google.com \
--cc=ira.weiny@intel.com \
--cc=james.morse@arm.com \
--cc=jiaqiyan@google.com \
--cc=jthoughton@google.com \
--cc=kangkang.shen@futurewei.com \
--cc=lenb@kernel.org \
--cc=leo.duran@amd.com \
--cc=linux-acpi@vger.kernel.org \
--cc=linux-cxl@vger.kernel.org \
--cc=linux-edac@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=linuxarm@huawei.com \
--cc=mchehab@kernel.org \
--cc=naoya.horiguchi@nec.com \
--cc=nifan.cxl@gmail.com \
--cc=pgonda@google.com \
--cc=prime.zeng@hisilicon.com \
--cc=rafael@kernel.org \
--cc=rientjes@google.com \
--cc=roberto.sassu@huawei.com \
--cc=shiju.jose@huawei.com \
--cc=somasundaram.a@hpe.com \
--cc=tanxiaofei@huawei.com \
--cc=tony.luck@intel.com \
--cc=vishal.l.verma@intel.com \
--cc=wanghuiqiang@huawei.com \
--cc=wbs@os.amperecomputing.com \
--cc=wschwartz@amperecomputing.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.