[PATCH v18 00/19] EDAC: Scrub: introduce generic EDAC RAS control feature driver + CXL/ACPI-RAS2 drivers

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v18 00/19] EDAC: Scrub: introduce generic EDAC RAS control feature driver + CXL/ACPI-RAS2 drivers
@ 2025-01-06 12:09 shiju.jose
  2025-01-06 12:09 ` [PATCH v18 01/19] EDAC: Add support for EDAC device features control shiju.jose
                   ` (20 more replies)
  0 siblings, 21 replies; 87+ messages in thread
From: shiju.jose @ 2025-01-06 12:09 UTC (permalink / raw)
  To: linux-edac, linux-cxl, linux-acpi, linux-mm, linux-kernel
  Cc: bp, tony.luck, rafael, lenb, mchehab, dan.j.williams, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	ira.weiny, david, Vilas.Sridharan, leo.duran, Yazen.Ghannam,
	rientjes, jiaqiyan, Jon.Grimm, dave.hansen, naoya.horiguchi,
	james.morse, jthoughton, somasundaram.a, erdemaktas, pgonda,
	duenwen, gthelen, wschwartz, dferguson, wbs, nifan.cxl,
	yazen.ghannam, tanxiaofei, prime.zeng, roberto.sassu,
	kangkang.shen, wanghuiqiang, linuxarm, shiju.jose

From: Shiju Jose <shiju.jose@huawei.com>

Previously known as "ras: scrub: introduce subsystem + CXL/ACPI-RAS2 drivers".

Augmenting EDAC for controlling RAS features
============================================
The proposed expansion of EDAC for controlling RAS features and
exposing features control attributes to userspace in sysfs.
Some Examples:
 - Scrub control
 - Error Check Scrub (ECS) control
 - ACPI RAS2 features
 - Post Package Repair (PPR) control
 - Memory Sparing Repair control etc.

High level design is illustrated in the following diagram.

         _______________________________________________
        |   Userspace - Rasdaemon                       |
        |  ____________                                 |
        | | RAS CXL    |       _______________          | 
        | | Err Handler|----->|               |         |
        | |____________|      | RAS Dynamic   |         |
        |  ____________       | Scrub, Memory |         |
        | | RAS Memory |----->| Repair Control|         |
        | | Err Handler|      |_______________|         |
        | |____________|           |                    |
        |__________________________|____________________|                              
                                   |
                                   |
    _______________________________|______________________________
   |   Kernel EDAC based SubSystem | for RAS Features Control     |
   | ______________________________|____________________________  |
   || EDAC Core          Sysfs EDAC| Bus                        | |
   ||    __________________________|________ _    _____________ | |
   ||   |/sys/bus/edac/devices/<dev>/scrubX/ |   | EDAC Device || |
   ||   |/sys/bus/edac/devices/<dev>/ecsX/   |<->| EDAC MC     || |
   ||   |/sys/bus/edac/devices/<dev>/repairX |   | EDAC Sysfs  || |
   ||   |____________________________________|   |_____________|| |
   ||                               | EDAC Bus                  | |
   ||               Get             |       Get                 | |
   ||    __________ Features       |   Features __________    | |
   ||   |          |Descs  _________|______ Descs|          |   | |
   ||   |EDAC Scrub|<-----| EDAC Device    |     | EDAC Mem |   | |
   ||   |__________|      | Driver- RAS    |---->| Repair   |   | |
   ||    __________       | Feature Control|     |__________|   | |
   ||   |          |<-----|________________|                    | |
   ||   |EDAC ECS  |   Register RAS | Features                  | |
   ||   |__________|                |                           | |
   ||         ______________________|_________                  | |
   ||_________|_____________|________________|__________________| |
   |   _______|____    _____|_________   ____|_________           |
   |  |            |  | CXL Mem Driver| | Client Driver|          |
   |  | ACPI RAS2  |  | Sparing, PPR, | | Mem Repair   |          |
   |  | Driver     |  | Scrub, ECS    | | Features     |          |
   |  |____________|  |_______________| |______________|          |
   |        |              |              |                       |
   |________|______________|______________|_______________________|
            |              |              |                     
     _______|______________|______________|_______________________
    |     __|______________|_ ____________|____________ ____      |
    |    |                                                  |     |
    |    |            Platform HW and Firmware              |     |
    |    |__________________________________________________|     |
    |_____________________________________________________________|                             

1. EDAC RAS Features components - Create feature specific descriptors.
   for example, EDAC scrub, EDAC ECS, EDAC memory repair in the above
   diagram. 
2. EDAC device driver for controlling RAS Features - Get feature's attr
   descriptors from EDAC RAS feature component and registers device's
   RAS features with EDAC bus and expose the feature's sysfs attributes
   under the sysfs EDAC bus.
3. RAS dynamic scrub controller - Userspace sample module added for scrub
   control in rasdaemon to issue scrubbing when excess number of memory
   errors are reported in a short span of time.

The added EDAC feature specific components (e.g. EDAC scrub, EDAC ECS,
EDAC memory repair etc) do callbacks to  the parent driver (e.g. CXL
driver, ACPI RAS driver etc) for the controls rather than just letting the
caller deal with it because of the following reasons.
1. Enforces a common API across multiple implementations can do that
   via review, but that's not generally gone well in the long run for
   subsystems that have done it (several have later moved to callback
   and feature list based approaches).
2. Gives a path for 'intercepting' in the EDAC feature driver.
   An example for this is that we could intercept PPR repair calls
   and sanity check that the memory in question is offline before
   passing back to the underlying code.  Sure we could rely on doing
   that via some additional calls from the parent driver, but the
   ABI will get messier.
3. (Speculative) we may get in kernel users of some features in the
   long run.

More details of the common RAS features are described in the following
sections.

Memory Scrubbing
================
Increasing DRAM size and cost has made memory subsystem reliability
an important concern. These modules are used where potentially
corrupted data could cause expensive or fatal issues. Memory errors are
one of the top hardware failures that cause server and workload crashes.

Memory scrub is a feature where an ECC engine reads data from
each memory media location, corrects with an ECC if necessary and
writes the corrected data back to the same memory media location.

The memory DIMMs could be scrubbed at a configurable rate to detect
uncorrected memory errors and attempts to recover from detected memory
errors providing the following benefits.
- Proactively scrubbing memory DIMMs reduces the chance of a correctable
  error becoming uncorrectable.
- Once detected, uncorrected errors caught in unallocated memory pages are
  isolated and prevented from being allocated to an application or the OS.
- The probability of software/hardware products encountering memory
  errors is reduced.
Some details of background can be found in Reference [5].

There are 2 types of memory scrubbing,
1. Background (patrol) scrubbing of the RAM whilst the RAM is otherwise
   idle.
2. On-demand scrubbing for a specific address range/region of memory.

There are several types of interfaces to HW memory scrubbers
identified such as ACPI NVDIMM ARS(Address Range Scrub), CXL memory
device patrol scrub, CXL DDR5 ECS, ACPI RAS2 memory scrubbing.

The scrub control varies between different memory scrubbers. To allow
for standard userspace tooling there is a need to present these controls
with a standard ABI.

Introduce generic memory EDAC scrub control which allows user to control
underlying scrubbers in the system via generic sysfs scrub control
interface. The common sysfs scrub control interface abstracts the control
of an arbitrary scrubbing functionality to a common set of functions.

Use case of common scrub control feature
========================================
1. There are several types of interfaces to HW memory scrubbers identified
   such as ACPI NVDIMM ARS(Address Range Scrub), CXL memory device patrol
   scrub, CXL DDR5 ECS, ACPI RAS2 memory scrubbing features and software
   based memory scrubber(discussed in the community Reference [5]).
   Also some scrubbers support controlling (background) patrol scrubbing
   (ACPI RAS2, CXL) and/or on-demand scrubbing(ACPI RAS2, ACPI ARS).
   However the scrub controls varies between memory scrubbers. Thus there
   is a requirement for a standard generic sysfs scrub controls exposed
   to userspace for the seamless control of the HW/SW scrubbers in
   the system by admin/scripts/tools etc.
2. Scrub controls in user space allow the user to disable the scrubbing
   in case disabling of the background patrol scrubbing or changing the
   scrub rate are needed for other purposes such as performance-aware
   operations which requires the background operations to be turned off
   or reduced.
3. Allows to perform on-demand scrubbing for specific address range if
   supported by the scrubber.
4. User space tools controls scrub the memory DIMMs regularly at a
   configurable scrub rate using the sysfs scrub controls discussed help,
   - to detect uncorrectable memory errors early before user accessing memory,
     which helps to recover the detected memory errors.
   - reduces the chance of a correctable error becoming uncorrectable.
5. Policy control for hotplugged memory. There is not necessarily a system
   wide bios or similar in the loop to control the scrub settings on a CXL
   device that wasn't there at boot. What that setting should be is a policy
   decision as we are trading of reliability vs performance - hence it should
   be in control of userspace. As such, 'an' interface is needed. Seems more
   sensible to try and unify it with other similar interfaces than spin
   yet another one.

The draft version of userspace code added in rasdaemon for dynamic scrub
control, based on frequency of memory errors reported to userspace, tested
for CXL device based patrol scrubbing feature and ACPI RAS2 based
scrubbing feature.

https://github.com/shijujose4/rasdaemon/tree/ras_feature_control

ToDO: For memory repair features, such as PPR, memory sparing, rasdaemon
collates records and decides to replace a row if there are lots of
corrected errors, or a single uncorrected error or error record received
with maintenance request flag set as in some CXL event records.

Comparison of scrubbing features
================================
 ................................................................
 .              .   ACPI    . CXL patrol.  CXL ECS  .  ARS      .
 .  Name        .   RAS2    . scrub     .           .           .
 ................................................................
 .              .           .           .           .           .
 . On-demand    . Supported . No        . No        . Supported .
 . Scrubbing    .           .           .           .           .
 .              .           .           .           .           .  
 ................................................................
 .              .           .           .           .           .
 . Background   . Supported . Supported . Supported . No        .
 . scrubbing    .           .           .           .           .
 .              .           .           .           .           .
 ................................................................
 .              .           .           .           .           .
 . Mode of      . Scrub ctrl. per device. per memory.  Unknown  .
 . scrubbing    . per NUMA  .           . media     .           .
 .              . domain.   .           .           .           .
 ................................................................
 .              .           .           .           .           . 
 . Query scrub  . Supported . Supported . Supported . Supported .       
 . capabilities .           .           .           .           .
 .              .           .           .           .           .
 ................................................................
 .              .           .           .           .           . 
 . Setting      . Supported . No        . No        . Supported .       
 . address range.           .           .           .           .
 .              .           .           .           .           .
 ................................................................
 .              .           .           .           .           . 
 . Setting      . Supported . Supported . No        . No        .       
 . scrub rate   .           .           .           .           .
 .              .           .           .           .           .
 ................................................................
 .              .           .           .           .           . 
 . Unit for     . Not       . in hours  . No        . No        .       
 . scrub rate   . Defined   .           .           .           .
 .              .           .           .           .           .
 ................................................................
 .              . Supported .           .           .           .
 . Scrub        . on-demand . No        . No        . Supported .
 . status/      . scrubbing .           .           .           .
 . Completion   . only      .           .           .           .
 ................................................................
 . UC error     .           .CXL general.CXL general. ACPI UCE  .
 . reporting    . Exception .media/DRAM .media/DRAM . notify and.
 .              .           .event/media.event/media. query     .
 .              .           .scan?      .scan?      . ARS status.
 ................................................................
 .              .           .           .           .           .      
 . Clear UC     .  No       . No        .  No       . Supported .
 . error        .           .           .           .           .
 .              .           .           .           .           .  
 ................................................................
 .              .           .           .           .           .
 . Translate    . No        . No        . No        . Supported .
 . *(1)SPA to   .           .           .           .           .
 . *(2)DPA      .           .           .           .           .  
 ................................................................

*(1) - SPA - System Physical Address. See section 9.19.7.8
       Function Index 5 - Translate SPA of ACPI spec r6.5.  
*(2) - DPA - Device Physical Address. See section 9.19.7.8
       Function Index 5 - Translate SPA of ACPI spec r6.5.  

CXL Memory Scrubbing features
=============================
CXL spec r3.1 section 8.2.9.9.11.1 describes the memory device patrol scrub
control feature. The device patrol scrub proactively locates and makes
corrections to errors in regular cycle. The patrol scrub control allows the
request to configure patrol scrubber's input configurations.

The patrol scrub control allows the requester to specify the number of
hours in which the patrol scrub cycles must be completed, provided that
the requested number is not less than the minimum number of hours for the
patrol scrub cycle that the device is capable of. In addition, the patrol
scrub controls allow the host to disable and enable the feature in case
disabling of the feature is needed for other purposes such as
performance-aware operations which require the background operations to be
turned off.

The Error Check Scrub (ECS) is a feature defined in JEDEC DDR5 SDRAM
Specification (JESD79-5) and allows the DRAM to internally read, correct
single-bit errors, and write back corrected data bits to the DRAM array
while providing transparency to error counts.

The DDR5 device contains number of memory media FRUs per device. The
DDR5 ECS feature and thus the ECS control driver supports configuring
the ECS parameters per FRU.

ACPI RAS2 Hardware-based Memory Scrubbing
=========================================
ACPI spec 6.5 section 5.2.21 ACPI RAS2 describes ACPI RAS2 table
provides interfaces for platform RAS features and supports independent
RAS controls and capabilities for a given RAS feature for multiple
instances of the same component in a given system.
Memory RAS features apply to RAS capabilities, controls and operations
that are specific to memory. RAS2 PCC sub-spaces for memory-specific RAS
features have a Feature Type of 0x00 (Memory).

The platform can use the hardware-based memory scrubbing feature to expose
controls and capabilities associated with hardware-based memory scrub
engines. The RAS2 memory scrubbing feature supports following as per spec,
 - Independent memory scrubbing controls for each NUMA domain, identified
   using its proximity domain.
   Note: However AmpereComputing has single entry repeated as they have
         centralized controls.
 - Provision for background (patrol) scrubbing of the entire memory system,
   as well as on-demand scrubbing for a specific region of memory.

ACPI Address Range Scrubbing(ARS)
================================
ARS allows the platform to communicate memory errors to system software.
This capability allows system software to prevent accesses to addresses
with uncorrectable errors in memory. ARS functions manage all NVDIMMs
present in the system. Only one scrub can be in progress system wide
at any given time.
Following functions are supported as per the specification.
1. Query ARS Capabilities for a given address range, indicates platform
   supports the ACPI NVDIMM Root Device Unconsumed Error Notification.
2. Start ARS triggers an Address Range Scrub for the given memory range.
   Address scrubbing can be done for volatile memory, persistent memory,
   or both.
3. Query ARS Status command allows software to get the status of ARS,  
   including the progress of ARS and ARS error record.
4. Clear Uncorrectable Error.
5. Translate SPA
6. ARS Error Inject etc.
Note: Support for ARS is not added in this series because to reduce the
line of code for review and could be added after initial code is merged. 
We'd like feedback on whether this is of interest to ARS community?

Post Package Repair(PPR)
========================
PPR (Post Package Repair) maintenance operation requests the memory device
to perform a repair operation on its media if supported. A memory device
may support two types of PPR: Hard PPR (hPPR), for a permanent row repair,
and Soft PPR (sPPR), for a temporary row repair. sPPR is much faster than
hPPR, but the repair is lost with a power cycle. During the execution of a
PPR maintenance operation, a memory device, may or may not retain data and
may or may not be able to process memory requests correctly. sPPR maintenance
operation may be executed at runtime, if data is retained and memory requests
are correctly processed. hPPR maintenance operation may be executed only at
boot because data would not be retained.

Use cases of common PPR control feature
=======================================
1. The Soft PPR (sPPR) and Hard PPR (hPPR) share similar control interfaces,
thus there is a requirement for a standard generic sysfs PPR controls exposed
to userspace for the seamless control of the PPR features in the system by the
admin/scripts/tools etc.
2. When a CXL device identifies a failure on a memory component, the device
may inform the host about the need for a PPR maintenance operation by using
an event record, where the maintenance needed flag is set. The event record
specifies the DPA that should be repaired. Kernel reports the corresponding
cxl general media or DRAM trace event to userspace. The userspace tool,
for reg. rasdaemon initiate a PPR maintenance operation in response to a
device request using the sysfs PPR control.
3. User space tools, for eg. rasdaemon, do request PPR on a memory region
when uncorrected memory error or excess corrected memory errors reported
on that memory.
4. Likely multiple instances of PPR present per memory device.

Memory Sparing
==============
Memory sparing is defined as a repair function that replaces a portion of
memory with a portion of functional memory at that same DPA. User space
tool, e.g. rasdaemon, may request the sparing operation for a given
address for which the uncorrectable error is reported. In CXL,
(CXL spec 3.1 section 8.2.9.7.1.4) subclasses for sparing operation vary
in terms of the scope of the sparing being performed. The cacheline sparing
subclass refers to a sparing action that can replace a full cacheline.
Row sparing is provided as an alternative to PPR sparing functions and its
scope is that of a single DDR row. Bank sparing allows an entire bank to
be replaced. Rank sparing is defined as an operation in which an entire
DDR rank is replaced.

Series adds,
1. EDAC device driver extended for controlling RAS features, EDAC scrub
   driver, EDAC ECS driver, EDAC memory repair driver supports memory
   scrub control, ECS control, memory repair(PPR, sparing) control
   respectively.
2. Several common patches from Dave's cxl/fwctl series.   
3. Support for CXL feature mailbox commands, which is used by CXL device
   scrubbing and memory repair features. 
4. CXL features driver supporting patrol scrub control (device and
   region based).

5. CXL features driver supporting ECS control feature.
6. ACPI RAS2 driver adds OS interface for RAS2 communication through
   PCC mailbox and extracts ACPI RAS2 feature table (RAS2) and
   create platform device for the RAS memory features, which binds
   to the memory ACPI RAS2 driver.
7. Memory ACPI RAS2 driver gets the PCC subspace for communicating
   with the ACPI compliant platform supports ACPI RAS2. Add callback
   functions and registers with EDAC device to support user to
   control the HW patrol scrubbers exposed to the kernel via the
   ACPI RAS2 table.
8. Support for CXL maintenance mailbox command, which is used by
   CXL device memory repair feature.   
9. CXL features driver supporting PPR control feature.
10. CXL features driver supporting memory sparing control feature.
    Note: There are other PPR, memory sparing drivers to come.

Open Questions based on feedbacks from the community:
1. Leo: Standardize unit for scrub rate, for example ACPI RAS2 does not define
   unit for the scrub rate. RAS2 clarification needed. 
2. Jonathan:
   - Any need for discoverability of capability to scan different regions,
   such as global PA space to userspace. Left as future extension.
   - For EDAC memory repair, control attribute for granularity(cache/row//bank/rank)
     is needed?

3. Jiaqi:
   - STOP_PATROL_SCRUBBER from RAS2 must be blocked and, must not be exposed to
     OS/userspace. Stopping patrol scrubber is unacceptable for platform where
     OEM has enabled patrol scrubber, because the patrol scrubber is a key part
     of logging and is repurposed for other RAS actions.
   If the OEM does not want to expose this control, they should lock it down so the
   interface is not exposed to the OS. These features are optional after all.
   - "Requested Address Range"/"Actual Address Range" (region to scrub) is a
      similarly bad thing to expose in RAS2.
   If the OEM does not want to expose this, they should lock it down so the
   interface is not exposed to the OS. These features are optional after all.
   As per LPC discussion, support for stop and attributes for addr range
   to be exposed to the userspace. 
4. Borislav: 
   - How the scrub control exposed to userspace will be used?
     POC added in rasdaemon with dynamic scrub control for CXL memory media
     errors and memory errors reported to userspace.
     https://github.com/shijujose4/rasdaemon/tree/scrub_control_6_june_2024
   - Is the scrub interface is sufficient for the use cases?
   - Who is going to use scrub controls tools/admin/scripts?
     1) Rasdaemon for dynamic control
     2) Udev script for more static 'defaults' on hotplug etc.
5. PPR   
   - For PPR, rasdaemon collates records and decides to replace a row if there
     are lots of corrected errors, or a single uncorrected error or error record
     received with maintenance request flag set as in CXL DRAM error record.
   - sPPR more or less startup only (so faking hPPR) or actually useful
     in a running system (if not the safe version that keeps everything
     running whilst replacement is ongoing)
   - Is future proofing for multiple PPR units useful given we've mashed
     together hPPR and sPPR for CXL.

Implementation
==============
1. Linux kernel
Version 17 of kernel implementations of RAS features control is available in,
https://github.com/shijujose4/linux.git
Branch: edac-enhancement-ras-features_v18

Note: Took updated patches for CXL feature infrastructure and feature commands
   from Dave's cxl/features branch.
   https://git.kernel.org/pub/scm/linux/kernel/git/djiang/linux.git/log/?h=cxl/features

   Apologise to Dave for not waiting enough for permission to sendout his patches
   in this series because of rush. 

2. QEMU emulation
QEMU for CXL RAS features implementation is available in, 
https://gitlab.com/shiju.jose/qemu.git
Branch: cxl-ras-features-2024-10-24

3. Userspace rasdaemon
The draft version of userspace sample code for dynamic scrub control,
based on frequency of memory errors reported to userspace, is added
in rasdaemon and enabled, tested for CXL device based patrol scrubbing
feature and ACPI RAS2 based scrubbing feature. This required updation
for the latest sysfs scrub interface.
https://github.com/shijujose4/rasdaemon/tree/ras_feature_control

ToDO: For PPR, rasdaemon collates records and decides to replace a row if there
are lots of corrected errors, or a single uncorrected error or error
record received with maintenance request flag set as in CXL DRAM error
record.

References:
1. ACPI spec r6.5 section 5.2.21 ACPI RAS2.
2. ACPI spec r6.5 section 9.19.7.2 ARS.
3. CXL spec  r3.1 8.2.9.9.11.1 Device patrol scrub control feature
4. CXL spec  r3.1 8.2.9.9.11.2 DDR5 ECS feature
5. CXL spec  r3.1 8.2.9.7.1.1 PPR Maintenance Operations
6. CXL spec  r3.1 8.2.9.7.2.1 sPPR Feature Discovery and Configuration
7. CXL spec  r3.1 8.2.9.7.2.2 hPPR Feature Discovery and Configuration
8. Background information about kernel support for memory scan, memory
   error detection and ACPI RASF.
   https://lore.kernel.org/all/20221103155029.2451105-1-jiaqiyan@google.com/
9. Discussions on RASF:
   https://lore.kernel.org/lkml/20230915172818.761-1-shiju.jose@huawei.com/#r 

Changes
=======
v17 -> v18:
1. Rebased to kernel version 6.13-rc5
2. Reordered patches for feedback from Jonathan on v17.
3.
3.1 Took updated patches for CXL feature infrastructure and feature commands
   from Dave's cxl/features branch.
   https://git.kernel.org/pub/scm/linux/kernel/git/djiang/linux.git/log/?h=cxl/features
   Updated, debug and tested CXL RAS features.

   Apologise to Dave for not waiting enough for permission to sendout his patches
   in this series because of rush.    

3.2. RAS features in the cxl/core/memfeature.c updated for interface
     changes in the CXL feature commands.
4. Modified ACPI RAS2 code for the recent interface changes in the
   PCC mbox code.

v16 -> v17:
1. 
1.1 Took several patches for CXL feature commands from Dave's 
   fwctl/cxl series and add fixes pointed by Jonathan in those patches.
1.2. cxl/core/memfeature.c updated for interface changes in the
   Get Supported Feature, Get Feature and Set Feature functions.
1.3. Used the UUID's for RAS features in CXL features code from
     include/cxl/features.h    
2. Changes based on feedbacks from Boris
 - Added attributes in EDAC memory repair to return the range for DPA
   and other control attributes, and added callback functions for the
   DPA range in CXL PPR and memory sparing code, which is the only one
   supported in the CXL.
 - Removed 'query' attribute for memory repair feature.

v15 -> v16:
1. Changes and Fixes for feedbacks from Boris
 - Modified documentations and interface file for EDAC memory repair
   to add more details and use cases.
 - Merged documentations to corresponding patches instead of common patch
   for full documentation for better readability.
 - Removed 'persist_mode_avail' attribute for memory repair feature.
2. Changes for feedback from Dave Jiang
 - Dave suggested helper function for ECS initialization in cxl/core/memfeature.c,
   which added for all CXl RAS features, scrub, ECS, PPR and memory sparing features.
 - Fixed endian conversion pointed by Dave in CXL memory sparing. Also I fixed
     similar in CXL scrub, ECS and PPR features.
3. Changes for feedback from Ni Fan.
 - Fixed a memory leak in edac_dev_register() for memory repair feature
   and few suggestions by Ni Fan.

v14 -> v15:
1. Changes and Fixes for feedbacks from Boris
  - Added documentations for edac features, scrub and memory_repair etc
    and placed in a separate patch.
  - Deleted extra 2 attributes for EDAC ECS log_entry_type_per_* and
    mode_counts_*.
  - Rsolved issues reported in Documentation/ABI/testing/sysfs-edac-ecs.
  - Deleted unused pr_ftmt() from few files.
  - Fixed some formating issues EDAC ECS code and similar in other files. 
    etc.
2. Change for feedback from Dave Jiang
  - In CXL code for patrol scrub control, Dave suggested replace
    void *drv_data with a union of parameters in cxl_ps_get_attrs() and
    similar functions.
    This is fixed by replacing void *drv_data with corresponding context
    structure(struct cxl_patrol_scrub_context) in CXL local functions as
    struct cxl_patrol_scrub_context difficult and can't be visible in
    generic EDAC control interface. Similar changes are made for CXL ECS,
    CXL PPR and CXL memory sparing local functions.

v13 -> v14:
1. Changes and Fixes for feedback from Boris
  - Check grammar of patch description.
  - Changed scrub control attributes for memory scrub range to "addr" and "size".
  - Fixed unreached code in edac_dev_register(). 
  - Removed enable_on_demand attribute from EDAC scrub control and modified
    RAS2 driver for the same.
  - Updated ABI documentation for EDAC scrub control.
    etc.

2. Changes for feedback from Greg/Rafael/Jonathan for ACPI RAS2
  - Replaced platform device creation and binding with
    auxiliary device creation and binding with ACPI RAS2
    memory auxiliary driver.

3. Changes and Fixes for feedback from Jonathan
  - Fixed unreached code in edac_dev_register(). 
  - Optimize callback functions in CXL ECS using macros.
  - Add readback attributes for the EDAC memory repair feature
    and add support in the CXL driver for PPR and memory sparing.
  - Add refactoring in the CXL driver for PPR and memory sparing
    for query/repair maintenance commands.
  - Add cxl_dpa_to_region_locked() function.  
  - Some more cleanups in the ACPI RAS2 and RAS2 memory drivers.
    etc.

4. Changes and Fixes for feedback from Ni Fan
   - Fixed compilation error - cxl_mem_ras_features_init refined, when CXL components
     build as module.

5. Optimize callback functions in CXL memory sparing using macros.
   etc.

v12 -> v13:
1. Changes and Fixes for feedback from Boris
  - Function edac_dev_feat_init() merge with edac_dev_register()
  - Add macros in EDAC feature specific code for repeated code.
  - Correct spelling mistakes.
  - Removed feature specific code from the patch "EDAC: Add support
    for EDAC device features control"
2. Changes for feedbacks from Dave Jiang
   - Move fields num_features and entries to struct cxl_mailbox,
     in "cxl: Add Get Supported Features command for kernel usage"
   - Use series from 
     https://lore.kernel.org/linux-cxl/20240905223711.1990186-1-dave.jiang@intel.com/   
3. Changes and Fixes for feedback from Ni Fan
   - In documentation scrub* to scrubX, ecs_fru* to ecs_fruX
   - Corrected some grammar mistakes in the patch headers.
   - Fixed an error print for min_scrub_cycle_hrs in the CXL patrol scrub
     code.
   - Improved an error print in the CXL ECS code.
   - bool -> tristate for config CXL_RAS_FEAT
4. Add support for CXL memory sparing feature.
5. Add common EDAC memory repair driver for controlling memory repair
   features, PPR, memory sparing etc.

v11 -> v12:
1. Changes and Fixes for feedback from Boris mainly for
    patch "EDAC: Add support for EDAC device features control"
    and other generic comments.

2. Took CXL patches from Dave Jiang for "Add Get Supported Features
   command for kernel usage" and other related patches. Merged helper
   functions from this series to the above patch. Modifications of
   CXL code in this series due to refactoring of CXL mailbox in Dave's
   patches.

3. Modified EDAC scrub control code to support multiple scrub instances
   per device.

v10 -> v11:
1. Feedback from Borislav:
   - Add generic EDAC code for control device features to
     /drivers/edac/edac_device.c.
   - Add common structure in edac for device feature's data.

2. Some more optimizations in generic EDAC code for control
   device features.

3. Changes for feedback from Fan for ACPI RAS2 memory driver.

4. Add support for control memory PPR (Post Package Repair) features
   in EDAC.

5. Add support for maintenance command in the CXL mailbox code,
   which is needed for support PPR features in CXL driver.  

6. Add support for control memory PPR (Post Package Repair) features
   and do perform PPR maintenance operation in CXL driver.

7. Rename drivers/cxl/core/memscrub.c to drivers/cxl/core/memfeature.c

v9 -> v10:
1. Feedback from Mauro Carvalho Chehab:
   - Changes suggested in EDAC RAS feature driver.
     use uppercase for enums, if else to switch-case, documentation for
     static scrub and ecs init functions etc.
   - Changes suggested in EDAC scrub.
     unit of scrub cycle hour to seconds.
     attribute node cycle_in_hours_available to min_cycle_duration and 
     max_cycle_duration.
     attribute node cycle_in_hours to current_cycle_duration.
     Use base 0 for kstrtou64() and kstrtol() functions.
     etc.
   - Changes suggested in EDAC ECS.
     uppercase for enums
     add ABI documentation. etc

2. Feedback from Fan:
   - Changes suggested in EDAC RAS feature driver.
     use uppercase for enums, change if...else to switch-case. 
     some optimization in edac_ras_dev_register() function
     add missing goto free_ctx
   - Changes suggested in the code for feature commands.  
   - CXL driver scrub and ECS code
     use uppercase for enums, fix typo, use enum type for mode
     fix lonf lines etc.

v8 -> v9:
1. Feedback from Borislav:
   - Add scrub control driver to the EDAC on feedback from Borislav.
   - Changed DEVICE_ATTR_..() static.
   - Changed the write permissions for scrub control sysfs files as
     root-only.
2. Feedback from Fan:
   - Optimized cxl_get_feature() function by using min() and removed
     feat_out_min_size.
   - Removed unreached return from cxl_set_feature() function.
   - Changed the term  "rate" to "cycle_in_hours" in all the
     scrub control code.
   - Allow cxl_mem_probe() continue if cxl_mem_patrol_scrub_init() fail,
     with just a debug warning.

3. Feedback from Jonathan:
   - Removed patch __free() based cleanup function for acpi_put_table.
     and added fix in the acpi RAS2 driver.

4. Feedback from Dan Williams:
   - Allow cxl_mem_probe() continue if cxl_mem_patrol_scrub_init() fail,
     with just a debug warning.
   - Add support for CXL region based scrub control.

5. Feedback from Daniel Ferguson on RAS2 drivers:
    In the ACPI RAS2 driver,
  - Incorporated the changes given for clearing error reported.
  - Incorporated the changes given for check the Set RAS Capability
    status and return an appropriate error.
    In the RAS2 memory driver,
  - Added more checks for start/stop bg and on-demand scrubbing
    so that addr range in cache do not get cleared and restrict
    permitted operations during scrubbing.

History for v1 to v8 is available here.
https://lore.kernel.org/lkml/20240726160556.2079-1-shiju.jose@huawei.com/

Dave Jiang (6):
  cxl: Refactor user ioctl command path from mds to mailbox
  cxl: Add skeletal features driver
  cxl: Enumerate feature commands
  cxl: Add Get Supported Features command for kernel usage
  cxl: Add features driver attribute to emit number of features
    supported
  cxl: Setup exclusive CXL features that are reserved for the kernel

Shiju Jose (13):
  EDAC: Add support for EDAC device features control
  EDAC: Add scrub control feature
  EDAC: Add ECS control feature
  EDAC: Add memory repair control feature
  ACPI:RAS2: Add ACPI RAS2 driver
  ras: mem: Add memory ACPI RAS2 driver
  cxl/mbox: Add GET_FEATURE mailbox command
  cxl/mbox: Add SET_FEATURE mailbox command
  cxl/memfeature: Add CXL memory device patrol scrub control feature
  cxl/memfeature: Add CXL memory device ECS control feature
  cxl/mbox: Add support for PERFORM_MAINTENANCE mailbox command
  cxl/memfeature: Add CXL memory device soft PPR control feature
  cxl/memfeature: Add CXL memory device memory sparing control feature

 Documentation/ABI/testing/sysfs-edac-ecs      |   63 +
 .../ABI/testing/sysfs-edac-memory-repair      |  244 +++
 Documentation/ABI/testing/sysfs-edac-scrub    |   74 +
 Documentation/edac/features.rst               |  102 ++
 Documentation/edac/index.rst                  |   12 +
 Documentation/edac/memory_repair.rst          |  249 +++
 Documentation/edac/scrub.rst                  |  393 ++++
 drivers/acpi/Kconfig                          |   11 +
 drivers/acpi/Makefile                         |    1 +
 drivers/acpi/ras2.c                           |  407 ++++
 drivers/cxl/Kconfig                           |   25 +
 drivers/cxl/Makefile                          |    3 +
 drivers/cxl/core/Makefile                     |    2 +
 drivers/cxl/core/core.h                       |    7 +-
 drivers/cxl/core/features.c                   |  287 +++
 drivers/cxl/core/mbox.c                       |  167 +-
 drivers/cxl/core/memdev.c                     |   22 +-
 drivers/cxl/core/memfeature.c                 | 1631 +++++++++++++++++
 drivers/cxl/core/port.c                       |    3 +
 drivers/cxl/core/region.c                     |    6 +
 drivers/cxl/cxl.h                             |    3 +
 drivers/cxl/cxlmem.h                          |   67 +-
 drivers/cxl/features.c                        |  215 +++
 drivers/cxl/mem.c                             |    5 +
 drivers/cxl/pci.c                             |   19 +
 drivers/edac/Makefile                         |    1 +
 drivers/edac/ecs.c                            |  207 +++
 drivers/edac/edac_device.c                    |  183 ++
 drivers/edac/mem_repair.c                     |  492 +++++
 drivers/edac/scrub.c                          |  209 +++
 drivers/ras/Kconfig                           |   10 +
 drivers/ras/Makefile                          |    1 +
 drivers/ras/acpi_ras2.c                       |  385 ++++
 include/acpi/ras2_acpi.h                      |   45 +
 include/cxl/features.h                        |  171 ++
 include/cxl/mailbox.h                         |   45 +-
 include/linux/edac.h                          |  238 +++
 tools/testing/cxl/Kbuild                      |    1 +
 38 files changed, 5909 insertions(+), 97 deletions(-)
 create mode 100644 Documentation/ABI/testing/sysfs-edac-ecs
 create mode 100644 Documentation/ABI/testing/sysfs-edac-memory-repair
 create mode 100644 Documentation/ABI/testing/sysfs-edac-scrub
 create mode 100644 Documentation/edac/features.rst
 create mode 100644 Documentation/edac/index.rst
 create mode 100644 Documentation/edac/memory_repair.rst
 create mode 100644 Documentation/edac/scrub.rst
 create mode 100755 drivers/acpi/ras2.c
 create mode 100644 drivers/cxl/core/features.c
 create mode 100644 drivers/cxl/core/memfeature.c
 create mode 100644 drivers/cxl/features.c
 create mode 100755 drivers/edac/ecs.c
 create mode 100755 drivers/edac/mem_repair.c
 create mode 100755 drivers/edac/scrub.c
 create mode 100644 drivers/ras/acpi_ras2.c
 create mode 100644 include/acpi/ras2_acpi.h
 create mode 100644 include/cxl/features.h

-- 
2.43.0

^ permalink raw reply	[flat|nested] 87+ messages in thread

* [PATCH v18 01/19] EDAC: Add support for EDAC device features control
  2025-01-06 12:09 [PATCH v18 00/19] EDAC: Scrub: introduce generic EDAC RAS control feature driver + CXL/ACPI-RAS2 drivers shiju.jose
@ 2025-01-06 12:09 ` shiju.jose
  2025-01-06 13:37   ` Borislav Petkov
                     ` (2 more replies)
  2025-01-06 12:09 ` [PATCH v18 02/19] EDAC: Add scrub control feature shiju.jose
                   ` (19 subsequent siblings)
  20 siblings, 3 replies; 87+ messages in thread
From: shiju.jose @ 2025-01-06 12:09 UTC (permalink / raw)
  To: linux-edac, linux-cxl, linux-acpi, linux-mm, linux-kernel
  Cc: bp, tony.luck, rafael, lenb, mchehab, dan.j.williams, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	ira.weiny, david, Vilas.Sridharan, leo.duran, Yazen.Ghannam,
	rientjes, jiaqiyan, Jon.Grimm, dave.hansen, naoya.horiguchi,
	james.morse, jthoughton, somasundaram.a, erdemaktas, pgonda,
	duenwen, gthelen, wschwartz, dferguson, wbs, nifan.cxl,
	yazen.ghannam, tanxiaofei, prime.zeng, roberto.sassu,
	kangkang.shen, wanghuiqiang, linuxarm, shiju.jose

From: Shiju Jose <shiju.jose@huawei.com>

Add generic EDAC device feature controls supporting the registration
of RAS features available in the system. The driver exposes control
attributes for these features to userspace in
/sys/bus/edac/devices/<dev-name>/<ras-feature>/

Co-developed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
---
 Documentation/edac/features.rst |  94 ++++++++++++++++++++++++++++++
 Documentation/edac/index.rst    |  10 ++++
 drivers/edac/edac_device.c      | 100 ++++++++++++++++++++++++++++++++
 include/linux/edac.h            |  28 +++++++++
 4 files changed, 232 insertions(+)
 create mode 100644 Documentation/edac/features.rst
 create mode 100644 Documentation/edac/index.rst

diff --git a/Documentation/edac/features.rst b/Documentation/edac/features.rst
new file mode 100644
index 000000000000..f32f259ce04d
--- /dev/null
+++ b/Documentation/edac/features.rst
@@ -0,0 +1,94 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+============================================
+Augmenting EDAC for controlling RAS features
+============================================
+
+Copyright (c) 2024 HiSilicon Limited.
+
+:Author:   Shiju Jose <shiju.jose@huawei.com>
+:License:  The GNU Free Documentation License, Version 1.2
+          (dual licensed under the GPL v2)
+:Original Reviewers:
+
+- Written for: 6.14
+
+Introduction
+------------
+The expansion of EDAC for controlling RAS features and exposing features
+control attributes to userspace via sysfs. Some Examples:
+
+* Scrub control
+
+* Error Check Scrub (ECS) control
+
+* ACPI RAS2 features
+
+* Post Package Repair (PPR) control
+
+* Memory Sparing Repair control etc.
+
+High level design is illustrated in the following diagram::
+
+         _______________________________________________
+        |   Userspace - Rasdaemon                       |
+        |  _____________                                |
+        | | RAS CXL mem |      _______________          |
+        | |error handler|---->|               |         |
+        | |_____________|     | RAS dynamic   |         |
+        |  _____________      | scrub, memory |         |
+        | | RAS memory  |---->| repair control|         |
+        | |error handler|     |_______________|         |
+        | |_____________|          |                    |
+        |__________________________|____________________|
+                                   |
+                                   |
+    _______________________________|______________________________
+   |     Kernel EDAC extension for | controlling RAS Features     |
+   | ______________________________|____________________________  |
+   || EDAC Core          Sysfs EDAC| Bus                        | |
+   ||    __________________________|_________     _____________ | |
+   ||   |/sys/bus/edac/devices/<dev>/scrubX/ |   | EDAC device || |
+   ||   |/sys/bus/edac/devices/<dev>/ecsX/   |<->| EDAC MC     || |
+   ||   |/sys/bus/edac/devices/<dev>/repairX |   | EDAC sysfs  || |
+   ||   |____________________________________|   |_____________|| |
+   ||                           EDAC|Bus                        | |
+   ||                               |                           | |
+   ||    __________ Get feature     |      Get feature          | |
+   ||   |          |desc   _________|______ desc  __________    | |
+   ||   |EDAC scrub|<-----| EDAC device    |     |          |   | |
+   ||   |__________|      | driver- RAS    |---->| EDAC mem |   | |
+   ||    __________       | feature control|     | repair   |   | |
+   ||   |          |<-----|________________|     |__________|   | |
+   ||   |EDAC ECS  |    Register RAS|features                   | |
+   ||   |__________|                |                           | |
+   ||         ______________________|_____________              | |
+   ||_________|_______________|__________________|______________| |
+   |   _______|____    _______|_______       ____|__________      |
+   |  |            |  | CXL mem driver|     | Client driver |     |
+   |  | ACPI RAS2  |  | scrub, ECS,   |     | memory repair |     |
+   |  | driver     |  | sparing, PPR  |     | features      |     |
+   |  |____________|  |_______________|     |_______________|     |
+   |        |                 |                    |              |
+   |________|_________________|____________________|______________|
+            |                 |                    |
+    ________|_________________|____________________|______________
+   |     ___|_________________|____________________|_______       |
+   |    |                                                  |      |
+   |    |            Platform HW and Firmware              |      |
+   |    |__________________________________________________|      |
+   |______________________________________________________________|
+
+
+1. EDAC Features components - Create feature specific descriptors.
+For example, EDAC scrub, EDAC ECS, EDAC memory repair in the above
+diagram.
+
+2. EDAC device driver for controlling RAS Features - Get feature's attribute
+descriptors from EDAC RAS feature component and registers device's RAS
+features with EDAC bus and exposes the features control attributes via
+the sysfs EDAC bus. For example, /sys/bus/edac/devices/<dev-name>/<feature>X/
+
+3. RAS dynamic feature controller - Userspace sample modules in rasdaemon for
+dynamic scrub/repair control to issue scrubbing/repair when excess number
+of corrected memory errors are reported in a short span of time.
diff --git a/Documentation/edac/index.rst b/Documentation/edac/index.rst
new file mode 100644
index 000000000000..b6c265a4cffb
--- /dev/null
+++ b/Documentation/edac/index.rst
@@ -0,0 +1,10 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==============
+EDAC Subsystem
+==============
+
+.. toctree::
+   :maxdepth: 1
+
+   features
diff --git a/drivers/edac/edac_device.c b/drivers/edac/edac_device.c
index 621dc2a5d034..9fce46dd7405 100644
--- a/drivers/edac/edac_device.c
+++ b/drivers/edac/edac_device.c
@@ -570,3 +570,103 @@ void edac_device_handle_ue_count(struct edac_device_ctl_info *edac_dev,
 		      block ? block->name : "N/A", count, msg);
 }
 EXPORT_SYMBOL_GPL(edac_device_handle_ue_count);
+
+static void edac_dev_release(struct device *dev)
+{
+	struct edac_dev_feat_ctx *ctx = container_of(dev, struct edac_dev_feat_ctx, dev);
+
+	kfree(ctx->dev.groups);
+	kfree(ctx);
+}
+
+const struct device_type edac_dev_type = {
+	.name = "edac_dev",
+	.release = edac_dev_release,
+};
+
+static void edac_dev_unreg(void *data)
+{
+	device_unregister(data);
+}
+
+/**
+ * edac_dev_register - register device for RAS features with EDAC
+ * @parent: parent device.
+ * @name: parent device's name.
+ * @private: parent driver's data to store in the context if any.
+ * @num_features: number of RAS features to register.
+ * @ras_features: list of RAS features to register.
+ *
+ * Return:
+ *  * %0       - Success.
+ *  * %-EINVAL - Invalid parameters passed.
+ *  * %-ENOMEM - Dynamic memory allocation failed.
+ *
+ */
+int edac_dev_register(struct device *parent, char *name,
+		      void *private, int num_features,
+		      const struct edac_dev_feature *ras_features)
+{
+	const struct attribute_group **ras_attr_groups;
+	struct edac_dev_feat_ctx *ctx;
+	int attr_gcnt = 0;
+	int ret, feat;
+
+	if (!parent || !name || !num_features || !ras_features)
+		return -EINVAL;
+
+	/* Double parse to make space for attributes */
+	for (feat = 0; feat < num_features; feat++) {
+		switch (ras_features[feat].ft_type) {
+		/* Add feature specific code */
+		default:
+			return -EINVAL;
+		}
+	}
+
+	ctx = kzalloc(sizeof(*ctx), GFP_KERNEL);
+	if (!ctx)
+		return -ENOMEM;
+
+	ras_attr_groups = kcalloc(attr_gcnt + 1, sizeof(*ras_attr_groups), GFP_KERNEL);
+	if (!ras_attr_groups) {
+		ret = -ENOMEM;
+		goto ctx_free;
+	}
+
+	attr_gcnt = 0;
+	for (feat = 0; feat < num_features; feat++, ras_features++) {
+		switch (ras_features->ft_type) {
+		/* Add feature specific code */
+		default:
+			ret = -EINVAL;
+			goto groups_free;
+		}
+	}
+
+	ctx->dev.parent = parent;
+	ctx->dev.bus = edac_get_sysfs_subsys();
+	ctx->dev.type = &edac_dev_type;
+	ctx->dev.groups = ras_attr_groups;
+	ctx->private = private;
+	dev_set_drvdata(&ctx->dev, ctx);
+
+	ret = dev_set_name(&ctx->dev, name);
+	if (ret)
+		goto groups_free;
+
+	ret = device_register(&ctx->dev);
+	if (ret) {
+		put_device(&ctx->dev);
+		return ret;
+	}
+
+	return devm_add_action_or_reset(parent, edac_dev_unreg, &ctx->dev);
+
+groups_free:
+	kfree(ras_attr_groups);
+ctx_free:
+	kfree(ctx);
+	return ret;
+}
+EXPORT_SYMBOL_GPL(edac_dev_register);
diff --git a/include/linux/edac.h b/include/linux/edac.h
index b4ee8961e623..521b17113d4d 100644
--- a/include/linux/edac.h
+++ b/include/linux/edac.h
@@ -661,4 +661,32 @@ static inline struct dimm_info *edac_get_dimm(struct mem_ctl_info *mci,
 
 	return mci->dimms[index];
 }
+
+#define EDAC_FEAT_NAME_LEN	128
+
+/* RAS feature type */
+enum edac_dev_feat {
+	RAS_FEAT_MAX
+};
+
+/* EDAC device feature information structure */
+struct edac_dev_data {
+	u8 instance;
+	void *private;
+};
+
+struct edac_dev_feat_ctx {
+	struct device dev;
+	void *private;
+};
+
+struct edac_dev_feature {
+	enum edac_dev_feat ft_type;
+	u8 instance;
+	void *ctx;
+};
+
+int edac_dev_register(struct device *parent, char *dev_name,
+		      void *parent_pvt_data, int num_features,
+		      const struct edac_dev_feature *ras_features);
 #endif /* _LINUX_EDAC_H_ */
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [PATCH v18 02/19] EDAC: Add scrub control feature
  2025-01-06 12:09 [PATCH v18 00/19] EDAC: Scrub: introduce generic EDAC RAS control feature driver + CXL/ACPI-RAS2 drivers shiju.jose
  2025-01-06 12:09 ` [PATCH v18 01/19] EDAC: Add support for EDAC device features control shiju.jose
@ 2025-01-06 12:09 ` shiju.jose
  2025-01-06 15:57   ` Borislav Petkov
                     ` (2 more replies)
  2025-01-06 12:09 ` [PATCH v18 03/19] EDAC: Add ECS " shiju.jose
                   ` (18 subsequent siblings)
  20 siblings, 3 replies; 87+ messages in thread
From: shiju.jose @ 2025-01-06 12:09 UTC (permalink / raw)
  To: linux-edac, linux-cxl, linux-acpi, linux-mm, linux-kernel
  Cc: bp, tony.luck, rafael, lenb, mchehab, dan.j.williams, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	ira.weiny, david, Vilas.Sridharan, leo.duran, Yazen.Ghannam,
	rientjes, jiaqiyan, Jon.Grimm, dave.hansen, naoya.horiguchi,
	james.morse, jthoughton, somasundaram.a, erdemaktas, pgonda,
	duenwen, gthelen, wschwartz, dferguson, wbs, nifan.cxl,
	yazen.ghannam, tanxiaofei, prime.zeng, roberto.sassu,
	kangkang.shen, wanghuiqiang, linuxarm, shiju.jose

From: Shiju Jose <shiju.jose@huawei.com>

Add a generic EDAC scrub control to manage memory scrubbers in the system.
Devices with a scrub feature register with the EDAC device driver, which
retrieves the scrub descriptor from the EDAC scrub driver and exposes the
sysfs scrub control attributes for a scrub instance to userspace at
/sys/bus/edac/devices/<dev-name>/scrubX/.

The common sysfs scrub control interface abstracts the control of
arbitrary scrubbing functionality into a common set of functions. The
sysfs scrub attribute nodes are only present if the client driver has
implemented the corresponding attribute callback function and passed the
operations(ops) to the EDAC device driver during registration.

Co-developed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
---
 Documentation/ABI/testing/sysfs-edac-scrub |  74 +++++++
 Documentation/edac/features.rst            |   5 +
 Documentation/edac/index.rst               |   1 +
 Documentation/edac/scrub.rst               | 244 +++++++++++++++++++++
 drivers/edac/Makefile                      |   1 +
 drivers/edac/edac_device.c                 |  41 +++-
 drivers/edac/scrub.c                       | 209 ++++++++++++++++++
 include/linux/edac.h                       |  34 +++
 8 files changed, 605 insertions(+), 4 deletions(-)
 create mode 100644 Documentation/ABI/testing/sysfs-edac-scrub
 create mode 100644 Documentation/edac/scrub.rst
 create mode 100755 drivers/edac/scrub.c

diff --git a/Documentation/ABI/testing/sysfs-edac-scrub b/Documentation/ABI/testing/sysfs-edac-scrub
new file mode 100644
index 000000000000..af14a68ee5a9
--- /dev/null
+++ b/Documentation/ABI/testing/sysfs-edac-scrub
@@ -0,0 +1,74 @@
+What:		/sys/bus/edac/devices/<dev-name>/scrubX
+Date:		Jan 2025
+KernelVersion:	6.14
+Contact:	linux-edac@vger.kernel.org
+Description:
+		The sysfs EDAC bus devices /<dev-name>/scrubX subdirectory
+		belongs to an instance of memory scrub control feature,
+		where <dev-name> directory corresponds to a device/memory
+		region registered with the EDAC device driver for the
+		scrub control feature.
+		The sysfs scrub attr nodes are only present if the parent
+		driver has implemented the corresponding attr callback
+		function and provided the necessary operations to the EDAC
+		device driver during registration.
+
+What:		/sys/bus/edac/devices/<dev-name>/scrubX/addr
+Date:		Jan 2025
+KernelVersion:	6.14
+Contact:	linux-edac@vger.kernel.org
+Description:
+		(RW) The base address of the memory region to be scrubbed
+		for on-demand scrubbing. Setting address would start
+		scrubbing. The size must be set before that.
+		The readback addr value would be non-zero if the requested
+		on-demand scrubbing is in progress, zero otherwise.
+
+What:		/sys/bus/edac/devices/<dev-name>/scrubX/size
+Date:		Jan 2025
+KernelVersion:	6.14
+Contact:	linux-edac@vger.kernel.org
+Description:
+		(RW) The size of the memory region to be scrubbed
+		(on-demand scrubbing).
+
+What:		/sys/bus/edac/devices/<dev-name>/scrubX/enable_background
+Date:		Jan 2025
+KernelVersion:	6.14
+Contact:	linux-edac@vger.kernel.org
+Description:
+		(RW) Start/Stop background(patrol) scrubbing if supported.
+
+What:		/sys/bus/edac/devices/<dev-name>/scrubX/enable_on_demand
+Date:		Jan 2025
+KernelVersion:	6.14
+Contact:	linux-edac@vger.kernel.org
+Description:
+		(RW) Start/Stop on-demand scrubbing the memory region
+		if supported.
+
+What:		/sys/bus/edac/devices/<dev-name>/scrubX/min_cycle_duration
+Date:		Jan 2025
+KernelVersion:	6.14
+Contact:	linux-edac@vger.kernel.org
+Description:
+		(RO) Supported minimum scrub cycle duration in seconds
+		by the memory scrubber.
+
+What:		/sys/bus/edac/devices/<dev-name>/scrubX/max_cycle_duration
+Date:		Jan 2025
+KernelVersion:	6.14
+Contact:	linux-edac@vger.kernel.org
+Description:
+		(RO) Supported maximum scrub cycle duration in seconds
+		by the memory scrubber.
+
+What:		/sys/bus/edac/devices/<dev-name>/scrubX/current_cycle_duration
+Date:		Jan 2025
+KernelVersion:	6.14
+Contact:	linux-edac@vger.kernel.org
+Description:
+		(RW) The current scrub cycle duration in seconds and must be
+		within the supported range by the memory scrubber.
+		Scrub has an overhead when running and that may want to be
+		reduced by taking longer to do it.
diff --git a/Documentation/edac/features.rst b/Documentation/edac/features.rst
index f32f259ce04d..ba3ab993ee4f 100644
--- a/Documentation/edac/features.rst
+++ b/Documentation/edac/features.rst
@@ -92,3 +92,8 @@ the sysfs EDAC bus. For example, /sys/bus/edac/devices/<dev-name>/<feature>X/
 3. RAS dynamic feature controller - Userspace sample modules in rasdaemon for
 dynamic scrub/repair control to issue scrubbing/repair when excess number
 of corrected memory errors are reported in a short span of time.
+
+RAS features
+------------
+1. Memory Scrub
+Memory scrub features are documented in `Documentation/edac/scrub.rst`.
diff --git a/Documentation/edac/index.rst b/Documentation/edac/index.rst
index b6c265a4cffb..dfb0c9fb9ab1 100644
--- a/Documentation/edac/index.rst
+++ b/Documentation/edac/index.rst
@@ -8,3 +8,4 @@ EDAC Subsystem
    :maxdepth: 1
 
    features
+   scrub
diff --git a/Documentation/edac/scrub.rst b/Documentation/edac/scrub.rst
new file mode 100644
index 000000000000..5a5108b744a4
--- /dev/null
+++ b/Documentation/edac/scrub.rst
@@ -0,0 +1,244 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+===================
+EDAC Scrub Control
+===================
+
+Copyright (c) 2024 HiSilicon Limited.
+
+:Author:   Shiju Jose <shiju.jose@huawei.com>
+:License:  The GNU Free Documentation License, Version 1.2
+          (dual licensed under the GPL v2)
+:Original Reviewers:
+
+- Written for: 6.14
+
+Introduction
+------------
+Increasing DRAM size and cost have made memory subsystem reliability an
+important concern. These modules are used where potentially corrupted data
+could cause expensive or fatal issues. Memory errors are among the top
+hardware failures that cause server and workload crashes.
+
+Memory scrubbing is a feature where an ECC (Error-Correcting Code) engine
+reads data from each memory media location, corrects with an ECC if
+necessary and writes the corrected data back to the same memory media
+location.
+
+The memory DIMMs can be scrubbed at a configurable rate to detect
+uncorrected memory errors and attempt recovery from detected errors,
+providing the following benefits.
+
+* Proactively scrubbing memory DIMMs reduces the chance of a correctable error becoming uncorrectable.
+
+* When detected, uncorrected errors caught in unallocated memory pages are isolated and prevented from being allocated to an application or the OS.
+
+* This reduces the likelihood of software or hardware products encountering memory errors.
+
+There are 2 types of memory scrubbing:
+
+1. Background (patrol) scrubbing of the RAM while the RAM is otherwise
+idle.
+
+2. On-demand scrubbing for a specific address range or region of memory.
+
+Several types of interfaces to hardware memory scrubbers have been
+identified, such as CXL memory device patrol scrub, CXL DDR5 ECS, ACPI
+RAS2 memory scrubbing, and ACPI NVDIMM ARS (Address Range Scrub).
+
+The scrub control varies between different memory scrubbers. To allow
+for standard userspace tooling there is a need to present these controls
+with a standard ABI.
+
+The control mechanisms vary across different memory scrubbers. To enable
+standardized userspace tooling, there is a need to present these controls
+through a standardized ABI.
+
+Introduce a generic memory EDAC scrub control that allows users to manage
+underlying scrubbers in the system through a standardized sysfs scrub
+control interface. This common sysfs scrub control interface abstracts the
+management of various scrubbing functionalities into a unified set of
+functions.
+
+Use cases of common scrub control feature
+-----------------------------------------
+1. Several types of interfaces for hardware (HW) memory scrubbers have
+been identified, including the CXL memory device patrol scrub, CXL DDR5
+ECS, ACPI RAS2 memory scrubbing features, ACPI NVDIMM ARS (Address Range
+Scrub), and software-based memory scrubbers. Some of these scrubbers
+support control over patrol (background) scrubbing (e.g., ACPI RAS2, CXL)
+and/or on-demand scrubbing (e.g., ACPI RAS2, ACPI ARS). However, the scrub
+control interfaces vary between memory scrubbers, highlighting the need for
+a standardized, generic sysfs scrub control interface that is accessible to
+userspace for administration and use by scripts/tools.
+
+2. User-space scrub controls allow users to disable scrubbing if necessary,
+for example, to disable background patrol scrubbing or adjust the scrub
+rate for performance-aware operations where background activities need to
+be minimized or disabled.
+
+3. User-space tools enable on-demand scrubbing for specific address ranges,
+provided that the scrubber supports this functionality.
+
+4. User-space tools can also control memory DIMM scrubbing at a configurable
+scrub rate via sysfs scrub controls. This approach offers several benefits:
+
+* Detects uncorrectable memory errors early, before user access to affected memory, helping facilitate recovery.
+
+* Reduces the likelihood of correctable errors developing into uncorrectable errors.
+
+5. Policy control for hotplugged memory is necessary because there may not
+be a system-wide BIOS or similar control to manage scrub settings for a CXL
+device added after boot. Determining these settings is a policy decision,
+balancing reliability against performance, so userspace should control it.
+Therefore, a unified interface is recommended for handling this function in
+a way that aligns with other similar interfaces, rather than creating a
+separate one.
+
+Scrubbing features
+------------------
+Comparison of various scrubbing features::
+
+ ................................................................
+ .              .   ACPI    . CXL patrol.  CXL ECS  .  ARS      .
+ .  Name        .   RAS2    . scrub     .           .           .
+ ................................................................
+ .              .           .           .           .           .
+ . On-demand    . Supported . No        . No        . Supported .
+ . Scrubbing    .           .           .           .           .
+ .              .           .           .           .           .
+ ................................................................
+ .              .           .           .           .           .
+ . Background   . Supported . Supported . Supported . No        .
+ . scrubbing    .           .           .           .           .
+ .              .           .           .           .           .
+ ................................................................
+ .              .           .           .           .           .
+ . Mode of      . Scrub ctrl. per device. per memory.  Unknown  .
+ . scrubbing    . per NUMA  .           . media     .           .
+ .              . domain.   .           .           .           .
+ ................................................................
+ .              .           .           .           .           .
+ . Query scrub  . Supported . Supported . Supported . Supported .
+ . capabilities .           .           .           .           .
+ .              .           .           .           .           .
+ ................................................................
+ .              .           .           .           .           .
+ . Setting      . Supported . No        . No        . Supported .
+ . address range.           .           .           .           .
+ .              .           .           .           .           .
+ ................................................................
+ .              .           .           .           .           .
+ . Setting      . Supported . Supported . No        . No        .
+ . scrub rate   .           .           .           .           .
+ .              .           .           .           .           .
+ ................................................................
+ .              .           .           .           .           .
+ . Unit for     . Not       . in hours  . No        . No        .
+ . scrub rate   . Defined   .           .           .           .
+ .              .           .           .           .           .
+ ................................................................
+ .              . Supported .           .           .           .
+ . Scrub        . on-demand . No        . No        . Supported .
+ . status/      . scrubbing .           .           .           .
+ . Completion   . only      .           .           .           .
+ ................................................................
+ . UC error     .           .CXL general.CXL general. ACPI UCE  .
+ . reporting    . Exception .media/DRAM .media/DRAM . notify and.
+ .              .           .event/media.event/media. query     .
+ .              .           .scan?      .scan?      . ARS status.
+ ................................................................
+ .              .           .           .           .           .
+ . Support for  . Supported . Supported . Supported . No        .
+ . EDAC control .           .           .           .           .
+ .              .           .           .           .           .
+ ................................................................
+
+CXL Memory Scrubbing features
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+CXL spec r3.1 section 8.2.9.9.11.1 describes the memory device patrol scrub
+control feature. The device patrol scrub proactively locates and makes
+corrections to errors in regular cycle. The patrol scrub control allows the
+request to configure patrol scrubber's input configurations.
+
+The patrol scrub control allows the requester to specify the number of
+hours in which the patrol scrub cycles must be completed, provided that
+the requested number is not less than the minimum number of hours for the
+patrol scrub cycle that the device is capable of. In addition, the patrol
+scrub controls allow the host to disable and enable the feature in case
+disabling of the feature is needed for other purposes such as
+performance-aware operations which require the background operations to be
+turned off.
+
+Error Check Scrub (ECS)
+~~~~~~~~~~~~~~~~~~~~~~~
+CXL spec r3.1 section 8.2.9.9.11.2 describes the Error Check Scrub (ECS)
+is a feature defined in JEDEC DDR5 SDRAM Specification (JESD79-5) and
+allows the DRAM to internally read, correct single-bit errors, and write
+back corrected data bits to the DRAM array while providing transparency
+to error counts.
+
+The DDR5 device contains number of memory media FRUs per device. The
+DDR5 ECS feature and thus the ECS control driver supports configuring
+the ECS parameters per FRU.
+
+ACPI RAS2 Hardware-based Memory Scrubbing
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+ACPI spec 6.5 section 5.2.21 ACPI RAS2 describes ACPI RAS2 table
+provides interfaces for platform RAS features and supports independent
+RAS controls and capabilities for a given RAS feature for multiple
+instances of the same component in a given system.
+Memory RAS features apply to RAS capabilities, controls and operations
+that are specific to memory. RAS2 PCC sub-spaces for memory-specific RAS
+features have a Feature Type of 0x00 (Memory).
+
+The platform can use the hardware-based memory scrubbing feature to expose
+controls and capabilities associated with hardware-based memory scrub
+engines. The RAS2 memory scrubbing feature supports following as per spec,
+
+* Independent memory scrubbing controls for each NUMA domain, identified using its proximity domain.
+
+* Provision for background (patrol) scrubbing of the entire memory system, as well as on-demand scrubbing for a specific region of memory.
+
+ACPI Address Range Scrubbing(ARS)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+ACPI spec 6.5 section 9.19.7.2 describes Address Range Scrubbing(ARS).
+ARS allows the platform to communicate memory errors to system software.
+This capability allows system software to prevent accesses to addresses
+with uncorrectable errors in memory. ARS functions manage all NVDIMMs
+present in the system. Only one scrub can be in progress system wide
+at any given time.
+Following functions are supported as per the specification.
+
+1. Query ARS Capabilities for a given address range, indicates platform
+supports the ACPI NVDIMM Root Device Unconsumed Error Notification.
+
+2. Start ARS triggers an Address Range Scrub for the given memory range.
+Address scrubbing can be done for volatile memory, persistent memory, or both.
+
+3. Query ARS Status command allows software to get the status of ARS,
+including the progress of ARS and ARS error record.
+
+4. Clear Uncorrectable Error.
+
+5. Translate SPA
+
+6. ARS Error Inject etc.
+
+The kernel supports an existing control for ARS and ARS is currently not
+supported in EDAC.
+
+The File System
+---------------
+
+The control attributes of a registered scrubber instance could be
+accessed in the
+
+/sys/bus/edac/devices/<dev-name>/scrubX/
+
+sysfs
+-----
+
+Sysfs files are documented in
+
+`Documentation/ABI/testing/sysfs-edac-scrub`.
diff --git a/drivers/edac/Makefile b/drivers/edac/Makefile
index f9cf19d8d13d..a162726cc6b9 100644
--- a/drivers/edac/Makefile
+++ b/drivers/edac/Makefile
@@ -10,6 +10,7 @@ obj-$(CONFIG_EDAC)			:= edac_core.o
 
 edac_core-y	:= edac_mc.o edac_device.o edac_mc_sysfs.o
 edac_core-y	+= edac_module.o edac_device_sysfs.o wq.o
+edac_core-y	+= scrub.o
 
 edac_core-$(CONFIG_EDAC_DEBUG)		+= debugfs.o
 
diff --git a/drivers/edac/edac_device.c b/drivers/edac/edac_device.c
index 9fce46dd7405..60b20eae01e8 100644
--- a/drivers/edac/edac_device.c
+++ b/drivers/edac/edac_device.c
@@ -575,6 +575,7 @@ static void edac_dev_release(struct device *dev)
 {
 	struct edac_dev_feat_ctx *ctx = container_of(dev, struct edac_dev_feat_ctx, dev);
 
+	kfree(ctx->scrub);
 	kfree(ctx->dev.groups);
 	kfree(ctx);
 }
@@ -608,8 +609,10 @@ int edac_dev_register(struct device *parent, char *name,
 		      const struct edac_dev_feature *ras_features)
 {
 	const struct attribute_group **ras_attr_groups;
+	struct edac_dev_data *dev_data;
 	struct edac_dev_feat_ctx *ctx;
 	int attr_gcnt = 0;
+	int scrub_cnt = 0;
 	int ret, feat;
 
 	if (!parent || !name || !num_features || !ras_features)
@@ -618,7 +621,10 @@ int edac_dev_register(struct device *parent, char *name,
 	/* Double parse to make space for attributes */
 	for (feat = 0; feat < num_features; feat++) {
 		switch (ras_features[feat].ft_type) {
-		/* Add feature specific code */
+		case RAS_FEAT_SCRUB:
+			attr_gcnt++;
+			scrub_cnt++;
+			break;
 		default:
 			return -EINVAL;
 		}
@@ -634,13 +640,38 @@ int edac_dev_register(struct device *parent, char *name,
 		goto ctx_free;
 	}
 
+	if (scrub_cnt) {
+		ctx->scrub = kcalloc(scrub_cnt, sizeof(*ctx->scrub), GFP_KERNEL);
+		if (!ctx->scrub) {
+			ret = -ENOMEM;
+			goto groups_free;
+		}
+	}
+
 	attr_gcnt = 0;
+	scrub_cnt = 0;
 	for (feat = 0; feat < num_features; feat++, ras_features++) {
 		switch (ras_features->ft_type) {
-		/* Add feature specific code */
+		case RAS_FEAT_SCRUB:
+			if (!ras_features->scrub_ops ||
+			    scrub_cnt != ras_features->instance)
+				goto data_mem_free;
+
+			dev_data = &ctx->scrub[scrub_cnt];
+			dev_data->instance = scrub_cnt;
+			dev_data->scrub_ops = ras_features->scrub_ops;
+			dev_data->private = ras_features->ctx;
+			ret = edac_scrub_get_desc(parent, &ras_attr_groups[attr_gcnt],
+						  ras_features->instance);
+			if (ret)
+				goto data_mem_free;
+
+			scrub_cnt++;
+			attr_gcnt++;
+			break;
 		default:
 			ret = -EINVAL;
-			goto groups_free;
+			goto data_mem_free;
 		}
 	}
 
@@ -653,7 +684,7 @@ int edac_dev_register(struct device *parent, char *name,
 
 	ret = dev_set_name(&ctx->dev, name);
 	if (ret)
-		goto groups_free;
+		goto data_mem_free;
 
 	ret = device_register(&ctx->dev);
 	if (ret) {
@@ -663,6 +694,8 @@ int edac_dev_register(struct device *parent, char *name,
 
 	return devm_add_action_or_reset(parent, edac_dev_unreg, &ctx->dev);
 
+data_mem_free:
+	kfree(ctx->scrub);
 groups_free:
 	kfree(ras_attr_groups);
 ctx_free:
diff --git a/drivers/edac/scrub.c b/drivers/edac/scrub.c
new file mode 100755
index 000000000000..3978201c4bfc
--- /dev/null
+++ b/drivers/edac/scrub.c
@@ -0,0 +1,209 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * The generic EDAC scrub driver controls the memory scrubbers in the
+ * system. The common sysfs scrub interface abstracts the control of
+ * various arbitrary scrubbing functionalities into a unified set of
+ * functions.
+ *
+ * Copyright (c) 2024 HiSilicon Limited.
+ */
+
+#include <linux/edac.h>
+
+enum edac_scrub_attributes {
+	SCRUB_ADDRESS,
+	SCRUB_SIZE,
+	SCRUB_ENABLE_BACKGROUND,
+	SCRUB_MIN_CYCLE_DURATION,
+	SCRUB_MAX_CYCLE_DURATION,
+	SCRUB_CUR_CYCLE_DURATION,
+	SCRUB_MAX_ATTRS
+};
+
+struct edac_scrub_dev_attr {
+	struct device_attribute dev_attr;
+	u8 instance;
+};
+
+struct edac_scrub_context {
+	char name[EDAC_FEAT_NAME_LEN];
+	struct edac_scrub_dev_attr scrub_dev_attr[SCRUB_MAX_ATTRS];
+	struct attribute *scrub_attrs[SCRUB_MAX_ATTRS + 1];
+	struct attribute_group group;
+};
+
+#define TO_SCRUB_DEV_ATTR(_dev_attr)      \
+		container_of(_dev_attr, struct edac_scrub_dev_attr, dev_attr)
+
+#define EDAC_SCRUB_ATTR_SHOW(attrib, cb, type, format)				\
+static ssize_t attrib##_show(struct device *ras_feat_dev,			\
+			     struct device_attribute *attr, char *buf)		\
+{										\
+	u8 inst = TO_SCRUB_DEV_ATTR(attr)->instance;				\
+	struct edac_dev_feat_ctx *ctx = dev_get_drvdata(ras_feat_dev);		\
+	const struct edac_scrub_ops *ops = ctx->scrub[inst].scrub_ops;		\
+	type data;								\
+	int ret;								\
+										\
+	ret = ops->cb(ras_feat_dev->parent, ctx->scrub[inst].private, &data);	\
+	if (ret)								\
+		return ret;							\
+										\
+	return sysfs_emit(buf, format, data);					\
+}
+
+EDAC_SCRUB_ATTR_SHOW(addr, read_addr, u64, "0x%llx\n")
+EDAC_SCRUB_ATTR_SHOW(size, read_size, u64, "0x%llx\n")
+EDAC_SCRUB_ATTR_SHOW(enable_background, get_enabled_bg, bool, "%u\n")
+EDAC_SCRUB_ATTR_SHOW(min_cycle_duration, get_min_cycle, u32, "%u\n")
+EDAC_SCRUB_ATTR_SHOW(max_cycle_duration, get_max_cycle, u32, "%u\n")
+EDAC_SCRUB_ATTR_SHOW(current_cycle_duration, get_cycle_duration, u32, "%u\n")
+
+#define EDAC_SCRUB_ATTR_STORE(attrib, cb, type, conv_func)			\
+static ssize_t attrib##_store(struct device *ras_feat_dev,			\
+			      struct device_attribute *attr,			\
+			      const char *buf, size_t len)			\
+{										\
+	u8 inst = TO_SCRUB_DEV_ATTR(attr)->instance;				\
+	struct edac_dev_feat_ctx *ctx = dev_get_drvdata(ras_feat_dev);		\
+	const struct edac_scrub_ops *ops = ctx->scrub[inst].scrub_ops;		\
+	type data;								\
+	int ret;								\
+										\
+	ret = conv_func(buf, 0, &data);						\
+	if (ret < 0)								\
+		return ret;							\
+										\
+	ret = ops->cb(ras_feat_dev->parent, ctx->scrub[inst].private, data);	\
+	if (ret)								\
+		return ret;							\
+										\
+	return len;								\
+}
+
+EDAC_SCRUB_ATTR_STORE(addr, write_addr, u64, kstrtou64)
+EDAC_SCRUB_ATTR_STORE(size, write_size, u64, kstrtou64)
+EDAC_SCRUB_ATTR_STORE(enable_background, set_enabled_bg, unsigned long, kstrtoul)
+EDAC_SCRUB_ATTR_STORE(current_cycle_duration, set_cycle_duration, unsigned long, kstrtoul)
+
+static umode_t scrub_attr_visible(struct kobject *kobj, struct attribute *a, int attr_id)
+{
+	struct device *ras_feat_dev = kobj_to_dev(kobj);
+	struct device_attribute *dev_attr = container_of(a, struct device_attribute, attr);
+	u8 inst = TO_SCRUB_DEV_ATTR(dev_attr)->instance;
+	struct edac_dev_feat_ctx *ctx = dev_get_drvdata(ras_feat_dev);
+	const struct edac_scrub_ops *ops = ctx->scrub[inst].scrub_ops;
+
+	switch (attr_id) {
+	case SCRUB_ADDRESS:
+		if (ops->read_addr) {
+			if (ops->write_addr)
+				return a->mode;
+			else
+				return 0444;
+		}
+		break;
+	case SCRUB_SIZE:
+		if (ops->read_size) {
+			if (ops->write_size)
+				return a->mode;
+			else
+				return 0444;
+		}
+		break;
+	case SCRUB_ENABLE_BACKGROUND:
+		if (ops->get_enabled_bg) {
+			if (ops->set_enabled_bg)
+				return a->mode;
+			else
+				return 0444;
+		}
+		break;
+	case SCRUB_MIN_CYCLE_DURATION:
+		if (ops->get_min_cycle)
+			return a->mode;
+		break;
+	case SCRUB_MAX_CYCLE_DURATION:
+		if (ops->get_max_cycle)
+			return a->mode;
+		break;
+	case SCRUB_CUR_CYCLE_DURATION:
+		if (ops->get_cycle_duration) {
+			if (ops->set_cycle_duration)
+				return a->mode;
+			else
+				return 0444;
+		}
+		break;
+	default:
+		break;
+	}
+
+	return 0;
+}
+
+#define EDAC_SCRUB_ATTR_RO(_name, _instance)       \
+	((struct edac_scrub_dev_attr) { .dev_attr = __ATTR_RO(_name), \
+					.instance = _instance })
+
+#define EDAC_SCRUB_ATTR_WO(_name, _instance)       \
+	((struct edac_scrub_dev_attr) { .dev_attr = __ATTR_WO(_name), \
+					.instance = _instance })
+
+#define EDAC_SCRUB_ATTR_RW(_name, _instance)       \
+	((struct edac_scrub_dev_attr) { .dev_attr = __ATTR_RW(_name), \
+					.instance = _instance })
+
+static int scrub_create_desc(struct device *scrub_dev,
+			     const struct attribute_group **attr_groups, u8 instance)
+{
+	struct edac_scrub_context *scrub_ctx;
+	struct attribute_group *group;
+	int i;
+	struct edac_scrub_dev_attr dev_attr[] = {
+		[SCRUB_ADDRESS] = EDAC_SCRUB_ATTR_RW(addr, instance),
+		[SCRUB_SIZE] = EDAC_SCRUB_ATTR_RW(size, instance),
+		[SCRUB_ENABLE_BACKGROUND] = EDAC_SCRUB_ATTR_RW(enable_background, instance),
+		[SCRUB_MIN_CYCLE_DURATION] = EDAC_SCRUB_ATTR_RO(min_cycle_duration, instance),
+		[SCRUB_MAX_CYCLE_DURATION] = EDAC_SCRUB_ATTR_RO(max_cycle_duration, instance),
+		[SCRUB_CUR_CYCLE_DURATION] = EDAC_SCRUB_ATTR_RW(current_cycle_duration, instance)
+	};
+
+	scrub_ctx = devm_kzalloc(scrub_dev, sizeof(*scrub_ctx), GFP_KERNEL);
+	if (!scrub_ctx)
+		return -ENOMEM;
+
+	group = &scrub_ctx->group;
+	for (i = 0; i < SCRUB_MAX_ATTRS; i++) {
+		memcpy(&scrub_ctx->scrub_dev_attr[i], &dev_attr[i], sizeof(dev_attr[i]));
+		scrub_ctx->scrub_attrs[i] = &scrub_ctx->scrub_dev_attr[i].dev_attr.attr;
+	}
+	sprintf(scrub_ctx->name, "%s%d", "scrub", instance);
+	group->name = scrub_ctx->name;
+	group->attrs = scrub_ctx->scrub_attrs;
+	group->is_visible  = scrub_attr_visible;
+
+	attr_groups[0] = group;
+
+	return 0;
+}
+
+/**
+ * edac_scrub_get_desc - get EDAC scrub descriptors
+ * @scrub_dev: client device, with scrub support
+ * @attr_groups: pointer to attribute group container
+ * @instance: device's scrub instance number.
+ *
+ * Return:
+ *  * %0	- Success.
+ *  * %-EINVAL	- Invalid parameters passed.
+ *  * %-ENOMEM	- Dynamic memory allocation failed.
+ */
+int edac_scrub_get_desc(struct device *scrub_dev,
+			const struct attribute_group **attr_groups, u8 instance)
+{
+	if (!scrub_dev || !attr_groups)
+		return -EINVAL;
+
+	return scrub_create_desc(scrub_dev, attr_groups, instance);
+}
diff --git a/include/linux/edac.h b/include/linux/edac.h
index 521b17113d4d..ace8b10bb028 100644
--- a/include/linux/edac.h
+++ b/include/linux/edac.h
@@ -666,11 +666,43 @@ static inline struct dimm_info *edac_get_dimm(struct mem_ctl_info *mci,
 
 /* RAS feature type */
 enum edac_dev_feat {
+	RAS_FEAT_SCRUB,
 	RAS_FEAT_MAX
 };
 
+/**
+ * struct edac_scrub_ops - scrub device operations (all elements optional)
+ * @read_addr: read base address of scrubbing range.
+ * @read_size: read offset of scrubbing range.
+ * @write_addr: set base address of the scrubbing range.
+ * @write_size: set offset of the scrubbing range.
+ * @get_enabled_bg: check if currently performing background scrub.
+ * @set_enabled_bg: start or stop a bg-scrub.
+ * @get_min_cycle: get minimum supported scrub cycle duration in seconds.
+ * @get_max_cycle: get maximum supported scrub cycle duration in seconds.
+ * @get_cycle_duration: get current scrub cycle duration in seconds.
+ * @set_cycle_duration: set current scrub cycle duration in seconds.
+ */
+struct edac_scrub_ops {
+	int (*read_addr)(struct device *dev, void *drv_data, u64 *base);
+	int (*read_size)(struct device *dev, void *drv_data, u64 *size);
+	int (*write_addr)(struct device *dev, void *drv_data, u64 base);
+	int (*write_size)(struct device *dev, void *drv_data, u64 size);
+	int (*get_enabled_bg)(struct device *dev, void *drv_data, bool *enable);
+	int (*set_enabled_bg)(struct device *dev, void *drv_data, bool enable);
+	int (*get_min_cycle)(struct device *dev, void *drv_data,  u32 *min);
+	int (*get_max_cycle)(struct device *dev, void *drv_data,  u32 *max);
+	int (*get_cycle_duration)(struct device *dev, void *drv_data, u32 *cycle);
+	int (*set_cycle_duration)(struct device *dev, void *drv_data, u32 cycle);
+};
+
+int edac_scrub_get_desc(struct device *scrub_dev,
+			const struct attribute_group **attr_groups,
+			u8 instance);
+
 /* EDAC device feature information structure */
 struct edac_dev_data {
+	const struct edac_scrub_ops *scrub_ops;
 	u8 instance;
 	void *private;
 };
@@ -678,11 +710,13 @@ struct edac_dev_data {
 struct edac_dev_feat_ctx {
 	struct device dev;
 	void *private;
+	struct edac_dev_data *scrub;
 };
 
 struct edac_dev_feature {
 	enum edac_dev_feat ft_type;
 	u8 instance;
+	const struct edac_scrub_ops *scrub_ops;
 	void *ctx;
 };
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [PATCH v18 03/19] EDAC: Add ECS control feature
  2025-01-06 12:09 [PATCH v18 00/19] EDAC: Scrub: introduce generic EDAC RAS control feature driver + CXL/ACPI-RAS2 drivers shiju.jose
  2025-01-06 12:09 ` [PATCH v18 01/19] EDAC: Add support for EDAC device features control shiju.jose
  2025-01-06 12:09 ` [PATCH v18 02/19] EDAC: Add scrub control feature shiju.jose
@ 2025-01-06 12:09 ` shiju.jose
  2025-01-13 16:09   ` Mauro Carvalho Chehab
  2025-01-06 12:10 ` [PATCH v18 04/19] EDAC: Add memory repair " shiju.jose
                   ` (17 subsequent siblings)
  20 siblings, 1 reply; 87+ messages in thread
From: shiju.jose @ 2025-01-06 12:09 UTC (permalink / raw)
  To: linux-edac, linux-cxl, linux-acpi, linux-mm, linux-kernel
  Cc: bp, tony.luck, rafael, lenb, mchehab, dan.j.williams, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	ira.weiny, david, Vilas.Sridharan, leo.duran, Yazen.Ghannam,
	rientjes, jiaqiyan, Jon.Grimm, dave.hansen, naoya.horiguchi,
	james.morse, jthoughton, somasundaram.a, erdemaktas, pgonda,
	duenwen, gthelen, wschwartz, dferguson, wbs, nifan.cxl,
	yazen.ghannam, tanxiaofei, prime.zeng, roberto.sassu,
	kangkang.shen, wanghuiqiang, linuxarm, shiju.jose

From: Shiju Jose <shiju.jose@huawei.com>

Add EDAC ECS (Error Check Scrub) control to manage a memory device's
ECS feature.

The Error Check Scrub (ECS) is a feature defined in JEDEC DDR5 SDRAM
Specification (JESD79-5) and allows the DRAM to internally read, correct
single-bit errors, and write back corrected data bits to the DRAM array
while providing transparency to error counts.

The DDR5 device contains number of memory media FRUs per device. The
DDR5 ECS feature and thus the ECS control driver supports configuring
the ECS parameters per FRU.

Memory devices support the ECS feature register with the EDAC device
driver, which retrieves the ECS descriptor from the EDAC ECS driver.
This driver exposes sysfs ECS control attributes to userspace via
/sys/bus/edac/devices/<dev-name>/ecs_fruX/.

The common sysfs ECS control interface abstracts the control of an
arbitrary ECS functionality to a common set of functions.

Support for the ECS feature is added separately because the control
attributes of the DDR5 ECS feature differ from those of the scrub
feature.

The sysfs ECS attribute nodes are only present if the client driver
has implemented the corresponding attribute callback function and
passed the necessary operations to the EDAC RAS feature driver during
registration.

Co-developed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
---
 Documentation/ABI/testing/sysfs-edac-ecs |  63 +++++++
 Documentation/edac/scrub.rst             |   2 +
 drivers/edac/Makefile                    |   2 +-
 drivers/edac/ecs.c                       | 207 +++++++++++++++++++++++
 drivers/edac/edac_device.c               |  17 ++
 include/linux/edac.h                     |  41 ++++-
 6 files changed, 329 insertions(+), 3 deletions(-)
 create mode 100644 Documentation/ABI/testing/sysfs-edac-ecs
 create mode 100755 drivers/edac/ecs.c

diff --git a/Documentation/ABI/testing/sysfs-edac-ecs b/Documentation/ABI/testing/sysfs-edac-ecs
new file mode 100644
index 000000000000..1160bec0603f
--- /dev/null
+++ b/Documentation/ABI/testing/sysfs-edac-ecs
@@ -0,0 +1,63 @@
+What:		/sys/bus/edac/devices/<dev-name>/ecs_fruX
+Date:		Jan 2025
+KernelVersion:	6.14
+Contact:	linux-edac@vger.kernel.org
+Description:
+		The sysfs EDAC bus devices /<dev-name>/ecs_fruX subdirectory
+		pertains to the memory media ECS (Error Check Scrub) control
+		feature, where <dev-name> directory corresponds to a device
+		registered with the EDAC device driver for the ECS feature.
+		/ecs_fruX belongs to the media FRUs (Field Replaceable Unit)
+		under the memory device.
+		The sysfs ECS attr nodes are only present if the parent
+		driver has implemented the corresponding attr callback
+		function and provided the necessary operations to the EDAC
+		device driver during registration.
+
+What:		/sys/bus/edac/devices/<dev-name>/ecs_fruX/log_entry_type
+Date:		Jan 2025
+KernelVersion:	6.14
+Contact:	linux-edac@vger.kernel.org
+Description:
+		(RW) The log entry type of how the DDR5 ECS log is reported.
+		0 - per DRAM.
+		1 - per memory media FRU.
+		All other values are reserved.
+
+What:		/sys/bus/edac/devices/<dev-name>/ecs_fruX/mode
+Date:		Jan 2025
+KernelVersion:	6.14
+Contact:	linux-edac@vger.kernel.org
+Description:
+		(RW) The mode of how the DDR5 ECS counts the errors.
+		Error count is tracked based on two different modes
+		selected by DDR5 ECS Control Feature - Codeword mode and
+		Row Count mode. If the ECS is under Codeword mode, then
+		the error count increments each time a codeword with check
+		bit errors is detected. If the ECS is under Row Count mode,
+		then the error counter increments each time a row with
+		check bit errors is detected.
+		0 - ECS counts rows in the memory media that have ECC errors.
+		1 - ECS counts codewords with errors, specifically, it counts
+		the number of ECC-detected errors in the memory media.
+		All other values are reserved.
+
+What:		/sys/bus/edac/devices/<dev-name>/ecs_fruX/reset
+Date:		Jan 2025
+KernelVersion:	6.14
+Contact:	linux-edac@vger.kernel.org
+Description:
+		(WO) ECS reset ECC counter.
+		1 - reset ECC counter to the default value.
+		All other values are reserved.
+
+What:		/sys/bus/edac/devices/<dev-name>/ecs_fruX/threshold
+Date:		Jan 2025
+KernelVersion:	6.14
+Contact:	linux-edac@vger.kernel.org
+Description:
+		(RW) DDR5 ECS threshold count per gigabits of memory cells.
+		The ECS error count is subject to the ECS Threshold count
+		per Gbit, which masks error counts less than the Threshold.
+		Supported values are 256, 1024 and 4096.
+		All other values are reserved.
diff --git a/Documentation/edac/scrub.rst b/Documentation/edac/scrub.rst
index 5a5108b744a4..5640f9aeee38 100644
--- a/Documentation/edac/scrub.rst
+++ b/Documentation/edac/scrub.rst
@@ -242,3 +242,5 @@ sysfs
 Sysfs files are documented in
 
 `Documentation/ABI/testing/sysfs-edac-scrub`.
+
+`Documentation/ABI/testing/sysfs-edac-ecs`.
diff --git a/drivers/edac/Makefile b/drivers/edac/Makefile
index a162726cc6b9..3a49304860f0 100644
--- a/drivers/edac/Makefile
+++ b/drivers/edac/Makefile
@@ -10,7 +10,7 @@ obj-$(CONFIG_EDAC)			:= edac_core.o
 
 edac_core-y	:= edac_mc.o edac_device.o edac_mc_sysfs.o
 edac_core-y	+= edac_module.o edac_device_sysfs.o wq.o
-edac_core-y	+= scrub.o
+edac_core-y	+= scrub.o ecs.o
 
 edac_core-$(CONFIG_EDAC_DEBUG)		+= debugfs.o
 
diff --git a/drivers/edac/ecs.c b/drivers/edac/ecs.c
new file mode 100755
index 000000000000..dae8e5ae881b
--- /dev/null
+++ b/drivers/edac/ecs.c
@@ -0,0 +1,207 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * The generic ECS driver is designed to support control of on-die error
+ * check scrub (e.g., DDR5 ECS). The common sysfs ECS interface abstracts
+ * the control of various ECS functionalities into a unified set of functions.
+ *
+ * Copyright (c) 2024 HiSilicon Limited.
+ */
+
+#include <linux/edac.h>
+
+#define EDAC_ECS_FRU_NAME "ecs_fru"
+
+enum edac_ecs_attributes {
+	ECS_LOG_ENTRY_TYPE,
+	ECS_MODE,
+	ECS_RESET,
+	ECS_THRESHOLD,
+	ECS_MAX_ATTRS
+};
+
+struct edac_ecs_dev_attr {
+	struct device_attribute dev_attr;
+	int fru_id;
+};
+
+struct edac_ecs_fru_context {
+	char name[EDAC_FEAT_NAME_LEN];
+	struct edac_ecs_dev_attr dev_attr[ECS_MAX_ATTRS];
+	struct attribute *ecs_attrs[ECS_MAX_ATTRS + 1];
+	struct attribute_group group;
+};
+
+struct edac_ecs_context {
+	u16 num_media_frus;
+	struct edac_ecs_fru_context *fru_ctxs;
+};
+
+#define TO_ECS_DEV_ATTR(_dev_attr)	\
+	container_of(_dev_attr, struct edac_ecs_dev_attr, dev_attr)
+
+#define EDAC_ECS_ATTR_SHOW(attrib, cb, type, format)				\
+static ssize_t attrib##_show(struct device *ras_feat_dev,			\
+			     struct device_attribute *attr, char *buf)		\
+{										\
+	struct edac_ecs_dev_attr *dev_attr = TO_ECS_DEV_ATTR(attr);		\
+	struct edac_dev_feat_ctx *ctx = dev_get_drvdata(ras_feat_dev);		\
+	const struct edac_ecs_ops *ops = ctx->ecs.ecs_ops;			\
+	type data;								\
+	int ret;								\
+										\
+	ret = ops->cb(ras_feat_dev->parent, ctx->ecs.private,			\
+		      dev_attr->fru_id, &data);					\
+	if (ret)								\
+		return ret;							\
+										\
+	return sysfs_emit(buf, format, data);					\
+}
+
+EDAC_ECS_ATTR_SHOW(log_entry_type, get_log_entry_type, u32, "%u\n")
+EDAC_ECS_ATTR_SHOW(mode, get_mode, u32, "%u\n")
+EDAC_ECS_ATTR_SHOW(threshold, get_threshold, u32, "%u\n")
+
+#define EDAC_ECS_ATTR_STORE(attrib, cb, type, conv_func)			\
+static ssize_t attrib##_store(struct device *ras_feat_dev,			\
+			      struct device_attribute *attr,			\
+			      const char *buf, size_t len)			\
+{										\
+	struct edac_ecs_dev_attr *dev_attr = TO_ECS_DEV_ATTR(attr);		\
+	struct edac_dev_feat_ctx *ctx = dev_get_drvdata(ras_feat_dev);		\
+	const struct edac_ecs_ops *ops = ctx->ecs.ecs_ops;			\
+	type data;								\
+	int ret;								\
+										\
+	ret = conv_func(buf, 0, &data);						\
+	if (ret < 0)								\
+		return ret;							\
+										\
+	ret = ops->cb(ras_feat_dev->parent, ctx->ecs.private,			\
+		      dev_attr->fru_id, data);					\
+	if (ret)								\
+		return ret;							\
+										\
+	return len;								\
+}
+
+EDAC_ECS_ATTR_STORE(log_entry_type, set_log_entry_type, unsigned long, kstrtoul)
+EDAC_ECS_ATTR_STORE(mode, set_mode, unsigned long, kstrtoul)
+EDAC_ECS_ATTR_STORE(reset, reset, unsigned long, kstrtoul)
+EDAC_ECS_ATTR_STORE(threshold, set_threshold, unsigned long, kstrtoul)
+
+static umode_t ecs_attr_visible(struct kobject *kobj, struct attribute *a, int attr_id)
+{
+	struct device *ras_feat_dev = kobj_to_dev(kobj);
+	struct edac_dev_feat_ctx *ctx = dev_get_drvdata(ras_feat_dev);
+	const struct edac_ecs_ops *ops = ctx->ecs.ecs_ops;
+
+	switch (attr_id) {
+	case ECS_LOG_ENTRY_TYPE:
+		if (ops->get_log_entry_type)  {
+			if (ops->set_log_entry_type)
+				return a->mode;
+			else
+				return 0444;
+		}
+		break;
+	case ECS_MODE:
+		if (ops->get_mode) {
+			if (ops->set_mode)
+				return a->mode;
+			else
+				return 0444;
+		}
+		break;
+	case ECS_RESET:
+		if (ops->reset)
+			return a->mode;
+		break;
+	case ECS_THRESHOLD:
+		if (ops->get_threshold) {
+			if (ops->set_threshold)
+				return a->mode;
+			else
+				return 0444;
+		}
+		break;
+	default:
+		break;
+	}
+
+	return 0;
+}
+
+#define EDAC_ECS_ATTR_RO(_name, _fru_id)       \
+	((struct edac_ecs_dev_attr) { .dev_attr = __ATTR_RO(_name), \
+				     .fru_id = _fru_id })
+
+#define EDAC_ECS_ATTR_WO(_name, _fru_id)       \
+	((struct edac_ecs_dev_attr) { .dev_attr = __ATTR_WO(_name), \
+				     .fru_id = _fru_id })
+
+#define EDAC_ECS_ATTR_RW(_name, _fru_id)       \
+	((struct edac_ecs_dev_attr) { .dev_attr = __ATTR_RW(_name), \
+				     .fru_id = _fru_id })
+
+static int ecs_create_desc(struct device *ecs_dev,
+			   const struct attribute_group **attr_groups, u16 num_media_frus)
+{
+	struct edac_ecs_context *ecs_ctx;
+	u32 fru;
+
+	ecs_ctx = devm_kzalloc(ecs_dev, sizeof(*ecs_ctx), GFP_KERNEL);
+	if (!ecs_ctx)
+		return -ENOMEM;
+
+	ecs_ctx->num_media_frus = num_media_frus;
+	ecs_ctx->fru_ctxs = devm_kcalloc(ecs_dev, num_media_frus,
+					 sizeof(*ecs_ctx->fru_ctxs),
+					 GFP_KERNEL);
+	if (!ecs_ctx->fru_ctxs)
+		return -ENOMEM;
+
+	for (fru = 0; fru < num_media_frus; fru++) {
+		struct edac_ecs_fru_context *fru_ctx = &ecs_ctx->fru_ctxs[fru];
+		struct attribute_group *group = &fru_ctx->group;
+		int i;
+
+		fru_ctx->dev_attr[ECS_LOG_ENTRY_TYPE] =
+					EDAC_ECS_ATTR_RW(log_entry_type, fru);
+		fru_ctx->dev_attr[ECS_MODE] = EDAC_ECS_ATTR_RW(mode, fru);
+		fru_ctx->dev_attr[ECS_RESET] = EDAC_ECS_ATTR_WO(reset, fru);
+		fru_ctx->dev_attr[ECS_THRESHOLD] =
+					EDAC_ECS_ATTR_RW(threshold, fru);
+
+		for (i = 0; i < ECS_MAX_ATTRS; i++)
+			fru_ctx->ecs_attrs[i] = &fru_ctx->dev_attr[i].dev_attr.attr;
+
+		sprintf(fru_ctx->name, "%s%d", EDAC_ECS_FRU_NAME, fru);
+		group->name = fru_ctx->name;
+		group->attrs = fru_ctx->ecs_attrs;
+		group->is_visible  = ecs_attr_visible;
+
+		attr_groups[fru] = group;
+	}
+
+	return 0;
+}
+
+/**
+ * edac_ecs_get_desc - get EDAC ECS descriptors
+ * @ecs_dev: client device, supports ECS feature
+ * @attr_groups: pointer to attribute group container
+ * @num_media_frus: number of media FRUs in the device
+ *
+ * Return:
+ *  * %0	- Success.
+ *  * %-EINVAL	- Invalid parameters passed.
+ *  * %-ENOMEM	- Dynamic memory allocation failed.
+ */
+int edac_ecs_get_desc(struct device *ecs_dev,
+		      const struct attribute_group **attr_groups, u16 num_media_frus)
+{
+	if (!ecs_dev || !attr_groups || !num_media_frus)
+		return -EINVAL;
+
+	return ecs_create_desc(ecs_dev, attr_groups, num_media_frus);
+}
diff --git a/drivers/edac/edac_device.c b/drivers/edac/edac_device.c
index 60b20eae01e8..1c1142a2e4e4 100644
--- a/drivers/edac/edac_device.c
+++ b/drivers/edac/edac_device.c
@@ -625,6 +625,9 @@ int edac_dev_register(struct device *parent, char *name,
 			attr_gcnt++;
 			scrub_cnt++;
 			break;
+		case RAS_FEAT_ECS:
+			attr_gcnt += ras_features[feat].ecs_info.num_media_frus;
+			break;
 		default:
 			return -EINVAL;
 		}
@@ -669,6 +672,20 @@ int edac_dev_register(struct device *parent, char *name,
 			scrub_cnt++;
 			attr_gcnt++;
 			break;
+		case RAS_FEAT_ECS:
+			if (!ras_features->ecs_ops)
+				goto data_mem_free;
+
+			dev_data = &ctx->ecs;
+			dev_data->ecs_ops = ras_features->ecs_ops;
+			dev_data->private = ras_features->ctx;
+			ret = edac_ecs_get_desc(parent, &ras_attr_groups[attr_gcnt],
+						ras_features->ecs_info.num_media_frus);
+			if (ret)
+				goto data_mem_free;
+
+			attr_gcnt += ras_features->ecs_info.num_media_frus;
+			break;
 		default:
 			ret = -EINVAL;
 			goto data_mem_free;
diff --git a/include/linux/edac.h b/include/linux/edac.h
index ace8b10bb028..979e91426701 100644
--- a/include/linux/edac.h
+++ b/include/linux/edac.h
@@ -667,6 +667,7 @@ static inline struct dimm_info *edac_get_dimm(struct mem_ctl_info *mci,
 /* RAS feature type */
 enum edac_dev_feat {
 	RAS_FEAT_SCRUB,
+	RAS_FEAT_ECS,
 	RAS_FEAT_MAX
 };
 
@@ -700,9 +701,40 @@ int edac_scrub_get_desc(struct device *scrub_dev,
 			const struct attribute_group **attr_groups,
 			u8 instance);
 
+/**
+ * struct edac_ecs_ops - ECS device operations (all elements optional)
+ * @get_log_entry_type: read the log entry type value.
+ * @set_log_entry_type: set the log entry type value.
+ * @get_mode: read the mode value.
+ * @set_mode: set the mode value.
+ * @reset: reset the ECS counter.
+ * @get_threshold: read the threshold count per gigabits of memory cells.
+ * @set_threshold: set the threshold count per gigabits of memory cells.
+ */
+struct edac_ecs_ops {
+	int (*get_log_entry_type)(struct device *dev, void *drv_data, int fru_id, u32 *val);
+	int (*set_log_entry_type)(struct device *dev, void *drv_data, int fru_id, u32 val);
+	int (*get_mode)(struct device *dev, void *drv_data, int fru_id, u32 *val);
+	int (*set_mode)(struct device *dev, void *drv_data, int fru_id, u32 val);
+	int (*reset)(struct device *dev, void *drv_data, int fru_id, u32 val);
+	int (*get_threshold)(struct device *dev, void *drv_data, int fru_id, u32 *threshold);
+	int (*set_threshold)(struct device *dev, void *drv_data, int fru_id, u32 threshold);
+};
+
+struct edac_ecs_ex_info {
+	u16 num_media_frus;
+};
+
+int edac_ecs_get_desc(struct device *ecs_dev,
+		      const struct attribute_group **attr_groups,
+		      u16 num_media_frus);
+
 /* EDAC device feature information structure */
 struct edac_dev_data {
-	const struct edac_scrub_ops *scrub_ops;
+	union {
+		const struct edac_scrub_ops *scrub_ops;
+		const struct edac_ecs_ops *ecs_ops;
+	};
 	u8 instance;
 	void *private;
 };
@@ -711,13 +743,18 @@ struct edac_dev_feat_ctx {
 	struct device dev;
 	void *private;
 	struct edac_dev_data *scrub;
+	struct edac_dev_data ecs;
 };
 
 struct edac_dev_feature {
 	enum edac_dev_feat ft_type;
 	u8 instance;
-	const struct edac_scrub_ops *scrub_ops;
+	union {
+		const struct edac_scrub_ops *scrub_ops;
+		const struct edac_ecs_ops *ecs_ops;
+	};
 	void *ctx;
+	struct edac_ecs_ex_info ecs_info;
 };
 
 int edac_dev_register(struct device *parent, char *dev_name,
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [PATCH v18 04/19] EDAC: Add memory repair control feature
  2025-01-06 12:09 [PATCH v18 00/19] EDAC: Scrub: introduce generic EDAC RAS control feature driver + CXL/ACPI-RAS2 drivers shiju.jose
                   ` (2 preceding siblings ...)
  2025-01-06 12:09 ` [PATCH v18 03/19] EDAC: Add ECS " shiju.jose
@ 2025-01-06 12:10 ` shiju.jose
  2025-01-09  9:19   ` Borislav Petkov
                     ` (2 more replies)
  2025-01-06 12:10 ` [PATCH v18 05/19] ACPI:RAS2: Add ACPI RAS2 driver shiju.jose
                   ` (16 subsequent siblings)
  20 siblings, 3 replies; 87+ messages in thread
From: shiju.jose @ 2025-01-06 12:10 UTC (permalink / raw)
  To: linux-edac, linux-cxl, linux-acpi, linux-mm, linux-kernel
  Cc: bp, tony.luck, rafael, lenb, mchehab, dan.j.williams, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	ira.weiny, david, Vilas.Sridharan, leo.duran, Yazen.Ghannam,
	rientjes, jiaqiyan, Jon.Grimm, dave.hansen, naoya.horiguchi,
	james.morse, jthoughton, somasundaram.a, erdemaktas, pgonda,
	duenwen, gthelen, wschwartz, dferguson, wbs, nifan.cxl,
	yazen.ghannam, tanxiaofei, prime.zeng, roberto.sassu,
	kangkang.shen, wanghuiqiang, linuxarm, shiju.jose

From: Shiju Jose <shiju.jose@huawei.com>

Add a generic EDAC memory repair control driver to manage memory repairs
in the system, such as CXL Post Package Repair (PPR) and CXL memory sparing
features.

For example, a CXL device with DRAM components that support PPR features
may implement PPR maintenance operations. DRAM components may support two
types of PPR, hard PPR, for a permanent row repair, and soft PPR,  for a
temporary row repair. Soft PPR is much faster than hard PPR, but the repair
is lost with a power cycle.
Similarly a CXL memory device may support soft and hard memory sparing at
cacheline, row, bank and rank granularities. Memory sparing is defined as
a repair function that replaces a portion of memory with a portion of
functional memory at that same granularity.
When a CXL device detects an error in a memory, it may report the host of
the need for a repair maintenance operation by using an event record where
the "maintenance needed" flag is set. The event records contains the device
physical address(DPA) and other attributes of the memory to repair (such as
channel, sub-channel, bank group, bank, rank, row, column etc). The kernel
will report the corresponding CXL general media or DRAM trace event to
userspace, and userspace tools (e.g. rasdaemon) will initiate a repair
operation in response to the device request via the sysfs repair control.

Device with memory repair features registers with EDAC device driver,
which retrieves memory repair descriptor from EDAC memory repair driver
and exposes the sysfs repair control attributes to userspace in
/sys/bus/edac/devices/<dev-name>/mem_repairX/.

The common memory repair control interface abstracts the control of
arbitrary memory repair functionality into a standardized set of functions.
The sysfs memory repair attribute nodes are only available if the client
driver has implemented the corresponding attribute callback function and
provided operations to the EDAC device driver during registration.

Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
---
 .../ABI/testing/sysfs-edac-memory-repair      | 244 +++++++++
 Documentation/edac/features.rst               |   3 +
 Documentation/edac/index.rst                  |   1 +
 Documentation/edac/memory_repair.rst          | 101 ++++
 drivers/edac/Makefile                         |   2 +-
 drivers/edac/edac_device.c                    |  33 ++
 drivers/edac/mem_repair.c                     | 492 ++++++++++++++++++
 include/linux/edac.h                          | 139 +++++
 8 files changed, 1014 insertions(+), 1 deletion(-)
 create mode 100644 Documentation/ABI/testing/sysfs-edac-memory-repair
 create mode 100644 Documentation/edac/memory_repair.rst
 create mode 100755 drivers/edac/mem_repair.c

diff --git a/Documentation/ABI/testing/sysfs-edac-memory-repair b/Documentation/ABI/testing/sysfs-edac-memory-repair
new file mode 100644
index 000000000000..e9268f3780ed
--- /dev/null
+++ b/Documentation/ABI/testing/sysfs-edac-memory-repair
@@ -0,0 +1,244 @@
+What:		/sys/bus/edac/devices/<dev-name>/mem_repairX
+Date:		Jan 2025
+KernelVersion:	6.14
+Contact:	linux-edac@vger.kernel.org
+Description:
+		The sysfs EDAC bus devices /<dev-name>/mem_repairX subdirectory
+		pertains to the memory media repair features control, such as
+		PPR (Post Package Repair), memory sparing etc, where<dev-name>
+		directory corresponds to a device registered with the EDAC
+		device driver for the memory repair features.
+
+		Post Package Repair is a maintenance operation requests the memory
+		device to perform a repair operation on its media, in detail is a
+		memory self-healing feature that fixes a failing memory location by
+		replacing it with a spare row in a DRAM device. For example, a
+		CXL memory device with DRAM components that support PPR features may
+		implement PPR maintenance operations. DRAM components may support
+		two types of PPR functions: hard PPR, for a permanent row repair, and
+		soft PPR, for a temporary row repair. soft PPR is much faster than
+		hard PPR, but the repair is lost with a power cycle.
+
+		Memory sparing is a repair function that replaces a portion
+		of memory with a portion of functional memory at that same
+		sparing granularity. Memory sparing has cacheline/row/bank/rank
+		sparing granularities. For example, in memory-sparing mode,
+		one memory rank serves as a spare for other ranks on the same
+		channel in case they fail. The spare rank is held in reserve and
+		not used as active memory until a failure is indicated, with
+		reserved capacity subtracted from the total available memory
+		in the system.The DIMM installation order for memory sparing
+		varies based on the number of processors and memory modules
+		installed in the server. After an error threshold is surpassed
+		in a system protected by memory sparing, the content of a failing
+		rank of DIMMs is copied to the spare rank. The failing rank is
+		then taken offline and the spare rank placed online for use as
+		active memory in place of the failed rank.
+
+		The sysfs attributes nodes for a repair feature are only
+		present if the parent driver has implemented the corresponding
+		attr callback function and provided the necessary operations
+		to the EDAC device driver during registration.
+
+		In some states of system configuration (e.g. before address
+		decoders have been configured), memory devices (e.g. CXL)
+		may not have an active mapping in the main host address
+		physical address map. As such, the memory to repair must be
+		identified by a device specific physical addressing scheme
+		using a device physical address(DPA). The DPA and other control
+		attributes to use will be presented in related error records.
+
+What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/repair_function
+Date:		Jan 2025
+KernelVersion:	6.14
+Contact:	linux-edac@vger.kernel.org
+Description:
+		(RO) Memory repair function type. For eg. post package repair,
+		memory sparing etc.
+		EDAC_SOFT_PPR - Soft post package repair
+		EDAC_HARD_PPR - Hard post package repair
+		EDAC_CACHELINE_MEM_SPARING - Cacheline memory sparing
+		EDAC_ROW_MEM_SPARING - Row memory sparing
+		EDAC_BANK_MEM_SPARING - Bank memory sparing
+		EDAC_RANK_MEM_SPARING - Rank memory sparing
+		All other values are reserved.
+
+What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/persist_mode
+Date:		Jan 2025
+KernelVersion:	6.14
+Contact:	linux-edac@vger.kernel.org
+Description:
+		(RW) Read/Write the current persist repair mode set for a
+		repair function. Persist repair modes supported in the
+		device, based on the memory repair function is temporary
+		or permanent and is lost with a power cycle.
+		EDAC_MEM_REPAIR_SOFT - Soft repair function (temporary repair).
+		EDAC_MEM_REPAIR_HARD - Hard memory repair function (permanent repair).
+		All other values are reserved.
+
+What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/dpa_support
+Date:		Jan 2025
+KernelVersion:	6.14
+Contact:	linux-edac@vger.kernel.org
+Description:
+		(RO) True if memory device required device physical
+		address (DPA) of memory to repair.
+		False if memory device required host specific physical
+                address (HPA) of memory to repair.
+		In some states of system configuration (e.g. before address
+		decoders have been configured), memory devices (e.g. CXL)
+		may not have an active mapping in the main host address
+		physical address map. As such, the memory to repair must be
+		identified by a device specific physical addressing scheme
+		using a DPA. The device physical address(DPA) to use will be
+		presented in related error records.
+
+What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/repair_safe_when_in_use
+Date:		Jan 2025
+KernelVersion:	6.14
+Contact:	linux-edac@vger.kernel.org
+Description:
+		(RO) True if memory media is accessible and data is retained
+		during the memory repair operation.
+		The data may not be retained and memory requests may not be
+		correctly processed during a repair operation. In such case
+		the repair operation should not executed at runtime.
+
+What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/hpa
+Date:		Jan 2025
+KernelVersion:	6.14
+Contact:	linux-edac@vger.kernel.org
+Description:
+		(RW) Host Physical Address (HPA) of the memory to repair.
+		See attribute 'dpa_support' for more details.
+		The HPA to use will be provided in related error records.
+
+What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/dpa
+Date:		Jan 2025
+KernelVersion:	6.14
+Contact:	linux-edac@vger.kernel.org
+Description:
+		(RW) Device Physical Address (DPA) of the memory to repair.
+		See attribute 'dpa_support' for more details.
+		The specific DPA to use will be provided in related error
+		records.
+
+What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/nibble_mask
+Date:		Jan 2025
+KernelVersion:	6.14
+Contact:	linux-edac@vger.kernel.org
+Description:
+		(RW) Read/Write Nibble mask of the memory to repair.
+		Nibble mask identifies one or more nibbles in error on the
+		memory bus that produced the error event. Nibble Mask bit 0
+		shall be set if nibble 0 on the memory bus produced the
+		event, etc. For example, CXL PPR and sparing, a nibble mask
+		bit set to 1 indicates the request to perform repair
+		operation in the specific device. All nibble mask bits set
+		to 1 indicates the request to perform the operation in all
+		devices. For CXL memory to repiar, the specific value of
+		nibble mask to use will be provided in related error records.
+		For more details, See nibble mask field in CXL spec ver 3.1,
+		section 8.2.9.7.1.2 Table 8-103 soft PPR and section
+		8.2.9.7.1.3 Table 8-104 hard PPR, section 8.2.9.7.1.4
+		Table 8-105 memory sparing.
+
+What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/bank_group
+What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/bank
+What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/rank
+What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/row
+What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/column
+What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/channel
+What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/sub_channel
+Date:		Jan 2025
+KernelVersion:	6.14
+Contact:	linux-edac@vger.kernel.org
+Description:
+		(RW) The control attributes associated with memory address
+		that is to be repaired. The specific value of attributes to
+		use depends on the portion of memory to repair and may be
+		reported to host in related error records and may be
+		available to userspace in trace events, such as in CXL
+		memory devices.
+
+		channel - The channel of the memory to repair. Channel is
+		defined as an interface that can be independently accessed
+		for a transaction.
+		rank - The rank of the memory to repair. Rank is defined as a
+		set of memory devices on a channel that together execute a
+		transaction.
+		bank_group - The bank group of the memory to repair.
+		bank - The bank number of the memory to repair.
+		row - The row number of the memory to repair.
+		column - The column number of the memory to repair.
+		sub_channel - The sub-channel of the memory to repair.
+
+		The requirement to set these attributes varies based on the
+		repair function. The attributes in sysfs are not present
+		unless required for a repair function.
+		For example, CXL spec ver 3.1, Section 8.2.9.7.1.2 Table 8-103
+		soft PPR and Section 8.2.9.7.1.3 Table 8-104 hard PPR operations,
+		these attributes are not required to set.
+		For example, CXL spec ver 3.1, Section 8.2.9.7.1.4 Table 8-105
+		memory sparing, these attributes are required to set based on
+		memory sparing granularity as follows.
+		Channel: Channel associated with the DPA that is to be spared
+		and applies to all subclasses of sparing (cacheline, bank,
+		row and rank sparing).
+		Rank: Rank associated with the DPA that is to be spared and
+		applies to all subclasses of sparing.
+		Bank & Bank Group: Bank & bank group are associated with
+		the DPA that is to be spared and applies to cacheline sparing,
+		row sparing and bank sparing subclasses.
+		Row: Row associated with the DPA that is to be spared and
+		applies to cacheline sparing and row sparing subclasses.
+		Column: column associated with the DPA that is to be spared
+		and applies to cacheline sparing only.
+		Sub-channel: sub-channel associated with the DPA that is to
+		be spared and applies to cacheline sparing only.
+
+What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/min_hpa
+What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/min_dpa
+What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/min_nibble_mask
+What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/min_bank_group
+What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/min_bank
+What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/min_rank
+What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/min_row
+What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/min_column
+What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/min_channel
+What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/min_sub_channel
+What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/max_hpa
+What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/max_dpa
+What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/max_nibble_mask
+What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/max_bank_group
+What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/max_bank
+What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/max_rank
+What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/max_row
+What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/max_column
+What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/max_channel
+What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/max_sub_channel
+Date:		Jan 2025
+KernelVersion:	6.14
+Contact:	linux-edac@vger.kernel.org
+Description:
+		(RW) The supported range of control attributes (optional)
+		associated with a memory address that is to be repaired.
+		The memory device may give the supported range of
+		attributes to use and it will depend on the memory device
+		and the portion of memory to repair.
+		The userspace may receive the specific value of attributes
+		to use for a repair operation from the memory device via
+		related error records and trace events, such as in CXL
+		memory devices.
+
+What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/repair
+Date:		Jan 2025
+KernelVersion:	6.14
+Contact:	linux-edac@vger.kernel.org
+Description:
+		(WO) Issue the memory repair operation for the specified
+		memory repair attributes. The operation may fail if resources
+		are insufficient based on the requirements of the memory
+		device and repair function.
+		EDAC_DO_MEM_REPAIR - issue repair operation.
+		All other values are reserved.
diff --git a/Documentation/edac/features.rst b/Documentation/edac/features.rst
index ba3ab993ee4f..bfd5533b81b7 100644
--- a/Documentation/edac/features.rst
+++ b/Documentation/edac/features.rst
@@ -97,3 +97,6 @@ RAS features
 ------------
 1. Memory Scrub
 Memory scrub features are documented in `Documentation/edac/scrub.rst`.
+
+2. Memory Repair
+Memory repair features are documented in `Documentation/edac/memory_repair.rst`.
diff --git a/Documentation/edac/index.rst b/Documentation/edac/index.rst
index dfb0c9fb9ab1..d6778f4562dd 100644
--- a/Documentation/edac/index.rst
+++ b/Documentation/edac/index.rst
@@ -8,4 +8,5 @@ EDAC Subsystem
    :maxdepth: 1
 
    features
+   memory_repair
    scrub
diff --git a/Documentation/edac/memory_repair.rst b/Documentation/edac/memory_repair.rst
new file mode 100644
index 000000000000..2787a8a2d6ba
--- /dev/null
+++ b/Documentation/edac/memory_repair.rst
@@ -0,0 +1,101 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==========================
+EDAC Memory Repair Control
+==========================
+
+Copyright (c) 2024 HiSilicon Limited.
+
+:Author:   Shiju Jose <shiju.jose@huawei.com>
+:License:  The GNU Free Documentation License, Version 1.2
+          (dual licensed under the GPL v2)
+:Original Reviewers:
+
+- Written for: 6.14
+
+Introduction
+------------
+Memory devices may support repair operations to address issues in their
+memory media. Post Package Repair (PPR) and memory sparing are examples
+of such features.
+
+Post Package Repair(PPR)
+~~~~~~~~~~~~~~~~~~~~~~~~
+Post Package Repair is a maintenance operation requests the memory device
+to perform repair operation on its media, in detail is a memory self-healing
+feature that fixes a failing memory location by replacing it with a spare
+row in a DRAM device. For example, a CXL memory device with DRAM components
+that support PPR features may implement PPR maintenance operations. DRAM
+components may support types of PPR functions, hard PPR, for a permanent row
+repair, and soft PPR, for a temporary row repair. Soft PPR is much faster
+than hard PPR, but the repair is lost with a power cycle.  The data may not
+be retained and memory requests may not be correctly processed during a
+repair operation. In such case, the repair operation should not executed
+at runtime.
+For example, CXL memory devices, soft PPR and hard PPR repair operations
+may be supported. See CXL spec rev 3.1 sections 8.2.9.7.1.1 PPR Maintenance
+Operations, 8.2.9.7.1.2 sPPR Maintenance Operation and 8.2.9.7.1.3 hPPR
+Maintenance Operation for more details.
+
+Memory Sparing
+~~~~~~~~~~~~~~
+Memory sparing is a repair function that replaces a portion of memory with
+a portion of functional memory at that same sparing granularity. Memory
+sparing has cacheline/row/bank/rank sparing granularities. For example, in
+memory-sparing mode, one memory rank serves as a spare for other ranks on
+the same channel in case they fail. The spare rank is held in reserve and
+not used as active memory until a failure is indicated, with reserved
+capacity subtracted from the total available memory in the system. The DIMM
+installation order for memory sparing varies based on the number of processors
+and memory modules installed in the server. After an error threshold is
+surpassed in a system protected by memory sparing, the content of a failing
+rank of DIMMs is copied to the spare rank. The failing rank is then taken
+offline and the spare rank placed online for use as active memory in place
+of the failed rank.
+
+For example, CXL memory devices may support various subclasses for sparing
+operation vary in terms of the scope of the sparing being performed.
+Cacheline sparing subclass refers to a sparing action that can replace a
+full cacheline. Row sparing is provided as an alternative to PPR sparing
+functions and its scope is that of a single DDR row. Bank sparing allows
+an entire bank to be replaced. Rank sparing is defined as an operation
+in which an entire DDR rank is replaced. See CXL spec 3.1 section
+8.2.9.7.1.4 Memory Sparing Maintenance Operations for more details.
+
+Use cases of generic memory repair features control
+---------------------------------------------------
+
+1. The soft PPR , hard PPR and memory-sparing features share similar
+control attributes. Therefore, there is a need for a standardized, generic
+sysfs repair control that is exposed to userspace and used by
+administrators, scripts and tools.
+
+2. When a CXL device detects an error in a memory component, it may inform
+the host of the need for a repair maintenance operation by using an event
+record where the "maintenance needed" flag is set. The event record
+specifies the device physical address(DPA) and attributes of the memory that
+requires repair. The kernel reports the corresponding CXL general media or
+DRAM trace event to userspace, and userspace tools (e.g. rasdaemon) initiate
+a repair maintenance operation in response to the device request using the
+sysfs repair control.
+
+3. Userspace tools, such as rasdaemon, may request a PPR/sparing on a memory
+region when an uncorrected memory error or an excess of corrected memory
+errors is reported on that memory.
+
+4. Multiple PPR/sparing instances may be present per memory device.
+
+The File System
+---------------
+
+The control attributes of a registered memory repair instance could be
+accessed in the
+
+/sys/bus/edac/devices/<dev-name>/mem_repairX/
+
+sysfs
+-----
+
+Sysfs files are documented in
+
+`Documentation/ABI/testing/sysfs-edac-memory-repair`.
diff --git a/drivers/edac/Makefile b/drivers/edac/Makefile
index 3a49304860f0..1de9fe66ac6b 100644
--- a/drivers/edac/Makefile
+++ b/drivers/edac/Makefile
@@ -10,7 +10,7 @@ obj-$(CONFIG_EDAC)			:= edac_core.o
 
 edac_core-y	:= edac_mc.o edac_device.o edac_mc_sysfs.o
 edac_core-y	+= edac_module.o edac_device_sysfs.o wq.o
-edac_core-y	+= scrub.o ecs.o
+edac_core-y	+= scrub.o ecs.o mem_repair.o
 
 edac_core-$(CONFIG_EDAC_DEBUG)		+= debugfs.o
 
diff --git a/drivers/edac/edac_device.c b/drivers/edac/edac_device.c
index 1c1142a2e4e4..a401d81dad8a 100644
--- a/drivers/edac/edac_device.c
+++ b/drivers/edac/edac_device.c
@@ -575,6 +575,7 @@ static void edac_dev_release(struct device *dev)
 {
 	struct edac_dev_feat_ctx *ctx = container_of(dev, struct edac_dev_feat_ctx, dev);
 
+	kfree(ctx->mem_repair);
 	kfree(ctx->scrub);
 	kfree(ctx->dev.groups);
 	kfree(ctx);
@@ -611,6 +612,7 @@ int edac_dev_register(struct device *parent, char *name,
 	const struct attribute_group **ras_attr_groups;
 	struct edac_dev_data *dev_data;
 	struct edac_dev_feat_ctx *ctx;
+	int mem_repair_cnt = 0;
 	int attr_gcnt = 0;
 	int scrub_cnt = 0;
 	int ret, feat;
@@ -628,6 +630,10 @@ int edac_dev_register(struct device *parent, char *name,
 		case RAS_FEAT_ECS:
 			attr_gcnt += ras_features[feat].ecs_info.num_media_frus;
 			break;
+		case RAS_FEAT_MEM_REPAIR:
+			attr_gcnt++;
+			mem_repair_cnt++;
+			break;
 		default:
 			return -EINVAL;
 		}
@@ -651,8 +657,17 @@ int edac_dev_register(struct device *parent, char *name,
 		}
 	}
 
+	if (mem_repair_cnt) {
+		ctx->mem_repair = kcalloc(mem_repair_cnt, sizeof(*ctx->mem_repair), GFP_KERNEL);
+		if (!ctx->mem_repair) {
+			ret = -ENOMEM;
+			goto data_mem_free;
+		}
+	}
+
 	attr_gcnt = 0;
 	scrub_cnt = 0;
+	mem_repair_cnt = 0;
 	for (feat = 0; feat < num_features; feat++, ras_features++) {
 		switch (ras_features->ft_type) {
 		case RAS_FEAT_SCRUB:
@@ -686,6 +701,23 @@ int edac_dev_register(struct device *parent, char *name,
 
 			attr_gcnt += ras_features->ecs_info.num_media_frus;
 			break;
+		case RAS_FEAT_MEM_REPAIR:
+			if (!ras_features->mem_repair_ops ||
+			    mem_repair_cnt != ras_features->instance)
+				goto data_mem_free;
+
+			dev_data = &ctx->mem_repair[mem_repair_cnt];
+			dev_data->instance = mem_repair_cnt;
+			dev_data->mem_repair_ops = ras_features->mem_repair_ops;
+			dev_data->private = ras_features->ctx;
+			ret = edac_mem_repair_get_desc(parent, &ras_attr_groups[attr_gcnt],
+						       ras_features->instance);
+			if (ret)
+				goto data_mem_free;
+
+			mem_repair_cnt++;
+			attr_gcnt++;
+			break;
 		default:
 			ret = -EINVAL;
 			goto data_mem_free;
@@ -712,6 +744,7 @@ int edac_dev_register(struct device *parent, char *name,
 	return devm_add_action_or_reset(parent, edac_dev_unreg, &ctx->dev);
 
 data_mem_free:
+	kfree(ctx->mem_repair);
 	kfree(ctx->scrub);
 groups_free:
 	kfree(ras_attr_groups);
diff --git a/drivers/edac/mem_repair.c b/drivers/edac/mem_repair.c
new file mode 100755
index 000000000000..e7439fd26c41
--- /dev/null
+++ b/drivers/edac/mem_repair.c
@@ -0,0 +1,492 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * The generic EDAC memory repair driver is designed to control the memory
+ * devices with memory repair features, such as Post Package Repair (PPR),
+ * memory sparing etc. The common sysfs memory repair interface abstracts
+ * the control of various arbitrary memory repair functionalities into a
+ * unified set of functions.
+ *
+ * Copyright (c) 2024 HiSilicon Limited.
+ */
+
+#include <linux/edac.h>
+
+enum edac_mem_repair_attributes {
+	MEM_REPAIR_FUNCTION,
+	MEM_REPAIR_PERSIST_MODE,
+	MEM_REPAIR_DPA_SUPPORT,
+	MEM_REPAIR_SAFE_IN_USE,
+	MEM_REPAIR_HPA,
+	MEM_REPAIR_MIN_HPA,
+	MEM_REPAIR_MAX_HPA,
+	MEM_REPAIR_DPA,
+	MEM_REPAIR_MIN_DPA,
+	MEM_REPAIR_MAX_DPA,
+	MEM_REPAIR_NIBBLE_MASK,
+	MEM_REPAIR_MIN_NIBBLE_MASK,
+	MEM_REPAIR_MAX_NIBBLE_MASK,
+	MEM_REPAIR_BANK_GROUP,
+	MEM_REPAIR_MIN_BANK_GROUP,
+	MEM_REPAIR_MAX_BANK_GROUP,
+	MEM_REPAIR_BANK,
+	MEM_REPAIR_MIN_BANK,
+	MEM_REPAIR_MAX_BANK,
+	MEM_REPAIR_RANK,
+	MEM_REPAIR_MIN_RANK,
+	MEM_REPAIR_MAX_RANK,
+	MEM_REPAIR_ROW,
+	MEM_REPAIR_MIN_ROW,
+	MEM_REPAIR_MAX_ROW,
+	MEM_REPAIR_COLUMN,
+	MEM_REPAIR_MIN_COLUMN,
+	MEM_REPAIR_MAX_COLUMN,
+	MEM_REPAIR_CHANNEL,
+	MEM_REPAIR_MIN_CHANNEL,
+	MEM_REPAIR_MAX_CHANNEL,
+	MEM_REPAIR_SUB_CHANNEL,
+	MEM_REPAIR_MIN_SUB_CHANNEL,
+	MEM_REPAIR_MAX_SUB_CHANNEL,
+	MEM_DO_REPAIR,
+	MEM_REPAIR_MAX_ATTRS
+};
+
+struct edac_mem_repair_dev_attr {
+	struct device_attribute dev_attr;
+	u8 instance;
+};
+
+struct edac_mem_repair_context {
+	char name[EDAC_FEAT_NAME_LEN];
+	struct edac_mem_repair_dev_attr mem_repair_dev_attr[MEM_REPAIR_MAX_ATTRS];
+	struct attribute *mem_repair_attrs[MEM_REPAIR_MAX_ATTRS + 1];
+	struct attribute_group group;
+};
+
+#define TO_MEM_REPAIR_DEV_ATTR(_dev_attr)      \
+		container_of(_dev_attr, struct edac_mem_repair_dev_attr, dev_attr)
+
+#define EDAC_MEM_REPAIR_ATTR_SHOW(attrib, cb, type, format)			\
+static ssize_t attrib##_show(struct device *ras_feat_dev,			\
+			     struct device_attribute *attr, char *buf)		\
+{										\
+	u8 inst = TO_MEM_REPAIR_DEV_ATTR(attr)->instance;			\
+	struct edac_dev_feat_ctx *ctx = dev_get_drvdata(ras_feat_dev);		\
+	const struct edac_mem_repair_ops *ops =					\
+				ctx->mem_repair[inst].mem_repair_ops;		\
+	type data;								\
+	int ret;								\
+										\
+	ret = ops->cb(ras_feat_dev->parent, ctx->mem_repair[inst].private,	\
+		      &data);							\
+	if (ret)								\
+		return ret;							\
+										\
+	return sysfs_emit(buf, format, data);					\
+}
+
+EDAC_MEM_REPAIR_ATTR_SHOW(repair_function, get_repair_function, u32, "%u\n")
+EDAC_MEM_REPAIR_ATTR_SHOW(persist_mode, get_persist_mode, u32, "%u\n")
+EDAC_MEM_REPAIR_ATTR_SHOW(dpa_support, get_dpa_support, u32, "%u\n")
+EDAC_MEM_REPAIR_ATTR_SHOW(repair_safe_when_in_use, get_repair_safe_when_in_use, u32, "%u\n")
+EDAC_MEM_REPAIR_ATTR_SHOW(hpa, get_hpa, u64, "0x%llx\n")
+EDAC_MEM_REPAIR_ATTR_SHOW(min_hpa, get_min_hpa, u64, "0x%llx\n")
+EDAC_MEM_REPAIR_ATTR_SHOW(max_hpa, get_max_hpa, u64, "0x%llx\n")
+EDAC_MEM_REPAIR_ATTR_SHOW(dpa, get_dpa, u64, "0x%llx\n")
+EDAC_MEM_REPAIR_ATTR_SHOW(min_dpa, get_min_dpa, u64, "0x%llx\n")
+EDAC_MEM_REPAIR_ATTR_SHOW(max_dpa, get_max_dpa, u64, "0x%llx\n")
+EDAC_MEM_REPAIR_ATTR_SHOW(nibble_mask, get_nibble_mask, u64, "0x%llx\n")
+EDAC_MEM_REPAIR_ATTR_SHOW(min_nibble_mask, get_min_nibble_mask, u64, "0x%llx\n")
+EDAC_MEM_REPAIR_ATTR_SHOW(max_nibble_mask, get_max_nibble_mask, u64, "0x%llx\n")
+EDAC_MEM_REPAIR_ATTR_SHOW(bank_group, get_bank_group, u32, "%u\n")
+EDAC_MEM_REPAIR_ATTR_SHOW(min_bank_group, get_min_bank_group, u32, "%u\n")
+EDAC_MEM_REPAIR_ATTR_SHOW(max_bank_group, get_max_bank_group, u32, "%u\n")
+EDAC_MEM_REPAIR_ATTR_SHOW(bank, get_bank, u32, "%u\n")
+EDAC_MEM_REPAIR_ATTR_SHOW(min_bank, get_min_bank, u32, "%u\n")
+EDAC_MEM_REPAIR_ATTR_SHOW(max_bank, get_max_bank, u32, "%u\n")
+EDAC_MEM_REPAIR_ATTR_SHOW(rank, get_rank, u32, "%u\n")
+EDAC_MEM_REPAIR_ATTR_SHOW(min_rank, get_min_rank, u32, "%u\n")
+EDAC_MEM_REPAIR_ATTR_SHOW(max_rank, get_max_rank, u32, "%u\n")
+EDAC_MEM_REPAIR_ATTR_SHOW(row, get_row, u64, "0x%llx\n")
+EDAC_MEM_REPAIR_ATTR_SHOW(min_row, get_min_row, u64, "0x%llx\n")
+EDAC_MEM_REPAIR_ATTR_SHOW(max_row, get_max_row, u64, "0x%llx\n")
+EDAC_MEM_REPAIR_ATTR_SHOW(column, get_column, u32, "%u\n")
+EDAC_MEM_REPAIR_ATTR_SHOW(min_column, get_min_column, u32, "%u\n")
+EDAC_MEM_REPAIR_ATTR_SHOW(max_column, get_max_column, u32, "%u\n")
+EDAC_MEM_REPAIR_ATTR_SHOW(channel, get_channel, u32, "%u\n")
+EDAC_MEM_REPAIR_ATTR_SHOW(min_channel, get_min_channel, u32, "%u\n")
+EDAC_MEM_REPAIR_ATTR_SHOW(max_channel, get_max_channel, u32, "%u\n")
+EDAC_MEM_REPAIR_ATTR_SHOW(sub_channel, get_sub_channel, u32, "%u\n")
+EDAC_MEM_REPAIR_ATTR_SHOW(min_sub_channel, get_min_sub_channel, u32, "%u\n")
+EDAC_MEM_REPAIR_ATTR_SHOW(max_sub_channel, get_max_sub_channel, u32, "%u\n")
+
+#define EDAC_MEM_REPAIR_ATTR_STORE(attrib, cb, type, conv_func)			\
+static ssize_t attrib##_store(struct device *ras_feat_dev,			\
+			      struct device_attribute *attr,			\
+			      const char *buf, size_t len)			\
+{										\
+	u8 inst = TO_MEM_REPAIR_DEV_ATTR(attr)->instance;			\
+	struct edac_dev_feat_ctx *ctx = dev_get_drvdata(ras_feat_dev);		\
+	const struct edac_mem_repair_ops *ops =					\
+				ctx->mem_repair[inst].mem_repair_ops;		\
+	type data;								\
+	int ret;								\
+										\
+	ret = conv_func(buf, 0, &data);						\
+	if (ret < 0)								\
+		return ret;							\
+										\
+	ret = ops->cb(ras_feat_dev->parent, ctx->mem_repair[inst].private,	\
+		      data);							\
+	if (ret)								\
+		return ret;							\
+										\
+	return len;								\
+}
+
+EDAC_MEM_REPAIR_ATTR_STORE(persist_mode, set_persist_mode, unsigned long, kstrtoul)
+EDAC_MEM_REPAIR_ATTR_STORE(hpa, set_hpa, u64, kstrtou64)
+EDAC_MEM_REPAIR_ATTR_STORE(dpa, set_dpa, u64, kstrtou64)
+EDAC_MEM_REPAIR_ATTR_STORE(nibble_mask, set_nibble_mask, u64, kstrtou64)
+EDAC_MEM_REPAIR_ATTR_STORE(bank_group, set_bank_group, unsigned long, kstrtoul)
+EDAC_MEM_REPAIR_ATTR_STORE(bank, set_bank, unsigned long, kstrtoul)
+EDAC_MEM_REPAIR_ATTR_STORE(rank, set_rank, unsigned long, kstrtoul)
+EDAC_MEM_REPAIR_ATTR_STORE(row, set_row, u64, kstrtou64)
+EDAC_MEM_REPAIR_ATTR_STORE(column, set_column, unsigned long, kstrtoul)
+EDAC_MEM_REPAIR_ATTR_STORE(channel, set_channel, unsigned long, kstrtoul)
+EDAC_MEM_REPAIR_ATTR_STORE(sub_channel, set_sub_channel, unsigned long, kstrtoul)
+
+#define EDAC_MEM_REPAIR_DO_OP(attrib, cb)						\
+static ssize_t attrib##_store(struct device *ras_feat_dev,				\
+			      struct device_attribute *attr,				\
+			      const char *buf, size_t len)				\
+{											\
+	u8 inst = TO_MEM_REPAIR_DEV_ATTR(attr)->instance;				\
+	struct edac_dev_feat_ctx *ctx = dev_get_drvdata(ras_feat_dev);			\
+	const struct edac_mem_repair_ops *ops = ctx->mem_repair[inst].mem_repair_ops;	\
+	unsigned long data;								\
+	int ret;									\
+											\
+	ret = kstrtoul(buf, 0, &data);							\
+	if (ret < 0)									\
+		return ret;								\
+											\
+	ret = ops->cb(ras_feat_dev->parent, ctx->mem_repair[inst].private, data);	\
+	if (ret)									\
+		return ret;								\
+											\
+	return len;									\
+}
+
+EDAC_MEM_REPAIR_DO_OP(repair, do_repair)
+
+static umode_t mem_repair_attr_visible(struct kobject *kobj, struct attribute *a, int attr_id)
+{
+	struct device *ras_feat_dev = kobj_to_dev(kobj);
+	struct device_attribute *dev_attr = container_of(a, struct device_attribute, attr);
+	struct edac_dev_feat_ctx *ctx = dev_get_drvdata(ras_feat_dev);
+	u8 inst = TO_MEM_REPAIR_DEV_ATTR(dev_attr)->instance;
+	const struct edac_mem_repair_ops *ops = ctx->mem_repair[inst].mem_repair_ops;
+
+	switch (attr_id) {
+	case MEM_REPAIR_FUNCTION:
+		if (ops->get_repair_function)
+			return a->mode;
+		break;
+	case MEM_REPAIR_PERSIST_MODE:
+		if (ops->get_persist_mode) {
+			if (ops->set_persist_mode)
+				return a->mode;
+			else
+				return 0444;
+		}
+		break;
+	case MEM_REPAIR_DPA_SUPPORT:
+		if (ops->get_dpa_support)
+			return a->mode;
+		break;
+	case MEM_REPAIR_SAFE_IN_USE:
+		if (ops->get_repair_safe_when_in_use)
+			return a->mode;
+		break;
+	case MEM_REPAIR_HPA:
+		if (ops->get_hpa) {
+			if (ops->set_hpa)
+				return a->mode;
+			else
+				return 0444;
+		}
+		break;
+	case MEM_REPAIR_MIN_HPA:
+		if (ops->get_min_hpa)
+			return a->mode;
+		break;
+	case MEM_REPAIR_MAX_HPA:
+		if (ops->get_max_hpa)
+			return a->mode;
+		break;
+	case MEM_REPAIR_DPA:
+		if (ops->get_dpa) {
+			if (ops->set_dpa)
+				return a->mode;
+			else
+				return 0444;
+		}
+		break;
+	case MEM_REPAIR_MIN_DPA:
+		if (ops->get_min_dpa)
+			return a->mode;
+		break;
+	case MEM_REPAIR_MAX_DPA:
+		if (ops->get_max_dpa)
+			return a->mode;
+		break;
+	case MEM_REPAIR_NIBBLE_MASK:
+		if (ops->get_nibble_mask) {
+			if (ops->set_nibble_mask)
+				return a->mode;
+			else
+				return 0444;
+		}
+		break;
+	case MEM_REPAIR_MIN_NIBBLE_MASK:
+		if (ops->get_min_nibble_mask)
+			return a->mode;
+		break;
+	case MEM_REPAIR_MAX_NIBBLE_MASK:
+		if (ops->get_max_nibble_mask)
+			return a->mode;
+		break;
+	case MEM_REPAIR_BANK_GROUP:
+		if (ops->get_bank_group) {
+			if (ops->set_bank_group)
+				return a->mode;
+			else
+				return 0444;
+		}
+		break;
+	case MEM_REPAIR_MIN_BANK_GROUP:
+		if (ops->get_min_bank_group)
+			return a->mode;
+		break;
+	case MEM_REPAIR_MAX_BANK_GROUP:
+		if (ops->get_max_bank_group)
+			return a->mode;
+		break;
+	case MEM_REPAIR_BANK:
+		if (ops->get_bank) {
+			if (ops->set_bank)
+				return a->mode;
+			else
+				return 0444;
+		}
+		break;
+	case MEM_REPAIR_MIN_BANK:
+		if (ops->get_min_bank)
+			return a->mode;
+		break;
+	case MEM_REPAIR_MAX_BANK:
+		if (ops->get_max_bank)
+			return a->mode;
+		break;
+	case MEM_REPAIR_RANK:
+		if (ops->get_rank) {
+			if (ops->set_rank)
+				return a->mode;
+			else
+				return 0444;
+		}
+		break;
+	case MEM_REPAIR_MIN_RANK:
+		if (ops->get_min_rank)
+			return a->mode;
+		break;
+	case MEM_REPAIR_MAX_RANK:
+		if (ops->get_max_rank)
+			return a->mode;
+		break;
+	case MEM_REPAIR_ROW:
+		if (ops->get_row) {
+			if (ops->set_row)
+				return a->mode;
+			else
+				return 0444;
+		}
+		break;
+	case MEM_REPAIR_MIN_ROW:
+		if (ops->get_min_row)
+			return a->mode;
+		break;
+	case MEM_REPAIR_MAX_ROW:
+		if (ops->get_max_row)
+			return a->mode;
+		break;
+	case MEM_REPAIR_COLUMN:
+		if (ops->get_column) {
+			if (ops->set_column)
+				return a->mode;
+			else
+				return 0444;
+		}
+		break;
+	case MEM_REPAIR_MIN_COLUMN:
+		if (ops->get_min_column)
+			return a->mode;
+		break;
+	case MEM_REPAIR_MAX_COLUMN:
+		if (ops->get_max_column)
+			return a->mode;
+		break;
+	case MEM_REPAIR_CHANNEL:
+		if (ops->get_channel) {
+			if (ops->set_channel)
+				return a->mode;
+			else
+				return 0444;
+		}
+		break;
+	case MEM_REPAIR_MIN_CHANNEL:
+		if (ops->get_min_channel)
+			return a->mode;
+		break;
+	case MEM_REPAIR_MAX_CHANNEL:
+		if (ops->get_max_channel)
+			return a->mode;
+		break;
+	case MEM_REPAIR_SUB_CHANNEL:
+		if (ops->get_sub_channel) {
+			if (ops->set_sub_channel)
+				return a->mode;
+			else
+				return 0444;
+		}
+		break;
+	case MEM_REPAIR_MIN_SUB_CHANNEL:
+		if (ops->get_min_sub_channel)
+			return a->mode;
+		break;
+	case MEM_REPAIR_MAX_SUB_CHANNEL:
+		if (ops->get_max_sub_channel)
+			return a->mode;
+		break;
+	case MEM_DO_REPAIR:
+		if (ops->do_repair)
+			return a->mode;
+		break;
+	default:
+		break;
+	}
+
+	return 0;
+}
+
+#define EDAC_MEM_REPAIR_ATTR_RO(_name, _instance)       \
+	((struct edac_mem_repair_dev_attr) { .dev_attr = __ATTR_RO(_name), \
+					     .instance = _instance })
+
+#define EDAC_MEM_REPAIR_ATTR_WO(_name, _instance)       \
+	((struct edac_mem_repair_dev_attr) { .dev_attr = __ATTR_WO(_name), \
+					     .instance = _instance })
+
+#define EDAC_MEM_REPAIR_ATTR_RW(_name, _instance)       \
+	((struct edac_mem_repair_dev_attr) { .dev_attr = __ATTR_RW(_name), \
+					     .instance = _instance })
+
+static int mem_repair_create_desc(struct device *dev,
+				  const struct attribute_group **attr_groups,
+				  u8 instance)
+{
+	struct edac_mem_repair_context *ctx;
+	struct attribute_group *group;
+	int i;
+	struct edac_mem_repair_dev_attr dev_attr[] = {
+		[MEM_REPAIR_FUNCTION] = EDAC_MEM_REPAIR_ATTR_RO(repair_function,
+							    instance),
+		[MEM_REPAIR_PERSIST_MODE] =
+				EDAC_MEM_REPAIR_ATTR_RW(persist_mode, instance),
+		[MEM_REPAIR_DPA_SUPPORT] =
+				EDAC_MEM_REPAIR_ATTR_RO(dpa_support, instance),
+		[MEM_REPAIR_SAFE_IN_USE] =
+				EDAC_MEM_REPAIR_ATTR_RO(repair_safe_when_in_use,
+							instance),
+		[MEM_REPAIR_HPA] = EDAC_MEM_REPAIR_ATTR_RW(hpa, instance),
+		[MEM_REPAIR_MIN_HPA] = EDAC_MEM_REPAIR_ATTR_RO(min_hpa, instance),
+		[MEM_REPAIR_MAX_HPA] = EDAC_MEM_REPAIR_ATTR_RO(max_hpa, instance),
+		[MEM_REPAIR_DPA] = EDAC_MEM_REPAIR_ATTR_RW(dpa, instance),
+		[MEM_REPAIR_MIN_DPA] = EDAC_MEM_REPAIR_ATTR_RO(min_dpa, instance),
+		[MEM_REPAIR_MAX_DPA] = EDAC_MEM_REPAIR_ATTR_RO(max_dpa, instance),
+		[MEM_REPAIR_NIBBLE_MASK] =
+				EDAC_MEM_REPAIR_ATTR_RW(nibble_mask, instance),
+		[MEM_REPAIR_MIN_NIBBLE_MASK] =
+				EDAC_MEM_REPAIR_ATTR_RO(min_nibble_mask, instance),
+		[MEM_REPAIR_MAX_NIBBLE_MASK] =
+				EDAC_MEM_REPAIR_ATTR_RO(max_nibble_mask, instance),
+		[MEM_REPAIR_BANK_GROUP] =
+				EDAC_MEM_REPAIR_ATTR_RW(bank_group, instance),
+		[MEM_REPAIR_MIN_BANK_GROUP] =
+				EDAC_MEM_REPAIR_ATTR_RO(min_bank_group, instance),
+		[MEM_REPAIR_MAX_BANK_GROUP] =
+				EDAC_MEM_REPAIR_ATTR_RO(max_bank_group, instance),
+		[MEM_REPAIR_BANK] = EDAC_MEM_REPAIR_ATTR_RW(bank, instance),
+		[MEM_REPAIR_MIN_BANK] = EDAC_MEM_REPAIR_ATTR_RO(min_bank, instance),
+		[MEM_REPAIR_MAX_BANK] = EDAC_MEM_REPAIR_ATTR_RO(max_bank, instance),
+		[MEM_REPAIR_RANK] = EDAC_MEM_REPAIR_ATTR_RW(rank, instance),
+		[MEM_REPAIR_MIN_RANK] = EDAC_MEM_REPAIR_ATTR_RO(min_rank, instance),
+		[MEM_REPAIR_MAX_RANK] = EDAC_MEM_REPAIR_ATTR_RO(max_rank, instance),
+		[MEM_REPAIR_ROW] = EDAC_MEM_REPAIR_ATTR_RW(row, instance),
+		[MEM_REPAIR_MIN_ROW] = EDAC_MEM_REPAIR_ATTR_RO(min_row, instance),
+		[MEM_REPAIR_MAX_ROW] = EDAC_MEM_REPAIR_ATTR_RO(max_row, instance),
+		[MEM_REPAIR_COLUMN] = EDAC_MEM_REPAIR_ATTR_RW(column, instance),
+		[MEM_REPAIR_MIN_COLUMN] = EDAC_MEM_REPAIR_ATTR_RO(min_column, instance),
+		[MEM_REPAIR_MAX_COLUMN] = EDAC_MEM_REPAIR_ATTR_RO(max_column, instance),
+		[MEM_REPAIR_CHANNEL] = EDAC_MEM_REPAIR_ATTR_RW(channel, instance),
+		[MEM_REPAIR_MIN_CHANNEL] = EDAC_MEM_REPAIR_ATTR_RO(min_channel, instance),
+		[MEM_REPAIR_MAX_CHANNEL] = EDAC_MEM_REPAIR_ATTR_RO(max_channel, instance),
+		[MEM_REPAIR_SUB_CHANNEL] =
+				EDAC_MEM_REPAIR_ATTR_RW(sub_channel, instance),
+		[MEM_REPAIR_MIN_SUB_CHANNEL] =
+				EDAC_MEM_REPAIR_ATTR_RO(min_sub_channel, instance),
+		[MEM_REPAIR_MAX_SUB_CHANNEL] =
+				EDAC_MEM_REPAIR_ATTR_RO(max_sub_channel, instance),
+		[MEM_DO_REPAIR] = EDAC_MEM_REPAIR_ATTR_WO(repair, instance)
+	};
+
+	ctx = devm_kzalloc(dev, sizeof(*ctx), GFP_KERNEL);
+	if (!ctx)
+		return -ENOMEM;
+
+	for (i = 0; i < MEM_REPAIR_MAX_ATTRS; i++) {
+		memcpy(&ctx->mem_repair_dev_attr[i].dev_attr,
+		       &dev_attr[i], sizeof(dev_attr[i]));
+		ctx->mem_repair_attrs[i] =
+				&ctx->mem_repair_dev_attr[i].dev_attr.attr;
+	}
+
+	sprintf(ctx->name, "%s%d", "mem_repair", instance);
+	group = &ctx->group;
+	group->name = ctx->name;
+	group->attrs = ctx->mem_repair_attrs;
+	group->is_visible  = mem_repair_attr_visible;
+	attr_groups[0] = group;
+
+	return 0;
+}
+
+/**
+ * edac_mem_repair_get_desc - get EDAC memory repair descriptors
+ * @dev: client device with memory repair feature
+ * @attr_groups: pointer to attribute group container
+ * @instance: device's memory repair instance number.
+ *
+ * Return:
+ *  * %0	- Success.
+ *  * %-EINVAL	- Invalid parameters passed.
+ *  * %-ENOMEM	- Dynamic memory allocation failed.
+ */
+int edac_mem_repair_get_desc(struct device *dev,
+			     const struct attribute_group **attr_groups, u8 instance)
+{
+	if (!dev || !attr_groups)
+		return -EINVAL;
+
+	return mem_repair_create_desc(dev, attr_groups, instance);
+}
diff --git a/include/linux/edac.h b/include/linux/edac.h
index 979e91426701..5d07192bf1a7 100644
--- a/include/linux/edac.h
+++ b/include/linux/edac.h
@@ -668,6 +668,7 @@ static inline struct dimm_info *edac_get_dimm(struct mem_ctl_info *mci,
 enum edac_dev_feat {
 	RAS_FEAT_SCRUB,
 	RAS_FEAT_ECS,
+	RAS_FEAT_MEM_REPAIR,
 	RAS_FEAT_MAX
 };
 
@@ -729,11 +730,147 @@ int edac_ecs_get_desc(struct device *ecs_dev,
 		      const struct attribute_group **attr_groups,
 		      u16 num_media_frus);
 
+enum edac_mem_repair_function {
+	EDAC_SOFT_PPR,
+	EDAC_HARD_PPR,
+	EDAC_CACHELINE_MEM_SPARING,
+	EDAC_ROW_MEM_SPARING,
+	EDAC_BANK_MEM_SPARING,
+	EDAC_RANK_MEM_SPARING,
+};
+
+enum edac_mem_repair_persist_mode {
+	EDAC_MEM_REPAIR_SOFT, /* soft memory repair */
+	EDAC_MEM_REPAIR_HARD, /* hard memory repair */
+};
+
+enum edac_mem_repair_cmd {
+	EDAC_DO_MEM_REPAIR = 1,
+};
+
+/**
+ * struct edac_mem_repair_ops - memory repair operations
+ * (all elements are optional except do_repair, set_hpa/set_dpa)
+ * @get_repair_function: get the memory repair function, listed in
+ *			 enum edac_mem_repair_function.
+ * @get_persist_mode: get the current persist mode. Persist repair modes supported
+ *		      in the device is based on the memory repair function which is
+ *		      temporary or permanent and is lost with a power cycle.
+ *		      EDAC_MEM_REPAIR_SOFT - Soft repair function (temporary repair).
+ *		      EDAC_MEM_REPAIR_HARD - Hard memory repair function (permanent repair).
+ * All other values are reserved.
+ * @set_persist_mode: set the persist mode of the memory repair instance.
+ * @get_dpa_support: get dpa support flag. In some states of system configuration
+ *		     (e.g. before address decoders have been configured), memory devices
+ *		     (e.g. CXL) may not have an active mapping in the main host address
+ *		     physical address map. As such, the memory to repair must be identified
+ *		     by a device specific physical addressing scheme using a device physical
+ *		     address(DPA). The DPA and other control attributes to use for the
+ *		     dry_run and repair operations will be presented in related error records.
+ * @get_repair_safe_when_in_use: get whether memory media is accessible and
+ *				 data is retained during repair operation.
+ * @get_hpa: get current host physical address (HPA).
+ * @set_hpa: set host physical address (HPA) of memory to repair.
+ * @get_min_hpa: get the minimum supported host physical address (HPA).
+ * @get_max_hpa: get the maximum supported host physical address (HPA).
+ * @get_dpa: get current device physical address (DPA).
+ * @set_dpa: set device physical address (DPA) of memory to repair.
+ * @get_min_dpa: get the minimum supported device physical address (DPA).
+ * @get_max_dpa: get the maximum supported device physical address (DPA).
+ * @get_nibble_mask: get current nibble mask.
+ * @set_nibble_mask: set nibble mask of memory to repair.
+ * @get_min_nibble_mask: get the minimum supported nibble mask.
+ * @get_max_nibble_mask: get the maximum supported nibble mask.
+ * @get_bank_group: get current bank group.
+ * @set_bank_group: set bank group of memory to repair.
+ * @get_min_bank_group: get the minimum supported bank group.
+ * @get_max_bank_group: get the maximum supported bank group.
+ * @get_bank: get current bank.
+ * @set_bank: set bank of memory to repair.
+ * @get_min_bank: get the minimum supported bank.
+ * @get_max_bank: get the maximum supported bank.
+ * @get_rank: get current rank.
+ * @set_rank: set rank of memory to repair.
+ * @get_min_rank: get the minimum supported rank.
+ * @get_max_rank: get the maximum supported rank.
+ * @get_row: get current row.
+ * @set_row: set row of memory to repair.
+ * @get_min_row: get the minimum supported row.
+ * @get_max_row: get the maximum supported row.
+ * @get_column: get current column.
+ * @set_column: set column of memory to repair.
+ * @get_min_column: get the minimum supported column.
+ * @get_max_column: get the maximum supported column.
+ * @get_channel: get current channel.
+ * @set_channel: set channel of memory to repair.
+ * @get_min_channel: get the minimum supported channel.
+ * @get_max_channel: get the maximum supported channel.
+ * @get_sub_channel: get current sub channel.
+ * @set_sub_channel: set sub channel of memory to repair.
+ * @get_min_sub_channel: get the minimum supported sub channel.
+ * @get_max_sub_channel: get the maximum supported sub channel.
+ * @do_repair: Issue memory repair operation for the HPA/DPA and
+ *	       other control attributes set for the memory to repair.
+ */
+struct edac_mem_repair_ops {
+	int (*get_repair_function)(struct device *dev, void *drv_data, u32 *val);
+	int (*get_persist_mode)(struct device *dev, void *drv_data, u32 *mode);
+	int (*set_persist_mode)(struct device *dev, void *drv_data, u32 mode);
+	int (*get_dpa_support)(struct device *dev, void *drv_data, u32 *val);
+	int (*get_repair_safe_when_in_use)(struct device *dev, void *drv_data, u32 *val);
+	int (*get_hpa)(struct device *dev, void *drv_data, u64 *hpa);
+	int (*set_hpa)(struct device *dev, void *drv_data, u64 hpa);
+	int (*get_min_hpa)(struct device *dev, void *drv_data, u64 *hpa);
+	int (*get_max_hpa)(struct device *dev, void *drv_data, u64 *hpa);
+	int (*get_dpa)(struct device *dev, void *drv_data, u64 *dpa);
+	int (*set_dpa)(struct device *dev, void *drv_data, u64 dpa);
+	int (*get_min_dpa)(struct device *dev, void *drv_data, u64 *dpa);
+	int (*get_max_dpa)(struct device *dev, void *drv_data, u64 *dpa);
+	int (*get_nibble_mask)(struct device *dev, void *drv_data, u64 *val);
+	int (*set_nibble_mask)(struct device *dev, void *drv_data, u64 val);
+	int (*get_min_nibble_mask)(struct device *dev, void *drv_data, u64 *val);
+	int (*get_max_nibble_mask)(struct device *dev, void *drv_data, u64 *val);
+	int (*get_bank_group)(struct device *dev, void *drv_data, u32 *val);
+	int (*set_bank_group)(struct device *dev, void *drv_data, u32 val);
+	int (*get_min_bank_group)(struct device *dev, void *drv_data, u32 *val);
+	int (*get_max_bank_group)(struct device *dev, void *drv_data, u32 *val);
+	int (*get_bank)(struct device *dev, void *drv_data, u32 *val);
+	int (*set_bank)(struct device *dev, void *drv_data, u32 val);
+	int (*get_min_bank)(struct device *dev, void *drv_data, u32 *val);
+	int (*get_max_bank)(struct device *dev, void *drv_data, u32 *val);
+	int (*get_rank)(struct device *dev, void *drv_data, u32 *val);
+	int (*set_rank)(struct device *dev, void *drv_data, u32 val);
+	int (*get_min_rank)(struct device *dev, void *drv_data, u32 *val);
+	int (*get_max_rank)(struct device *dev, void *drv_data, u32 *val);
+	int (*get_row)(struct device *dev, void *drv_data, u64 *val);
+	int (*set_row)(struct device *dev, void *drv_data, u64 val);
+	int (*get_min_row)(struct device *dev, void *drv_data, u64 *val);
+	int (*get_max_row)(struct device *dev, void *drv_data, u64 *val);
+	int (*get_column)(struct device *dev, void *drv_data, u32 *val);
+	int (*set_column)(struct device *dev, void *drv_data, u32 val);
+	int (*get_min_column)(struct device *dev, void *drv_data, u32 *val);
+	int (*get_max_column)(struct device *dev, void *drv_data, u32 *val);
+	int (*get_channel)(struct device *dev, void *drv_data, u32 *val);
+	int (*set_channel)(struct device *dev, void *drv_data, u32 val);
+	int (*get_min_channel)(struct device *dev, void *drv_data, u32 *val);
+	int (*get_max_channel)(struct device *dev, void *drv_data, u32 *val);
+	int (*get_sub_channel)(struct device *dev, void *drv_data, u32 *val);
+	int (*set_sub_channel)(struct device *dev, void *drv_data, u32 val);
+	int (*get_min_sub_channel)(struct device *dev, void *drv_data, u32 *val);
+	int (*get_max_sub_channel)(struct device *dev, void *drv_data, u32 *val);
+	int (*do_repair)(struct device *dev, void *drv_data, u32 val);
+};
+
+int edac_mem_repair_get_desc(struct device *dev,
+			     const struct attribute_group **attr_groups,
+			     u8 instance);
+
 /* EDAC device feature information structure */
 struct edac_dev_data {
 	union {
 		const struct edac_scrub_ops *scrub_ops;
 		const struct edac_ecs_ops *ecs_ops;
+		const struct edac_mem_repair_ops *mem_repair_ops;
 	};
 	u8 instance;
 	void *private;
@@ -744,6 +881,7 @@ struct edac_dev_feat_ctx {
 	void *private;
 	struct edac_dev_data *scrub;
 	struct edac_dev_data ecs;
+	struct edac_dev_data *mem_repair;
 };
 
 struct edac_dev_feature {
@@ -752,6 +890,7 @@ struct edac_dev_feature {
 	union {
 		const struct edac_scrub_ops *scrub_ops;
 		const struct edac_ecs_ops *ecs_ops;
+		const struct edac_mem_repair_ops *mem_repair_ops;
 	};
 	void *ctx;
 	struct edac_ecs_ex_info ecs_info;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [PATCH v18 05/19] ACPI:RAS2: Add ACPI RAS2 driver
  2025-01-06 12:09 [PATCH v18 00/19] EDAC: Scrub: introduce generic EDAC RAS control feature driver + CXL/ACPI-RAS2 drivers shiju.jose
                   ` (3 preceding siblings ...)
  2025-01-06 12:10 ` [PATCH v18 04/19] EDAC: Add memory repair " shiju.jose
@ 2025-01-06 12:10 ` shiju.jose
  2025-01-21 23:01   ` Daniel Ferguson
  2025-01-30 19:19   ` Daniel Ferguson
  2025-01-06 12:10 ` [PATCH v18 06/19] ras: mem: Add memory " shiju.jose
                   ` (15 subsequent siblings)
  20 siblings, 2 replies; 87+ messages in thread
From: shiju.jose @ 2025-01-06 12:10 UTC (permalink / raw)
  To: linux-edac, linux-cxl, linux-acpi, linux-mm, linux-kernel
  Cc: bp, tony.luck, rafael, lenb, mchehab, dan.j.williams, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	ira.weiny, david, Vilas.Sridharan, leo.duran, Yazen.Ghannam,
	rientjes, jiaqiyan, Jon.Grimm, dave.hansen, naoya.horiguchi,
	james.morse, jthoughton, somasundaram.a, erdemaktas, pgonda,
	duenwen, gthelen, wschwartz, dferguson, wbs, nifan.cxl,
	yazen.ghannam, tanxiaofei, prime.zeng, roberto.sassu,
	kangkang.shen, wanghuiqiang, linuxarm, shiju.jose

From: Shiju Jose <shiju.jose@huawei.com>

Add support for ACPI RAS2 feature table (RAS2) defined in the
ACPI 6.5 Specification, section 5.2.21.
Driver contains RAS2 Init, which extracts the RAS2 table and driver
adds auxiliary device for each memory feature which binds to the
RAS2 memory driver.

Driver uses PCC mailbox to communicate with the ACPI HW and the
driver adds OSPM interfaces to send RAS2 commands.

Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Co-developed-by: A Somasundaram <somasundaram.a@hpe.com>
Signed-off-by: A Somasundaram <somasundaram.a@hpe.com>
Co-developed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
---
 drivers/acpi/Kconfig     |  11 ++
 drivers/acpi/Makefile    |   1 +
 drivers/acpi/ras2.c      | 407 +++++++++++++++++++++++++++++++++++++++
 include/acpi/ras2_acpi.h |  45 +++++
 4 files changed, 464 insertions(+)
 create mode 100755 drivers/acpi/ras2.c
 create mode 100644 include/acpi/ras2_acpi.h

diff --git a/drivers/acpi/Kconfig b/drivers/acpi/Kconfig
index d81b55f5068c..bae9a47c829d 100644
--- a/drivers/acpi/Kconfig
+++ b/drivers/acpi/Kconfig
@@ -293,6 +293,17 @@ config ACPI_CPPC_LIB
 	  If your platform does not support CPPC in firmware,
 	  leave this option disabled.
 
+config ACPI_RAS2
+	bool "ACPI RAS2 driver"
+	select AUXILIARY_BUS
+	select MAILBOX
+	select PCC
+	help
+	  The driver adds support for ACPI RAS2 feature table(extracts RAS2
+	  table from OS system table) and OSPM interfaces to send RAS2
+	  commands via PCC mailbox subspace. Driver adds platform device for
+	  the RAS2 memory features which binds to the RAS2 memory driver.
+
 config ACPI_PROCESSOR
 	tristate "Processor"
 	depends on X86 || ARM64 || LOONGARCH || RISCV
diff --git a/drivers/acpi/Makefile b/drivers/acpi/Makefile
index 40208a0f5dfb..797b38cdc2f3 100644
--- a/drivers/acpi/Makefile
+++ b/drivers/acpi/Makefile
@@ -100,6 +100,7 @@ obj-$(CONFIG_ACPI_EC_DEBUGFS)	+= ec_sys.o
 obj-$(CONFIG_ACPI_BGRT)		+= bgrt.o
 obj-$(CONFIG_ACPI_CPPC_LIB)	+= cppc_acpi.o
 obj-$(CONFIG_ACPI_SPCR_TABLE)	+= spcr.o
+obj-$(CONFIG_ACPI_RAS2)		+= ras2.o
 obj-$(CONFIG_ACPI_DEBUGGER_USER) += acpi_dbg.o
 obj-$(CONFIG_ACPI_PPTT) 	+= pptt.o
 obj-$(CONFIG_ACPI_PFRUT)	+= pfr_update.o pfr_telemetry.o
diff --git a/drivers/acpi/ras2.c b/drivers/acpi/ras2.c
new file mode 100755
index 000000000000..50f7f6684393
--- /dev/null
+++ b/drivers/acpi/ras2.c
@@ -0,0 +1,407 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Implementation of ACPI RAS2 driver.
+ *
+ * Copyright (c) 2024 HiSilicon Limited.
+ *
+ * Support for RAS2 - ACPI 6.5 Specification, section 5.2.21
+ *
+ * Driver contains ACPI RAS2 init, which extracts the ACPI RAS2 table and
+ * get the PCC channel subspace for communicating with the ACPI compliant
+ * HW platform which supports ACPI RAS2. Driver adds platform devices
+ * for each RAS2 memory feature which binds to the memory ACPI RAS2 driver.
+ */
+
+#define pr_fmt(fmt)    "ACPI RAS2: " fmt
+
+#include <linux/delay.h>
+#include <linux/export.h>
+#include <linux/ktime.h>
+#include <linux/platform_device.h>
+#include <acpi/pcc.h>
+#include <acpi/ras2_acpi.h>
+
+/* Data structure for PCC communication */
+struct ras2_pcc_subspace {
+	int pcc_subspace_id;
+	struct mbox_client mbox_client;
+	struct pcc_mbox_chan *pcc_chan;
+	struct acpi_ras2_shared_memory __iomem *pcc_comm_addr;
+	bool pcc_channel_acquired;
+	ktime_t deadline;
+	unsigned int pcc_mpar;
+	unsigned int pcc_mrtt;
+	struct list_head elem;
+	u16 ref_count;
+};
+
+/*
+ * Arbitrary Retries for PCC commands because the
+ * remote processor could be much slower to reply.
+ */
+#define RAS2_NUM_RETRIES 600
+
+#define RAS2_FEATURE_TYPE_MEMORY        0x00
+
+/* global variables for the RAS2 PCC subspaces */
+static DEFINE_MUTEX(ras2_pcc_subspace_lock);
+static LIST_HEAD(ras2_pcc_subspaces);
+
+static int ras2_report_cap_error(u32 cap_status)
+{
+	switch (cap_status) {
+	case ACPI_RAS2_NOT_VALID:
+	case ACPI_RAS2_NOT_SUPPORTED:
+		return -EPERM;
+	case ACPI_RAS2_BUSY:
+		return -EBUSY;
+	case ACPI_RAS2_FAILED:
+	case ACPI_RAS2_ABORTED:
+	case ACPI_RAS2_INVALID_DATA:
+		return -EINVAL;
+	default: /* 0 or other, Success */
+		return 0;
+	}
+}
+
+static int ras2_check_pcc_chan(struct ras2_pcc_subspace *pcc_subspace)
+{
+	struct acpi_ras2_shared_memory __iomem *generic_comm_base = pcc_subspace->pcc_comm_addr;
+	ktime_t next_deadline = ktime_add(ktime_get(), pcc_subspace->deadline);
+	u32 cap_status;
+	u16 status;
+	u32 ret;
+
+	while (!ktime_after(ktime_get(), next_deadline)) {
+		/*
+		 * As per ACPI spec, the PCC space will be initialized by
+		 * platform and should have set the command completion bit when
+		 * PCC can be used by OSPM
+		 */
+		status = readw_relaxed(&generic_comm_base->status);
+		if (status & RAS2_PCC_CMD_ERROR) {
+			cap_status = readw_relaxed(&generic_comm_base->set_capabilities_status);
+			ret = ras2_report_cap_error(cap_status);
+
+			status &= ~RAS2_PCC_CMD_ERROR;
+			writew_relaxed(status, &generic_comm_base->status);
+			return ret;
+		}
+		if (status & RAS2_PCC_CMD_COMPLETE)
+			return 0;
+		/*
+		 * Reducing the bus traffic in case this loop takes longer than
+		 * a few retries.
+		 */
+		msleep(10);
+	}
+
+	return -EIO;
+}
+
+/**
+ * ras2_send_pcc_cmd() - Send RAS2 command via PCC channel
+ * @ras2_ctx:	pointer to the RAS2 context structure
+ * @cmd:	command to send
+ *
+ * Returns: 0 on success, an error otherwise
+ */
+int ras2_send_pcc_cmd(struct ras2_mem_ctx *ras2_ctx, u16 cmd)
+{
+	struct ras2_pcc_subspace *pcc_subspace = ras2_ctx->pcc_subspace;
+	struct acpi_ras2_shared_memory *generic_comm_base = pcc_subspace->pcc_comm_addr;
+	static ktime_t last_cmd_cmpl_time, last_mpar_reset;
+	struct mbox_chan *pcc_channel;
+	unsigned int time_delta;
+	static int mpar_count;
+	int ret;
+
+	guard(mutex)(&ras2_pcc_subspace_lock);
+	ret = ras2_check_pcc_chan(pcc_subspace);
+	if (ret < 0)
+		return ret;
+	pcc_channel = pcc_subspace->pcc_chan->mchan;
+
+	/*
+	 * Handle the Minimum Request Turnaround Time(MRTT)
+	 * "The minimum amount of time that OSPM must wait after the completion
+	 * of a command before issuing the next command, in microseconds"
+	 */
+	if (pcc_subspace->pcc_mrtt) {
+		time_delta = ktime_us_delta(ktime_get(), last_cmd_cmpl_time);
+		if (pcc_subspace->pcc_mrtt > time_delta)
+			udelay(pcc_subspace->pcc_mrtt - time_delta);
+	}
+
+	/*
+	 * Handle the non-zero Maximum Periodic Access Rate(MPAR)
+	 * "The maximum number of periodic requests that the subspace channel can
+	 * support, reported in commands per minute. 0 indicates no limitation."
+	 *
+	 * This parameter should be ideally zero or large enough so that it can
+	 * handle maximum number of requests that all the cores in the system can
+	 * collectively generate. If it is not, we will follow the spec and just
+	 * not send the request to the platform after hitting the MPAR limit in
+	 * any 60s window
+	 */
+	if (pcc_subspace->pcc_mpar) {
+		if (mpar_count == 0) {
+			time_delta = ktime_ms_delta(ktime_get(), last_mpar_reset);
+			if (time_delta < 60 * MSEC_PER_SEC) {
+				dev_dbg(ras2_ctx->dev,
+					"PCC cmd not sent due to MPAR limit");
+				return -EIO;
+			}
+			last_mpar_reset = ktime_get();
+			mpar_count = pcc_subspace->pcc_mpar;
+		}
+		mpar_count--;
+	}
+
+	/* Write to the shared comm region. */
+	writew_relaxed(cmd, &generic_comm_base->command);
+
+	/* Flip CMD COMPLETE bit */
+	writew_relaxed(0, &generic_comm_base->status);
+
+	/* Ring doorbell */
+	ret = mbox_send_message(pcc_channel, &cmd);
+	if (ret < 0) {
+		dev_err(ras2_ctx->dev,
+			"Err sending PCC mbox message. cmd:%d, ret:%d\n",
+			cmd, ret);
+		return ret;
+	}
+
+	/*
+	 * If Minimum Request Turnaround Time is non-zero, we need
+	 * to record the completion time of both READ and WRITE
+	 * command for proper handling of MRTT, so we need to check
+	 * for pcc_mrtt in addition to CMD_READ
+	 */
+	if (cmd == RAS2_PCC_CMD_EXEC || pcc_subspace->pcc_mrtt) {
+		ret = ras2_check_pcc_chan(pcc_subspace);
+		if (pcc_subspace->pcc_mrtt)
+			last_cmd_cmpl_time = ktime_get();
+	}
+
+	if (pcc_channel->mbox->txdone_irq)
+		mbox_chan_txdone(pcc_channel, ret);
+	else
+		mbox_client_txdone(pcc_channel, ret);
+
+	return ret >= 0 ? 0 : ret;
+}
+EXPORT_SYMBOL_GPL(ras2_send_pcc_cmd);
+
+static int ras2_register_pcc_channel(struct ras2_mem_ctx *ras2_ctx, int pcc_subspace_id)
+{
+	struct ras2_pcc_subspace *pcc_subspace;
+	struct pcc_mbox_chan *pcc_chan;
+	struct mbox_client *mbox_cl;
+
+	if (pcc_subspace_id < 0)
+		return -EINVAL;
+
+	mutex_lock(&ras2_pcc_subspace_lock);
+	list_for_each_entry(pcc_subspace, &ras2_pcc_subspaces, elem) {
+		if (pcc_subspace->pcc_subspace_id == pcc_subspace_id) {
+			ras2_ctx->pcc_subspace = pcc_subspace;
+			pcc_subspace->ref_count++;
+			mutex_unlock(&ras2_pcc_subspace_lock);
+			return 0;
+		}
+	}
+	mutex_unlock(&ras2_pcc_subspace_lock);
+
+	pcc_subspace = kcalloc(1, sizeof(*pcc_subspace), GFP_KERNEL);
+	if (!pcc_subspace)
+		return -ENOMEM;
+	mbox_cl = &pcc_subspace->mbox_client;
+	mbox_cl->knows_txdone = true;
+
+	pcc_chan = pcc_mbox_request_channel(mbox_cl, pcc_subspace_id);
+	if (IS_ERR(pcc_chan)) {
+		kfree(pcc_subspace);
+		return PTR_ERR(pcc_chan);
+	}
+	*pcc_subspace = (struct ras2_pcc_subspace) {
+		.pcc_subspace_id = pcc_subspace_id,
+		.pcc_chan = pcc_chan,
+		.pcc_comm_addr = acpi_os_ioremap(pcc_chan->shmem_base_addr,
+						 pcc_chan->shmem_size),
+		.deadline = ns_to_ktime(RAS2_NUM_RETRIES *
+					pcc_chan->latency *
+					NSEC_PER_USEC),
+		.pcc_mrtt = pcc_chan->min_turnaround_time,
+		.pcc_mpar = pcc_chan->max_access_rate,
+		.mbox_client = {
+			.knows_txdone = true,
+		},
+		.pcc_channel_acquired = true,
+	};
+	mutex_lock(&ras2_pcc_subspace_lock);
+	list_add(&pcc_subspace->elem, &ras2_pcc_subspaces);
+	pcc_subspace->ref_count++;
+	mutex_unlock(&ras2_pcc_subspace_lock);
+	ras2_ctx->pcc_subspace = pcc_subspace;
+	ras2_ctx->pcc_comm_addr = pcc_subspace->pcc_comm_addr;
+	ras2_ctx->dev = pcc_chan->mchan->mbox->dev;
+
+	return 0;
+}
+
+static DEFINE_IDA(ras2_ida);
+static void ras2_remove_pcc(struct ras2_mem_ctx *ras2_ctx)
+{
+	struct ras2_pcc_subspace *pcc_subspace = ras2_ctx->pcc_subspace;
+
+	guard(mutex)(&ras2_pcc_subspace_lock);
+	if (pcc_subspace->ref_count > 0)
+		pcc_subspace->ref_count--;
+	if (!pcc_subspace->ref_count) {
+		list_del(&pcc_subspace->elem);
+		pcc_mbox_free_channel(pcc_subspace->pcc_chan);
+		kfree(pcc_subspace);
+	}
+}
+
+static void ras2_release(struct device *device)
+{
+	struct auxiliary_device *auxdev = container_of(device, struct auxiliary_device, dev);
+	struct ras2_mem_ctx *ras2_ctx = container_of(auxdev, struct ras2_mem_ctx, adev);
+
+	ida_free(&ras2_ida, auxdev->id);
+	ras2_remove_pcc(ras2_ctx);
+	kfree(ras2_ctx);
+}
+
+static struct ras2_mem_ctx *ras2_add_aux_device(char *name, int channel)
+{
+	struct ras2_mem_ctx *ras2_ctx;
+	int id, ret;
+
+	ras2_ctx = kzalloc(sizeof(*ras2_ctx), GFP_KERNEL);
+	if (!ras2_ctx)
+		return ERR_PTR(-ENOMEM);
+
+	mutex_init(&ras2_ctx->lock);
+
+	ret = ras2_register_pcc_channel(ras2_ctx, channel);
+	if (ret < 0) {
+		pr_debug("failed to register pcc channel ret=%d\n", ret);
+		goto ctx_free;
+	}
+
+	id = ida_alloc(&ras2_ida, GFP_KERNEL);
+	if (id < 0) {
+		ret = id;
+		goto pcc_free;
+	}
+	ras2_ctx->id = id;
+	ras2_ctx->adev.id = id;
+	ras2_ctx->adev.name = RAS2_MEM_DEV_ID_NAME;
+	ras2_ctx->adev.dev.release = ras2_release;
+	ras2_ctx->adev.dev.parent = ras2_ctx->dev;
+
+	ret = auxiliary_device_init(&ras2_ctx->adev);
+	if (ret)
+		goto ida_free;
+
+	ret = auxiliary_device_add(&ras2_ctx->adev);
+	if (ret) {
+		auxiliary_device_uninit(&ras2_ctx->adev);
+		return ERR_PTR(ret);
+	}
+
+	return ras2_ctx;
+
+ida_free:
+	ida_free(&ras2_ida, id);
+pcc_free:
+	ras2_remove_pcc(ras2_ctx);
+ctx_free:
+	kfree(ras2_ctx);
+	return ERR_PTR(ret);
+}
+
+static int __init ras2_acpi_init(void)
+{
+	struct acpi_table_header *pAcpiTable = NULL;
+	struct acpi_ras2_pcc_desc *pcc_desc_list;
+	struct acpi_table_ras2 *pRas2Table;
+	struct ras2_mem_ctx *ras2_ctx;
+	int pcc_subspace_id;
+	acpi_size ras2_size;
+	acpi_status status;
+	u8 count = 0, i;
+	int ret = 0;
+
+	status = acpi_get_table("RAS2", 0, &pAcpiTable);
+	if (ACPI_FAILURE(status) || !pAcpiTable) {
+		pr_err("ACPI RAS2 driver failed to initialize, get table failed\n");
+		return -EINVAL;
+	}
+
+	ras2_size = pAcpiTable->length;
+	if (ras2_size < sizeof(struct acpi_table_ras2)) {
+		pr_err("ACPI RAS2 table present but broken (too short #1)\n");
+		ret = -EINVAL;
+		goto free_ras2_table;
+	}
+
+	pRas2Table = (struct acpi_table_ras2 *)pAcpiTable;
+	if (pRas2Table->num_pcc_descs <= 0) {
+		pr_err("ACPI RAS2 table does not contain PCC descriptors\n");
+		ret = -EINVAL;
+		goto free_ras2_table;
+	}
+
+	pcc_desc_list = (struct acpi_ras2_pcc_desc *)(pRas2Table + 1);
+	/* Double scan for the case of only one actual controller */
+	pcc_subspace_id = -1;
+	count = 0;
+	for (i = 0; i < pRas2Table->num_pcc_descs; i++, pcc_desc_list++) {
+		if (pcc_desc_list->feature_type != RAS2_FEATURE_TYPE_MEMORY)
+			continue;
+		if (pcc_subspace_id == -1) {
+			pcc_subspace_id = pcc_desc_list->channel_id;
+			count++;
+		}
+		if (pcc_desc_list->channel_id != pcc_subspace_id)
+			count++;
+	}
+	/*
+	 * Workaround for the client platform with multiple scrub devices
+	 * but uses single PCC subspace for communication.
+	 */
+	if (count == 1) {
+		/* Add auxiliary device and bind ACPI RAS2 memory driver */
+		ras2_ctx = ras2_add_aux_device(RAS2_MEM_DEV_ID_NAME, pcc_subspace_id);
+		if (IS_ERR(ras2_ctx)) {
+			ret = PTR_ERR(ras2_ctx);
+			goto free_ras2_table;
+		}
+		acpi_put_table(pAcpiTable);
+		return 0;
+	}
+
+	pcc_desc_list = (struct acpi_ras2_pcc_desc *)(pRas2Table + 1);
+	count = 0;
+	for (i = 0; i < pRas2Table->num_pcc_descs; i++, pcc_desc_list++) {
+		if (pcc_desc_list->feature_type != RAS2_FEATURE_TYPE_MEMORY)
+			continue;
+		pcc_subspace_id = pcc_desc_list->channel_id;
+		/* Add auxiliary device and bind ACPI RAS2 memory driver */
+		ras2_ctx = ras2_add_aux_device(RAS2_MEM_DEV_ID_NAME, pcc_subspace_id);
+		if (IS_ERR(ras2_ctx)) {
+			ret = PTR_ERR(ras2_ctx);
+			goto free_ras2_table;
+		}
+	}
+
+free_ras2_table:
+	acpi_put_table(pAcpiTable);
+	return ret;
+}
+late_initcall(ras2_acpi_init)
diff --git a/include/acpi/ras2_acpi.h b/include/acpi/ras2_acpi.h
new file mode 100644
index 000000000000..7b32407ae2af
--- /dev/null
+++ b/include/acpi/ras2_acpi.h
@@ -0,0 +1,45 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * RAS2 ACPI driver header file
+ *
+ * (C) Copyright 2014, 2015 Hewlett-Packard Enterprises
+ *
+ * Copyright (c) 2024 HiSilicon Limited
+ */
+
+#ifndef _RAS2_ACPI_H
+#define _RAS2_ACPI_H
+
+#include <linux/acpi.h>
+#include <linux/auxiliary_bus.h>
+#include <linux/mailbox_client.h>
+#include <linux/mutex.h>
+#include <linux/types.h>
+
+#define RAS2_PCC_CMD_COMPLETE	BIT(0)
+#define RAS2_PCC_CMD_ERROR	BIT(2)
+
+/* RAS2 specific PCC commands */
+#define RAS2_PCC_CMD_EXEC 0x01
+
+#define RAS2_AUX_DEV_NAME "ras2"
+#define RAS2_MEM_DEV_ID_NAME "acpi_ras2_mem"
+
+/* Data structure RAS2 table */
+struct ras2_mem_ctx {
+	struct auxiliary_device adev;
+	/* Lock to provide mutually exclusive access to PCC channel */
+	struct mutex lock;
+	int id;
+	u8 instance;
+	bool bg;
+	u64 base, size;
+	u8 scrub_cycle_hrs, min_scrub_cycle, max_scrub_cycle;
+	struct device *dev;
+	struct device *scrub_dev;
+	void *pcc_subspace;
+	struct acpi_ras2_shared_memory __iomem *pcc_comm_addr;
+};
+
+int ras2_send_pcc_cmd(struct ras2_mem_ctx *ras2_ctx, u16 cmd);
+#endif /* _RAS2_ACPI_H */
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [PATCH v18 06/19] ras: mem: Add memory ACPI RAS2 driver
  2025-01-06 12:09 [PATCH v18 00/19] EDAC: Scrub: introduce generic EDAC RAS control feature driver + CXL/ACPI-RAS2 drivers shiju.jose
                   ` (4 preceding siblings ...)
  2025-01-06 12:10 ` [PATCH v18 05/19] ACPI:RAS2: Add ACPI RAS2 driver shiju.jose
@ 2025-01-06 12:10 ` shiju.jose
  2025-01-21 23:01   ` Daniel Ferguson
  2025-01-30 19:19   ` Daniel Ferguson
  2025-01-06 12:10 ` [PATCH v18 07/19] cxl: Refactor user ioctl command path from mds to mailbox shiju.jose
                   ` (14 subsequent siblings)
  20 siblings, 2 replies; 87+ messages in thread
From: shiju.jose @ 2025-01-06 12:10 UTC (permalink / raw)
  To: linux-edac, linux-cxl, linux-acpi, linux-mm, linux-kernel
  Cc: bp, tony.luck, rafael, lenb, mchehab, dan.j.williams, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	ira.weiny, david, Vilas.Sridharan, leo.duran, Yazen.Ghannam,
	rientjes, jiaqiyan, Jon.Grimm, dave.hansen, naoya.horiguchi,
	james.morse, jthoughton, somasundaram.a, erdemaktas, pgonda,
	duenwen, gthelen, wschwartz, dferguson, wbs, nifan.cxl,
	yazen.ghannam, tanxiaofei, prime.zeng, roberto.sassu,
	kangkang.shen, wanghuiqiang, linuxarm, shiju.jose

From: Shiju Jose <shiju.jose@huawei.com>

Memory ACPI RAS2 auxiliary driver binds to the auxiliary device
add by the ACPI RAS2 table parser.

Driver uses a PCC subspace for communicating with the ACPI compliant
platform.

Device with ACPI RAS2 scrub feature registers with EDAC device driver,
which retrieves the scrub descriptor from EDAC scrub and exposes
the scrub control attributes for RAS2 scrub instance to userspace in
/sys/bus/edac/devices/acpi_ras_mem0/scrubX/.

Co-developed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
---
 Documentation/edac/scrub.rst |  81 ++++++++
 drivers/ras/Kconfig          |  10 +
 drivers/ras/Makefile         |   1 +
 drivers/ras/acpi_ras2.c      | 385 +++++++++++++++++++++++++++++++++++
 4 files changed, 477 insertions(+)
 create mode 100644 drivers/ras/acpi_ras2.c

diff --git a/Documentation/edac/scrub.rst b/Documentation/edac/scrub.rst
index 5640f9aeee38..f86645c7f0af 100644
--- a/Documentation/edac/scrub.rst
+++ b/Documentation/edac/scrub.rst
@@ -244,3 +244,84 @@ Sysfs files are documented in
 `Documentation/ABI/testing/sysfs-edac-scrub`.
 
 `Documentation/ABI/testing/sysfs-edac-ecs`.
+
+Examples
+--------
+
+The usage takes the form shown in these examples:
+
+1. ACPI RAS2
+
+1.1 On demand scrubbing for a specific memory region.
+
+root@localhost:~# cat /sys/bus/edac/devices/acpi_ras_mem0/scrub0/min_cycle_duration
+
+3600
+
+root@localhost:~# cat /sys/bus/edac/devices/acpi_ras_mem0/scrub0/max_cycle_duration
+
+86400
+
+root@localhost:~# cat /sys/bus/edac/devices/acpi_ras_mem0/scrub0/current_cycle_duration
+
+36000
+
+# Readback 'addr', non-zero - demand scrub is in progress, zero - scrub is finished.
+
+root@localhost:~# cat /sys/bus/edac/devices/acpi_ras_mem0/scrub0/addr
+
+0
+
+root@localhost:~# echo 54000 > /sys/bus/edac/devices/acpi_ras_mem0/scrub0/current_cycle_duration
+
+root@localhost:~# echo 0x150000 > /sys/bus/edac/devices/acpi_ras_mem0/scrub0/size
+
+# Write 'addr' starts demand scrubbing, please make sure other attributes are set prior to that.
+
+root@localhost:~# echo 0x120000 > /sys/bus/edac/devices/acpi_ras_mem0/scrub0/addr
+
+root@localhost:~# cat /sys/bus/edac/devices/acpi_ras_mem0/scrub0/current_cycle_duration
+
+54000
+
+# Readback 'addr', non-zero - demand scrub is in progress, zero - scrub is finished.
+
+root@localhost:~# cat /sys/bus/edac/devices/acpi_ras_mem0/scrub0/addr
+
+0x120000
+
+root@localhost:~# cat /sys/bus/edac/devices/acpi_ras_mem0/scrub0/addr
+
+0
+
+1.2 Background scrubbing the entire memory
+
+root@localhost:~# cat /sys/bus/edac/devices/acpi_ras_mem0/scrub0/min_cycle_duration
+
+3600
+
+root@localhost:~# cat /sys/bus/edac/devices/acpi_ras_mem0/scrub0/max_cycle_duration
+
+86400
+
+root@localhost:~# cat /sys/bus/edac/devices/acpi_ras_mem0/scrub0/current_cycle_duration
+
+36000
+
+root@localhost:~# cat /sys/bus/edac/devices/acpi_ras_mem0/scrub0/enable_background
+
+0
+
+root@localhost:~# echo 10800 > /sys/bus/edac/devices/acpi_ras_mem0/scrub0/current_cycle_duration
+
+root@localhost:~# echo 1 > /sys/bus/edac/devices/acpi_ras_mem0/scrub0/enable_background
+
+root@localhost:~# cat /sys/bus/edac/devices/acpi_ras_mem0/scrub0/enable_background
+
+1
+
+root@localhost:~# cat /sys/bus/edac/devices/acpi_ras_mem0/scrub0/current_cycle_duration
+
+10800
+
+root@localhost:~# echo 0 > /sys/bus/edac/devices/acpi_ras_mem0/scrub0/enable_background
diff --git a/drivers/ras/Kconfig b/drivers/ras/Kconfig
index fc4f4bb94a4c..b77790bdc73a 100644
--- a/drivers/ras/Kconfig
+++ b/drivers/ras/Kconfig
@@ -46,4 +46,14 @@ config RAS_FMPM
 	  Memory will be retired during boot time and run time depending on
 	  platform-specific policies.
 
+config MEM_ACPI_RAS2
+	tristate "Memory ACPI RAS2 driver"
+	depends on ACPI_RAS2
+	depends on EDAC
+	help
+	  The driver binds to the platform device added by the ACPI RAS2
+	  table parser. Use a PCC channel subspace for communicating with
+	  the ACPI compliant platform to provide control of memory scrub
+	  parameters to the user via the EDAC scrub.
+
 endif
diff --git a/drivers/ras/Makefile b/drivers/ras/Makefile
index 11f95d59d397..a0e6e903d6b0 100644
--- a/drivers/ras/Makefile
+++ b/drivers/ras/Makefile
@@ -2,6 +2,7 @@
 obj-$(CONFIG_RAS)	+= ras.o
 obj-$(CONFIG_DEBUG_FS)	+= debugfs.o
 obj-$(CONFIG_RAS_CEC)	+= cec.o
+obj-$(CONFIG_MEM_ACPI_RAS2)	+= acpi_ras2.o
 
 obj-$(CONFIG_RAS_FMPM)	+= amd/fmpm.o
 obj-y			+= amd/atl/
diff --git a/drivers/ras/acpi_ras2.c b/drivers/ras/acpi_ras2.c
new file mode 100644
index 000000000000..6c5772d60f22
--- /dev/null
+++ b/drivers/ras/acpi_ras2.c
@@ -0,0 +1,385 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * ACPI RAS2 memory driver
+ *
+ * Copyright (c) 2024 HiSilicon Limited.
+ *
+ */
+
+#define pr_fmt(fmt)	"MEMORY ACPI RAS2: " fmt
+
+#include <linux/bitfield.h>
+#include <linux/edac.h>
+#include <linux/platform_device.h>
+#include <acpi/ras2_acpi.h>
+
+#define RAS2_DEV_NUM_RAS_FEATURES	1
+
+#define RAS2_SUPPORT_HW_PARTOL_SCRUB	BIT(0)
+#define RAS2_TYPE_PATROL_SCRUB	0x0000
+
+#define RAS2_GET_PATROL_PARAMETERS	0x01
+#define	RAS2_START_PATROL_SCRUBBER	0x02
+#define	RAS2_STOP_PATROL_SCRUBBER	0x03
+
+#define RAS2_PATROL_SCRUB_SCHRS_IN_MASK	GENMASK(15, 8)
+#define RAS2_PATROL_SCRUB_EN_BACKGROUND	BIT(0)
+#define RAS2_PATROL_SCRUB_SCHRS_OUT_MASK	GENMASK(7, 0)
+#define RAS2_PATROL_SCRUB_MIN_SCHRS_OUT_MASK	GENMASK(15, 8)
+#define RAS2_PATROL_SCRUB_MAX_SCHRS_OUT_MASK	GENMASK(23, 16)
+#define RAS2_PATROL_SCRUB_FLAG_SCRUBBER_RUNNING	BIT(0)
+
+#define RAS2_SCRUB_NAME_LEN      128
+#define RAS2_HOUR_IN_SECS    3600
+
+struct acpi_ras2_ps_shared_mem {
+	struct acpi_ras2_shared_memory common;
+	struct acpi_ras2_patrol_scrub_parameter params;
+};
+
+static int ras2_is_patrol_scrub_support(struct ras2_mem_ctx *ras2_ctx)
+{
+	struct acpi_ras2_shared_memory __iomem *common = (void *)
+						ras2_ctx->pcc_comm_addr;
+
+	guard(mutex)(&ras2_ctx->lock);
+	common->set_capabilities[0] = 0;
+
+	return common->features[0] & RAS2_SUPPORT_HW_PARTOL_SCRUB;
+}
+
+static int ras2_update_patrol_scrub_params_cache(struct ras2_mem_ctx *ras2_ctx)
+{
+	struct acpi_ras2_ps_shared_mem __iomem *ps_sm = (void *)
+						ras2_ctx->pcc_comm_addr;
+	int ret;
+
+	ps_sm->common.set_capabilities[0] = RAS2_SUPPORT_HW_PARTOL_SCRUB;
+	ps_sm->params.patrol_scrub_command = RAS2_GET_PATROL_PARAMETERS;
+
+	ret = ras2_send_pcc_cmd(ras2_ctx, RAS2_PCC_CMD_EXEC);
+	if (ret) {
+		dev_err(ras2_ctx->dev, "failed to read parameters\n");
+		return ret;
+	}
+
+	ras2_ctx->min_scrub_cycle = FIELD_GET(RAS2_PATROL_SCRUB_MIN_SCHRS_OUT_MASK,
+					      ps_sm->params.scrub_params_out);
+	ras2_ctx->max_scrub_cycle = FIELD_GET(RAS2_PATROL_SCRUB_MAX_SCHRS_OUT_MASK,
+					      ps_sm->params.scrub_params_out);
+	if (!ras2_ctx->bg) {
+		ras2_ctx->base = ps_sm->params.actual_address_range[0];
+		ras2_ctx->size = ps_sm->params.actual_address_range[1];
+	}
+	ras2_ctx->scrub_cycle_hrs = FIELD_GET(RAS2_PATROL_SCRUB_SCHRS_OUT_MASK,
+					      ps_sm->params.scrub_params_out);
+
+	return 0;
+}
+
+/* Context - lock must be held */
+static int ras2_get_patrol_scrub_running(struct ras2_mem_ctx *ras2_ctx,
+					 bool *running)
+{
+	struct acpi_ras2_ps_shared_mem __iomem *ps_sm = (void *)
+						ras2_ctx->pcc_comm_addr;
+	int ret;
+
+	ps_sm->common.set_capabilities[0] = RAS2_SUPPORT_HW_PARTOL_SCRUB;
+	ps_sm->params.patrol_scrub_command = RAS2_GET_PATROL_PARAMETERS;
+
+	ret = ras2_send_pcc_cmd(ras2_ctx, RAS2_PCC_CMD_EXEC);
+	if (ret) {
+		dev_err(ras2_ctx->dev, "failed to read parameters\n");
+		return ret;
+	}
+
+	*running = ps_sm->params.flags & RAS2_PATROL_SCRUB_FLAG_SCRUBBER_RUNNING;
+
+	return 0;
+}
+
+static int ras2_hw_scrub_read_min_scrub_cycle(struct device *dev, void *drv_data,
+					      u32 *min)
+{
+	struct ras2_mem_ctx *ras2_ctx = drv_data;
+
+	*min = ras2_ctx->min_scrub_cycle * RAS2_HOUR_IN_SECS;
+
+	return 0;
+}
+
+static int ras2_hw_scrub_read_max_scrub_cycle(struct device *dev, void *drv_data,
+					      u32 *max)
+{
+	struct ras2_mem_ctx *ras2_ctx = drv_data;
+
+	*max = ras2_ctx->max_scrub_cycle * RAS2_HOUR_IN_SECS;
+
+	return 0;
+}
+
+static int ras2_hw_scrub_cycle_read(struct device *dev, void *drv_data,
+				    u32 *scrub_cycle_secs)
+{
+	struct ras2_mem_ctx *ras2_ctx = drv_data;
+
+	*scrub_cycle_secs = ras2_ctx->scrub_cycle_hrs * RAS2_HOUR_IN_SECS;
+
+	return 0;
+}
+
+static int ras2_hw_scrub_cycle_write(struct device *dev, void *drv_data,
+				     u32 scrub_cycle_secs)
+{
+	u8 scrub_cycle_hrs = scrub_cycle_secs / RAS2_HOUR_IN_SECS;
+	struct ras2_mem_ctx *ras2_ctx = drv_data;
+	bool running;
+	int ret;
+
+	guard(mutex)(&ras2_ctx->lock);
+	ret = ras2_get_patrol_scrub_running(ras2_ctx, &running);
+	if (ret)
+		return ret;
+
+	if (running)
+		return -EBUSY;
+
+	if (scrub_cycle_hrs < ras2_ctx->min_scrub_cycle ||
+	    scrub_cycle_hrs > ras2_ctx->max_scrub_cycle)
+		return -EINVAL;
+
+	ras2_ctx->scrub_cycle_hrs = scrub_cycle_hrs;
+
+	return 0;
+}
+
+static int ras2_hw_scrub_read_addr(struct device *dev, void *drv_data, u64 *base)
+{
+	struct ras2_mem_ctx *ras2_ctx = drv_data;
+	int ret;
+
+	/*
+	 * When BG scrubbing is enabled the actual address range is not valid.
+	 * Return -EBUSY now unless find out a method to retrieve actual full PA range.
+	 */
+	if (ras2_ctx->bg)
+		return -EBUSY;
+
+	/*
+	 * When demand scrubbing is finished firmware must reset actual
+	 * address range to 0. Otherwise userspace assumes demand scrubbing
+	 * is in progress.
+	 */
+	ret = ras2_update_patrol_scrub_params_cache(ras2_ctx);
+	if (ret)
+		return ret;
+	*base = ras2_ctx->base;
+
+	return 0;
+}
+
+static int ras2_hw_scrub_read_size(struct device *dev, void *drv_data, u64 *size)
+{
+	struct ras2_mem_ctx *ras2_ctx = drv_data;
+	int ret;
+
+	if (ras2_ctx->bg)
+		return -EBUSY;
+
+	ret = ras2_update_patrol_scrub_params_cache(ras2_ctx);
+	if (ret)
+		return ret;
+	*size = ras2_ctx->size;
+
+	return 0;
+}
+
+static int ras2_hw_scrub_write_addr(struct device *dev, void *drv_data, u64 base)
+{
+	struct ras2_mem_ctx *ras2_ctx = drv_data;
+	struct acpi_ras2_ps_shared_mem __iomem *ps_sm = (void *)
+						ras2_ctx->pcc_comm_addr;
+	bool running;
+	int ret;
+
+	guard(mutex)(&ras2_ctx->lock);
+	ps_sm->common.set_capabilities[0] = RAS2_SUPPORT_HW_PARTOL_SCRUB;
+	if (ras2_ctx->bg)
+		return -EBUSY;
+
+	if (!base || !ras2_ctx->size) {
+		dev_warn(ras2_ctx->dev,
+			 "%s: Invalid address range, base=0x%llx "
+			 "size=0x%llx\n", __func__,
+			 base, ras2_ctx->size);
+		return -ERANGE;
+	}
+
+	ret = ras2_get_patrol_scrub_running(ras2_ctx, &running);
+	if (ret)
+		return ret;
+
+	if (running)
+		return -EBUSY;
+
+	ps_sm->params.scrub_params_in &= ~RAS2_PATROL_SCRUB_SCHRS_IN_MASK;
+	ps_sm->params.scrub_params_in |= FIELD_PREP(RAS2_PATROL_SCRUB_SCHRS_IN_MASK,
+						    ras2_ctx->scrub_cycle_hrs);
+	ps_sm->params.requested_address_range[0] = base;
+	ps_sm->params.requested_address_range[1] = ras2_ctx->size;
+	ps_sm->params.scrub_params_in &= ~RAS2_PATROL_SCRUB_EN_BACKGROUND;
+	ps_sm->params.patrol_scrub_command = RAS2_START_PATROL_SCRUBBER;
+
+	ret = ras2_send_pcc_cmd(ras2_ctx, RAS2_PCC_CMD_EXEC);
+	if (ret) {
+		dev_err(ras2_ctx->dev, "Failed to start demand scrubbing\n");
+		return ret;
+	}
+
+	return ras2_update_patrol_scrub_params_cache(ras2_ctx);
+}
+
+static int ras2_hw_scrub_write_size(struct device *dev, void *drv_data, u64 size)
+{
+	struct ras2_mem_ctx *ras2_ctx = drv_data;
+	bool running;
+	int ret;
+
+	guard(mutex)(&ras2_ctx->lock);
+	ret = ras2_get_patrol_scrub_running(ras2_ctx, &running);
+	if (ret)
+		return ret;
+
+	if (running)
+		return -EBUSY;
+
+	if (!size) {
+		dev_warn(dev, "%s: Invalid address range size=0x%llx\n",
+			 __func__, size);
+		return -EINVAL;
+	}
+
+	ras2_ctx->size = size;
+
+	return 0;
+}
+
+static int ras2_hw_scrub_set_enabled_bg(struct device *dev, void *drv_data, bool enable)
+{
+	struct ras2_mem_ctx *ras2_ctx = drv_data;
+	struct acpi_ras2_ps_shared_mem __iomem *ps_sm = (void *)
+						ras2_ctx->pcc_comm_addr;
+	bool running;
+	int ret;
+
+	guard(mutex)(&ras2_ctx->lock);
+	ps_sm->common.set_capabilities[0] = RAS2_SUPPORT_HW_PARTOL_SCRUB;
+	ret = ras2_get_patrol_scrub_running(ras2_ctx, &running);
+	if (ret)
+		return ret;
+	if (enable) {
+		if (ras2_ctx->bg || running)
+			return -EBUSY;
+		ps_sm->params.requested_address_range[0] = 0;
+		ps_sm->params.requested_address_range[1] = 0;
+		ps_sm->params.scrub_params_in &= ~RAS2_PATROL_SCRUB_SCHRS_IN_MASK;
+		ps_sm->params.scrub_params_in |= FIELD_PREP(RAS2_PATROL_SCRUB_SCHRS_IN_MASK,
+							    ras2_ctx->scrub_cycle_hrs);
+		ps_sm->params.patrol_scrub_command = RAS2_START_PATROL_SCRUBBER;
+	} else {
+		if (!ras2_ctx->bg)
+			return -EPERM;
+		if (!ras2_ctx->bg && running)
+			return -EBUSY;
+		ps_sm->params.patrol_scrub_command = RAS2_STOP_PATROL_SCRUBBER;
+	}
+	ps_sm->params.scrub_params_in &= ~RAS2_PATROL_SCRUB_EN_BACKGROUND;
+	ps_sm->params.scrub_params_in |= FIELD_PREP(RAS2_PATROL_SCRUB_EN_BACKGROUND,
+						    enable);
+	ret = ras2_send_pcc_cmd(ras2_ctx, RAS2_PCC_CMD_EXEC);
+	if (ret) {
+		dev_err(ras2_ctx->dev, "Failed to %s background scrubbing\n",
+			enable ? "enable" : "disable");
+		return ret;
+	}
+	if (enable) {
+		ras2_ctx->bg = true;
+		/* Update the cache to account for rounding of supplied parameters and similar */
+		ret = ras2_update_patrol_scrub_params_cache(ras2_ctx);
+	} else {
+		ret = ras2_update_patrol_scrub_params_cache(ras2_ctx);
+		ras2_ctx->bg = false;
+	}
+
+	return ret;
+}
+
+static int ras2_hw_scrub_get_enabled_bg(struct device *dev, void *drv_data, bool *enabled)
+{
+	struct ras2_mem_ctx *ras2_ctx = drv_data;
+
+	*enabled = ras2_ctx->bg;
+
+	return 0;
+}
+
+static const struct edac_scrub_ops ras2_scrub_ops = {
+	.read_addr = ras2_hw_scrub_read_addr,
+	.read_size = ras2_hw_scrub_read_size,
+	.write_addr = ras2_hw_scrub_write_addr,
+	.write_size = ras2_hw_scrub_write_size,
+	.get_enabled_bg = ras2_hw_scrub_get_enabled_bg,
+	.set_enabled_bg = ras2_hw_scrub_set_enabled_bg,
+	.get_min_cycle = ras2_hw_scrub_read_min_scrub_cycle,
+	.get_max_cycle = ras2_hw_scrub_read_max_scrub_cycle,
+	.get_cycle_duration = ras2_hw_scrub_cycle_read,
+	.set_cycle_duration = ras2_hw_scrub_cycle_write,
+};
+
+static int ras2_probe(struct auxiliary_device *auxdev,
+		      const struct auxiliary_device_id *id)
+{
+	struct ras2_mem_ctx *ras2_ctx = container_of(auxdev, struct ras2_mem_ctx, adev);
+	struct edac_dev_feature ras_features[RAS2_DEV_NUM_RAS_FEATURES];
+	char scrub_name[RAS2_SCRUB_NAME_LEN];
+	int num_ras_features = 0;
+	int ret;
+
+	if (!ras2_is_patrol_scrub_support(ras2_ctx))
+		return -EOPNOTSUPP;
+
+	ret = ras2_update_patrol_scrub_params_cache(ras2_ctx);
+	if (ret)
+		return ret;
+
+	snprintf(scrub_name, sizeof(scrub_name), "acpi_ras_mem%d",
+		 ras2_ctx->id);
+
+	ras_features[num_ras_features].ft_type = RAS_FEAT_SCRUB;
+	ras_features[num_ras_features].instance = ras2_ctx->instance;
+	ras_features[num_ras_features].scrub_ops = &ras2_scrub_ops;
+	ras_features[num_ras_features].ctx = ras2_ctx;
+	num_ras_features++;
+
+	return edac_dev_register(&auxdev->dev, scrub_name, NULL,
+				 num_ras_features, ras_features);
+}
+
+static const struct auxiliary_device_id ras2_mem_dev_id_table[] = {
+	{ .name = RAS2_AUX_DEV_NAME "." RAS2_MEM_DEV_ID_NAME, },
+	{ },
+};
+
+MODULE_DEVICE_TABLE(auxiliary, ras2_mem_dev_id_table);
+
+static struct auxiliary_driver ras2_mem_driver = {
+	.name = RAS2_MEM_DEV_ID_NAME,
+	.probe = ras2_probe,
+	.id_table = ras2_mem_dev_id_table,
+};
+module_auxiliary_driver(ras2_mem_driver);
+
+MODULE_IMPORT_NS("ACPI_RAS2");
+MODULE_DESCRIPTION("ACPI RAS2 memory driver");
+MODULE_LICENSE("GPL");
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [PATCH v18 07/19] cxl: Refactor user ioctl command path from mds to mailbox
  2025-01-06 12:09 [PATCH v18 00/19] EDAC: Scrub: introduce generic EDAC RAS control feature driver + CXL/ACPI-RAS2 drivers shiju.jose
                   ` (5 preceding siblings ...)
  2025-01-06 12:10 ` [PATCH v18 06/19] ras: mem: Add memory " shiju.jose
@ 2025-01-06 12:10 ` shiju.jose
  2025-01-06 12:10 ` [PATCH v18 08/19] cxl: Add skeletal features driver shiju.jose
                   ` (13 subsequent siblings)
  20 siblings, 0 replies; 87+ messages in thread
From: shiju.jose @ 2025-01-06 12:10 UTC (permalink / raw)
  To: linux-edac, linux-cxl, linux-acpi, linux-mm, linux-kernel
  Cc: bp, tony.luck, rafael, lenb, mchehab, dan.j.williams, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	ira.weiny, david, Vilas.Sridharan, leo.duran, Yazen.Ghannam,
	rientjes, jiaqiyan, Jon.Grimm, dave.hansen, naoya.horiguchi,
	james.morse, jthoughton, somasundaram.a, erdemaktas, pgonda,
	duenwen, gthelen, wschwartz, dferguson, wbs, nifan.cxl,
	yazen.ghannam, tanxiaofei, prime.zeng, roberto.sassu,
	kangkang.shen, wanghuiqiang, linuxarm, shiju.jose

From: Dave Jiang <dave.jiang@intel.com>

With 'struct cxl_mailbox' context introduced, the helper functions
cxl_query_cmd() and cxl_send_cmd() can take a cxl_mailbox directly
rather than a cxl_memdev parameter. Refactor to use cxl_mailbox
directly.

Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
---
 drivers/cxl/core/core.h   |  6 ++-
 drivers/cxl/core/mbox.c   | 91 +++++++++++++++++++--------------------
 drivers/cxl/core/memdev.c | 22 +++++++---
 drivers/cxl/cxlmem.h      | 40 -----------------
 include/cxl/mailbox.h     | 41 +++++++++++++++++-
 5 files changed, 103 insertions(+), 97 deletions(-)

diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
index 800466f96a68..23761340e65c 100644
--- a/drivers/cxl/core/core.h
+++ b/drivers/cxl/core/core.h
@@ -4,6 +4,8 @@
 #ifndef __CXL_CORE_H__
 #define __CXL_CORE_H__
 
+#include <cxl/mailbox.h>
+
 extern const struct device_type cxl_nvdimm_bridge_type;
 extern const struct device_type cxl_nvdimm_type;
 extern const struct device_type cxl_pmu_type;
@@ -65,9 +67,9 @@ static inline void cxl_region_exit(void)
 
 struct cxl_send_command;
 struct cxl_mem_query_commands;
-int cxl_query_cmd(struct cxl_memdev *cxlmd,
+int cxl_query_cmd(struct cxl_mailbox *cxl_mbox,
 		  struct cxl_mem_query_commands __user *q);
-int cxl_send_cmd(struct cxl_memdev *cxlmd, struct cxl_send_command __user *s);
+int cxl_send_cmd(struct cxl_mailbox *cxl_mailbox, struct cxl_send_command __user *s);
 void __iomem *devm_cxl_iomap_block(struct device *dev, resource_size_t addr,
 				   resource_size_t length);
 
diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
index 548564c770c0..bdb8f060f2c1 100644
--- a/drivers/cxl/core/mbox.c
+++ b/drivers/cxl/core/mbox.c
@@ -349,40 +349,40 @@ static bool cxl_payload_from_user_allowed(u16 opcode, void *payload_in)
 	return true;
 }
 
-static int cxl_mbox_cmd_ctor(struct cxl_mbox_cmd *mbox,
-			     struct cxl_memdev_state *mds, u16 opcode,
+static int cxl_mbox_cmd_ctor(struct cxl_mbox_cmd *mbox_cmd,
+			     struct cxl_mailbox *cxl_mbox, u16 opcode,
 			     size_t in_size, size_t out_size, u64 in_payload)
 {
-	struct cxl_mailbox *cxl_mbox = &mds->cxlds.cxl_mbox;
-	*mbox = (struct cxl_mbox_cmd) {
+	*mbox_cmd = (struct cxl_mbox_cmd) {
 		.opcode = opcode,
 		.size_in = in_size,
 	};
 
 	if (in_size) {
-		mbox->payload_in = vmemdup_user(u64_to_user_ptr(in_payload),
-						in_size);
-		if (IS_ERR(mbox->payload_in))
-			return PTR_ERR(mbox->payload_in);
-
-		if (!cxl_payload_from_user_allowed(opcode, mbox->payload_in)) {
-			dev_dbg(mds->cxlds.dev, "%s: input payload not allowed\n",
+		mbox_cmd->payload_in = vmemdup_user(u64_to_user_ptr(in_payload),
+						    in_size);
+		if (IS_ERR(mbox_cmd->payload_in))
+			return PTR_ERR(mbox_cmd->payload_in);
+
+		if (!cxl_payload_from_user_allowed(opcode,
+						   mbox_cmd->payload_in)) {
+			dev_dbg(cxl_mbox->host, "%s: input payload not allowed\n",
 				cxl_mem_opcode_to_name(opcode));
-			kvfree(mbox->payload_in);
+			kvfree(mbox_cmd->payload_in);
 			return -EBUSY;
 		}
 	}
 
 	/* Prepare to handle a full payload for variable sized output */
 	if (out_size == CXL_VARIABLE_PAYLOAD)
-		mbox->size_out = cxl_mbox->payload_size;
+		mbox_cmd->size_out = cxl_mbox->payload_size;
 	else
-		mbox->size_out = out_size;
+		mbox_cmd->size_out = out_size;
 
-	if (mbox->size_out) {
-		mbox->payload_out = kvzalloc(mbox->size_out, GFP_KERNEL);
-		if (!mbox->payload_out) {
-			kvfree(mbox->payload_in);
+	if (mbox_cmd->size_out) {
+		mbox_cmd->payload_out = kvzalloc(mbox_cmd->size_out, GFP_KERNEL);
+		if (!mbox_cmd->payload_out) {
+			kvfree(mbox_cmd->payload_in);
 			return -ENOMEM;
 		}
 	}
@@ -397,10 +397,8 @@ static void cxl_mbox_cmd_dtor(struct cxl_mbox_cmd *mbox)
 
 static int cxl_to_mem_cmd_raw(struct cxl_mem_command *mem_cmd,
 			      const struct cxl_send_command *send_cmd,
-			      struct cxl_memdev_state *mds)
+			      struct cxl_mailbox *cxl_mbox)
 {
-	struct cxl_mailbox *cxl_mbox = &mds->cxlds.cxl_mbox;
-
 	if (send_cmd->raw.rsvd)
 		return -EINVAL;
 
@@ -415,7 +413,7 @@ static int cxl_to_mem_cmd_raw(struct cxl_mem_command *mem_cmd,
 	if (!cxl_mem_raw_command_allowed(send_cmd->raw.opcode))
 		return -EPERM;
 
-	dev_WARN_ONCE(mds->cxlds.dev, true, "raw command path used\n");
+	dev_WARN_ONCE(cxl_mbox->host, true, "raw command path used\n");
 
 	*mem_cmd = (struct cxl_mem_command) {
 		.info = {
@@ -431,7 +429,7 @@ static int cxl_to_mem_cmd_raw(struct cxl_mem_command *mem_cmd,
 
 static int cxl_to_mem_cmd(struct cxl_mem_command *mem_cmd,
 			  const struct cxl_send_command *send_cmd,
-			  struct cxl_memdev_state *mds)
+			  struct cxl_mailbox *cxl_mbox)
 {
 	struct cxl_mem_command *c = &cxl_mem_commands[send_cmd->id];
 	const struct cxl_command_info *info = &c->info;
@@ -446,11 +444,11 @@ static int cxl_to_mem_cmd(struct cxl_mem_command *mem_cmd,
 		return -EINVAL;
 
 	/* Check that the command is enabled for hardware */
-	if (!test_bit(info->id, mds->enabled_cmds))
+	if (!test_bit(info->id, cxl_mbox->enabled_cmds))
 		return -ENOTTY;
 
 	/* Check that the command is not claimed for exclusive kernel use */
-	if (test_bit(info->id, mds->exclusive_cmds))
+	if (test_bit(info->id, cxl_mbox->exclusive_cmds))
 		return -EBUSY;
 
 	/* Check the input buffer is the expected size */
@@ -479,7 +477,7 @@ static int cxl_to_mem_cmd(struct cxl_mem_command *mem_cmd,
 /**
  * cxl_validate_cmd_from_user() - Check fields for CXL_MEM_SEND_COMMAND.
  * @mbox_cmd: Sanitized and populated &struct cxl_mbox_cmd.
- * @mds: The driver data for the operation
+ * @cxl_mbox: CXL mailbox context
  * @send_cmd: &struct cxl_send_command copied in from userspace.
  *
  * Return:
@@ -494,10 +492,9 @@ static int cxl_to_mem_cmd(struct cxl_mem_command *mem_cmd,
  * safe to send to the hardware.
  */
 static int cxl_validate_cmd_from_user(struct cxl_mbox_cmd *mbox_cmd,
-				      struct cxl_memdev_state *mds,
+				      struct cxl_mailbox *cxl_mbox,
 				      const struct cxl_send_command *send_cmd)
 {
-	struct cxl_mailbox *cxl_mbox = &mds->cxlds.cxl_mbox;
 	struct cxl_mem_command mem_cmd;
 	int rc;
 
@@ -514,24 +511,23 @@ static int cxl_validate_cmd_from_user(struct cxl_mbox_cmd *mbox_cmd,
 
 	/* Sanitize and construct a cxl_mem_command */
 	if (send_cmd->id == CXL_MEM_COMMAND_ID_RAW)
-		rc = cxl_to_mem_cmd_raw(&mem_cmd, send_cmd, mds);
+		rc = cxl_to_mem_cmd_raw(&mem_cmd, send_cmd, cxl_mbox);
 	else
-		rc = cxl_to_mem_cmd(&mem_cmd, send_cmd, mds);
+		rc = cxl_to_mem_cmd(&mem_cmd, send_cmd, cxl_mbox);
 
 	if (rc)
 		return rc;
 
 	/* Sanitize and construct a cxl_mbox_cmd */
-	return cxl_mbox_cmd_ctor(mbox_cmd, mds, mem_cmd.opcode,
+	return cxl_mbox_cmd_ctor(mbox_cmd, cxl_mbox, mem_cmd.opcode,
 				 mem_cmd.info.size_in, mem_cmd.info.size_out,
 				 send_cmd->in.payload);
 }
 
-int cxl_query_cmd(struct cxl_memdev *cxlmd,
+int cxl_query_cmd(struct cxl_mailbox *cxl_mbox,
 		  struct cxl_mem_query_commands __user *q)
 {
-	struct cxl_memdev_state *mds = to_cxl_memdev_state(cxlmd->cxlds);
-	struct device *dev = &cxlmd->dev;
+	struct device *dev = cxl_mbox->host;
 	struct cxl_mem_command *cmd;
 	u32 n_commands;
 	int j = 0;
@@ -552,9 +548,9 @@ int cxl_query_cmd(struct cxl_memdev *cxlmd,
 	cxl_for_each_cmd(cmd) {
 		struct cxl_command_info info = cmd->info;
 
-		if (test_bit(info.id, mds->enabled_cmds))
+		if (test_bit(info.id, cxl_mbox->enabled_cmds))
 			info.flags |= CXL_MEM_COMMAND_FLAG_ENABLED;
-		if (test_bit(info.id, mds->exclusive_cmds))
+		if (test_bit(info.id, cxl_mbox->exclusive_cmds))
 			info.flags |= CXL_MEM_COMMAND_FLAG_EXCLUSIVE;
 
 		if (copy_to_user(&q->commands[j++], &info, sizeof(info)))
@@ -569,7 +565,7 @@ int cxl_query_cmd(struct cxl_memdev *cxlmd,
 
 /**
  * handle_mailbox_cmd_from_user() - Dispatch a mailbox command for userspace.
- * @mds: The driver data for the operation
+ * @cxl_mbox: The mailbox context for the operation.
  * @mbox_cmd: The validated mailbox command.
  * @out_payload: Pointer to userspace's output payload.
  * @size_out: (Input) Max payload size to copy out.
@@ -590,13 +586,12 @@ int cxl_query_cmd(struct cxl_memdev *cxlmd,
  *
  * See cxl_send_cmd().
  */
-static int handle_mailbox_cmd_from_user(struct cxl_memdev_state *mds,
+static int handle_mailbox_cmd_from_user(struct cxl_mailbox *cxl_mbox,
 					struct cxl_mbox_cmd *mbox_cmd,
 					u64 out_payload, s32 *size_out,
 					u32 *retval)
 {
-	struct cxl_mailbox *cxl_mbox = &mds->cxlds.cxl_mbox;
-	struct device *dev = mds->cxlds.dev;
+	struct device *dev = cxl_mbox->host;
 	int rc;
 
 	dev_dbg(dev,
@@ -633,10 +628,9 @@ static int handle_mailbox_cmd_from_user(struct cxl_memdev_state *mds,
 	return rc;
 }
 
-int cxl_send_cmd(struct cxl_memdev *cxlmd, struct cxl_send_command __user *s)
+int cxl_send_cmd(struct cxl_mailbox *cxl_mbox, struct cxl_send_command __user *s)
 {
-	struct cxl_memdev_state *mds = to_cxl_memdev_state(cxlmd->cxlds);
-	struct device *dev = &cxlmd->dev;
+	struct device *dev = cxl_mbox->host;
 	struct cxl_send_command send;
 	struct cxl_mbox_cmd mbox_cmd;
 	int rc;
@@ -646,11 +640,11 @@ int cxl_send_cmd(struct cxl_memdev *cxlmd, struct cxl_send_command __user *s)
 	if (copy_from_user(&send, s, sizeof(send)))
 		return -EFAULT;
 
-	rc = cxl_validate_cmd_from_user(&mbox_cmd, mds, &send);
+	rc = cxl_validate_cmd_from_user(&mbox_cmd, cxl_mbox, &send);
 	if (rc)
 		return rc;
 
-	rc = handle_mailbox_cmd_from_user(mds, &mbox_cmd, send.out.payload,
+	rc = handle_mailbox_cmd_from_user(cxl_mbox, &mbox_cmd, send.out.payload,
 					  &send.out.size, &send.retval);
 	if (rc)
 		return rc;
@@ -724,6 +718,7 @@ static int cxl_xfer_log(struct cxl_memdev_state *mds, uuid_t *uuid,
  */
 static void cxl_walk_cel(struct cxl_memdev_state *mds, size_t size, u8 *cel)
 {
+	struct cxl_mailbox *cxl_mbox = &mds->cxlds.cxl_mbox;
 	struct cxl_cel_entry *cel_entry;
 	const int cel_entries = size / sizeof(*cel_entry);
 	struct device *dev = mds->cxlds.dev;
@@ -737,7 +732,7 @@ static void cxl_walk_cel(struct cxl_memdev_state *mds, size_t size, u8 *cel)
 		int enabled = 0;
 
 		if (cmd) {
-			set_bit(cmd->info.id, mds->enabled_cmds);
+			set_bit(cmd->info.id, cxl_mbox->enabled_cmds);
 			enabled++;
 		}
 
@@ -807,6 +802,7 @@ static const uuid_t log_uuid[] = {
  */
 int cxl_enumerate_cmds(struct cxl_memdev_state *mds)
 {
+	struct cxl_mailbox *cxl_mbox = &mds->cxlds.cxl_mbox;
 	struct cxl_mbox_get_supported_logs *gsl;
 	struct device *dev = mds->cxlds.dev;
 	struct cxl_mem_command *cmd;
@@ -845,7 +841,7 @@ int cxl_enumerate_cmds(struct cxl_memdev_state *mds)
 		/* In case CEL was bogus, enable some default commands. */
 		cxl_for_each_cmd(cmd)
 			if (cmd->flags & CXL_CMD_FLAG_FORCE_ENABLE)
-				set_bit(cmd->info.id, mds->enabled_cmds);
+				set_bit(cmd->info.id, cxl_mbox->enabled_cmds);
 
 		/* Found the required CEL */
 		rc = 0;
@@ -1448,6 +1444,7 @@ struct cxl_memdev_state *cxl_memdev_state_create(struct device *dev)
 	mutex_init(&mds->event.log_lock);
 	mds->cxlds.dev = dev;
 	mds->cxlds.reg_map.host = dev;
+	mds->cxlds.cxl_mbox.host = dev;
 	mds->cxlds.reg_map.resource = CXL_RESOURCE_NONE;
 	mds->cxlds.type = CXL_DEVTYPE_CLASSMEM;
 	mds->ram_perf.qos_class = CXL_QOS_CLASS_INVALID;
diff --git a/drivers/cxl/core/memdev.c b/drivers/cxl/core/memdev.c
index ae3dfcbe8938..2e2e035abdaa 100644
--- a/drivers/cxl/core/memdev.c
+++ b/drivers/cxl/core/memdev.c
@@ -564,9 +564,11 @@ EXPORT_SYMBOL_NS_GPL(is_cxl_memdev, "CXL");
 void set_exclusive_cxl_commands(struct cxl_memdev_state *mds,
 				unsigned long *cmds)
 {
+	struct cxl_mailbox *cxl_mbox = &mds->cxlds.cxl_mbox;
+
 	down_write(&cxl_memdev_rwsem);
-	bitmap_or(mds->exclusive_cmds, mds->exclusive_cmds, cmds,
-		  CXL_MEM_COMMAND_ID_MAX);
+	bitmap_or(cxl_mbox->exclusive_cmds, cxl_mbox->exclusive_cmds,
+		  cmds, CXL_MEM_COMMAND_ID_MAX);
 	up_write(&cxl_memdev_rwsem);
 }
 EXPORT_SYMBOL_NS_GPL(set_exclusive_cxl_commands, "CXL");
@@ -579,9 +581,11 @@ EXPORT_SYMBOL_NS_GPL(set_exclusive_cxl_commands, "CXL");
 void clear_exclusive_cxl_commands(struct cxl_memdev_state *mds,
 				  unsigned long *cmds)
 {
+	struct cxl_mailbox *cxl_mbox = &mds->cxlds.cxl_mbox;
+
 	down_write(&cxl_memdev_rwsem);
-	bitmap_andnot(mds->exclusive_cmds, mds->exclusive_cmds, cmds,
-		      CXL_MEM_COMMAND_ID_MAX);
+	bitmap_andnot(cxl_mbox->exclusive_cmds, cxl_mbox->exclusive_cmds,
+		      cmds, CXL_MEM_COMMAND_ID_MAX);
 	up_write(&cxl_memdev_rwsem);
 }
 EXPORT_SYMBOL_NS_GPL(clear_exclusive_cxl_commands, "CXL");
@@ -656,11 +660,14 @@ static struct cxl_memdev *cxl_memdev_alloc(struct cxl_dev_state *cxlds,
 static long __cxl_memdev_ioctl(struct cxl_memdev *cxlmd, unsigned int cmd,
 			       unsigned long arg)
 {
+	struct cxl_memdev_state *mds = to_cxl_memdev_state(cxlmd->cxlds);
+	struct cxl_mailbox *cxl_mbox = &mds->cxlds.cxl_mbox;
+
 	switch (cmd) {
 	case CXL_MEM_QUERY_COMMANDS:
-		return cxl_query_cmd(cxlmd, (void __user *)arg);
+		return cxl_query_cmd(cxl_mbox, (void __user *)arg);
 	case CXL_MEM_SEND_COMMAND:
-		return cxl_send_cmd(cxlmd, (void __user *)arg);
+		return cxl_send_cmd(cxl_mbox, (void __user *)arg);
 	default:
 		return -ENOTTY;
 	}
@@ -994,10 +1001,11 @@ static void cxl_remove_fw_upload(void *fwl)
 int devm_cxl_setup_fw_upload(struct device *host, struct cxl_memdev_state *mds)
 {
 	struct cxl_dev_state *cxlds = &mds->cxlds;
+	struct cxl_mailbox *cxl_mbox = &cxlds->cxl_mbox;
 	struct device *dev = &cxlds->cxlmd->dev;
 	struct fw_upload *fwl;
 
-	if (!test_bit(CXL_MEM_COMMAND_ID_GET_FW_INFO, mds->enabled_cmds))
+	if (!test_bit(CXL_MEM_COMMAND_ID_GET_FW_INFO, cxl_mbox->enabled_cmds))
 		return 0;
 
 	fwl = firmware_upload_register(THIS_MODULE, dev, dev_name(dev),
diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
index 2a25d1957ddb..a0a49809cd76 100644
--- a/drivers/cxl/cxlmem.h
+++ b/drivers/cxl/cxlmem.h
@@ -106,42 +106,6 @@ static inline struct cxl_ep *cxl_ep_load(struct cxl_port *port,
 	return xa_load(&port->endpoints, (unsigned long)&cxlmd->dev);
 }
 
-/**
- * struct cxl_mbox_cmd - A command to be submitted to hardware.
- * @opcode: (input) The command set and command submitted to hardware.
- * @payload_in: (input) Pointer to the input payload.
- * @payload_out: (output) Pointer to the output payload. Must be allocated by
- *		 the caller.
- * @size_in: (input) Number of bytes to load from @payload_in.
- * @size_out: (input) Max number of bytes loaded into @payload_out.
- *            (output) Number of bytes generated by the device. For fixed size
- *            outputs commands this is always expected to be deterministic. For
- *            variable sized output commands, it tells the exact number of bytes
- *            written.
- * @min_out: (input) internal command output payload size validation
- * @poll_count: (input) Number of timeouts to attempt.
- * @poll_interval_ms: (input) Time between mailbox background command polling
- *                    interval timeouts.
- * @return_code: (output) Error code returned from hardware.
- *
- * This is the primary mechanism used to send commands to the hardware.
- * All the fields except @payload_* correspond exactly to the fields described in
- * Command Register section of the CXL 2.0 8.2.8.4.5. @payload_in and
- * @payload_out are written to, and read from the Command Payload Registers
- * defined in CXL 2.0 8.2.8.4.8.
- */
-struct cxl_mbox_cmd {
-	u16 opcode;
-	void *payload_in;
-	void *payload_out;
-	size_t size_in;
-	size_t size_out;
-	size_t min_out;
-	int poll_count;
-	int poll_interval_ms;
-	u16 return_code;
-};
-
 /*
  * Per CXL 3.0 Section 8.2.8.4.5.1
  */
@@ -461,8 +425,6 @@ static inline struct cxl_dev_state *mbox_to_cxlds(struct cxl_mailbox *cxl_mbox)
  * @lsa_size: Size of Label Storage Area
  *                (CXL 2.0 8.2.9.5.1.1 Identify Memory Device)
  * @firmware_version: Firmware version for the memory device.
- * @enabled_cmds: Hardware commands found enabled in CEL.
- * @exclusive_cmds: Commands that are kernel-internal only
  * @total_bytes: sum of all possible capacities
  * @volatile_only_bytes: hard volatile capacity
  * @persistent_only_bytes: hard persistent capacity
@@ -485,8 +447,6 @@ struct cxl_memdev_state {
 	struct cxl_dev_state cxlds;
 	size_t lsa_size;
 	char firmware_version[0x10];
-	DECLARE_BITMAP(enabled_cmds, CXL_MEM_COMMAND_ID_MAX);
-	DECLARE_BITMAP(exclusive_cmds, CXL_MEM_COMMAND_ID_MAX);
 	u64 total_bytes;
 	u64 volatile_only_bytes;
 	u64 persistent_only_bytes;
diff --git a/include/cxl/mailbox.h b/include/cxl/mailbox.h
index bacd111e75f1..cc894f07a435 100644
--- a/include/cxl/mailbox.h
+++ b/include/cxl/mailbox.h
@@ -3,12 +3,49 @@
 #ifndef __CXL_MBOX_H__
 #define __CXL_MBOX_H__
 #include <linux/rcuwait.h>
+#include <uapi/linux/cxl_mem.h>
 
-struct cxl_mbox_cmd;
+/**
+ * struct cxl_mbox_cmd - A command to be submitted to hardware.
+ * @opcode: (input) The command set and command submitted to hardware.
+ * @payload_in: (input) Pointer to the input payload.
+ * @payload_out: (output) Pointer to the output payload. Must be allocated by
+ *		 the caller.
+ * @size_in: (input) Number of bytes to load from @payload_in.
+ * @size_out: (input) Max number of bytes loaded into @payload_out.
+ *            (output) Number of bytes generated by the device. For fixed size
+ *            outputs commands this is always expected to be deterministic. For
+ *            variable sized output commands, it tells the exact number of bytes
+ *            written.
+ * @min_out: (input) internal command output payload size validation
+ * @poll_count: (input) Number of timeouts to attempt.
+ * @poll_interval_ms: (input) Time between mailbox background command polling
+ *                    interval timeouts.
+ * @return_code: (output) Error code returned from hardware.
+ *
+ * This is the primary mechanism used to send commands to the hardware.
+ * All the fields except @payload_* correspond exactly to the fields described in
+ * Command Register section of the CXL 2.0 8.2.8.4.5. @payload_in and
+ * @payload_out are written to, and read from the Command Payload Registers
+ * defined in CXL 2.0 8.2.8.4.8.
+ */
+struct cxl_mbox_cmd {
+	u16 opcode;
+	void *payload_in;
+	void *payload_out;
+	size_t size_in;
+	size_t size_out;
+	size_t min_out;
+	int poll_count;
+	int poll_interval_ms;
+	u16 return_code;
+};
 
 /**
  * struct cxl_mailbox - context for CXL mailbox operations
  * @host: device that hosts the mailbox
+ * @enabled_cmds: mailbox commands that are enabled by the driver
+ * @exclusive_cmds: mailbox commands that are exclusive to the kernel
  * @payload_size: Size of space for payload
  *                (CXL 3.1 8.2.8.4.3 Mailbox Capabilities Register)
  * @mbox_mutex: mutex protects device mailbox and firmware
@@ -17,6 +54,8 @@ struct cxl_mbox_cmd;
  */
 struct cxl_mailbox {
 	struct device *host;
+	DECLARE_BITMAP(enabled_cmds, CXL_MEM_COMMAND_ID_MAX);
+	DECLARE_BITMAP(exclusive_cmds, CXL_MEM_COMMAND_ID_MAX);
 	size_t payload_size;
 	struct mutex mbox_mutex; /* lock to protect mailbox context */
 	struct rcuwait mbox_wait;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [PATCH v18 08/19] cxl: Add skeletal features driver
  2025-01-06 12:09 [PATCH v18 00/19] EDAC: Scrub: introduce generic EDAC RAS control feature driver + CXL/ACPI-RAS2 drivers shiju.jose
                   ` (6 preceding siblings ...)
  2025-01-06 12:10 ` [PATCH v18 07/19] cxl: Refactor user ioctl command path from mds to mailbox shiju.jose
@ 2025-01-06 12:10 ` shiju.jose
  2025-01-06 12:10 ` [PATCH v18 09/19] cxl: Enumerate feature commands shiju.jose
                   ` (12 subsequent siblings)
  20 siblings, 0 replies; 87+ messages in thread
From: shiju.jose @ 2025-01-06 12:10 UTC (permalink / raw)
  To: linux-edac, linux-cxl, linux-acpi, linux-mm, linux-kernel
  Cc: bp, tony.luck, rafael, lenb, mchehab, dan.j.williams, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	ira.weiny, david, Vilas.Sridharan, leo.duran, Yazen.Ghannam,
	rientjes, jiaqiyan, Jon.Grimm, dave.hansen, naoya.horiguchi,
	james.morse, jthoughton, somasundaram.a, erdemaktas, pgonda,
	duenwen, gthelen, wschwartz, dferguson, wbs, nifan.cxl,
	yazen.ghannam, tanxiaofei, prime.zeng, roberto.sassu,
	kangkang.shen, wanghuiqiang, linuxarm, shiju.jose

From: Dave Jiang <dave.jiang@intel.com>

Add the basic bits of a features driver to handle all CXL feature related
services.

Signed-off-by: Dave Jiang <dave.jiang@intel.com>
---
 drivers/cxl/Kconfig         |  8 +++++
 drivers/cxl/Makefile        |  3 ++
 drivers/cxl/core/Makefile   |  1 +
 drivers/cxl/core/core.h     |  1 +
 drivers/cxl/core/features.c | 71 +++++++++++++++++++++++++++++++++++++
 drivers/cxl/core/port.c     |  3 ++
 drivers/cxl/cxl.h           |  1 +
 drivers/cxl/features.c      | 44 +++++++++++++++++++++++
 drivers/cxl/pci.c           | 19 ++++++++++
 include/cxl/features.h      | 23 ++++++++++++
 include/cxl/mailbox.h       |  3 ++
 tools/testing/cxl/Kbuild    |  1 +
 12 files changed, 178 insertions(+)
 create mode 100644 drivers/cxl/core/features.c
 create mode 100644 drivers/cxl/features.c
 create mode 100644 include/cxl/features.h

diff --git a/drivers/cxl/Kconfig b/drivers/cxl/Kconfig
index 876469e23f7a..0bc6a2cb8474 100644
--- a/drivers/cxl/Kconfig
+++ b/drivers/cxl/Kconfig
@@ -146,4 +146,12 @@ config CXL_REGION_INVALIDATION_TEST
 	  If unsure, or if this kernel is meant for production environments,
 	  say N.
 
+config CXL_FEATURES
+	tristate "CXL: Features support"
+	default CXL_BUS
+	help
+	  Enable CXL features support that are tied to a CXL mailbox.
+
+	  If unsure say 'y'.
+
 endif
diff --git a/drivers/cxl/Makefile b/drivers/cxl/Makefile
index 2caa90fa4bf2..4696fc218df4 100644
--- a/drivers/cxl/Makefile
+++ b/drivers/cxl/Makefile
@@ -7,15 +7,18 @@
 # - 'mem' and 'pmem' before endpoint drivers so that memdevs are
 #   immediately enabled
 # - 'pci' last, also mirrors the hardware enumeration hierarchy
+# - 'features' comes after pci device is enumerated
 obj-y += core/
 obj-$(CONFIG_CXL_PORT) += cxl_port.o
 obj-$(CONFIG_CXL_ACPI) += cxl_acpi.o
 obj-$(CONFIG_CXL_PMEM) += cxl_pmem.o
 obj-$(CONFIG_CXL_MEM) += cxl_mem.o
 obj-$(CONFIG_CXL_PCI) += cxl_pci.o
+obj-$(CONFIG_CXL_FEATURES) += cxl_features.o
 
 cxl_port-y := port.o
 cxl_acpi-y := acpi.o
 cxl_pmem-y := pmem.o security.o
 cxl_mem-y := mem.o
 cxl_pci-y := pci.o
+cxl_features-y := features.o
diff --git a/drivers/cxl/core/Makefile b/drivers/cxl/core/Makefile
index 9259bcc6773c..73b6348afd67 100644
--- a/drivers/cxl/core/Makefile
+++ b/drivers/cxl/core/Makefile
@@ -14,5 +14,6 @@ cxl_core-y += pci.o
 cxl_core-y += hdm.o
 cxl_core-y += pmu.o
 cxl_core-y += cdat.o
+cxl_core-y += features.o
 cxl_core-$(CONFIG_TRACING) += trace.o
 cxl_core-$(CONFIG_CXL_REGION) += region.o
diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
index 23761340e65c..e8a3df226643 100644
--- a/drivers/cxl/core/core.h
+++ b/drivers/cxl/core/core.h
@@ -9,6 +9,7 @@
 extern const struct device_type cxl_nvdimm_bridge_type;
 extern const struct device_type cxl_nvdimm_type;
 extern const struct device_type cxl_pmu_type;
+extern const struct device_type cxl_features_type;
 
 extern struct attribute_group cxl_base_attribute_group;
 
diff --git a/drivers/cxl/core/features.c b/drivers/cxl/core/features.c
new file mode 100644
index 000000000000..eb6eb191a32e
--- /dev/null
+++ b/drivers/cxl/core/features.c
@@ -0,0 +1,71 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/* Copyright(c) 2024-2025 Intel Corporation. All rights reserved. */
+#include <linux/device.h>
+#include "cxl.h"
+#include "core.h"
+
+#define CXL_FEATURE_MAX_DEVS 65536
+static DEFINE_IDA(cxl_features_ida);
+
+static void cxl_features_release(struct device *dev)
+{
+	struct cxl_features *features = to_cxl_features(dev);
+
+	ida_free(&cxl_features_ida, features->id);
+	kfree(features);
+}
+
+static void remove_features_dev(void *dev)
+{
+	device_unregister(dev);
+}
+
+const struct device_type cxl_features_type = {
+	.name = "features",
+	.release = cxl_features_release,
+};
+EXPORT_SYMBOL_NS_GPL(cxl_features_type, "CXL");
+
+struct cxl_features *cxl_features_alloc(struct cxl_mailbox *cxl_mbox,
+					struct device *parent)
+{
+	struct device *dev;
+	int rc;
+
+	struct cxl_features *features __free(kfree) =
+		kzalloc(sizeof(*features), GFP_KERNEL);
+	if (!features)
+		return ERR_PTR(-ENOMEM);
+
+	rc = ida_alloc_max(&cxl_features_ida, CXL_FEATURE_MAX_DEVS - 1,
+			   GFP_KERNEL);
+	if (rc < 0)
+		return ERR_PTR(rc);
+
+	features->id = rc;
+	features->cxl_mbox = cxl_mbox;
+	dev = &features->dev;
+	device_initialize(dev);
+	device_set_pm_not_required(dev);
+	dev->parent = parent;
+	dev->bus = &cxl_bus_type;
+	dev->type = &cxl_features_type;
+	rc = dev_set_name(dev, "features%d", features->id);
+	if (rc)
+		goto err;
+
+	rc = device_add(dev);
+	if (rc)
+		goto err;
+
+	rc = devm_add_action_or_reset(parent, remove_features_dev, dev);
+	if (rc)
+		goto err;
+
+	return no_free_ptr(features);
+
+err:
+	put_device(dev);
+	return ERR_PTR(rc);
+}
+EXPORT_SYMBOL_NS_GPL(cxl_features_alloc, "CXL");
diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
index 78a5c2c25982..cc53a597cae6 100644
--- a/drivers/cxl/core/port.c
+++ b/drivers/cxl/core/port.c
@@ -74,6 +74,9 @@ static int cxl_device_id(const struct device *dev)
 		return CXL_DEVICE_REGION;
 	if (dev->type == &cxl_pmu_type)
 		return CXL_DEVICE_PMU;
+	if (dev->type == &cxl_features_type)
+		return CXL_DEVICE_FEATURES;
+
 	return 0;
 }
 
diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
index f6015f24ad38..ee29d1a1c8df 100644
--- a/drivers/cxl/cxl.h
+++ b/drivers/cxl/cxl.h
@@ -855,6 +855,7 @@ void cxl_driver_unregister(struct cxl_driver *cxl_drv);
 #define CXL_DEVICE_PMEM_REGION		7
 #define CXL_DEVICE_DAX_REGION		8
 #define CXL_DEVICE_PMU			9
+#define CXL_DEVICE_FEATURES		10
 
 #define MODULE_ALIAS_CXL(type) MODULE_ALIAS("cxl:t" __stringify(type) "*")
 #define CXL_MODALIAS_FMT "cxl:t%d"
diff --git a/drivers/cxl/features.c b/drivers/cxl/features.c
new file mode 100644
index 000000000000..93b16b5e2b68
--- /dev/null
+++ b/drivers/cxl/features.c
@@ -0,0 +1,44 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/* Copyright(c) 2024,2025 Intel Corporation. All rights reserved. */
+#include <linux/device.h>
+#include <linux/module.h>
+#include <linux/pci.h>
+#include <cxl/features.h>
+
+#include "cxl.h"
+
+static int cxl_features_probe(struct device *dev)
+{
+	struct cxl_features *features = to_cxl_features(dev);
+	struct cxl_features_state *cfs __free(kfree) =
+		kzalloc(sizeof(*cfs), GFP_KERNEL);
+
+	if (!cfs)
+		return -ENOMEM;
+
+	cfs->features = features;
+	dev_set_drvdata(dev, no_free_ptr(cfs));
+
+	return 0;
+}
+
+static void cxl_features_remove(struct device *dev)
+{
+	struct cxl_features_state *cfs = dev_get_drvdata(dev);
+
+	kfree(cfs);
+}
+
+static struct cxl_driver cxl_features_driver = {
+	.name = "cxl_features",
+	.probe = cxl_features_probe,
+	.remove = cxl_features_remove,
+	.id = CXL_DEVICE_FEATURES,
+};
+
+module_cxl_driver(cxl_features_driver);
+
+MODULE_DESCRIPTION("CXL: Features");
+MODULE_LICENSE("GPL v2");
+MODULE_IMPORT_NS("CXL");
+MODULE_ALIAS_CXL(CXL_DEVICE_FEATURES);
diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
index 6d94ff4a4f1a..eb68dd3f8b21 100644
--- a/drivers/cxl/pci.c
+++ b/drivers/cxl/pci.c
@@ -386,6 +386,21 @@ static int cxl_pci_mbox_send(struct cxl_mailbox *cxl_mbox,
 	return rc;
 }
 
+static int cxl_pci_setup_features(struct cxl_memdev_state *mds)
+{
+	struct cxl_dev_state *cxlds = &mds->cxlds;
+	struct cxl_mailbox *cxl_mbox = &cxlds->cxl_mbox;
+	struct cxl_features *features;
+
+	features = cxl_features_alloc(cxl_mbox, cxlds->dev);
+	if (IS_ERR(features))
+		return PTR_ERR(features);
+
+	cxl_mbox->features = features;
+
+	return 0;
+}
+
 static int cxl_pci_setup_mailbox(struct cxl_memdev_state *mds, bool irq_avail)
 {
 	struct cxl_dev_state *cxlds = &mds->cxlds;
@@ -980,6 +995,10 @@ static int cxl_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 	if (rc)
 		return rc;
 
+	rc = cxl_pci_setup_features(mds);
+	if (rc)
+		return rc;
+
 	rc = cxl_set_timestamp(mds);
 	if (rc)
 		return rc;
diff --git a/include/cxl/features.h b/include/cxl/features.h
new file mode 100644
index 000000000000..b92da1e92780
--- /dev/null
+++ b/include/cxl/features.h
@@ -0,0 +1,23 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/* Copyright(c) 2024-2025 Intel Corporation. */
+#ifndef __CXL_FEATURES_H__
+#define __CXL_FEATURES_H__
+
+struct cxl_mailbox;
+
+struct cxl_features {
+	int id;
+	struct device dev;
+	struct cxl_mailbox *cxl_mbox;
+};
+#define to_cxl_features(dev) container_of(dev, struct cxl_features, dev)
+
+struct cxl_features_state {
+	struct cxl_features *features;
+	int num_features;
+};
+
+struct cxl_features *cxl_features_alloc(struct cxl_mailbox *cxl_mbox,
+					struct device *parent);
+
+#endif
diff --git a/include/cxl/mailbox.h b/include/cxl/mailbox.h
index cc894f07a435..6caab0d406ba 100644
--- a/include/cxl/mailbox.h
+++ b/include/cxl/mailbox.h
@@ -3,6 +3,7 @@
 #ifndef __CXL_MBOX_H__
 #define __CXL_MBOX_H__
 #include <linux/rcuwait.h>
+#include <cxl/features.h>
 #include <uapi/linux/cxl_mem.h>
 
 /**
@@ -50,6 +51,7 @@ struct cxl_mbox_cmd {
  *                (CXL 3.1 8.2.8.4.3 Mailbox Capabilities Register)
  * @mbox_mutex: mutex protects device mailbox and firmware
  * @mbox_wait: rcuwait for mailbox
+ * @features: pointer to cxl_features device
  * @mbox_send: @dev specific transport for transmitting mailbox commands
  */
 struct cxl_mailbox {
@@ -59,6 +61,7 @@ struct cxl_mailbox {
 	size_t payload_size;
 	struct mutex mbox_mutex; /* lock to protect mailbox context */
 	struct rcuwait mbox_wait;
+	struct cxl_features *features;
 	int (*mbox_send)(struct cxl_mailbox *cxl_mbox, struct cxl_mbox_cmd *cmd);
 };
 
diff --git a/tools/testing/cxl/Kbuild b/tools/testing/cxl/Kbuild
index b1256fee3567..79de943841f4 100644
--- a/tools/testing/cxl/Kbuild
+++ b/tools/testing/cxl/Kbuild
@@ -61,6 +61,7 @@ cxl_core-y += $(CXL_CORE_SRC)/pci.o
 cxl_core-y += $(CXL_CORE_SRC)/hdm.o
 cxl_core-y += $(CXL_CORE_SRC)/pmu.o
 cxl_core-y += $(CXL_CORE_SRC)/cdat.o
+cxl_core-y += $(CXL_CORE_SRC)/features.o
 cxl_core-$(CONFIG_TRACING) += $(CXL_CORE_SRC)/trace.o
 cxl_core-$(CONFIG_CXL_REGION) += $(CXL_CORE_SRC)/region.o
 cxl_core-y += config_check.o
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [PATCH v18 09/19] cxl: Enumerate feature commands
  2025-01-06 12:09 [PATCH v18 00/19] EDAC: Scrub: introduce generic EDAC RAS control feature driver + CXL/ACPI-RAS2 drivers shiju.jose
                   ` (7 preceding siblings ...)
  2025-01-06 12:10 ` [PATCH v18 08/19] cxl: Add skeletal features driver shiju.jose
@ 2025-01-06 12:10 ` shiju.jose
  2025-01-06 12:10 ` [PATCH v18 10/19] cxl: Add Get Supported Features command for kernel usage shiju.jose
                   ` (11 subsequent siblings)
  20 siblings, 0 replies; 87+ messages in thread
From: shiju.jose @ 2025-01-06 12:10 UTC (permalink / raw)
  To: linux-edac, linux-cxl, linux-acpi, linux-mm, linux-kernel
  Cc: bp, tony.luck, rafael, lenb, mchehab, dan.j.williams, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	ira.weiny, david, Vilas.Sridharan, leo.duran, Yazen.Ghannam,
	rientjes, jiaqiyan, Jon.Grimm, dave.hansen, naoya.horiguchi,
	james.morse, jthoughton, somasundaram.a, erdemaktas, pgonda,
	duenwen, gthelen, wschwartz, dferguson, wbs, nifan.cxl,
	yazen.ghannam, tanxiaofei, prime.zeng, roberto.sassu,
	kangkang.shen, wanghuiqiang, linuxarm, shiju.jose

From: Dave Jiang <dave.jiang@intel.com>

Add feature commands enumeration code in order to detect and enumerate
the 3 feature related commands "get supported features", "get feature",
and "set feature". The enumeration will help determine whether the driver
can issue any of the 3 commands to the device.

Signed-off-by: Dave Jiang <dave.jiang@intel.com>
---
 drivers/cxl/core/mbox.c | 41 +++++++++++++++++++++++++++++++++++++++++
 drivers/cxl/cxlmem.h    |  3 +++
 include/cxl/features.h  |  7 +++++++
 include/cxl/mailbox.h   |  1 +
 4 files changed, 52 insertions(+)

diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
index bdb8f060f2c1..5e21ff99d70f 100644
--- a/drivers/cxl/core/mbox.c
+++ b/drivers/cxl/core/mbox.c
@@ -38,6 +38,21 @@ static bool cxl_raw_allow_all;
 	.flags = _flags,                                                       \
 	}
 
+#define cxl_for_each_feature_cmd(cmd)                                          \
+	for ((cmd) = &cxl_feature_commands[0];                                 \
+	     ((cmd) - cxl_feature_commands) < ARRAY_SIZE(cxl_feature_commands); (cmd)++)
+
+#define CXL_FEATURE_CMD(_id, sin, sout, _flags)                                \
+	[CXL_FEATURE_ID_##_id] = {                                             \
+	.info =	{                                                              \
+			.id = CXL_FEATURE_ID_##_id,                            \
+			.size_in = sin,                                        \
+			.size_out = sout,                                      \
+		},                                                             \
+	.opcode = CXL_MBOX_OP_##_id,                                           \
+	.flags = _flags,                                                       \
+	}
+
 #define CXL_VARIABLE_PAYLOAD	~0U
 /*
  * This table defines the supported mailbox commands for the driver. This table
@@ -69,6 +84,13 @@ static struct cxl_mem_command cxl_mem_commands[CXL_MEM_COMMAND_ID_MAX] = {
 	CXL_CMD(GET_TIMESTAMP, 0, 0x8, 0),
 };
 
+#define CXL_FEATURE_COMMAND_ID_MAX 3
+static struct cxl_mem_command cxl_feature_commands[CXL_FEATURE_COMMAND_ID_MAX] = {
+	CXL_FEATURE_CMD(GET_SUPPORTED_FEATURES, 0x8, CXL_VARIABLE_PAYLOAD, 0),
+	CXL_FEATURE_CMD(GET_FEATURE, 0xf, CXL_VARIABLE_PAYLOAD, 0),
+	CXL_FEATURE_CMD(SET_FEATURE, CXL_VARIABLE_PAYLOAD, 0, 0),
+};
+
 /*
  * Commands that RAW doesn't permit. The rationale for each:
  *
@@ -212,6 +234,17 @@ static struct cxl_mem_command *cxl_mem_find_command(u16 opcode)
 	return NULL;
 }
 
+static struct cxl_mem_command *cxl_find_feature_command(u16 opcode)
+{
+	struct cxl_mem_command *c;
+
+	cxl_for_each_feature_cmd(c)
+		if (c->opcode == opcode)
+			return c;
+
+	return NULL;
+}
+
 static const char *cxl_mem_opcode_to_name(u16 opcode)
 {
 	struct cxl_mem_command *c;
@@ -734,6 +767,14 @@ static void cxl_walk_cel(struct cxl_memdev_state *mds, size_t size, u8 *cel)
 		if (cmd) {
 			set_bit(cmd->info.id, cxl_mbox->enabled_cmds);
 			enabled++;
+		} else {
+			struct cxl_mem_command *fcmd =
+				cxl_find_feature_command(opcode);
+
+			if (fcmd) {
+				set_bit(fcmd->info.id, cxl_mbox->feature_cmds);
+				enabled++;
+			}
 		}
 
 		if (cxl_is_poison_command(opcode)) {
diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
index a0a49809cd76..55c55685cb39 100644
--- a/drivers/cxl/cxlmem.h
+++ b/drivers/cxl/cxlmem.h
@@ -490,6 +490,9 @@ enum cxl_opcode {
 	CXL_MBOX_OP_GET_LOG_CAPS	= 0x0402,
 	CXL_MBOX_OP_CLEAR_LOG           = 0x0403,
 	CXL_MBOX_OP_GET_SUP_LOG_SUBLIST = 0x0405,
+	CXL_MBOX_OP_GET_SUPPORTED_FEATURES	= 0x0500,
+	CXL_MBOX_OP_GET_FEATURE		= 0x0501,
+	CXL_MBOX_OP_SET_FEATURE		= 0x0502,
 	CXL_MBOX_OP_IDENTIFY		= 0x4000,
 	CXL_MBOX_OP_GET_PARTITION_INFO	= 0x4100,
 	CXL_MBOX_OP_SET_PARTITION_INFO	= 0x4101,
diff --git a/include/cxl/features.h b/include/cxl/features.h
index b92da1e92780..7a8be3c621a1 100644
--- a/include/cxl/features.h
+++ b/include/cxl/features.h
@@ -5,6 +5,13 @@
 
 struct cxl_mailbox;
 
+enum feature_cmds {
+	CXL_FEATURE_ID_GET_SUPPORTED_FEATURES = 0,
+	CXL_FEATURE_ID_GET_FEATURE,
+	CXL_FEATURE_ID_SET_FEATURE,
+	CXL_FEATURE_ID_MAX,
+};
+
 struct cxl_features {
 	int id;
 	struct device dev;
diff --git a/include/cxl/mailbox.h b/include/cxl/mailbox.h
index 6caab0d406ba..263fc346aeb1 100644
--- a/include/cxl/mailbox.h
+++ b/include/cxl/mailbox.h
@@ -58,6 +58,7 @@ struct cxl_mailbox {
 	struct device *host;
 	DECLARE_BITMAP(enabled_cmds, CXL_MEM_COMMAND_ID_MAX);
 	DECLARE_BITMAP(exclusive_cmds, CXL_MEM_COMMAND_ID_MAX);
+	DECLARE_BITMAP(feature_cmds, CXL_FEATURE_ID_MAX);
 	size_t payload_size;
 	struct mutex mbox_mutex; /* lock to protect mailbox context */
 	struct rcuwait mbox_wait;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [PATCH v18 10/19] cxl: Add Get Supported Features command for kernel usage
  2025-01-06 12:09 [PATCH v18 00/19] EDAC: Scrub: introduce generic EDAC RAS control feature driver + CXL/ACPI-RAS2 drivers shiju.jose
                   ` (8 preceding siblings ...)
  2025-01-06 12:10 ` [PATCH v18 09/19] cxl: Enumerate feature commands shiju.jose
@ 2025-01-06 12:10 ` shiju.jose
  2025-01-06 12:10 ` [PATCH v18 11/19] cxl: Add features driver attribute to emit number of features supported shiju.jose
                   ` (10 subsequent siblings)
  20 siblings, 0 replies; 87+ messages in thread
From: shiju.jose @ 2025-01-06 12:10 UTC (permalink / raw)
  To: linux-edac, linux-cxl, linux-acpi, linux-mm, linux-kernel
  Cc: bp, tony.luck, rafael, lenb, mchehab, dan.j.williams, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	ira.weiny, david, Vilas.Sridharan, leo.duran, Yazen.Ghannam,
	rientjes, jiaqiyan, Jon.Grimm, dave.hansen, naoya.horiguchi,
	james.morse, jthoughton, somasundaram.a, erdemaktas, pgonda,
	duenwen, gthelen, wschwartz, dferguson, wbs, nifan.cxl,
	yazen.ghannam, tanxiaofei, prime.zeng, roberto.sassu,
	kangkang.shen, wanghuiqiang, linuxarm, shiju.jose

From: Dave Jiang <dave.jiang@intel.com>

CXL spec r3.1 8.2.9.6.1 Get Supported Features (Opcode 0500h)
The command retrieve the list of supported device-specific features
(identified by UUID) and general information about each Feature.

The driver will retrieve the feature entries in order to make checks and
provide information for the Get Feature and Set Feature command. One of
the main piece of information retrieved are the effects a Set Feature
command would have for a particular feature.

Co-developed-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
---
 drivers/cxl/core/features.c |  28 +++++++
 drivers/cxl/core/mbox.c     |   3 +-
 drivers/cxl/cxl.h           |   2 +
 drivers/cxl/features.c      | 146 +++++++++++++++++++++++++++++++++++-
 include/cxl/features.h      |  32 ++++++++
 5 files changed, 209 insertions(+), 2 deletions(-)

diff --git a/drivers/cxl/core/features.c b/drivers/cxl/core/features.c
index eb6eb191a32e..66a4b82910e6 100644
--- a/drivers/cxl/core/features.c
+++ b/drivers/cxl/core/features.c
@@ -1,6 +1,7 @@
 // SPDX-License-Identifier: GPL-2.0-only
 /* Copyright(c) 2024-2025 Intel Corporation. All rights reserved. */
 #include <linux/device.h>
+#include <cxl/mailbox.h>
 #include "cxl.h"
 #include "core.h"
 
@@ -69,3 +70,30 @@ struct cxl_features *cxl_features_alloc(struct cxl_mailbox *cxl_mbox,
 	return ERR_PTR(rc);
 }
 EXPORT_SYMBOL_NS_GPL(cxl_features_alloc, "CXL");
+
+struct cxl_feat_entry *
+cxl_get_supported_feature_entry(struct cxl_features *features,
+				const uuid_t *feat_uuid)
+{
+	struct cxl_feat_entry *feat_entry;
+	struct cxl_features_state *cfs;
+	int count;
+
+	cfs = dev_get_drvdata(&features->dev);
+	if (!cfs)
+		return ERR_PTR(-EOPNOTSUPP);
+
+	if (!cfs->num_features)
+		return ERR_PTR(-ENOENT);
+
+	/* Check CXL dev supports the feature */
+	feat_entry = cfs->entries;
+	for (count = 0; count < cfs->num_features;
+	     count++, feat_entry++) {
+		if (uuid_equal(&feat_entry->uuid, feat_uuid))
+			return feat_entry;
+	}
+
+	return ERR_PTR(-ENOENT);
+}
+EXPORT_SYMBOL_NS_GPL(cxl_get_supported_feature_entry, "CXL");
diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
index 5e21ff99d70f..0b4946205910 100644
--- a/drivers/cxl/core/mbox.c
+++ b/drivers/cxl/core/mbox.c
@@ -234,7 +234,7 @@ static struct cxl_mem_command *cxl_mem_find_command(u16 opcode)
 	return NULL;
 }
 
-static struct cxl_mem_command *cxl_find_feature_command(u16 opcode)
+struct cxl_mem_command *cxl_find_feature_command(u16 opcode)
 {
 	struct cxl_mem_command *c;
 
@@ -244,6 +244,7 @@ static struct cxl_mem_command *cxl_find_feature_command(u16 opcode)
 
 	return NULL;
 }
+EXPORT_SYMBOL_NS_GPL(cxl_find_feature_command, "CXL");
 
 static const char *cxl_mem_opcode_to_name(u16 opcode)
 {
diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
index ee29d1a1c8df..1284614d71d0 100644
--- a/drivers/cxl/cxl.h
+++ b/drivers/cxl/cxl.h
@@ -912,6 +912,8 @@ void cxl_coordinates_combine(struct access_coordinate *out,
 
 bool cxl_endpoint_decoder_reset_detected(struct cxl_port *port);
 
+struct cxl_mem_command *cxl_find_feature_command(u16 opcode);
+
 /*
  * Unit test builds overrides this to __weak, find the 'strong' version
  * of these symbols in tools/testing/cxl/.
diff --git a/drivers/cxl/features.c b/drivers/cxl/features.c
index 93b16b5e2b68..2cdf5ed0a771 100644
--- a/drivers/cxl/features.c
+++ b/drivers/cxl/features.c
@@ -3,20 +3,164 @@
 #include <linux/device.h>
 #include <linux/module.h>
 #include <linux/pci.h>
+#include <cxl/mailbox.h>
 #include <cxl/features.h>
 
 #include "cxl.h"
+#include "cxlmem.h"
+
+static void cxl_free_feature_entries(void *entries)
+{
+	kvfree(entries);
+}
+
+static int cxl_get_supported_features_count(struct cxl_mailbox *cxl_mbox)
+{
+	struct cxl_mbox_get_sup_feats_out mbox_out;
+	struct cxl_mbox_get_sup_feats_in mbox_in;
+	struct cxl_mbox_cmd mbox_cmd;
+	int rc;
+
+	memset(&mbox_in, 0, sizeof(mbox_in));
+	mbox_in.count = cpu_to_le32(sizeof(mbox_out));
+	memset(&mbox_out, 0, sizeof(mbox_out));
+	mbox_cmd = (struct cxl_mbox_cmd) {
+		.opcode = CXL_MBOX_OP_GET_SUPPORTED_FEATURES,
+		.size_in = sizeof(mbox_in),
+		.payload_in = &mbox_in,
+		.size_out = sizeof(mbox_out),
+		.payload_out = &mbox_out,
+		.min_out = sizeof(mbox_out),
+	};
+	rc = cxl_internal_send_cmd(cxl_mbox, &mbox_cmd);
+	if (rc < 0)
+		return rc;
+
+	return le16_to_cpu(mbox_out.supported_feats);
+}
+
+static int cxl_get_supported_features(struct cxl_features_state *cfs)
+{
+	int remain_feats, max_size, max_feats, start, rc, hdr_size;
+	struct cxl_mailbox *cxl_mbox = cfs->features->cxl_mbox;
+	int feat_size = sizeof(struct cxl_feat_entry);
+	struct cxl_mbox_get_sup_feats_in mbox_in;
+	struct cxl_feat_entry *entry;
+	struct cxl_mbox_cmd mbox_cmd;
+	struct cxl_mem_command *cmd;
+	int count;
+
+	/* Get supported features is optional, need to check */
+	cmd = cxl_find_feature_command(CXL_MBOX_OP_GET_SUPPORTED_FEATURES);
+	if (!cmd)
+		return -EOPNOTSUPP;
+	if (!test_bit(cmd->info.id, cxl_mbox->feature_cmds))
+		return -EOPNOTSUPP;
+
+	count = cxl_get_supported_features_count(cxl_mbox);
+	if (count == 0)
+		return 0;
+	if (count < 0)
+		return -ENXIO;
+
+	struct cxl_feat_entry *entries __free(kvfree) =
+		kvmalloc(count * sizeof(*entries), GFP_KERNEL);
+	if (!entries)
+		return -ENOMEM;
+
+	struct cxl_mbox_get_sup_feats_out *mbox_out __free(kvfree) =
+		kvmalloc(cxl_mbox->payload_size, GFP_KERNEL);
+	if (!mbox_out)
+		return -ENOMEM;
+
+	hdr_size = sizeof(*mbox_out);
+	max_size = cxl_mbox->payload_size - hdr_size;
+	/* max feat entries that can fit in mailbox max payload size */
+	max_feats = max_size / feat_size;
+	entry = entries;
+
+	start = 0;
+	remain_feats = count;
+	do {
+		int retrieved, alloc_size, copy_feats;
+		int num_entries;
+
+		if (remain_feats > max_feats) {
+			alloc_size = sizeof(*mbox_out) + max_feats * feat_size;
+			remain_feats = remain_feats - max_feats;
+			copy_feats = max_feats;
+		} else {
+			alloc_size = sizeof(*mbox_out) + remain_feats * feat_size;
+			copy_feats = remain_feats;
+			remain_feats = 0;
+		}
+
+		memset(&mbox_in, 0, sizeof(mbox_in));
+		mbox_in.count = cpu_to_le32(alloc_size);
+		mbox_in.start_idx = cpu_to_le16(start);
+		memset(mbox_out, 0, alloc_size);
+		mbox_cmd = (struct cxl_mbox_cmd) {
+			.opcode = CXL_MBOX_OP_GET_SUPPORTED_FEATURES,
+			.size_in = sizeof(mbox_in),
+			.payload_in = &mbox_in,
+			.size_out = alloc_size,
+			.payload_out = mbox_out,
+			.min_out = hdr_size,
+		};
+		rc = cxl_internal_send_cmd(cxl_mbox, &mbox_cmd);
+		if (rc < 0)
+			return rc;
+
+		if (mbox_cmd.size_out <= hdr_size)
+			return -ENXIO;
+
+		/*
+		 * Make sure retrieved out buffer is multiple of feature
+		 * entries.
+		 */
+		retrieved = mbox_cmd.size_out - hdr_size;
+		if (retrieved % feat_size)
+			return -ENXIO;
+
+		num_entries = le16_to_cpu(mbox_out->num_entries);
+		/*
+		 * If the reported output entries * defined entry size !=
+		 * retrieved output bytes, then the output package is incorrect.
+		 */
+		if (num_entries * feat_size != retrieved)
+			return -ENXIO;
+
+		memcpy(entry, mbox_out->ents, retrieved);
+		entry++;
+		/*
+		 * If the number of output entries is less than expected, add the
+		 * remaining entries to the next batch.
+		 */
+		remain_feats += copy_feats - num_entries;
+		start += num_entries;
+	} while (remain_feats);
+
+	cfs->num_features = count;
+	cfs->entries = no_free_ptr(entries);
+	return devm_add_action_or_reset(&cfs->features->dev,
+					cxl_free_feature_entries, cfs->entries);
+}
 
 static int cxl_features_probe(struct device *dev)
 {
 	struct cxl_features *features = to_cxl_features(dev);
+	int rc;
+
 	struct cxl_features_state *cfs __free(kfree) =
 		kzalloc(sizeof(*cfs), GFP_KERNEL);
-
 	if (!cfs)
 		return -ENOMEM;
 
 	cfs->features = features;
+	rc = cxl_get_supported_features(cfs);
+	if (rc)
+		return rc;
+
 	dev_set_drvdata(dev, no_free_ptr(cfs));
 
 	return 0;
diff --git a/include/cxl/features.h b/include/cxl/features.h
index 7a8be3c621a1..429b9782667c 100644
--- a/include/cxl/features.h
+++ b/include/cxl/features.h
@@ -3,6 +3,8 @@
 #ifndef __CXL_FEATURES_H__
 #define __CXL_FEATURES_H__
 
+#include <linux/uuid.h>
+
 struct cxl_mailbox;
 
 enum feature_cmds {
@@ -19,12 +21,42 @@ struct cxl_features {
 };
 #define to_cxl_features(dev) container_of(dev, struct cxl_features, dev)
 
+/* Get Supported Features (0x500h) CXL r3.1 8.2.9.6.1 */
+struct cxl_mbox_get_sup_feats_in {
+	__le32 count;
+	__le16 start_idx;
+	u8 reserved[2];
+} __packed;
+
+struct cxl_feat_entry {
+	uuid_t uuid;
+	__le16 id;
+	__le16 get_feat_size;
+	__le16 set_feat_size;
+	__le32 flags;
+	u8 get_feat_ver;
+	u8 set_feat_ver;
+	__le16 effects;
+	u8 reserved[18];
+} __packed;
+
+struct cxl_mbox_get_sup_feats_out {
+	__le16 num_entries;
+	__le16 supported_feats;
+	u8 reserved[4];
+	struct cxl_feat_entry ents[] __counted_by_le(num_entries);
+} __packed;
+
 struct cxl_features_state {
 	struct cxl_features *features;
 	int num_features;
+	struct cxl_feat_entry *entries;
 };
 
 struct cxl_features *cxl_features_alloc(struct cxl_mailbox *cxl_mbox,
 					struct device *parent);
+struct cxl_feat_entry *
+cxl_get_supported_feature_entry(struct cxl_features *features,
+				const uuid_t *feat_uuid);
 
 #endif
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [PATCH v18 11/19] cxl: Add features driver attribute to emit number of features supported
  2025-01-06 12:09 [PATCH v18 00/19] EDAC: Scrub: introduce generic EDAC RAS control feature driver + CXL/ACPI-RAS2 drivers shiju.jose
                   ` (9 preceding siblings ...)
  2025-01-06 12:10 ` [PATCH v18 10/19] cxl: Add Get Supported Features command for kernel usage shiju.jose
@ 2025-01-06 12:10 ` shiju.jose
  2025-01-06 12:10 ` [PATCH v18 12/19] cxl/mbox: Add GET_FEATURE mailbox command shiju.jose
                   ` (9 subsequent siblings)
  20 siblings, 0 replies; 87+ messages in thread
From: shiju.jose @ 2025-01-06 12:10 UTC (permalink / raw)
  To: linux-edac, linux-cxl, linux-acpi, linux-mm, linux-kernel
  Cc: bp, tony.luck, rafael, lenb, mchehab, dan.j.williams, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	ira.weiny, david, Vilas.Sridharan, leo.duran, Yazen.Ghannam,
	rientjes, jiaqiyan, Jon.Grimm, dave.hansen, naoya.horiguchi,
	james.morse, jthoughton, somasundaram.a, erdemaktas, pgonda,
	duenwen, gthelen, wschwartz, dferguson, wbs, nifan.cxl,
	yazen.ghannam, tanxiaofei, prime.zeng, roberto.sassu,
	kangkang.shen, wanghuiqiang, linuxarm, shiju.jose

From: Dave Jiang <dave.jiang@intel.com>

Enable sysfs attribute emission of the number of features supported by the
driver/device. This is useful for userspace to determine the number of features
to query for.

Signed-off-by: Dave Jiang <dave.jiang@intel.com>
---
 drivers/cxl/features.c | 27 +++++++++++++++++++++++++++
 1 file changed, 27 insertions(+)

diff --git a/drivers/cxl/features.c b/drivers/cxl/features.c
index 2cdf5ed0a771..2f0fb072921e 100644
--- a/drivers/cxl/features.c
+++ b/drivers/cxl/features.c
@@ -173,11 +173,38 @@ static void cxl_features_remove(struct device *dev)
 	kfree(cfs);
 }
 
+static ssize_t features_show(struct device *dev, struct device_attribute *attr,
+			     char *buf)
+{
+	struct cxl_features_state *cfs = dev_get_drvdata(dev);
+
+	if (!cfs)
+		return -ENOENT;
+
+	return sysfs_emit(buf, "%d\n", cfs->num_features);
+}
+
+static DEVICE_ATTR_RO(features);
+
+static struct attribute *cxl_features_attrs[] = {
+	&dev_attr_features.attr,
+	NULL
+};
+
+static struct attribute_group cxl_features_group = {
+	.attrs = cxl_features_attrs,
+};
+
+__ATTRIBUTE_GROUPS(cxl_features);
+
 static struct cxl_driver cxl_features_driver = {
 	.name = "cxl_features",
 	.probe = cxl_features_probe,
 	.remove = cxl_features_remove,
 	.id = CXL_DEVICE_FEATURES,
+	.drv = {
+		.dev_groups = cxl_features_groups,
+	},
 };
 
 module_cxl_driver(cxl_features_driver);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [PATCH v18 12/19] cxl/mbox: Add GET_FEATURE mailbox command
  2025-01-06 12:09 [PATCH v18 00/19] EDAC: Scrub: introduce generic EDAC RAS control feature driver + CXL/ACPI-RAS2 drivers shiju.jose
                   ` (10 preceding siblings ...)
  2025-01-06 12:10 ` [PATCH v18 11/19] cxl: Add features driver attribute to emit number of features supported shiju.jose
@ 2025-01-06 12:10 ` shiju.jose
  2025-01-06 12:10 ` [PATCH v18 13/19] cxl/mbox: Add SET_FEATURE " shiju.jose
                   ` (8 subsequent siblings)
  20 siblings, 0 replies; 87+ messages in thread
From: shiju.jose @ 2025-01-06 12:10 UTC (permalink / raw)
  To: linux-edac, linux-cxl, linux-acpi, linux-mm, linux-kernel
  Cc: bp, tony.luck, rafael, lenb, mchehab, dan.j.williams, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	ira.weiny, david, Vilas.Sridharan, leo.duran, Yazen.Ghannam,
	rientjes, jiaqiyan, Jon.Grimm, dave.hansen, naoya.horiguchi,
	james.morse, jthoughton, somasundaram.a, erdemaktas, pgonda,
	duenwen, gthelen, wschwartz, dferguson, wbs, nifan.cxl,
	yazen.ghannam, tanxiaofei, prime.zeng, roberto.sassu,
	kangkang.shen, wanghuiqiang, linuxarm, shiju.jose

From: Shiju Jose <shiju.jose@huawei.com>

Add support for GET_FEATURE mailbox command.

CXL spec 3.1 section 8.2.9.6 describes optional device specific features.
The settings of a feature can be retrieved using Get Feature command.
CXL spec 3.1 section 8.2.9.6.2 describes Get Feature command.

Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
---
 drivers/cxl/core/features.c | 74 +++++++++++++++++++++++++++++++++++++
 drivers/cxl/features.c      |  6 +--
 include/cxl/features.h      | 27 ++++++++++++++
 3 files changed, 102 insertions(+), 5 deletions(-)

diff --git a/drivers/cxl/core/features.c b/drivers/cxl/core/features.c
index 66a4b82910e6..ab9386b53a95 100644
--- a/drivers/cxl/core/features.c
+++ b/drivers/cxl/core/features.c
@@ -4,6 +4,7 @@
 #include <cxl/mailbox.h>
 #include "cxl.h"
 #include "core.h"
+#include "cxlmem.h"
 
 #define CXL_FEATURE_MAX_DEVS 65536
 static DEFINE_IDA(cxl_features_ida);
@@ -97,3 +98,76 @@ cxl_get_supported_feature_entry(struct cxl_features *features,
 	return ERR_PTR(-ENOENT);
 }
 EXPORT_SYMBOL_NS_GPL(cxl_get_supported_feature_entry, "CXL");
+
+bool cxl_feature_enabled(struct cxl_features_state *cfs, u16 opcode)
+{
+	struct cxl_mailbox *cxl_mbox = cfs->features->cxl_mbox;
+	struct cxl_mem_command *cmd;
+
+	cmd = cxl_find_feature_command(opcode);
+	if (!cmd)
+		return false;
+
+	return test_bit(cmd->info.id, cxl_mbox->feature_cmds);
+}
+EXPORT_SYMBOL_NS_GPL(cxl_feature_enabled, "CXL");
+
+size_t cxl_get_feature(struct cxl_features *features, const uuid_t feat_uuid,
+		       enum cxl_get_feat_selection selection,
+		       void *feat_out, size_t feat_out_size, u16 offset,
+		       u16 *return_code)
+{
+	size_t data_to_rd_size, size_out;
+	struct cxl_features_state *cfs;
+	struct cxl_mbox_get_feat_in pi;
+	struct cxl_mailbox *cxl_mbox;
+	struct cxl_mbox_cmd mbox_cmd;
+	size_t data_rcvd_size = 0;
+	int rc;
+
+	if (return_code)
+		*return_code = CXL_MBOX_CMD_RC_INPUT;
+
+	cfs = dev_get_drvdata(&features->dev);
+	if (!cfs)
+		return 0;
+
+	if (!cxl_feature_enabled(cfs, CXL_MBOX_OP_GET_FEATURE))
+		return 0;
+
+	if (!feat_out || !feat_out_size)
+		return 0;
+
+	cxl_mbox = features->cxl_mbox;
+	size_out = min(feat_out_size, cxl_mbox->payload_size);
+	pi.uuid = feat_uuid;
+	pi.selection = selection;
+	do {
+		data_to_rd_size = min(feat_out_size - data_rcvd_size,
+				      cxl_mbox->payload_size);
+		pi.offset = cpu_to_le16(offset + data_rcvd_size);
+		pi.count = cpu_to_le16(data_to_rd_size);
+
+		mbox_cmd = (struct cxl_mbox_cmd) {
+			.opcode = CXL_MBOX_OP_GET_FEATURE,
+			.size_in = sizeof(pi),
+			.payload_in = &pi,
+			.size_out = size_out,
+			.payload_out = feat_out + data_rcvd_size,
+			.min_out = data_to_rd_size,
+		};
+		rc = cxl_internal_send_cmd(cxl_mbox, &mbox_cmd);
+		if (rc < 0 || !mbox_cmd.size_out) {
+			if (return_code)
+				*return_code = mbox_cmd.return_code;
+			return 0;
+		}
+		data_rcvd_size += mbox_cmd.size_out;
+	} while (data_rcvd_size < feat_out_size);
+
+	if (return_code)
+		*return_code = CXL_MBOX_CMD_RC_SUCCESS;
+
+	return data_rcvd_size;
+}
+EXPORT_SYMBOL_NS_GPL(cxl_get_feature, "CXL");
diff --git a/drivers/cxl/features.c b/drivers/cxl/features.c
index 2f0fb072921e..2faa9e03a840 100644
--- a/drivers/cxl/features.c
+++ b/drivers/cxl/features.c
@@ -47,14 +47,10 @@ static int cxl_get_supported_features(struct cxl_features_state *cfs)
 	struct cxl_mbox_get_sup_feats_in mbox_in;
 	struct cxl_feat_entry *entry;
 	struct cxl_mbox_cmd mbox_cmd;
-	struct cxl_mem_command *cmd;
 	int count;
 
 	/* Get supported features is optional, need to check */
-	cmd = cxl_find_feature_command(CXL_MBOX_OP_GET_SUPPORTED_FEATURES);
-	if (!cmd)
-		return -EOPNOTSUPP;
-	if (!test_bit(cmd->info.id, cxl_mbox->feature_cmds))
+	if (!cxl_feature_enabled(cfs, CXL_MBOX_OP_GET_SUPPORTED_FEATURES))
 		return -EOPNOTSUPP;
 
 	count = cxl_get_supported_features_count(cxl_mbox);
diff --git a/include/cxl/features.h b/include/cxl/features.h
index 429b9782667c..582fb85da5b9 100644
--- a/include/cxl/features.h
+++ b/include/cxl/features.h
@@ -53,10 +53,37 @@ struct cxl_features_state {
 	struct cxl_feat_entry *entries;
 };
 
+/*
+ * Get Feature CXL 3.1 Spec 8.2.9.6.2
+ */
+
+/*
+ * Get Feature input payload
+ * CXL rev 3.1 section 8.2.9.6.2 Table 8-99
+ */
+enum cxl_get_feat_selection {
+	CXL_GET_FEAT_SEL_CURRENT_VALUE,
+	CXL_GET_FEAT_SEL_DEFAULT_VALUE,
+	CXL_GET_FEAT_SEL_SAVED_VALUE,
+	CXL_GET_FEAT_SEL_MAX
+};
+
+struct cxl_mbox_get_feat_in {
+	uuid_t uuid;
+	__le16 offset;
+	__le16 count;
+	u8 selection;
+}  __packed;
+
+bool cxl_feature_enabled(struct cxl_features_state *cfs, u16 opcode);
 struct cxl_features *cxl_features_alloc(struct cxl_mailbox *cxl_mbox,
 					struct device *parent);
 struct cxl_feat_entry *
 cxl_get_supported_feature_entry(struct cxl_features *features,
 				const uuid_t *feat_uuid);
+size_t cxl_get_feature(struct cxl_features *features, const uuid_t feat_uuid,
+		       enum cxl_get_feat_selection selection,
+		       void *feat_out, size_t feat_out_size, u16 offset,
+		       u16 *return_code);
 
 #endif
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [PATCH v18 13/19] cxl/mbox: Add SET_FEATURE mailbox command
  2025-01-06 12:09 [PATCH v18 00/19] EDAC: Scrub: introduce generic EDAC RAS control feature driver + CXL/ACPI-RAS2 drivers shiju.jose
                   ` (11 preceding siblings ...)
  2025-01-06 12:10 ` [PATCH v18 12/19] cxl/mbox: Add GET_FEATURE mailbox command shiju.jose
@ 2025-01-06 12:10 ` shiju.jose
  2025-01-06 12:10 ` [PATCH v18 14/19] cxl: Setup exclusive CXL features that are reserved for the kernel shiju.jose
                   ` (7 subsequent siblings)
  20 siblings, 0 replies; 87+ messages in thread
From: shiju.jose @ 2025-01-06 12:10 UTC (permalink / raw)
  To: linux-edac, linux-cxl, linux-acpi, linux-mm, linux-kernel
  Cc: bp, tony.luck, rafael, lenb, mchehab, dan.j.williams, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	ira.weiny, david, Vilas.Sridharan, leo.duran, Yazen.Ghannam,
	rientjes, jiaqiyan, Jon.Grimm, dave.hansen, naoya.horiguchi,
	james.morse, jthoughton, somasundaram.a, erdemaktas, pgonda,
	duenwen, gthelen, wschwartz, dferguson, wbs, nifan.cxl,
	yazen.ghannam, tanxiaofei, prime.zeng, roberto.sassu,
	kangkang.shen, wanghuiqiang, linuxarm, shiju.jose

From: Shiju Jose <shiju.jose@huawei.com>

Add support for SET_FEATURE mailbox command.

CXL spec 3.1 section 8.2.9.6 describes optional device specific features.
CXL devices supports features with changeable attributes.
The settings of a feature can be optionally modified using Set Feature
command.
CXL spec 3.1 section 8.2.9.6.3 describes Set Feature command.

Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
---
 drivers/cxl/core/features.c | 92 +++++++++++++++++++++++++++++++++++++
 include/cxl/features.h      | 32 +++++++++++++
 2 files changed, 124 insertions(+)

diff --git a/drivers/cxl/core/features.c b/drivers/cxl/core/features.c
index ab9386b53a95..932e82b52f90 100644
--- a/drivers/cxl/core/features.c
+++ b/drivers/cxl/core/features.c
@@ -171,3 +171,95 @@ size_t cxl_get_feature(struct cxl_features *features, const uuid_t feat_uuid,
 	return data_rcvd_size;
 }
 EXPORT_SYMBOL_NS_GPL(cxl_get_feature, "CXL");
+
+/*
+ * FEAT_DATA_MIN_PAYLOAD_SIZE - min extra number of bytes should be
+ * available in the mailbox for storing the actual feature data so that
+ * the feature data transfer would work as expected.
+ */
+#define FEAT_DATA_MIN_PAYLOAD_SIZE 10
+int cxl_set_feature(struct cxl_features *features,
+		    const uuid_t feat_uuid, u8 feat_version,
+		    void *feat_data, size_t feat_data_size,
+		    u32 feat_flag, u16 offset, u16 *return_code)
+{
+	struct cxl_memdev_set_feat_pi {
+		struct cxl_mbox_set_feat_hdr hdr;
+		u8 feat_data[];
+	}  __packed;
+	size_t data_in_size, data_sent_size = 0;
+	struct cxl_features_state *cfs;
+	struct cxl_mbox_cmd mbox_cmd;
+	struct cxl_mailbox *cxl_mbox;
+	size_t hdr_size;
+	int rc = 0;
+
+	if (return_code)
+		*return_code = CXL_MBOX_CMD_RC_INPUT;
+
+	cfs = dev_get_drvdata(&features->dev);
+	if (!cfs)
+		return -EOPNOTSUPP;
+
+	if (!cxl_feature_enabled(cfs, CXL_MBOX_OP_SET_FEATURE))
+		return -EOPNOTSUPP;
+
+	cxl_mbox = features->cxl_mbox;
+
+	struct cxl_memdev_set_feat_pi *pi __free(kfree) =
+					kmalloc(cxl_mbox->payload_size, GFP_KERNEL);
+	pi->hdr.uuid = feat_uuid;
+	pi->hdr.version = feat_version;
+	feat_flag &= ~CXL_SET_FEAT_FLAG_DATA_TRANSFER_MASK;
+	feat_flag |= CXL_SET_FEAT_FLAG_DATA_SAVED_ACROSS_RESET;
+	hdr_size = sizeof(pi->hdr);
+	/*
+	 * Check minimum mbox payload size is available for
+	 * the feature data transfer.
+	 */
+	if (hdr_size + FEAT_DATA_MIN_PAYLOAD_SIZE > cxl_mbox->payload_size)
+		return -ENOMEM;
+
+	if ((hdr_size + feat_data_size) <= cxl_mbox->payload_size) {
+		pi->hdr.flags = cpu_to_le32(feat_flag |
+				       CXL_SET_FEAT_FLAG_FULL_DATA_TRANSFER);
+		data_in_size = feat_data_size;
+	} else {
+		pi->hdr.flags = cpu_to_le32(feat_flag |
+				       CXL_SET_FEAT_FLAG_INITIATE_DATA_TRANSFER);
+		data_in_size = cxl_mbox->payload_size - hdr_size;
+	}
+
+	do {
+		pi->hdr.offset = cpu_to_le16(offset + data_sent_size);
+		memcpy(pi->feat_data, feat_data + data_sent_size, data_in_size);
+		mbox_cmd = (struct cxl_mbox_cmd) {
+			.opcode = CXL_MBOX_OP_SET_FEATURE,
+			.size_in = hdr_size + data_in_size,
+			.payload_in = pi,
+		};
+		rc = cxl_internal_send_cmd(cxl_mbox, &mbox_cmd);
+		if (rc < 0) {
+			if (return_code)
+				*return_code = mbox_cmd.return_code;
+			return rc;
+		}
+
+		data_sent_size += data_in_size;
+		if (data_sent_size >= feat_data_size) {
+			if (return_code)
+				*return_code = CXL_MBOX_CMD_RC_SUCCESS;
+			return 0;
+		}
+
+		if ((feat_data_size - data_sent_size) <= (cxl_mbox->payload_size - hdr_size)) {
+			data_in_size = feat_data_size - data_sent_size;
+			pi->hdr.flags = cpu_to_le32(feat_flag |
+					       CXL_SET_FEAT_FLAG_FINISH_DATA_TRANSFER);
+		} else {
+			pi->hdr.flags = cpu_to_le32(feat_flag |
+					       CXL_SET_FEAT_FLAG_CONTINUE_DATA_TRANSFER);
+		}
+	} while (true);
+}
+EXPORT_SYMBOL_NS_GPL(cxl_set_feature, "CXL");
diff --git a/include/cxl/features.h b/include/cxl/features.h
index 582fb85da5b9..41d3602b186c 100644
--- a/include/cxl/features.h
+++ b/include/cxl/features.h
@@ -75,6 +75,35 @@ struct cxl_mbox_get_feat_in {
 	u8 selection;
 }  __packed;
 
+/*
+ * Set Feature CXL 3.1 Spec 8.2.9.6.3
+ */
+
+/*
+ * Set Feature input payload
+ * CXL rev 3.1 section 8.2.9.6.3 Table 8-101
+ */
+/* Set Feature : Payload in flags */
+#define CXL_SET_FEAT_FLAG_DATA_TRANSFER_MASK	GENMASK(2, 0)
+enum cxl_set_feat_flag_data_transfer {
+	CXL_SET_FEAT_FLAG_FULL_DATA_TRANSFER,
+	CXL_SET_FEAT_FLAG_INITIATE_DATA_TRANSFER,
+	CXL_SET_FEAT_FLAG_CONTINUE_DATA_TRANSFER,
+	CXL_SET_FEAT_FLAG_FINISH_DATA_TRANSFER,
+	CXL_SET_FEAT_FLAG_ABORT_DATA_TRANSFER,
+	CXL_SET_FEAT_FLAG_DATA_TRANSFER_MAX
+};
+
+#define CXL_SET_FEAT_FLAG_DATA_SAVED_ACROSS_RESET	BIT(3)
+
+struct cxl_mbox_set_feat_hdr {
+	uuid_t uuid;
+	__le32 flags;
+	__le16 offset;
+	u8 version;
+	u8 rsvd[9];
+}  __packed;
+
 bool cxl_feature_enabled(struct cxl_features_state *cfs, u16 opcode);
 struct cxl_features *cxl_features_alloc(struct cxl_mailbox *cxl_mbox,
 					struct device *parent);
@@ -85,5 +114,8 @@ size_t cxl_get_feature(struct cxl_features *features, const uuid_t feat_uuid,
 		       enum cxl_get_feat_selection selection,
 		       void *feat_out, size_t feat_out_size, u16 offset,
 		       u16 *return_code);
+int cxl_set_feature(struct cxl_features *features, const uuid_t feat_uuid,
+		    u8 feat_version, void *feat_data, size_t feat_data_size,
+		    u32 feat_flag, u16 offset, u16 *return_code);
 
 #endif
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [PATCH v18 14/19] cxl: Setup exclusive CXL features that are reserved for the kernel
  2025-01-06 12:09 [PATCH v18 00/19] EDAC: Scrub: introduce generic EDAC RAS control feature driver + CXL/ACPI-RAS2 drivers shiju.jose
                   ` (12 preceding siblings ...)
  2025-01-06 12:10 ` [PATCH v18 13/19] cxl/mbox: Add SET_FEATURE " shiju.jose
@ 2025-01-06 12:10 ` shiju.jose
  2025-01-06 12:10 ` [PATCH v18 15/19] cxl/memfeature: Add CXL memory device patrol scrub control feature shiju.jose
                   ` (6 subsequent siblings)
  20 siblings, 0 replies; 87+ messages in thread
From: shiju.jose @ 2025-01-06 12:10 UTC (permalink / raw)
  To: linux-edac, linux-cxl, linux-acpi, linux-mm, linux-kernel
  Cc: bp, tony.luck, rafael, lenb, mchehab, dan.j.williams, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	ira.weiny, david, Vilas.Sridharan, leo.duran, Yazen.Ghannam,
	rientjes, jiaqiyan, Jon.Grimm, dave.hansen, naoya.horiguchi,
	james.morse, jthoughton, somasundaram.a, erdemaktas, pgonda,
	duenwen, gthelen, wschwartz, dferguson, wbs, nifan.cxl,
	yazen.ghannam, tanxiaofei, prime.zeng, roberto.sassu,
	kangkang.shen, wanghuiqiang, linuxarm, shiju.jose

From: Dave Jiang <dave.jiang@intel.com>

Certain features will be exclusively used by components such as in
kernel RAS driver. Setup an exclusion list that can be later filtered
out before exposing to user space.

Signed-off-by: Dave Jiang <dave.jiang@intel.com>
---
 drivers/cxl/core/features.c | 22 ++++++++++++++++++++++
 drivers/cxl/features.c      |  6 +++++-
 include/cxl/features.h      | 34 ++++++++++++++++++++++++++++++++++
 3 files changed, 61 insertions(+), 1 deletion(-)

diff --git a/drivers/cxl/core/features.c b/drivers/cxl/core/features.c
index 932e82b52f90..c836b3a64561 100644
--- a/drivers/cxl/core/features.c
+++ b/drivers/cxl/core/features.c
@@ -6,6 +6,17 @@
 #include "core.h"
 #include "cxlmem.h"
 
+static const uuid_t cxl_exclusive_feats[] = {
+	CXL_FEAT_PATROL_SCRUB_UUID,
+	CXL_FEAT_ECS_UUID,
+	CXL_FEAT_SPPR_UUID,
+	CXL_FEAT_HPPR_UUID,
+	CXL_FEAT_CACHELINE_SPARING_UUID,
+	CXL_FEAT_ROW_SPARING_UUID,
+	CXL_FEAT_BANK_SPARING_UUID,
+	CXL_FEAT_RANK_SPARING_UUID,
+};
+
 #define CXL_FEATURE_MAX_DEVS 65536
 static DEFINE_IDA(cxl_features_ida);
 
@@ -263,3 +274,14 @@ int cxl_set_feature(struct cxl_features *features,
 	} while (true);
 }
 EXPORT_SYMBOL_NS_GPL(cxl_set_feature, "CXL");
+
+bool is_cxl_feature_exclusive(struct cxl_feat_entry *entry)
+{
+	for (int i = 0; i < ARRAY_SIZE(cxl_exclusive_feats); i++) {
+		if (uuid_equal(&entry->uuid, &cxl_exclusive_feats[i]))
+			return true;
+	}
+
+	return false;
+}
+EXPORT_SYMBOL_NS_GPL(is_cxl_feature_exclusive, "CXL");
diff --git a/drivers/cxl/features.c b/drivers/cxl/features.c
index 2faa9e03a840..3388bce5943a 100644
--- a/drivers/cxl/features.c
+++ b/drivers/cxl/features.c
@@ -47,6 +47,7 @@ static int cxl_get_supported_features(struct cxl_features_state *cfs)
 	struct cxl_mbox_get_sup_feats_in mbox_in;
 	struct cxl_feat_entry *entry;
 	struct cxl_mbox_cmd mbox_cmd;
+	int user_feats = 0;
 	int count;
 
 	/* Get supported features is optional, need to check */
@@ -127,6 +128,8 @@ static int cxl_get_supported_features(struct cxl_features_state *cfs)
 			return -ENXIO;
 
 		memcpy(entry, mbox_out->ents, retrieved);
+		if (!is_cxl_feature_exclusive(entry))
+			user_feats++;
 		entry++;
 		/*
 		 * If the number of output entries is less than expected, add the
@@ -138,6 +141,7 @@ static int cxl_get_supported_features(struct cxl_features_state *cfs)
 
 	cfs->num_features = count;
 	cfs->entries = no_free_ptr(entries);
+	cfs->num_user_features = user_feats;
 	return devm_add_action_or_reset(&cfs->features->dev,
 					cxl_free_feature_entries, cfs->entries);
 }
@@ -177,7 +181,7 @@ static ssize_t features_show(struct device *dev, struct device_attribute *attr,
 	if (!cfs)
 		return -ENOENT;
 
-	return sysfs_emit(buf, "%d\n", cfs->num_features);
+	return sysfs_emit(buf, "%d\n", cfs->num_user_features);
 }
 
 static DEVICE_ATTR_RO(features);
diff --git a/include/cxl/features.h b/include/cxl/features.h
index 41d3602b186c..adff3496b4be 100644
--- a/include/cxl/features.h
+++ b/include/cxl/features.h
@@ -5,6 +5,38 @@
 
 #include <linux/uuid.h>
 
+#define CXL_FEAT_PATROL_SCRUB_UUID						\
+	UUID_INIT(0x96dad7d6, 0xfde8, 0x482b, 0xa7, 0x33, 0x75, 0x77, 0x4e,	\
+		  0x06, 0xdb, 0x8a)
+
+#define CXL_FEAT_ECS_UUID							\
+	UUID_INIT(0xe5b13f22, 0x2328, 0x4a14, 0xb8, 0xba, 0xb9, 0x69, 0x1e,	\
+		  0x89, 0x33, 0x86)
+
+#define CXL_FEAT_SPPR_UUID							\
+	UUID_INIT(0x892ba475, 0xfad8, 0x474e, 0x9d, 0x3e, 0x69, 0x2c, 0x91,	\
+		  0x75, 0x68, 0xbb)
+
+#define CXL_FEAT_HPPR_UUID							\
+	UUID_INIT(0x80ea4521, 0x786f, 0x4127, 0xaf, 0xb1, 0xec, 0x74, 0x59,	\
+		  0xfb, 0x0e, 0x24)
+
+#define CXL_FEAT_CACHELINE_SPARING_UUID						\
+	UUID_INIT(0x96C33386, 0x91dd, 0x44c7, 0x9e, 0xcb, 0xfd, 0xaf, 0x65,	\
+		  0x03, 0xba, 0xc4)
+
+#define CXL_FEAT_ROW_SPARING_UUID						\
+	UUID_INIT(0x450ebf67, 0xb135, 0x4f97, 0xa4, 0x98, 0xc2, 0xd5, 0x7f,	\
+		  0x27, 0x9b, 0xed)
+
+#define CXL_FEAT_BANK_SPARING_UUID						\
+	UUID_INIT(0x78b79636, 0x90ac, 0x4b64, 0xa4, 0xef, 0xfa, 0xac, 0x5d,	\
+		  0x18, 0xa8, 0x63)
+
+#define CXL_FEAT_RANK_SPARING_UUID						\
+	UUID_INIT(0x34dbaff5, 0x0552, 0x4281, 0x8f, 0x76, 0xda, 0x0b, 0x5e,	\
+		  0x7a, 0x76, 0xa7)
+
 struct cxl_mailbox;
 
 enum feature_cmds {
@@ -50,6 +82,7 @@ struct cxl_mbox_get_sup_feats_out {
 struct cxl_features_state {
 	struct cxl_features *features;
 	int num_features;
+	int num_user_features;
 	struct cxl_feat_entry *entries;
 };
 
@@ -117,5 +150,6 @@ size_t cxl_get_feature(struct cxl_features *features, const uuid_t feat_uuid,
 int cxl_set_feature(struct cxl_features *features, const uuid_t feat_uuid,
 		    u8 feat_version, void *feat_data, size_t feat_data_size,
 		    u32 feat_flag, u16 offset, u16 *return_code);
+bool is_cxl_feature_exclusive(struct cxl_feat_entry *entry);
 
 #endif
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [PATCH v18 15/19] cxl/memfeature: Add CXL memory device patrol scrub control feature
  2025-01-06 12:09 [PATCH v18 00/19] EDAC: Scrub: introduce generic EDAC RAS control feature driver + CXL/ACPI-RAS2 drivers shiju.jose
                   ` (13 preceding siblings ...)
  2025-01-06 12:10 ` [PATCH v18 14/19] cxl: Setup exclusive CXL features that are reserved for the kernel shiju.jose
@ 2025-01-06 12:10 ` shiju.jose
  2025-01-24 20:38   ` Dan Williams
  2025-01-06 12:10 ` [PATCH v18 16/19] cxl/memfeature: Add CXL memory device ECS " shiju.jose
                   ` (5 subsequent siblings)
  20 siblings, 1 reply; 87+ messages in thread
From: shiju.jose @ 2025-01-06 12:10 UTC (permalink / raw)
  To: linux-edac, linux-cxl, linux-acpi, linux-mm, linux-kernel
  Cc: bp, tony.luck, rafael, lenb, mchehab, dan.j.williams, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	ira.weiny, david, Vilas.Sridharan, leo.duran, Yazen.Ghannam,
	rientjes, jiaqiyan, Jon.Grimm, dave.hansen, naoya.horiguchi,
	james.morse, jthoughton, somasundaram.a, erdemaktas, pgonda,
	duenwen, gthelen, wschwartz, dferguson, wbs, nifan.cxl,
	yazen.ghannam, tanxiaofei, prime.zeng, roberto.sassu,
	kangkang.shen, wanghuiqiang, linuxarm, shiju.jose

From: Shiju Jose <shiju.jose@huawei.com>

CXL spec 3.1 section 8.2.9.9.11.1 describes the device patrol scrub control
feature. The device patrol scrub proactively locates and makes corrections
to errors in regular cycle.

Allow specifying the number of hours within which the patrol scrub must be
completed, subject to minimum and maximum limits reported by the device.
Also allow disabling scrub allowing trade-off error rates against
performance.

Add support for patrol scrub control on CXL memory devices.
Register with the EDAC device driver, which retrieves the scrub attribute
descriptors from EDAC scrub and exposes the sysfs scrub control attributes
to userspace. For example, scrub control for the CXL memory device
"cxl_mem0" is exposed in /sys/bus/edac/devices/cxl_mem0/scrubX/.

Additionally, add support for region-based CXL memory patrol scrub control.
CXL memory regions may be interleaved across one or more CXL memory
devices. For example, region-based scrub control for "cxl_region1" is
exposed in /sys/bus/edac/devices/cxl_region1/scrubX/.

Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Co-developed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
---
 Documentation/edac/scrub.rst  |  66 ++++++
 drivers/cxl/Kconfig           |  17 ++
 drivers/cxl/core/Makefile     |   1 +
 drivers/cxl/core/memfeature.c | 392 ++++++++++++++++++++++++++++++++++
 drivers/cxl/core/region.c     |   6 +
 drivers/cxl/cxlmem.h          |   7 +
 drivers/cxl/mem.c             |   5 +
 include/cxl/features.h        |  16 ++
 8 files changed, 510 insertions(+)
 create mode 100644 drivers/cxl/core/memfeature.c

diff --git a/Documentation/edac/scrub.rst b/Documentation/edac/scrub.rst
index f86645c7f0af..80e986c57885 100644
--- a/Documentation/edac/scrub.rst
+++ b/Documentation/edac/scrub.rst
@@ -325,3 +325,69 @@ root@localhost:~# cat /sys/bus/edac/devices/acpi_ras_mem0/scrub0/current_cycle_d
 10800
 
 root@localhost:~# echo 0 > /sys/bus/edac/devices/acpi_ras_mem0/scrub0/enable_background
+
+2. CXL memory device patrol scrubber
+
+2.1 Device based scrubbing
+
+root@localhost:~# cat /sys/bus/edac/devices/cxl_mem0/scrub0/min_cycle_duration
+
+3600
+
+root@localhost:~# cat /sys/bus/edac/devices/cxl_mem0/scrub0/max_cycle_duration
+
+918000
+
+root@localhost:~# cat /sys/bus/edac/devices/cxl_mem0/scrub0/current_cycle_duration
+
+43200
+
+root@localhost:~# echo 54000 > /sys/bus/edac/devices/cxl_mem0/scrub0/current_cycle_duration
+
+root@localhost:~# cat /sys/bus/edac/devices/cxl_mem0/scrub0/current_cycle_duration
+
+54000
+
+root@localhost:~# echo 1 > /sys/bus/edac/devices/cxl_mem0/scrub0/enable_background
+
+root@localhost:~# cat /sys/bus/edac/devices/cxl_mem0/scrub0/enable_background
+
+1
+
+root@localhost:~# echo 0 > /sys/bus/edac/devices/cxl_mem0/scrub0/enable_background
+
+root@localhost:~# cat /sys/bus/edac/devices/cxl_mem0/scrub0/enable_background
+
+0
+
+2.2. Region based scrubbing
+
+root@localhost:~# cat /sys/bus/edac/devices/cxl_region0/scrub0/min_cycle_duration
+
+3600
+
+root@localhost:~# cat /sys/bus/edac/devices/cxl_region0/scrub0/max_cycle_duration
+
+918000
+
+root@localhost:~# cat /sys/bus/edac/devices/cxl_region0/scrub0/current_cycle_duration
+
+43200
+
+root@localhost:~# echo 54000 > /sys/bus/edac/devices/cxl_region0/scrub0/current_cycle_duration
+
+root@localhost:~# cat /sys/bus/edac/devices/cxl_region0/scrub0/current_cycle_duration
+
+54000
+
+root@localhost:~# echo 1 > /sys/bus/edac/devices/cxl_region0/scrub0/enable_background
+
+root@localhost:~# cat /sys/bus/edac/devices/cxl_region0/scrub0/enable_background
+
+1
+
+root@localhost:~# echo 0 > /sys/bus/edac/devices/cxl_region0/scrub0/enable_background
+
+root@localhost:~# cat /sys/bus/edac/devices/cxl_region0/scrub0/enable_background
+
+0
diff --git a/drivers/cxl/Kconfig b/drivers/cxl/Kconfig
index 0bc6a2cb8474..6078f02e883b 100644
--- a/drivers/cxl/Kconfig
+++ b/drivers/cxl/Kconfig
@@ -154,4 +154,21 @@ config CXL_FEATURES
 
 	  If unsure say 'y'.
 
+config CXL_RAS_FEATURES
+	tristate "CXL: Memory RAS features"
+	depends on CXL_PCI
+	depends on CXL_MEM
+	depends on EDAC
+	help
+	  The CXL memory RAS feature control is optional and allows host to
+	  control the RAS features configurations of CXL Type 3 devices.
+
+	  It registers with the EDAC device subsystem to expose control
+	  attributes of CXL memory device's RAS features to the user.
+	  It provides interface functions to support configuring the CXL
+	  memory device's RAS features.
+	  Say 'y/m/n' to enable/disable control of the CXL.mem device's RAS features.
+	  See section 8.2.9.9.11 of CXL 3.1 specification for the detailed
+	  information of CXL memory device features.
+
 endif
diff --git a/drivers/cxl/core/Makefile b/drivers/cxl/core/Makefile
index 73b6348afd67..54baca513ecb 100644
--- a/drivers/cxl/core/Makefile
+++ b/drivers/cxl/core/Makefile
@@ -17,3 +17,4 @@ cxl_core-y += cdat.o
 cxl_core-y += features.o
 cxl_core-$(CONFIG_TRACING) += trace.o
 cxl_core-$(CONFIG_CXL_REGION) += region.o
+cxl_core-$(CONFIG_CXL_RAS_FEATURES) += memfeature.o
diff --git a/drivers/cxl/core/memfeature.c b/drivers/cxl/core/memfeature.c
new file mode 100644
index 000000000000..77d1bf6ce45f
--- /dev/null
+++ b/drivers/cxl/core/memfeature.c
@@ -0,0 +1,392 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * CXL memory RAS feature driver.
+ *
+ * Copyright (c) 2024 HiSilicon Limited.
+ *
+ *  - Supports functions to configure RAS features of the
+ *    CXL memory devices.
+ *  - Registers with the EDAC device subsystem driver to expose
+ *    the features sysfs attributes to the user for configuring
+ *    CXL memory RAS feature.
+ */
+
+#include <linux/cleanup.h>
+#include <linux/edac.h>
+#include <linux/limits.h>
+#include <cxl/features.h>
+#include <cxl.h>
+#include <cxlmem.h>
+
+#define CXL_DEV_NUM_RAS_FEATURES	1
+#define CXL_DEV_HOUR_IN_SECS	3600
+
+#define CXL_DEV_NAME_LEN	128
+
+/* CXL memory patrol scrub control functions */
+struct cxl_patrol_scrub_context {
+	u8 instance;
+	u16 get_feat_size;
+	u16 set_feat_size;
+	u8 get_version;
+	u8 set_version;
+	u16 effects;
+	struct cxl_memdev *cxlmd;
+	struct cxl_region *cxlr;
+};
+
+/**
+ * struct cxl_memdev_ps_params - CXL memory patrol scrub parameter data structure.
+ * @enable:     [IN & OUT] enable(1)/disable(0) patrol scrub.
+ * @scrub_cycle_changeable: [OUT] scrub cycle attribute of patrol scrub is changeable.
+ * @scrub_cycle_hrs:    [IN] Requested patrol scrub cycle in hours.
+ *                      [OUT] Current patrol scrub cycle in hours.
+ * @min_scrub_cycle_hrs:[OUT] minimum patrol scrub cycle in hours supported.
+ */
+struct cxl_memdev_ps_params {
+	bool enable;
+	bool scrub_cycle_changeable;
+	u8 scrub_cycle_hrs;
+	u8 min_scrub_cycle_hrs;
+};
+
+enum cxl_scrub_param {
+	CXL_PS_PARAM_ENABLE,
+	CXL_PS_PARAM_SCRUB_CYCLE,
+};
+
+#define CXL_MEMDEV_PS_SCRUB_CYCLE_CHANGE_CAP_MASK	BIT(0)
+#define	CXL_MEMDEV_PS_SCRUB_CYCLE_REALTIME_REPORT_CAP_MASK	BIT(1)
+#define	CXL_MEMDEV_PS_CUR_SCRUB_CYCLE_MASK	GENMASK(7, 0)
+#define	CXL_MEMDEV_PS_MIN_SCRUB_CYCLE_MASK	GENMASK(15, 8)
+#define	CXL_MEMDEV_PS_FLAG_ENABLED_MASK	BIT(0)
+
+struct cxl_memdev_ps_rd_attrs {
+	u8 scrub_cycle_cap;
+	__le16 scrub_cycle_hrs;
+	u8 scrub_flags;
+}  __packed;
+
+struct cxl_memdev_ps_wr_attrs {
+	u8 scrub_cycle_hrs;
+	u8 scrub_flags;
+}  __packed;
+
+static int cxl_mem_ps_get_attrs(struct cxl_mailbox *cxl_mbox,
+				struct cxl_memdev_ps_params *params)
+{
+	size_t rd_data_size = sizeof(struct cxl_memdev_ps_rd_attrs);
+	u16 scrub_cycle_hrs;
+	size_t data_size;
+	u16 return_code;
+	struct cxl_memdev_ps_rd_attrs *rd_attrs __free(kfree) =
+						kmalloc(rd_data_size, GFP_KERNEL);
+	if (!rd_attrs)
+		return -ENOMEM;
+
+	data_size = cxl_get_feature(cxl_mbox->features, CXL_FEAT_PATROL_SCRUB_UUID,
+				    CXL_GET_FEAT_SEL_CURRENT_VALUE,
+				    rd_attrs, rd_data_size, 0, &return_code);
+	if (!data_size || return_code != CXL_MBOX_CMD_RC_SUCCESS)
+		return -EIO;
+
+	params->scrub_cycle_changeable = FIELD_GET(CXL_MEMDEV_PS_SCRUB_CYCLE_CHANGE_CAP_MASK,
+						   rd_attrs->scrub_cycle_cap);
+	params->enable = FIELD_GET(CXL_MEMDEV_PS_FLAG_ENABLED_MASK,
+				   rd_attrs->scrub_flags);
+	scrub_cycle_hrs = le16_to_cpu(rd_attrs->scrub_cycle_hrs);
+	params->scrub_cycle_hrs = FIELD_GET(CXL_MEMDEV_PS_CUR_SCRUB_CYCLE_MASK,
+					    scrub_cycle_hrs);
+	params->min_scrub_cycle_hrs = FIELD_GET(CXL_MEMDEV_PS_MIN_SCRUB_CYCLE_MASK,
+						scrub_cycle_hrs);
+
+	return 0;
+}
+
+static int cxl_ps_get_attrs(struct cxl_patrol_scrub_context *cxl_ps_ctx,
+			    struct cxl_memdev_ps_params *params)
+{
+	struct cxl_memdev *cxlmd;
+	u16 min_scrub_cycle = 0;
+	int i, ret;
+
+	if (cxl_ps_ctx->cxlr) {
+		struct cxl_region *cxlr = cxl_ps_ctx->cxlr;
+		struct cxl_region_params *p = &cxlr->params;
+
+		for (i = p->interleave_ways - 1; i >= 0; i--) {
+			struct cxl_endpoint_decoder *cxled = p->targets[i];
+
+			cxlmd = cxled_to_memdev(cxled);
+			ret = cxl_mem_ps_get_attrs(&cxlmd->cxlds->cxl_mbox, params);
+			if (ret)
+				return ret;
+
+			if (params->min_scrub_cycle_hrs > min_scrub_cycle)
+				min_scrub_cycle = params->min_scrub_cycle_hrs;
+		}
+		params->min_scrub_cycle_hrs = min_scrub_cycle;
+		return 0;
+	}
+	cxlmd = cxl_ps_ctx->cxlmd;
+
+	return cxl_mem_ps_get_attrs(&cxlmd->cxlds->cxl_mbox, params);
+}
+
+static int cxl_mem_ps_set_attrs(struct device *dev,
+				struct cxl_patrol_scrub_context *cxl_ps_ctx,
+				struct cxl_mailbox *cxl_mbox,
+				struct cxl_memdev_ps_params *params,
+				enum cxl_scrub_param param_type)
+{
+	struct cxl_memdev_ps_wr_attrs wr_attrs;
+	struct cxl_memdev_ps_params rd_params;
+	u16 return_code;
+	int ret;
+
+	ret = cxl_mem_ps_get_attrs(cxl_mbox, &rd_params);
+	if (ret) {
+		dev_err(dev, "Get cxlmemdev patrol scrub params failed ret=%d\n",
+			ret);
+		return ret;
+	}
+
+	switch (param_type) {
+	case CXL_PS_PARAM_ENABLE:
+		wr_attrs.scrub_flags = FIELD_PREP(CXL_MEMDEV_PS_FLAG_ENABLED_MASK,
+						  params->enable);
+		wr_attrs.scrub_cycle_hrs = FIELD_PREP(CXL_MEMDEV_PS_CUR_SCRUB_CYCLE_MASK,
+						      rd_params.scrub_cycle_hrs);
+		break;
+	case CXL_PS_PARAM_SCRUB_CYCLE:
+		if (params->scrub_cycle_hrs < rd_params.min_scrub_cycle_hrs) {
+			dev_err(dev, "Invalid CXL patrol scrub cycle(%d) to set\n",
+				params->scrub_cycle_hrs);
+			dev_err(dev, "Minimum supported CXL patrol scrub cycle in hour %d\n",
+				rd_params.min_scrub_cycle_hrs);
+			return -EINVAL;
+		}
+		wr_attrs.scrub_cycle_hrs = FIELD_PREP(CXL_MEMDEV_PS_CUR_SCRUB_CYCLE_MASK,
+						      params->scrub_cycle_hrs);
+		wr_attrs.scrub_flags = FIELD_PREP(CXL_MEMDEV_PS_FLAG_ENABLED_MASK,
+						  rd_params.enable);
+		break;
+	}
+
+	ret = cxl_set_feature(cxl_mbox->features, CXL_FEAT_PATROL_SCRUB_UUID,
+			      cxl_ps_ctx->set_version,
+			      &wr_attrs, sizeof(wr_attrs),
+			      CXL_SET_FEAT_FLAG_DATA_SAVED_ACROSS_RESET,
+			      0, &return_code);
+	if (ret || return_code != CXL_MBOX_CMD_RC_SUCCESS) {
+		dev_err(dev, "CXL patrol scrub set feature failed ret=%d return_code=%u\n",
+			ret, return_code);
+		return ret;
+	}
+
+	return 0;
+}
+
+static int cxl_ps_set_attrs(struct device *dev,
+			    struct cxl_patrol_scrub_context *cxl_ps_ctx,
+			    struct cxl_memdev_ps_params *params,
+			    enum cxl_scrub_param param_type)
+{
+	struct cxl_memdev *cxlmd;
+	int ret, i;
+
+	if (cxl_ps_ctx->cxlr) {
+		struct cxl_region *cxlr = cxl_ps_ctx->cxlr;
+		struct cxl_region_params *p = &cxlr->params;
+
+		for (i = p->interleave_ways - 1; i >= 0; i--) {
+			struct cxl_endpoint_decoder *cxled = p->targets[i];
+
+			cxlmd = cxled_to_memdev(cxled);
+			ret = cxl_mem_ps_set_attrs(dev, cxl_ps_ctx, &cxlmd->cxlds->cxl_mbox,
+						   params, param_type);
+			if (ret)
+				return ret;
+		}
+		return 0;
+	}
+	cxlmd = cxl_ps_ctx->cxlmd;
+
+	return cxl_mem_ps_set_attrs(dev, cxl_ps_ctx, &cxlmd->cxlds->cxl_mbox,
+				    params, param_type);
+}
+
+static int cxl_patrol_scrub_get_enabled_bg(struct device *dev, void *drv_data, bool *enabled)
+{
+	struct cxl_patrol_scrub_context *ctx = drv_data;
+	struct cxl_memdev_ps_params params;
+	int ret;
+
+	ret = cxl_ps_get_attrs(ctx, &params);
+	if (ret)
+		return ret;
+
+	*enabled = params.enable;
+
+	return 0;
+}
+
+static int cxl_patrol_scrub_set_enabled_bg(struct device *dev, void *drv_data, bool enable)
+{
+	struct cxl_patrol_scrub_context *ctx = drv_data;
+	struct cxl_memdev_ps_params params = {
+		.enable = enable,
+	};
+
+	return cxl_ps_set_attrs(dev, ctx, &params, CXL_PS_PARAM_ENABLE);
+}
+
+static int cxl_patrol_scrub_read_min_scrub_cycle(struct device *dev, void *drv_data,
+						 u32 *min)
+{
+	struct cxl_patrol_scrub_context *ctx = drv_data;
+	struct cxl_memdev_ps_params params;
+	int ret;
+
+	ret = cxl_ps_get_attrs(ctx, &params);
+	if (ret)
+		return ret;
+	*min = params.min_scrub_cycle_hrs * CXL_DEV_HOUR_IN_SECS;
+
+	return 0;
+}
+
+static int cxl_patrol_scrub_read_max_scrub_cycle(struct device *dev, void *drv_data,
+						 u32 *max)
+{
+	*max = U8_MAX * CXL_DEV_HOUR_IN_SECS; /* Max set by register size */
+
+	return 0;
+}
+
+static int cxl_patrol_scrub_read_scrub_cycle(struct device *dev, void *drv_data,
+					     u32 *scrub_cycle_secs)
+{
+	struct cxl_patrol_scrub_context *ctx = drv_data;
+	struct cxl_memdev_ps_params params;
+	int ret;
+
+	ret = cxl_ps_get_attrs(ctx, &params);
+	if (ret)
+		return ret;
+
+	*scrub_cycle_secs = params.scrub_cycle_hrs * CXL_DEV_HOUR_IN_SECS;
+
+	return 0;
+}
+
+static int cxl_patrol_scrub_write_scrub_cycle(struct device *dev, void *drv_data,
+					      u32 scrub_cycle_secs)
+{
+	struct cxl_patrol_scrub_context *ctx = drv_data;
+	struct cxl_memdev_ps_params params = {
+		.scrub_cycle_hrs = scrub_cycle_secs / CXL_DEV_HOUR_IN_SECS,
+	};
+
+	return cxl_ps_set_attrs(dev, ctx, &params, CXL_PS_PARAM_SCRUB_CYCLE);
+}
+
+static const struct edac_scrub_ops cxl_ps_scrub_ops = {
+	.get_enabled_bg = cxl_patrol_scrub_get_enabled_bg,
+	.set_enabled_bg = cxl_patrol_scrub_set_enabled_bg,
+	.get_min_cycle = cxl_patrol_scrub_read_min_scrub_cycle,
+	.get_max_cycle = cxl_patrol_scrub_read_max_scrub_cycle,
+	.get_cycle_duration = cxl_patrol_scrub_read_scrub_cycle,
+	.set_cycle_duration = cxl_patrol_scrub_write_scrub_cycle,
+};
+
+static int cxl_memdev_scrub_init(struct cxl_memdev *cxlmd, struct cxl_region *cxlr,
+				 struct edac_dev_feature *ras_feature, u8 scrub_inst)
+{
+	struct cxl_patrol_scrub_context *cxl_ps_ctx;
+	struct cxl_feat_entry *feat_entry;
+	struct cxl_mailbox *cxl_mbox;
+	struct cxl_dev_state *cxlds;
+	int i;
+
+	if (cxlr) {
+		struct cxl_region_params *p = &cxlr->params;
+
+		for (i = p->interleave_ways - 1; i >= 0; i--) {
+			struct cxl_endpoint_decoder *cxled = p->targets[i];
+
+			cxlmd = cxled_to_memdev(cxled);
+			cxlds = cxlmd->cxlds;
+			cxl_mbox = &cxlds->cxl_mbox;
+			feat_entry = cxl_get_supported_feature_entry(cxl_mbox->features,
+								     &CXL_FEAT_PATROL_SCRUB_UUID);
+			if (IS_ERR(feat_entry))
+				return -EOPNOTSUPP;
+
+			if (!(le32_to_cpu(feat_entry->flags) & CXL_FEAT_ENTRY_FLAG_CHANGABLE))
+				return -EOPNOTSUPP;
+		}
+	} else {
+		cxlds = cxlmd->cxlds;
+		cxl_mbox = &cxlds->cxl_mbox;
+		feat_entry = cxl_get_supported_feature_entry(cxl_mbox->features,
+							     &CXL_FEAT_PATROL_SCRUB_UUID);
+		if (IS_ERR(feat_entry))
+			return -EOPNOTSUPP;
+
+		if (!(le32_to_cpu(feat_entry->flags) & CXL_FEAT_ENTRY_FLAG_CHANGABLE))
+			return -EOPNOTSUPP;
+	}
+
+	cxl_ps_ctx = devm_kzalloc(&cxlmd->dev, sizeof(*cxl_ps_ctx), GFP_KERNEL);
+	if (!cxl_ps_ctx)
+		return -ENOMEM;
+
+	*cxl_ps_ctx = (struct cxl_patrol_scrub_context) {
+		.get_feat_size = le16_to_cpu(feat_entry->get_feat_size),
+		.set_feat_size = le16_to_cpu(feat_entry->set_feat_size),
+		.get_version = feat_entry->get_feat_ver,
+		.set_version = feat_entry->set_feat_ver,
+		.effects = le16_to_cpu(feat_entry->effects),
+		.instance = scrub_inst,
+	};
+	if (cxlr)
+		cxl_ps_ctx->cxlr = cxlr;
+	else
+		cxl_ps_ctx->cxlmd = cxlmd;
+
+	ras_feature->ft_type = RAS_FEAT_SCRUB;
+	ras_feature->instance = cxl_ps_ctx->instance;
+	ras_feature->scrub_ops = &cxl_ps_scrub_ops;
+	ras_feature->ctx = cxl_ps_ctx;
+
+	return 0;
+}
+
+int cxl_mem_ras_features_init(struct cxl_memdev *cxlmd, struct cxl_region *cxlr)
+{
+	struct edac_dev_feature ras_features[CXL_DEV_NUM_RAS_FEATURES];
+	char cxl_dev_name[CXL_DEV_NAME_LEN];
+	int num_ras_features = 0;
+	u8 scrub_inst = 0;
+	int rc;
+
+	rc = cxl_memdev_scrub_init(cxlmd, cxlr, &ras_features[num_ras_features],
+				   scrub_inst);
+	if (rc < 0)
+		return rc;
+
+	scrub_inst++;
+	num_ras_features++;
+
+	if (cxlr)
+		snprintf(cxl_dev_name, sizeof(cxl_dev_name),
+			 "cxl_region%d", cxlr->id);
+	else
+		snprintf(cxl_dev_name, sizeof(cxl_dev_name),
+			 "%s_%s", "cxl", dev_name(&cxlmd->dev));
+
+	return edac_dev_register(&cxlmd->dev, cxl_dev_name, NULL,
+				 num_ras_features, ras_features);
+}
+EXPORT_SYMBOL_NS_GPL(cxl_mem_ras_features_init, "CXL");
diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
index b98b1ccffd1c..c2be70cd87f8 100644
--- a/drivers/cxl/core/region.c
+++ b/drivers/cxl/core/region.c
@@ -3449,6 +3449,12 @@ static int cxl_region_probe(struct device *dev)
 					p->res->start, p->res->end, cxlr,
 					is_system_ram) > 0)
 			return 0;
+
+		rc = cxl_mem_ras_features_init(NULL, cxlr);
+		if (rc)
+			dev_warn(&cxlr->dev, "CXL RAS features init for region_id=%d failed\n",
+				 cxlr->id);
+
 		return devm_cxl_add_dax_region(cxlr);
 	default:
 		dev_dbg(&cxlr->dev, "unsupported region mode: %d\n",
diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
index 55c55685cb39..2b02e47cd7e7 100644
--- a/drivers/cxl/cxlmem.h
+++ b/drivers/cxl/cxlmem.h
@@ -800,6 +800,13 @@ int cxl_trigger_poison_list(struct cxl_memdev *cxlmd);
 int cxl_inject_poison(struct cxl_memdev *cxlmd, u64 dpa);
 int cxl_clear_poison(struct cxl_memdev *cxlmd, u64 dpa);
 
+#if IS_ENABLED(CONFIG_CXL_RAS_FEATURES)
+int cxl_mem_ras_features_init(struct cxl_memdev *cxlmd, struct cxl_region *cxlr);
+#else
+static inline int cxl_mem_ras_features_init(struct cxl_memdev *cxlmd, struct cxl_region *cxlr)
+{ return 0; }
+#endif
+
 #ifdef CONFIG_CXL_SUSPEND
 void cxl_mem_active_inc(void);
 void cxl_mem_active_dec(void);
diff --git a/drivers/cxl/mem.c b/drivers/cxl/mem.c
index 2f03a4d5606e..d236b4b8a93c 100644
--- a/drivers/cxl/mem.c
+++ b/drivers/cxl/mem.c
@@ -116,6 +116,10 @@ static int cxl_mem_probe(struct device *dev)
 	if (!cxlds->media_ready)
 		return -EBUSY;
 
+	rc = cxl_mem_ras_features_init(cxlmd, NULL);
+	if (rc)
+		dev_warn(&cxlmd->dev, "CXL RAS features init failed\n");
+
 	/*
 	 * Someone is trying to reattach this device after it lost its port
 	 * connection (an endpoint port previously registered by this memdev was
@@ -259,3 +263,4 @@ MODULE_ALIAS_CXL(CXL_DEVICE_MEMORY_EXPANDER);
  * endpoint registration.
  */
 MODULE_SOFTDEP("pre: cxl_port");
+MODULE_SOFTDEP("pre: cxl_features");
diff --git a/include/cxl/features.h b/include/cxl/features.h
index adff3496b4be..d1d1c5b7efc1 100644
--- a/include/cxl/features.h
+++ b/include/cxl/features.h
@@ -60,6 +60,22 @@ struct cxl_mbox_get_sup_feats_in {
 	u8 reserved[2];
 } __packed;
 
+/* Supported Feature Entry : Payload out attribute flags */
+#define CXL_FEAT_ENTRY_FLAG_CHANGABLE  BIT(0)
+#define CXL_FEAT_ENTRY_FLAG_DEEPEST_RESET_PERSISTENCE_MASK     GENMASK(3, 1)
+#define CXL_FEAT_ENTRY_FLAG_PERSIST_ACROSS_FIRMWARE_UPDATE     BIT(4)
+#define CXL_FEAT_ENTRY_FLAG_SUPPORT_DEFAULT_SELECTION  BIT(5)
+#define CXL_FEAT_ENTRY_FLAG_SUPPORT_SAVED_SELECTION    BIT(6)
+
+enum cxl_feat_attr_value_persistence {
+	CXL_FEAT_ATTR_VALUE_PERSISTENCE_NONE,
+	CXL_FEAT_ATTR_VALUE_PERSISTENCE_CXL_RESET,
+	CXL_FEAT_ATTR_VALUE_PERSISTENCE_HOT_RESET,
+	CXL_FEAT_ATTR_VALUE_PERSISTENCE_WARM_RESET,
+	CXL_FEAT_ATTR_VALUE_PERSISTENCE_COLD_RESET,
+	CXL_FEAT_ATTR_VALUE_PERSISTENCE_MAX
+};
+
 struct cxl_feat_entry {
 	uuid_t uuid;
 	__le16 id;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [PATCH v18 16/19] cxl/memfeature: Add CXL memory device ECS control feature
  2025-01-06 12:09 [PATCH v18 00/19] EDAC: Scrub: introduce generic EDAC RAS control feature driver + CXL/ACPI-RAS2 drivers shiju.jose
                   ` (14 preceding siblings ...)
  2025-01-06 12:10 ` [PATCH v18 15/19] cxl/memfeature: Add CXL memory device patrol scrub control feature shiju.jose
@ 2025-01-06 12:10 ` shiju.jose
  2025-01-06 12:10 ` [PATCH v18 17/19] cxl/mbox: Add support for PERFORM_MAINTENANCE mailbox command shiju.jose
                   ` (4 subsequent siblings)
  20 siblings, 0 replies; 87+ messages in thread
From: shiju.jose @ 2025-01-06 12:10 UTC (permalink / raw)
  To: linux-edac, linux-cxl, linux-acpi, linux-mm, linux-kernel
  Cc: bp, tony.luck, rafael, lenb, mchehab, dan.j.williams, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	ira.weiny, david, Vilas.Sridharan, leo.duran, Yazen.Ghannam,
	rientjes, jiaqiyan, Jon.Grimm, dave.hansen, naoya.horiguchi,
	james.morse, jthoughton, somasundaram.a, erdemaktas, pgonda,
	duenwen, gthelen, wschwartz, dferguson, wbs, nifan.cxl,
	yazen.ghannam, tanxiaofei, prime.zeng, roberto.sassu,
	kangkang.shen, wanghuiqiang, linuxarm, shiju.jose

From: Shiju Jose <shiju.jose@huawei.com>

CXL spec 3.1 section 8.2.9.9.11.2 describes the DDR5 ECS (Error Check
Scrub) control feature.
The Error Check Scrub (ECS) is a feature defined in JEDEC DDR5 SDRAM
Specification (JESD79-5) and allows the DRAM to internally read, correct
single-bit errors, and write back corrected data bits to the DRAM array
while providing transparency to error counts.

The ECS control allows the requester to change the log entry type, the ECS
threshold count (provided the request falls within the limits specified in
DDR5 mode registers), switch between codeword mode and row count mode, and
reset the ECS counter.

Register with EDAC device driver, which retrieves the ECS attribute
descriptors from the EDAC ECS and exposes the ECS control attributes to
userspace via sysfs. For example, the ECS control for the memory media FRU0
in CXL mem0 device is located at /sys/bus/edac/devices/cxl_mem0/ecs_fru0/

Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
---
 drivers/cxl/core/memfeature.c | 372 +++++++++++++++++++++++++++++++++-
 1 file changed, 369 insertions(+), 3 deletions(-)

diff --git a/drivers/cxl/core/memfeature.c b/drivers/cxl/core/memfeature.c
index 77d1bf6ce45f..eed9d57aa691 100644
--- a/drivers/cxl/core/memfeature.c
+++ b/drivers/cxl/core/memfeature.c
@@ -18,7 +18,7 @@
 #include <cxl.h>
 #include <cxlmem.h>
 
-#define CXL_DEV_NUM_RAS_FEATURES	1
+#define CXL_DEV_NUM_RAS_FEATURES	2
 #define CXL_DEV_HOUR_IN_SECS	3600
 
 #define CXL_DEV_NAME_LEN	128
@@ -300,6 +300,311 @@ static const struct edac_scrub_ops cxl_ps_scrub_ops = {
 	.set_cycle_duration = cxl_patrol_scrub_write_scrub_cycle,
 };
 
+/* CXL DDR5 ECS control definitions */
+struct cxl_ecs_context {
+	u16 num_media_frus;
+	u16 get_feat_size;
+	u16 set_feat_size;
+	u8 get_version;
+	u8 set_version;
+	u16 effects;
+	struct cxl_memdev *cxlmd;
+};
+
+enum {
+	CXL_ECS_PARAM_LOG_ENTRY_TYPE,
+	CXL_ECS_PARAM_THRESHOLD,
+	CXL_ECS_PARAM_MODE,
+	CXL_ECS_PARAM_RESET_COUNTER,
+};
+
+#define CXL_ECS_LOG_ENTRY_TYPE_MASK	GENMASK(1, 0)
+#define CXL_ECS_REALTIME_REPORT_CAP_MASK	BIT(0)
+#define CXL_ECS_THRESHOLD_COUNT_MASK	GENMASK(2, 0)
+#define CXL_ECS_COUNT_MODE_MASK	BIT(3)
+#define CXL_ECS_RESET_COUNTER_MASK	BIT(4)
+#define CXL_ECS_RESET_COUNTER	1
+
+enum {
+	ECS_THRESHOLD_256 = 256,
+	ECS_THRESHOLD_1024 = 1024,
+	ECS_THRESHOLD_4096 = 4096,
+};
+
+enum {
+	ECS_THRESHOLD_IDX_256 = 3,
+	ECS_THRESHOLD_IDX_1024 = 4,
+	ECS_THRESHOLD_IDX_4096 = 5,
+};
+
+static const u16 ecs_supp_threshold[] = {
+	[ECS_THRESHOLD_IDX_256] = 256,
+	[ECS_THRESHOLD_IDX_1024] = 1024,
+	[ECS_THRESHOLD_IDX_4096] = 4096,
+};
+
+enum {
+	ECS_LOG_ENTRY_TYPE_DRAM = 0x0,
+	ECS_LOG_ENTRY_TYPE_MEM_MEDIA_FRU = 0x1,
+};
+
+enum cxl_ecs_count_mode {
+	ECS_MODE_COUNTS_ROWS = 0,
+	ECS_MODE_COUNTS_CODEWORDS = 1,
+};
+
+/**
+ * struct cxl_ecs_params - CXL memory DDR5 ECS parameter data structure.
+ * @log_entry_type: ECS log entry type, per DRAM or per memory media FRU.
+ * @threshold: ECS threshold count per GB of memory cells.
+ * @count_mode: codeword/row count mode
+ *		0 : ECS counts rows with errors
+ *		1 : ECS counts codeword with errors
+ * @reset_counter: [IN] reset ECC counter to default value.
+ */
+struct cxl_ecs_params {
+	u8 log_entry_type;
+	u16 threshold;
+	enum cxl_ecs_count_mode count_mode;
+	u8 reset_counter;
+};
+
+struct cxl_ecs_fru_rd_attrs {
+	u8 ecs_cap;
+	__le16 ecs_config;
+	u8 ecs_flags;
+}  __packed;
+
+struct cxl_ecs_rd_attrs {
+	u8 ecs_log_cap;
+	struct cxl_ecs_fru_rd_attrs fru_attrs[];
+}  __packed;
+
+struct cxl_ecs_fru_wr_attrs {
+	__le16 ecs_config;
+} __packed;
+
+struct cxl_ecs_wr_attrs {
+	u8 ecs_log_cap;
+	struct cxl_ecs_fru_wr_attrs fru_attrs[];
+}  __packed;
+
+/* CXL DDR5 ECS control functions */
+static int cxl_mem_ecs_get_attrs(struct device *dev,
+				 struct cxl_ecs_context *cxl_ecs_ctx,
+				 int fru_id, struct cxl_ecs_params *params)
+{
+	struct cxl_memdev *cxlmd = cxl_ecs_ctx->cxlmd;
+	struct cxl_mailbox *cxl_mbox = &cxlmd->cxlds->cxl_mbox;
+	struct cxl_ecs_fru_rd_attrs *fru_rd_attrs;
+	size_t rd_data_size;
+	u8 threshold_index;
+	size_t data_size;
+	u16 return_code;
+	u16 ecs_config;
+
+	rd_data_size = cxl_ecs_ctx->get_feat_size;
+
+	struct cxl_ecs_rd_attrs *rd_attrs __free(kfree) =
+					kmalloc(rd_data_size, GFP_KERNEL);
+	if (!rd_attrs)
+		return -ENOMEM;
+
+	params->log_entry_type = 0;
+	params->threshold = 0;
+	params->count_mode = 0;
+	data_size = cxl_get_feature(cxl_mbox->features, CXL_FEAT_ECS_UUID,
+				    CXL_GET_FEAT_SEL_CURRENT_VALUE,
+				    rd_attrs, rd_data_size, 0, &return_code);
+	if (!data_size || return_code != CXL_MBOX_CMD_RC_SUCCESS)
+		return -EIO;
+
+	fru_rd_attrs = rd_attrs->fru_attrs;
+	params->log_entry_type = FIELD_GET(CXL_ECS_LOG_ENTRY_TYPE_MASK,
+					   rd_attrs->ecs_log_cap);
+	ecs_config = le16_to_cpu(fru_rd_attrs[fru_id].ecs_config);
+	threshold_index = FIELD_GET(CXL_ECS_THRESHOLD_COUNT_MASK,
+				    ecs_config);
+	params->threshold = ecs_supp_threshold[threshold_index];
+	params->count_mode = FIELD_GET(CXL_ECS_COUNT_MODE_MASK,
+				       ecs_config);
+	return 0;
+}
+
+static int cxl_mem_ecs_set_attrs(struct device *dev,
+				 struct cxl_ecs_context *cxl_ecs_ctx,
+				 int fru_id, struct cxl_ecs_params *params,
+				 u8 param_type)
+{
+	struct cxl_memdev *cxlmd = cxl_ecs_ctx->cxlmd;
+	struct cxl_mailbox *cxl_mbox = &cxlmd->cxlds->cxl_mbox;
+	struct cxl_ecs_fru_rd_attrs *fru_rd_attrs;
+	struct cxl_ecs_fru_wr_attrs *fru_wr_attrs;
+	size_t rd_data_size, wr_data_size;
+	u16 num_media_frus, count;
+	size_t data_size;
+	u16 return_code;
+	u16 ecs_config;
+	int ret;
+
+	num_media_frus = cxl_ecs_ctx->num_media_frus;
+	rd_data_size = cxl_ecs_ctx->get_feat_size;
+	wr_data_size = cxl_ecs_ctx->set_feat_size;
+	struct cxl_ecs_rd_attrs *rd_attrs __free(kfree) =
+				kmalloc(rd_data_size, GFP_KERNEL);
+	if (!rd_attrs)
+		return -ENOMEM;
+
+	data_size = cxl_get_feature(cxl_mbox->features, CXL_FEAT_ECS_UUID,
+				    CXL_GET_FEAT_SEL_CURRENT_VALUE,
+				    rd_attrs, rd_data_size, 0, &return_code);
+	if (!data_size || return_code != CXL_MBOX_CMD_RC_SUCCESS)
+		return -EIO;
+
+	struct cxl_ecs_wr_attrs *wr_attrs __free(kfree) =
+					kmalloc(wr_data_size, GFP_KERNEL);
+	if (!wr_attrs)
+		return -ENOMEM;
+
+	/*
+	 * Fill writable attributes from the current attributes read
+	 * for all the media FRUs.
+	 */
+	fru_rd_attrs = rd_attrs->fru_attrs;
+	fru_wr_attrs = wr_attrs->fru_attrs;
+	wr_attrs->ecs_log_cap = rd_attrs->ecs_log_cap;
+	for (count = 0; count < num_media_frus; count++)
+		fru_wr_attrs[count].ecs_config = fru_rd_attrs[count].ecs_config;
+
+	/* Fill attribute to be set for the media FRU */
+	ecs_config = le16_to_cpu(fru_rd_attrs[fru_id].ecs_config);
+	switch (param_type) {
+	case CXL_ECS_PARAM_LOG_ENTRY_TYPE:
+		if (params->log_entry_type != ECS_LOG_ENTRY_TYPE_DRAM &&
+		    params->log_entry_type != ECS_LOG_ENTRY_TYPE_MEM_MEDIA_FRU) {
+			dev_err(dev,
+				"Invalid CXL ECS scrub log entry type(%d) to set\n",
+			       params->log_entry_type);
+			dev_err(dev,
+				"Log Entry Type 0: per DRAM  1: per Memory Media FRU\n");
+			return -EINVAL;
+		}
+		wr_attrs->ecs_log_cap = FIELD_PREP(CXL_ECS_LOG_ENTRY_TYPE_MASK,
+						   params->log_entry_type);
+		break;
+	case CXL_ECS_PARAM_THRESHOLD:
+		ecs_config &= ~CXL_ECS_THRESHOLD_COUNT_MASK;
+		switch (params->threshold) {
+		case ECS_THRESHOLD_256:
+			ecs_config |= FIELD_PREP(CXL_ECS_THRESHOLD_COUNT_MASK,
+						 ECS_THRESHOLD_IDX_256);
+			break;
+		case ECS_THRESHOLD_1024:
+			ecs_config |= FIELD_PREP(CXL_ECS_THRESHOLD_COUNT_MASK,
+						 ECS_THRESHOLD_IDX_1024);
+			break;
+		case ECS_THRESHOLD_4096:
+			ecs_config |= FIELD_PREP(CXL_ECS_THRESHOLD_COUNT_MASK,
+						 ECS_THRESHOLD_IDX_4096);
+			break;
+		default:
+			dev_err(dev,
+				"Invalid CXL ECS scrub threshold count(%d) to set\n",
+				params->threshold);
+			dev_err(dev,
+				"Supported scrub threshold counts: %u, %u, %u\n",
+				ECS_THRESHOLD_256, ECS_THRESHOLD_1024, ECS_THRESHOLD_4096);
+			return -EINVAL;
+		}
+		break;
+	case CXL_ECS_PARAM_MODE:
+		if (params->count_mode != ECS_MODE_COUNTS_ROWS &&
+		    params->count_mode != ECS_MODE_COUNTS_CODEWORDS) {
+			dev_err(dev,
+				"Invalid CXL ECS scrub mode(%d) to set\n",
+				params->count_mode);
+			dev_err(dev,
+				"Supported ECS Modes: 0: ECS counts rows with errors,"
+				" 1: ECS counts codewords with errors\n");
+			return -EINVAL;
+		}
+		ecs_config &= ~CXL_ECS_COUNT_MODE_MASK;
+		ecs_config |= FIELD_PREP(CXL_ECS_COUNT_MODE_MASK, params->count_mode);
+		break;
+	case CXL_ECS_PARAM_RESET_COUNTER:
+		if (params->reset_counter != CXL_ECS_RESET_COUNTER)
+			return -EINVAL;
+
+		ecs_config &= ~CXL_ECS_RESET_COUNTER_MASK;
+		ecs_config |= FIELD_PREP(CXL_ECS_RESET_COUNTER_MASK, params->reset_counter);
+		break;
+	default:
+		dev_err(dev, "Invalid CXL ECS parameter to set\n");
+		return -EINVAL;
+	}
+	fru_wr_attrs[fru_id].ecs_config = cpu_to_le16(ecs_config);
+
+	ret = cxl_set_feature(cxl_mbox->features, CXL_FEAT_ECS_UUID,
+			      cxl_ecs_ctx->set_version,
+			      wr_attrs, wr_data_size,
+			      CXL_SET_FEAT_FLAG_DATA_SAVED_ACROSS_RESET,
+			      0, &return_code);
+	if (ret || return_code != CXL_MBOX_CMD_RC_SUCCESS) {
+		dev_err(dev, "CXL ECS set feature failed ret=%d return_code=%u\n",
+			ret, return_code);
+		return ret;
+	}
+
+	return 0;
+}
+
+#define CXL_ECS_GET_ATTR(attrib)						\
+static int cxl_ecs_get_##attrib(struct device *dev, void *drv_data,		\
+				int fru_id, u32 *val)				\
+{										\
+	struct cxl_ecs_context *ctx = drv_data;					\
+	struct cxl_ecs_params params;						\
+	int ret;								\
+										\
+	ret = cxl_mem_ecs_get_attrs(dev, ctx, fru_id, &params);			\
+	if (ret)								\
+		return ret;							\
+										\
+	*val = params.attrib;							\
+										\
+	return 0;								\
+}
+
+CXL_ECS_GET_ATTR(log_entry_type)
+CXL_ECS_GET_ATTR(count_mode)
+CXL_ECS_GET_ATTR(threshold)
+
+#define CXL_ECS_SET_ATTR(attrib, param_type)						\
+static int cxl_ecs_set_##attrib(struct device *dev, void *drv_data,			\
+				int fru_id, u32 val)					\
+{											\
+	struct cxl_ecs_context *ctx = drv_data;						\
+	struct cxl_ecs_params params = {						\
+		.attrib = val,								\
+	};										\
+											\
+	return cxl_mem_ecs_set_attrs(dev, ctx, fru_id, &params, (param_type));		\
+}
+CXL_ECS_SET_ATTR(log_entry_type, CXL_ECS_PARAM_LOG_ENTRY_TYPE)
+CXL_ECS_SET_ATTR(count_mode, CXL_ECS_PARAM_MODE)
+CXL_ECS_SET_ATTR(reset_counter, CXL_ECS_PARAM_RESET_COUNTER)
+CXL_ECS_SET_ATTR(threshold, CXL_ECS_PARAM_THRESHOLD)
+
+static const struct edac_ecs_ops cxl_ecs_ops = {
+	.get_log_entry_type = cxl_ecs_get_log_entry_type,
+	.set_log_entry_type = cxl_ecs_set_log_entry_type,
+	.get_mode = cxl_ecs_get_count_mode,
+	.set_mode = cxl_ecs_set_count_mode,
+	.reset = cxl_ecs_set_reset_counter,
+	.get_threshold = cxl_ecs_get_threshold,
+	.set_threshold = cxl_ecs_set_threshold,
+};
+
 static int cxl_memdev_scrub_init(struct cxl_memdev *cxlmd, struct cxl_region *cxlr,
 				 struct edac_dev_feature *ras_feature, u8 scrub_inst)
 {
@@ -363,6 +668,52 @@ static int cxl_memdev_scrub_init(struct cxl_memdev *cxlmd, struct cxl_region *cx
 	return 0;
 }
 
+static int cxl_memdev_ecs_init(struct cxl_memdev *cxlmd,
+			       struct edac_dev_feature *ras_feature)
+{
+	struct cxl_dev_state *cxlds = cxlmd->cxlds;
+	struct cxl_mailbox *cxl_mbox = &cxlds->cxl_mbox;
+	struct cxl_ecs_context *cxl_ecs_ctx;
+	struct cxl_feat_entry *feat_entry;
+	int num_media_frus;
+
+	feat_entry = cxl_get_supported_feature_entry(cxl_mbox->features,
+						     &CXL_FEAT_ECS_UUID);
+	if (IS_ERR(feat_entry))
+		return -EOPNOTSUPP;
+
+	if (!(le32_to_cpu(feat_entry->flags) & CXL_FEAT_ENTRY_FLAG_CHANGABLE))
+		return -EOPNOTSUPP;
+
+	num_media_frus = (le16_to_cpu(feat_entry->get_feat_size) -
+				sizeof(struct cxl_ecs_rd_attrs)) /
+				sizeof(struct cxl_ecs_fru_rd_attrs);
+	if (!num_media_frus)
+		return -EOPNOTSUPP;
+
+	cxl_ecs_ctx = devm_kzalloc(&cxlmd->dev, sizeof(*cxl_ecs_ctx),
+				   GFP_KERNEL);
+	if (!cxl_ecs_ctx)
+		return -ENOMEM;
+
+	*cxl_ecs_ctx = (struct cxl_ecs_context) {
+		.get_feat_size = le16_to_cpu(feat_entry->get_feat_size),
+		.set_feat_size = le16_to_cpu(feat_entry->set_feat_size),
+		.get_version = feat_entry->get_feat_ver,
+		.set_version = feat_entry->set_feat_ver,
+		.effects = le16_to_cpu(feat_entry->effects),
+		.num_media_frus = num_media_frus,
+		.cxlmd = cxlmd,
+	};
+
+	ras_feature->ft_type = RAS_FEAT_ECS;
+	ras_feature->ecs_ops = &cxl_ecs_ops;
+	ras_feature->ctx = cxl_ecs_ctx;
+	ras_feature->ecs_info.num_media_frus = num_media_frus;
+
+	return 0;
+}
+
 int cxl_mem_ras_features_init(struct cxl_memdev *cxlmd, struct cxl_region *cxlr)
 {
 	struct edac_dev_feature ras_features[CXL_DEV_NUM_RAS_FEATURES];
@@ -373,19 +724,34 @@ int cxl_mem_ras_features_init(struct cxl_memdev *cxlmd, struct cxl_region *cxlr)
 
 	rc = cxl_memdev_scrub_init(cxlmd, cxlr, &ras_features[num_ras_features],
 				   scrub_inst);
+	if (rc == -EOPNOTSUPP)
+		goto feat_scrub_done;
 	if (rc < 0)
 		return rc;
 
 	scrub_inst++;
 	num_ras_features++;
 
-	if (cxlr)
+feat_scrub_done:
+	if (cxlr) {
 		snprintf(cxl_dev_name, sizeof(cxl_dev_name),
 			 "cxl_region%d", cxlr->id);
-	else
+		goto feat_register;
+	} else {
 		snprintf(cxl_dev_name, sizeof(cxl_dev_name),
 			 "%s_%s", "cxl", dev_name(&cxlmd->dev));
+	}
+
+	rc = cxl_memdev_ecs_init(cxlmd, &ras_features[num_ras_features]);
+	if (rc == -EOPNOTSUPP)
+		goto feat_ecs_done;
+	if (rc < 0)
+		return rc;
+
+	num_ras_features++;
 
+feat_ecs_done:
+feat_register:
 	return edac_dev_register(&cxlmd->dev, cxl_dev_name, NULL,
 				 num_ras_features, ras_features);
 }
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [PATCH v18 17/19] cxl/mbox: Add support for PERFORM_MAINTENANCE mailbox command
  2025-01-06 12:09 [PATCH v18 00/19] EDAC: Scrub: introduce generic EDAC RAS control feature driver + CXL/ACPI-RAS2 drivers shiju.jose
                   ` (15 preceding siblings ...)
  2025-01-06 12:10 ` [PATCH v18 16/19] cxl/memfeature: Add CXL memory device ECS " shiju.jose
@ 2025-01-06 12:10 ` shiju.jose
  2025-01-06 12:10 ` [PATCH v18 18/19] cxl/memfeature: Add CXL memory device soft PPR control feature shiju.jose
                   ` (3 subsequent siblings)
  20 siblings, 0 replies; 87+ messages in thread
From: shiju.jose @ 2025-01-06 12:10 UTC (permalink / raw)
  To: linux-edac, linux-cxl, linux-acpi, linux-mm, linux-kernel
  Cc: bp, tony.luck, rafael, lenb, mchehab, dan.j.williams, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	ira.weiny, david, Vilas.Sridharan, leo.duran, Yazen.Ghannam,
	rientjes, jiaqiyan, Jon.Grimm, dave.hansen, naoya.horiguchi,
	james.morse, jthoughton, somasundaram.a, erdemaktas, pgonda,
	duenwen, gthelen, wschwartz, dferguson, wbs, nifan.cxl,
	yazen.ghannam, tanxiaofei, prime.zeng, roberto.sassu,
	kangkang.shen, wanghuiqiang, linuxarm, shiju.jose

From: Shiju Jose <shiju.jose@huawei.com>

Add support for PERFORM_MAINTENANCE mailbox command.

CXL spec 3.1 section 8.2.9.7.1 describes the Perform Maintenance command.
This command requests the device to execute the maintenance operation
specified by the maintenance operation class and the maintenance operation
subclass.

Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
---
 drivers/cxl/core/mbox.c | 34 ++++++++++++++++++++++++++++++++++
 drivers/cxl/cxlmem.h    | 17 +++++++++++++++++
 2 files changed, 51 insertions(+)

diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
index 0b4946205910..bdea003172d8 100644
--- a/drivers/cxl/core/mbox.c
+++ b/drivers/cxl/core/mbox.c
@@ -832,6 +832,40 @@ static const uuid_t log_uuid[] = {
 	[VENDOR_DEBUG_UUID] = DEFINE_CXL_VENDOR_DEBUG_UUID,
 };
 
+int cxl_do_maintenance(struct cxl_mailbox *cxl_mbox,
+		       u8 class, u8 subclass,
+		       void *data_in, size_t data_in_size)
+{
+	struct cxl_memdev_maintenance_pi {
+		struct cxl_mbox_do_maintenance_hdr hdr;
+		u8 data[];
+	}  __packed;
+	struct cxl_mbox_cmd mbox_cmd;
+	size_t hdr_size;
+
+	struct cxl_memdev_maintenance_pi *pi __free(kfree) =
+					kmalloc(cxl_mbox->payload_size, GFP_KERNEL);
+	pi->hdr.op_class = class;
+	pi->hdr.op_subclass = subclass;
+	hdr_size = sizeof(pi->hdr);
+	/*
+	 * Check minimum mbox payload size is available for
+	 * the maintenance data transfer.
+	 */
+	if (hdr_size + data_in_size > cxl_mbox->payload_size)
+		return -ENOMEM;
+
+	memcpy(pi->data, data_in, data_in_size);
+	mbox_cmd = (struct cxl_mbox_cmd) {
+		.opcode = CXL_MBOX_OP_DO_MAINTENANCE,
+		.size_in = hdr_size + data_in_size,
+		.payload_in = pi,
+	};
+
+	return cxl_internal_send_cmd(cxl_mbox, &mbox_cmd);
+}
+EXPORT_SYMBOL_NS_GPL(cxl_do_maintenance, "CXL");
+
 /**
  * cxl_enumerate_cmds() - Enumerate commands for a device.
  * @mds: The driver data for the operation
diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
index 2b02e47cd7e7..7ad75ab739e5 100644
--- a/drivers/cxl/cxlmem.h
+++ b/drivers/cxl/cxlmem.h
@@ -493,6 +493,7 @@ enum cxl_opcode {
 	CXL_MBOX_OP_GET_SUPPORTED_FEATURES	= 0x0500,
 	CXL_MBOX_OP_GET_FEATURE		= 0x0501,
 	CXL_MBOX_OP_SET_FEATURE		= 0x0502,
+	CXL_MBOX_OP_DO_MAINTENANCE	= 0x0600,
 	CXL_MBOX_OP_IDENTIFY		= 0x4000,
 	CXL_MBOX_OP_GET_PARTITION_INFO	= 0x4100,
 	CXL_MBOX_OP_SET_PARTITION_INFO	= 0x4101,
@@ -776,6 +777,19 @@ enum {
 	CXL_PMEM_SEC_PASS_USER,
 };
 
+/*
+ * Perform Maintenance CXL 3.1 Spec 8.2.9.7.1
+ */
+
+/*
+ * Perform Maintenance input payload
+ * CXL rev 3.1 section 8.2.9.7.1 Table 8-102
+ */
+struct cxl_mbox_do_maintenance_hdr {
+	u8 op_class;
+	u8 op_subclass;
+}  __packed;
+
 int cxl_internal_send_cmd(struct cxl_mailbox *cxl_mbox,
 			  struct cxl_mbox_cmd *cmd);
 int cxl_dev_state_identify(struct cxl_memdev_state *mds);
@@ -842,4 +856,7 @@ struct cxl_hdm {
 struct seq_file;
 struct dentry *cxl_debugfs_create_dir(const char *dir);
 void cxl_dpa_debug(struct seq_file *file, struct cxl_dev_state *cxlds);
+int cxl_do_maintenance(struct cxl_mailbox *cxl_mbox,
+		       u8 class, u8 subclass,
+		       void *data_in, size_t data_in_size);
 #endif /* __CXL_MEM_H__ */
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [PATCH v18 18/19] cxl/memfeature: Add CXL memory device soft PPR control feature
  2025-01-06 12:09 [PATCH v18 00/19] EDAC: Scrub: introduce generic EDAC RAS control feature driver + CXL/ACPI-RAS2 drivers shiju.jose
                   ` (16 preceding siblings ...)
  2025-01-06 12:10 ` [PATCH v18 17/19] cxl/mbox: Add support for PERFORM_MAINTENANCE mailbox command shiju.jose
@ 2025-01-06 12:10 ` shiju.jose
  2025-01-06 12:10 ` [PATCH v18 19/19] cxl/memfeature: Add CXL memory device memory sparing " shiju.jose
                   ` (2 subsequent siblings)
  20 siblings, 0 replies; 87+ messages in thread
From: shiju.jose @ 2025-01-06 12:10 UTC (permalink / raw)
  To: linux-edac, linux-cxl, linux-acpi, linux-mm, linux-kernel
  Cc: bp, tony.luck, rafael, lenb, mchehab, dan.j.williams, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	ira.weiny, david, Vilas.Sridharan, leo.duran, Yazen.Ghannam,
	rientjes, jiaqiyan, Jon.Grimm, dave.hansen, naoya.horiguchi,
	james.morse, jthoughton, somasundaram.a, erdemaktas, pgonda,
	duenwen, gthelen, wschwartz, dferguson, wbs, nifan.cxl,
	yazen.ghannam, tanxiaofei, prime.zeng, roberto.sassu,
	kangkang.shen, wanghuiqiang, linuxarm, shiju.jose

From: Shiju Jose <shiju.jose@huawei.com>

Post Package Repair (PPR) maintenance operations may be supported by CXL
devices that implement CXL.mem protocol. A PPR maintenance operation
requests the CXL device to perform a repair operation on its media.
For example, a CXL device with DRAM components that support PPR features
may implement PPR Maintenance operations. DRAM components may support two
types of PPR, hard PPR (hPPR), for a permanent row repair, and Soft PPR
(sPPR), for a temporary row repair. Soft PPR is much faster than hPPR,
but the repair is lost with a power cycle.

During the execution of a PPR Maintenance operation, a CXL memory device:
- May or may not retain data
- May or may not be able to process CXL.mem requests correctly, including
the ones that target the DPA involved in the repair.
These CXL Memory Device capabilities are specified by Restriction Flags
in the sPPR Feature and hPPR Feature.

Soft PPR maintenance operation may be executed at runtime, if data is
retained and CXL.mem requests are correctly processed. For CXL devices with
DRAM components, hPPR maintenance operation may be executed only at boot
because data would not be retained.
When a CXL device identifies error on a memory component, the device
may inform the host about the need for a PPR maintenance operation by using
an Event Record, where the Maintenance Needed flag is set. The Event Record
specifies the DPA that should be repaired. A CXL device may not keep track
of the requests that have already been sent and the information on which
DPA should be repaired may be lost upon power cycle.
The userspace tool requests for maintenance operation if the number of
corrected error reported on a CXL.mem media exceeds error threshold.

CXL spec 3.1 section 8.2.9.7.1.2 describes the device's sPPR (soft PPR)
maintenance operation and section 8.2.9.7.1.3 describes the device's
hPPR (hard PPR) maintenance operation feature.

CXL spec 3.1 section 8.2.9.7.2.1 describes the sPPR feature discovery and
configuration.

CXL spec 3.1 section 8.2.9.7.2.2 describes the hPPR feature discovery and
configuration.

Add support for controlling CXL memory device sPPR feature.
Register with EDAC driver, which gets the memory repair attr descriptors
from the EDAC memory repair driver and exposes sysfs repair control
attributes for PRR to the userspace. For example CXL PPR control for the
CXL mem0 device is exposed in /sys/bus/edac/devices/cxl_mem0/mem_repairX/

Tested with QEMU patch for CXL PPR feature.
https://lore.kernel.org/all/20240730045722.71482-1-dave@stgolabs.net/

Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
---
 Documentation/edac/memory_repair.rst |  58 ++++
 drivers/cxl/core/memfeature.c        | 387 ++++++++++++++++++++++++++-
 2 files changed, 444 insertions(+), 1 deletion(-)

diff --git a/Documentation/edac/memory_repair.rst b/Documentation/edac/memory_repair.rst
index 2787a8a2d6ba..f40a34de32a4 100644
--- a/Documentation/edac/memory_repair.rst
+++ b/Documentation/edac/memory_repair.rst
@@ -99,3 +99,61 @@ sysfs
 Sysfs files are documented in
 
 `Documentation/ABI/testing/sysfs-edac-memory-repair`.
+
+Example
+-------
+
+The usage takes the form shown in this example:
+
+1. CXL memory device Soft Post Package Repair (Soft PPR)
+
+# read capabilities
+
+root@localhost:~# cat /sys/bus/edac/devices/cxl_mem0/mem_repair0/dpa_support
+
+1
+
+root@localhost:~# cat /sys/bus/edac/devices/cxl_mem0/mem_repair0/nibble_mask
+
+0x0
+
+root@localhost:~# cat /sys/bus/edac/devices/cxl_mem0/mem_repair0/persist_mode
+
+0
+
+root@localhost:~# cat /sys/bus/edac/devices/cxl_mem0/mem_repair0/repair_function
+
+0
+
+root@localhost:~# cat /sys/bus/edac/devices/cxl_mem0/mem_repair0/repair_safe_when_in_use
+
+1
+
+# set and readback attributes
+
+root@localhost:~# echo 0x8a2d > /sys/bus/edac/devices/cxl_mem0/mem_repair0/nibble_mask
+
+root@localhost:~# cat /sys/bus/edac/devices/cxl_mem0/mem_repair0/min_dpa
+
+0x0
+
+root@localhost:~# cat /sys/bus/edac/devices/cxl_mem0/mem_repair0/max_dpa
+
+0xfffffff
+
+root@localhost:~# echo 0x300000 >  /sys/bus/edac/devices/cxl_mem0/mem_repair0/dpa
+
+root@localhost:~# cat /sys/bus/edac/devices/cxl_mem0/mem_repair0/dpa
+
+0x300000
+
+root@localhost:~# cat /sys/bus/edac/devices/cxl_mem0/mem_repair0/nibble_mask
+
+0x8a2d
+
+# issue repair operations
+
+# reapir returns error if unsupported/resources are not available
+# for the repair operation.
+
+root@localhost:~# echo 1 > /sys/bus/edac/devices/cxl_mem0/mem_repair0/repair
diff --git a/drivers/cxl/core/memfeature.c b/drivers/cxl/core/memfeature.c
index eed9d57aa691..5437df549bf1 100644
--- a/drivers/cxl/core/memfeature.c
+++ b/drivers/cxl/core/memfeature.c
@@ -14,11 +14,13 @@
 #include <linux/cleanup.h>
 #include <linux/edac.h>
 #include <linux/limits.h>
+#include <linux/unaligned.h>
 #include <cxl/features.h>
 #include <cxl.h>
 #include <cxlmem.h>
+#include "core.h"
 
-#define CXL_DEV_NUM_RAS_FEATURES	2
+#define CXL_DEV_NUM_RAS_FEATURES	3
 #define CXL_DEV_HOUR_IN_SECS	3600
 
 #define CXL_DEV_NAME_LEN	128
@@ -605,6 +607,334 @@ static const struct edac_ecs_ops cxl_ecs_ops = {
 	.set_threshold = cxl_ecs_set_threshold,
 };
 
+/* CXL memory soft PPR & hard PPR control definitions */
+struct cxl_ppr_context {
+	uuid_t repair_uuid;
+	u8 instance;
+	u16 get_feat_size;
+	u16 set_feat_size;
+	u8 get_version;
+	u8 set_version;
+	u16 effects;
+	struct cxl_memdev *cxlmd;
+	enum edac_mem_repair_function repair_function;
+	enum edac_mem_repair_persist_mode persist_mode;
+	u64 dpa;
+	u32 nibble_mask;
+};
+
+/**
+ * struct cxl_memdev_ppr_params - CXL memory PPR parameter data structure.
+ * @op_class: PPR operation class.
+ * @op_subclass: PPR operation subclass.
+ * @dpa_support: device physical address for PPR support.
+ * @media_accessible: memory media is accessible or not during PPR operation.
+ * @data_retained: data is retained or not during PPR operation.
+ * @dpa: device physical address.
+ */
+struct cxl_memdev_ppr_params {
+	u8 op_class;
+	u8 op_subclass;
+	bool dpa_support;
+	bool media_accessible;
+	bool data_retained;
+	u64 dpa;
+};
+
+enum cxl_ppr_param {
+	CXL_PPR_PARAM_DO_QUERY,
+	CXL_PPR_PARAM_DO_PPR,
+};
+
+/* See CXL rev 3.1 @8.2.9.7.2.1 Table 8-113 sPPR Feature Readable Attributes */
+/* See CXL rev 3.1 @8.2.9.7.2.2 Table 8-116 hPPR Feature Readable Attributes */
+#define CXL_MEMDEV_PPR_QUERY_RESOURCE_FLAG	BIT(0)
+
+#define CXL_MEMDEV_PPR_DEVICE_INITIATED_MASK	BIT(0)
+#define CXL_MEMDEV_PPR_FLAG_DPA_SUPPORT_MASK	BIT(0)
+#define CXL_MEMDEV_PPR_FLAG_NIBBLE_SUPPORT_MASK	BIT(1)
+#define CXL_MEMDEV_PPR_FLAG_MEM_SPARING_EV_REC_SUPPORT_MASK	BIT(2)
+
+#define CXL_MEMDEV_PPR_RESTRICTION_FLAG_MEDIA_ACCESSIBLE_MASK	BIT(0)
+#define CXL_MEMDEV_PPR_RESTRICTION_FLAG_DATA_RETAINED_MASK	BIT(2)
+
+#define CXL_MEMDEV_PPR_SPARING_EV_REC_EN_MASK	BIT(0)
+
+struct cxl_memdev_repair_rd_attrs_hdr {
+	u8 max_op_latency;
+	__le16 op_cap;
+	__le16 op_mode;
+	u8 op_class;
+	u8 op_subclass;
+	u8 rsvd[9];
+}  __packed;
+
+struct cxl_memdev_ppr_rd_attrs {
+	struct cxl_memdev_repair_rd_attrs_hdr hdr;
+	u8 ppr_flags;
+	__le16 restriction_flags;
+	u8 ppr_op_mode;
+}  __packed;
+
+/* See CXL rev 3.1 @8.2.9.7.1.2 Table 8-103 sPPR Maintenance Input Payload */
+/* See CXL rev 3.1 @8.2.9.7.1.3 Table 8-104 hPPR Maintenance Input Payload */
+struct cxl_memdev_ppr_maintenance_attrs {
+	u8 flags;
+	__le64 dpa;
+	u8 nibble_mask[3];
+}  __packed;
+
+static int cxl_mem_ppr_get_attrs(struct cxl_ppr_context *cxl_ppr_ctx,
+				 struct cxl_memdev_ppr_params *params)
+{
+	size_t rd_data_size = sizeof(struct cxl_memdev_ppr_rd_attrs);
+	struct cxl_memdev *cxlmd = cxl_ppr_ctx->cxlmd;
+	struct cxl_mailbox *cxl_mbox = &cxlmd->cxlds->cxl_mbox;
+	u16 restriction_flags;
+	size_t data_size;
+	u16 return_code;
+
+	struct cxl_memdev_ppr_rd_attrs *rd_attrs __free(kfree) =
+				kmalloc(rd_data_size, GFP_KERNEL);
+	if (!rd_attrs)
+		return -ENOMEM;
+
+	data_size = cxl_get_feature(cxl_mbox->features, cxl_ppr_ctx->repair_uuid,
+				    CXL_GET_FEAT_SEL_CURRENT_VALUE,
+				    rd_attrs, rd_data_size, 0, &return_code);
+	if (!data_size || return_code != CXL_MBOX_CMD_RC_SUCCESS)
+		return -EIO;
+
+	params->op_class = rd_attrs->hdr.op_class;
+	params->op_subclass = rd_attrs->hdr.op_subclass;
+	params->dpa_support = FIELD_GET(CXL_MEMDEV_PPR_FLAG_DPA_SUPPORT_MASK,
+					rd_attrs->ppr_flags);
+	restriction_flags = le16_to_cpu(rd_attrs->restriction_flags);
+	params->media_accessible = FIELD_GET(CXL_MEMDEV_PPR_RESTRICTION_FLAG_MEDIA_ACCESSIBLE_MASK,
+					     restriction_flags) ^ 1;
+	params->data_retained = FIELD_GET(CXL_MEMDEV_PPR_RESTRICTION_FLAG_DATA_RETAINED_MASK,
+					  restriction_flags) ^ 1;
+
+	return 0;
+}
+
+static int cxl_mem_do_ppr_op(struct device *dev,
+			     struct cxl_ppr_context *cxl_ppr_ctx,
+			     struct cxl_memdev_ppr_params *rd_params,
+			     enum cxl_ppr_param param_type)
+{
+	struct cxl_memdev_ppr_maintenance_attrs maintenance_attrs;
+	struct cxl_memdev *cxlmd = cxl_ppr_ctx->cxlmd;
+	int ret;
+
+	if (!rd_params->media_accessible || !rd_params->data_retained) {
+		/* Check if DPA is mapped */
+		if (cxl_dpa_to_region(cxlmd, cxl_ppr_ctx->dpa)) {
+			dev_err(dev, "CXL can't do PPR as DPA is mapped\n");
+			return -EBUSY;
+		}
+	}
+	memset(&maintenance_attrs, 0, sizeof(maintenance_attrs));
+	if (param_type == CXL_PPR_PARAM_DO_QUERY)
+		maintenance_attrs.flags = CXL_MEMDEV_PPR_QUERY_RESOURCE_FLAG;
+	else
+		maintenance_attrs.flags = 0;
+	maintenance_attrs.dpa = cpu_to_le64(cxl_ppr_ctx->dpa);
+	put_unaligned_le24(cxl_ppr_ctx->nibble_mask, maintenance_attrs.nibble_mask);
+	ret = cxl_do_maintenance(&cxlmd->cxlds->cxl_mbox, rd_params->op_class,
+				 rd_params->op_subclass, &maintenance_attrs,
+				 sizeof(maintenance_attrs));
+	if (ret) {
+		dev_err(dev, "CXL do PPR failed ret=%d\n", ret);
+		up_read(&cxl_region_rwsem);
+		cxl_ppr_ctx->nibble_mask = 0;
+		cxl_ppr_ctx->dpa = 0;
+		return ret;
+	}
+
+	return 0;
+}
+
+static int cxl_mem_ppr_set_attrs(struct device *dev,
+				 struct cxl_ppr_context *cxl_ppr_ctx,
+				 enum cxl_ppr_param param_type)
+{
+	struct cxl_memdev_ppr_params rd_params;
+	int ret;
+
+	ret = cxl_mem_ppr_get_attrs(cxl_ppr_ctx, &rd_params);
+	if (ret) {
+		dev_err(dev, "Get cxlmemdev PPR params failed ret=%d\n",
+			ret);
+		return ret;
+	}
+
+	switch (param_type) {
+	case CXL_PPR_PARAM_DO_QUERY:
+	case CXL_PPR_PARAM_DO_PPR:
+		ret = down_read_interruptible(&cxl_region_rwsem);
+		if (ret)
+			return ret;
+		ret = down_read_interruptible(&cxl_dpa_rwsem);
+		if (ret) {
+			up_read(&cxl_region_rwsem);
+			return ret;
+		}
+		ret = cxl_mem_do_ppr_op(dev, cxl_ppr_ctx, &rd_params, param_type);
+		up_read(&cxl_dpa_rwsem);
+		up_read(&cxl_region_rwsem);
+		return ret;
+	default:
+		return -EINVAL;
+	}
+}
+
+static int cxl_ppr_get_repair_function(struct device *dev, void *drv_data,
+				       u32 *repair_function)
+{
+	struct cxl_ppr_context *cxl_ppr_ctx = drv_data;
+
+	*repair_function = cxl_ppr_ctx->repair_function;
+
+	return 0;
+}
+
+static int cxl_ppr_get_persist_mode(struct device *dev, void *drv_data,
+				    u32 *persist_mode)
+{
+	struct cxl_ppr_context *cxl_ppr_ctx = drv_data;
+
+	*persist_mode = cxl_ppr_ctx->persist_mode;
+
+	return 0;
+}
+
+static int cxl_ppr_get_dpa_support(struct device *dev, void *drv_data,
+				   u32 *dpa_support)
+{
+	struct cxl_ppr_context *cxl_ppr_ctx = drv_data;
+	struct cxl_memdev_ppr_params params;
+	int ret;
+
+	ret = cxl_mem_ppr_get_attrs(cxl_ppr_ctx, &params);
+	if (ret)
+		return ret;
+
+	*dpa_support = params.dpa_support;
+
+	return 0;
+}
+
+static int cxl_get_ppr_safe_when_in_use(struct device *dev, void *drv_data,
+					u32 *safe)
+{
+	struct cxl_ppr_context *cxl_ppr_ctx = drv_data;
+	struct cxl_memdev_ppr_params params;
+	int ret;
+
+	ret = cxl_mem_ppr_get_attrs(cxl_ppr_ctx, &params);
+	if (ret)
+		return ret;
+
+	*safe = params.media_accessible & params.data_retained;
+
+	return 0;
+}
+
+static int cxl_ppr_get_min_dpa(struct device *dev, void *drv_data,
+			       u64 *min_dpa)
+{
+	struct cxl_ppr_context *cxl_ppr_ctx = drv_data;
+	struct cxl_memdev *cxlmd = cxl_ppr_ctx->cxlmd;
+	struct cxl_dev_state *cxlds = cxlmd->cxlds;
+
+	*min_dpa = cxlds->dpa_res.start;
+
+	return 0;
+}
+
+static int cxl_ppr_get_max_dpa(struct device *dev, void *drv_data,
+			       u64 *max_dpa)
+{
+	struct cxl_ppr_context *cxl_ppr_ctx = drv_data;
+	struct cxl_memdev *cxlmd = cxl_ppr_ctx->cxlmd;
+	struct cxl_dev_state *cxlds = cxlmd->cxlds;
+
+	*max_dpa = cxlds->dpa_res.end;
+
+	return 0;
+}
+
+static int cxl_ppr_get_dpa(struct device *dev, void *drv_data,
+			   u64 *dpa)
+{
+	struct cxl_ppr_context *cxl_ppr_ctx = drv_data;
+
+	*dpa = cxl_ppr_ctx->dpa;
+
+	return 0;
+}
+
+static int cxl_ppr_set_dpa(struct device *dev, void *drv_data, u64 dpa)
+{
+	struct cxl_ppr_context *cxl_ppr_ctx = drv_data;
+	struct cxl_memdev *cxlmd = cxl_ppr_ctx->cxlmd;
+	struct cxl_dev_state *cxlds = cxlmd->cxlds;
+
+	if (!dpa || dpa < cxlds->dpa_res.start || dpa > cxlds->dpa_res.end)
+		return -EINVAL;
+
+	cxl_ppr_ctx->dpa = dpa;
+
+	return 0;
+}
+
+static int cxl_ppr_get_nibble_mask(struct device *dev, void *drv_data,
+				   u64 *nibble_mask)
+{
+	struct cxl_ppr_context *cxl_ppr_ctx = drv_data;
+
+	*nibble_mask = cxl_ppr_ctx->nibble_mask;
+
+	return 0;
+}
+
+static int cxl_ppr_set_nibble_mask(struct device *dev, void *drv_data, u64 nibble_mask)
+{
+	struct cxl_ppr_context *cxl_ppr_ctx = drv_data;
+
+	cxl_ppr_ctx->nibble_mask = nibble_mask;
+
+	return 0;
+}
+
+static int cxl_do_ppr(struct device *dev, void *drv_data, u32 val)
+{
+	struct cxl_ppr_context *cxl_ppr_ctx = drv_data;
+	int ret;
+
+	if (!cxl_ppr_ctx->dpa || val != EDAC_DO_MEM_REPAIR)
+		return -EINVAL;
+
+	ret = cxl_mem_ppr_set_attrs(dev, cxl_ppr_ctx, CXL_PPR_PARAM_DO_PPR);
+
+	return ret;
+}
+
+static const struct edac_mem_repair_ops cxl_sppr_ops = {
+	.get_repair_function = cxl_ppr_get_repair_function,
+	.get_persist_mode = cxl_ppr_get_persist_mode,
+	.get_dpa_support = cxl_ppr_get_dpa_support,
+	.get_repair_safe_when_in_use = cxl_get_ppr_safe_when_in_use,
+	.get_min_dpa = cxl_ppr_get_min_dpa,
+	.get_max_dpa = cxl_ppr_get_max_dpa,
+	.get_dpa = cxl_ppr_get_dpa,
+	.set_dpa = cxl_ppr_set_dpa,
+	.get_nibble_mask = cxl_ppr_get_nibble_mask,
+	.set_nibble_mask = cxl_ppr_set_nibble_mask,
+	.do_repair = cxl_do_ppr,
+};
+
 static int cxl_memdev_scrub_init(struct cxl_memdev *cxlmd, struct cxl_region *cxlr,
 				 struct edac_dev_feature *ras_feature, u8 scrub_inst)
 {
@@ -714,11 +1044,55 @@ static int cxl_memdev_ecs_init(struct cxl_memdev *cxlmd,
 	return 0;
 }
 
+static int cxl_memdev_soft_ppr_init(struct cxl_memdev *cxlmd,
+				    struct edac_dev_feature *ras_feature,
+				    u8 repair_inst)
+{
+	struct cxl_dev_state *cxlds = cxlmd->cxlds;
+	struct cxl_mailbox *cxl_mbox = &cxlds->cxl_mbox;
+	struct cxl_ppr_context *cxl_sppr_ctx;
+	struct cxl_feat_entry *feat_entry;
+
+	feat_entry = cxl_get_supported_feature_entry(cxl_mbox->features,
+						     &CXL_FEAT_SPPR_UUID);
+	if (IS_ERR(feat_entry))
+		return -EOPNOTSUPP;
+
+	if (!(le32_to_cpu(feat_entry->flags) & CXL_FEAT_ENTRY_FLAG_CHANGABLE))
+		return -EOPNOTSUPP;
+
+	cxl_sppr_ctx = devm_kzalloc(&cxlmd->dev, sizeof(*cxl_sppr_ctx),
+				    GFP_KERNEL);
+	if (!cxl_sppr_ctx)
+		return -ENOMEM;
+
+	*cxl_sppr_ctx = (struct cxl_ppr_context) {
+		.repair_uuid = CXL_FEAT_SPPR_UUID,
+		.get_feat_size = le16_to_cpu(feat_entry->get_feat_size),
+		.set_feat_size = le16_to_cpu(feat_entry->set_feat_size),
+		.get_version = feat_entry->get_feat_ver,
+		.set_version = feat_entry->set_feat_ver,
+		.effects = le16_to_cpu(feat_entry->effects),
+		.cxlmd = cxlmd,
+		.repair_function = EDAC_SOFT_PPR,
+		.persist_mode = EDAC_MEM_REPAIR_SOFT,
+		.instance = repair_inst,
+	};
+
+	ras_feature->ft_type = RAS_FEAT_MEM_REPAIR;
+	ras_feature->instance = cxl_sppr_ctx->instance;
+	ras_feature->mem_repair_ops = &cxl_sppr_ops;
+	ras_feature->ctx = cxl_sppr_ctx;
+
+	return 0;
+}
+
 int cxl_mem_ras_features_init(struct cxl_memdev *cxlmd, struct cxl_region *cxlr)
 {
 	struct edac_dev_feature ras_features[CXL_DEV_NUM_RAS_FEATURES];
 	char cxl_dev_name[CXL_DEV_NAME_LEN];
 	int num_ras_features = 0;
+	u8 repair_inst = 0;
 	u8 scrub_inst = 0;
 	int rc;
 
@@ -751,6 +1125,17 @@ int cxl_mem_ras_features_init(struct cxl_memdev *cxlmd, struct cxl_region *cxlr)
 	num_ras_features++;
 
 feat_ecs_done:
+	rc = cxl_memdev_soft_ppr_init(cxlmd, &ras_features[num_ras_features],
+				      repair_inst);
+	if (rc == -EOPNOTSUPP)
+		goto feat_soft_ppr_done;
+	if (rc < 0)
+		return rc;
+
+	repair_inst++;
+	num_ras_features++;
+
+feat_soft_ppr_done:
 feat_register:
 	return edac_dev_register(&cxlmd->dev, cxl_dev_name, NULL,
 				 num_ras_features, ras_features);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [PATCH v18 19/19] cxl/memfeature: Add CXL memory device memory sparing control feature
  2025-01-06 12:09 [PATCH v18 00/19] EDAC: Scrub: introduce generic EDAC RAS control feature driver + CXL/ACPI-RAS2 drivers shiju.jose
                   ` (17 preceding siblings ...)
  2025-01-06 12:10 ` [PATCH v18 18/19] cxl/memfeature: Add CXL memory device soft PPR control feature shiju.jose
@ 2025-01-06 12:10 ` shiju.jose
  2025-01-13 14:46 ` [PATCH v18 00/19] EDAC: Scrub: introduce generic EDAC RAS control feature driver + CXL/ACPI-RAS2 drivers Mauro Carvalho Chehab
  2025-01-30 19:18 ` Daniel Ferguson
  20 siblings, 0 replies; 87+ messages in thread
From: shiju.jose @ 2025-01-06 12:10 UTC (permalink / raw)
  To: linux-edac, linux-cxl, linux-acpi, linux-mm, linux-kernel
  Cc: bp, tony.luck, rafael, lenb, mchehab, dan.j.williams, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	ira.weiny, david, Vilas.Sridharan, leo.duran, Yazen.Ghannam,
	rientjes, jiaqiyan, Jon.Grimm, dave.hansen, naoya.horiguchi,
	james.morse, jthoughton, somasundaram.a, erdemaktas, pgonda,
	duenwen, gthelen, wschwartz, dferguson, wbs, nifan.cxl,
	yazen.ghannam, tanxiaofei, prime.zeng, roberto.sassu,
	kangkang.shen, wanghuiqiang, linuxarm, shiju.jose

From: Shiju Jose <shiju.jose@huawei.com>

Memory sparing is defined as a repair function that replaces a portion of
memory with a portion of functional memory at that same DPA. The subclasses
for this operation vary in terms of the scope of the sparing being
performed. The cacheline sparing subclass refers to a sparing action that
can replace a full cacheline. Row sparing is provided as an alternative to
PPR sparing functions and its scope is that of a single DDR row. Bank
sparing allows an entire bank to be replaced. Rank sparing is defined as
an operation in which an entire DDR rank is replaced.

Memory sparing maintenance operations may be supported by CXL devices
that implement CXL.mem protocol. A sparing maintenance operation requests
the CXL device to perform a repair operation on its media.
For example, a CXL device with DRAM components that support memory sparing
features may implement sparing maintenance operations.

The host may issue a query command by setting query resources flag in the
input payload (CXL spec 3.1 Table 8-105) to determine availability of
sparing resources for a given address. In response to a query request,
the device shall report the resource availability by producing the memory
sparing event record (CXL spec 3.1 Table 8-48) in which the Channel, Rank,
Nibble Mask, Bank Group, Bank, Row, Column, Sub-Channel fields are a copy
of the values specified in the request.

During the execution of a sparing maintenance operation, a CXL memory
device:
- may not retain data
- may not be able to process CXL.mem requests correctly.
These CXL memory device capabilities are specified by restriction flags
in the memory sparing feature readable attributes.

When a CXL device identifies error on a memory component, the device
may inform the host about the need for a memory sparing maintenance
operation by using an Event Record, where the maintenance needed flag may
set. The event record specifies some of the DPA, Channel, Rank, Nibble
Mask, Bank Group, Bank, Row, Column, Sub-Channel fields that should be
repaired. The userspace tool requests for maintenance operation if the
number of corrected error reported on a CXL.mem media exceeds error
threshold.

CXL spec 3.1 section 8.2.9.7.1.4 describes the device's memory sparing
maintenance operation feature.

CXL spec 3.1 section 8.2.9.7.2.3 describes the memory sparing feature
discovery and configuration.

Add support for controlling CXL memory device memory sparing feature.
Register with EDAC driver, which gets the memory repair attr descriptors
from the EDAC memory repair driver and exposes sysfs repair control
attributes for memory sparing to the userspace. For example CXL memory
sparing control for the CXL mem0 device is exposed in
/sys/bus/edac/devices/cxl_mem0/mem_repairX/

Use case
========
1. CXL device identifies a failure in a memory component, report to
userspace in a CXL generic or DRAM trace event with DPA and other
attributes of memory to repair such as channel, rank, nibble mask,
bank Group, bank, row, column, sub-channel.
2. Rasdaemon process the trace event and issue query request in sysfs to
check resources available for memory sparing if either of the following
conditions met.
 - number of corrected error reported on a CXL.mem media exceeds error
threshold
 - maintenance needed flag set in the event record.
3. CXL device shall report the resource availability by producing the
memory sparing event record in which the channel, rank, nibble mask, bank
Group, bank, row, column, sub-channel fields are a copy of the values
specified in the request. The query resource command shall return error
(invalid input) if the controller does not support reporting resource is
available.
4. Rasdaemon process the memory sparing trace event and issue repair
request for memory sparing.

Kernel CXL driver shall report memory sparing event record to the userspace
with the resource availability in order rasdaemon to process the event
record and issue a repair request in sysfs for the memory sparing operation
in the CXL device.

Tested for memory sparing control feature with
   "hw/cxl: Add memory sparing control feature"
   Repository: "https://gitlab.com/shiju.jose/qemu.git"
   Branch: cxl-ras-features-2024-10-24

Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
---
 Documentation/edac/memory_repair.rst |  90 +++++
 drivers/cxl/core/memfeature.c        | 492 ++++++++++++++++++++++++++-
 2 files changed, 580 insertions(+), 2 deletions(-)

diff --git a/Documentation/edac/memory_repair.rst b/Documentation/edac/memory_repair.rst
index f40a34de32a4..abafaa6c80e1 100644
--- a/Documentation/edac/memory_repair.rst
+++ b/Documentation/edac/memory_repair.rst
@@ -157,3 +157,93 @@ root@localhost:~# cat /sys/bus/edac/devices/cxl_mem0/mem_repair0/nibble_mask
 # for the repair operation.
 
 root@localhost:~# echo 1 > /sys/bus/edac/devices/cxl_mem0/mem_repair0/repair
+
+1.2. CXL memory sparing
+
+# read capabilities
+
+root@localhost:~# cat /sys/bus/edac/devices/cxl_mem0/mem_repair1/repair_function
+
+2
+
+root@localhost:~# cat /sys/bus/edac/devices/cxl_mem0/mem_repair1/dpa_support
+
+1
+
+root@localhost:~# cat /sys/bus/edac/devices/cxl_mem0/mem_repair1/persist_mode
+
+0
+
+root@localhost:~# cat /sys/bus/edac/devices/cxl_mem0/mem_repair1/repair_safe_when_in_use
+
+1
+
+root@localhost:~# cat /sys/bus/edac/devices/cxl_mem0/mem_repair1/min_dpa
+
+0x0
+
+root@localhost:~# cat /sys/bus/edac/devices/cxl_mem0/mem_repair1/max_dpa
+
+0xfffffff
+
+#set and readback attributes
+
+root@localhost:~# echo 0x700000 > /sys/bus/edac/devices/cxl_mem0/mem_repair1/dpa
+
+root@localhost:~# echo 1 > /sys/bus/edac/devices/cxl_mem0/mem_repair1/bank_group
+
+root@localhost:~# echo 3 > /sys/bus/edac/devices/cxl_mem0/mem_repair1/bank
+
+root@localhost:~# echo 2 > /sys/bus/edac/devices/cxl_mem0/mem_repair1/channel
+
+root@localhost:~# echo  7 > /sys/bus/edac/devices/cxl_mem0/mem_repair1/rank
+
+root@localhost:~# echo 0x240a > /sys/bus/edac/devices/cxl_mem0/mem_repair1/row
+
+root@localhost:~# echo 5 > /sys/bus/edac/devices/cxl_mem0/mem_repair1/sub_channel
+
+root@localhost:~# echo 11 > /sys/bus/edac/devices/cxl_mem0/mem_repair1/column
+
+root@localhost:~# echo 0x85c2 > /sys/bus/edac/devices/cxl_mem0/mem_repair1/nibble_mask
+
+root@localhost:~# cat /sys/bus/edac/devices/cxl_mem0/mem_repair1/bank_group
+
+1
+
+root@localhost:~# cat /sys/bus/edac/devices/cxl_mem0/mem_repair1/bank
+
+3
+
+root@localhost:~# cat /sys/bus/edac/devices/cxl_mem0/mem_repair1/channel
+
+2
+
+root@localhost:~# cat /sys/bus/edac/devices/cxl_mem0/mem_repair1/rank
+
+7
+
+root@localhost:~# cat /sys/bus/edac/devices/cxl_mem0/mem_repair1/row
+
+0x240a
+
+root@localhost:~# cat /sys/bus/edac/devices/cxl_mem0/mem_repair1/sub_channel
+
+5
+
+root@localhost:~# cat /sys/bus/edac/devices/cxl_mem0/mem_repair1/column
+
+11
+
+root@localhost:~# cat /sys/bus/edac/devices/cxl_mem0/mem_repair1/nibble_mask
+
+0x85c2
+
+root@localhost:~# cat /sys/bus/edac/devices/cxl_mem0/mem_repair1/dpa
+
+0x700000
+
+# issue repair operation
+# repair returns error if unsupported or resources are not available for the
+# repair operation.
+
+root@localhost:~# echo 1 > /sys/bus/edac/devices/cxl_mem0/mem_repair1/repair
diff --git a/drivers/cxl/core/memfeature.c b/drivers/cxl/core/memfeature.c
index 5437df549bf1..c9be563bb5b0 100644
--- a/drivers/cxl/core/memfeature.c
+++ b/drivers/cxl/core/memfeature.c
@@ -20,7 +20,7 @@
 #include <cxlmem.h>
 #include "core.h"
 
-#define CXL_DEV_NUM_RAS_FEATURES	3
+#define CXL_DEV_NUM_RAS_FEATURES	7
 #define CXL_DEV_HOUR_IN_SECS	3600
 
 #define CXL_DEV_NAME_LEN	128
@@ -935,6 +935,437 @@ static const struct edac_mem_repair_ops cxl_sppr_ops = {
 	.do_repair = cxl_do_ppr,
 };
 
+/* CXL memory sparing control definitions */
+enum cxl_mem_sparing_granularity {
+	CXL_MEM_SPARING_CACHELINE,
+	CXL_MEM_SPARING_ROW,
+	CXL_MEM_SPARING_BANK,
+	CXL_MEM_SPARING_RANK,
+	CXL_MEM_SPARING_MAX
+};
+
+struct cxl_mem_sparing_context {
+	uuid_t repair_uuid;
+	u8 instance;
+	u16 get_feat_size;
+	u16 set_feat_size;
+	u8 get_version;
+	u8 set_version;
+	u16 effects;
+	struct cxl_memdev *cxlmd;
+	enum edac_mem_repair_function repair_function;
+	enum edac_mem_repair_persist_mode persist_mode;
+	enum cxl_mem_sparing_granularity granularity;
+	bool dpa_support;
+	u64 dpa;
+	u8 channel;
+	u8 rank;
+	u32 nibble_mask;
+	u8 bank_group;
+	u8 bank;
+	u32 row;
+	u16 column;
+	u8 sub_channel;
+};
+
+struct cxl_memdev_sparing_params {
+	u8 op_class;
+	u8 op_subclass;
+	bool cap_safe_when_in_use;
+	bool cap_hard_sparing;
+	bool cap_soft_sparing;
+};
+
+enum cxl_mem_sparing_param_type {
+	CXL_MEM_SPARING_PARAM_DO_QUERY,
+	CXL_MEM_SPARING_PARAM_DO_REPAIR,
+};
+
+#define CXL_MEMDEV_SPARING_RD_CAP_SAFE_IN_USE_MASK	BIT(0)
+#define CXL_MEMDEV_SPARING_RD_CAP_HARD_SPARING_MASK	BIT(1)
+#define CXL_MEMDEV_SPARING_RD_CAP_SOFT_SPARING_MASK	BIT(2)
+
+#define CXL_MEMDEV_SPARING_WR_DEVICE_INITIATED_MASK	BIT(0)
+
+#define CXL_MEMDEV_SPARING_QUERY_RESOURCE_FLAG	BIT(0)
+#define CXL_MEMDEV_SET_HARD_SPARING_FLAG	BIT(1)
+#define CXL_MEMDEV_SPARING_SUB_CHANNEL_VALID_FLAG	BIT(2)
+#define CXL_MEMDEV_SPARING_NIB_MASK_VALID_FLAG	BIT(3)
+
+/* See CXL rev 3.1 @8.2.9.7.2.3 Table 8-119 Memory Sparing Feature Readable Attributes */
+struct cxl_memdev_sparing_rd_attrs {
+	struct cxl_memdev_repair_rd_attrs_hdr hdr;
+	u8 rsvd;
+	__le16 restriction_flags;
+}  __packed;
+
+/* See CXL rev 3.1 @8.2.9.7.1.4 Table 8-105 Memory Sparing Input Payload */
+struct cxl_memdev_sparing_in_payload {
+	u8 flags;
+	u8 channel;
+	u8 rank;
+	u8 nibble_mask[3];
+	u8 bank_group;
+	u8 bank;
+	u8 row[3];
+	__le16 column;
+	u8 sub_channel;
+}  __packed;
+
+static int cxl_mem_sparing_get_attrs(struct cxl_mem_sparing_context *cxl_sparing_ctx,
+				     struct cxl_memdev_sparing_params *params)
+{
+	size_t rd_data_size = sizeof(struct cxl_memdev_sparing_rd_attrs);
+	struct cxl_memdev *cxlmd = cxl_sparing_ctx->cxlmd;
+	struct cxl_mailbox *cxl_mbox = &cxlmd->cxlds->cxl_mbox;
+	u16 restriction_flags;
+	size_t data_size;
+	u16 return_code;
+	struct cxl_memdev_sparing_rd_attrs *rd_attrs __free(kfree) =
+				kmalloc(rd_data_size, GFP_KERNEL);
+	if (!rd_attrs)
+		return -ENOMEM;
+
+	data_size = cxl_get_feature(cxl_mbox->features,
+				    cxl_sparing_ctx->repair_uuid,
+				    CXL_GET_FEAT_SEL_CURRENT_VALUE,
+				    rd_attrs, rd_data_size, 0, &return_code);
+	if (!data_size || return_code != CXL_MBOX_CMD_RC_SUCCESS)
+		return -EIO;
+
+	params->op_class = rd_attrs->hdr.op_class;
+	params->op_subclass = rd_attrs->hdr.op_subclass;
+	restriction_flags = le16_to_cpu(rd_attrs->restriction_flags);
+	params->cap_safe_when_in_use = FIELD_GET(CXL_MEMDEV_SPARING_RD_CAP_SAFE_IN_USE_MASK,
+						 restriction_flags) ^ 1;
+	params->cap_hard_sparing = FIELD_GET(CXL_MEMDEV_SPARING_RD_CAP_HARD_SPARING_MASK,
+					     restriction_flags);
+	params->cap_soft_sparing = FIELD_GET(CXL_MEMDEV_SPARING_RD_CAP_SOFT_SPARING_MASK,
+					     restriction_flags);
+
+	return 0;
+}
+
+static int cxl_mem_do_sparing_op(struct device *dev,
+				 struct cxl_mem_sparing_context *cxl_sparing_ctx,
+				 struct cxl_memdev_sparing_params *rd_params,
+				 enum cxl_mem_sparing_param_type param_type)
+{
+	struct cxl_memdev *cxlmd = cxl_sparing_ctx->cxlmd;
+	struct cxl_memdev_sparing_in_payload sparing_pi;
+	int ret;
+
+	if (!rd_params->cap_safe_when_in_use && cxl_sparing_ctx->dpa) {
+		/* Check if DPA is mapped */
+		if (cxl_dpa_to_region(cxlmd, cxl_sparing_ctx->dpa)) {
+			dev_err(dev, "CXL can't do sparing as DPA is mapped\n");
+			return -EBUSY;
+		}
+	}
+	memset(&sparing_pi, 0, sizeof(sparing_pi));
+	if (param_type == CXL_MEM_SPARING_PARAM_DO_QUERY) {
+		sparing_pi.flags = CXL_MEMDEV_SPARING_QUERY_RESOURCE_FLAG;
+	} else {
+		sparing_pi.flags =
+			FIELD_PREP(CXL_MEMDEV_SPARING_QUERY_RESOURCE_FLAG, 0);
+		/* Do need set hard sparing, sub-channel & nb mask flags for query? */
+		if (cxl_sparing_ctx->persist_mode == EDAC_MEM_REPAIR_HARD)
+			sparing_pi.flags |=
+				FIELD_PREP(CXL_MEMDEV_SET_HARD_SPARING_FLAG, 1);
+		if (cxl_sparing_ctx->sub_channel)
+			sparing_pi.flags |=
+				FIELD_PREP(CXL_MEMDEV_SPARING_SUB_CHANNEL_VALID_FLAG, 1);
+		if (cxl_sparing_ctx->nibble_mask)
+			sparing_pi.flags |=
+				FIELD_PREP(CXL_MEMDEV_SPARING_NIB_MASK_VALID_FLAG, 1);
+	}
+	/* Common atts for all memory sparing types */
+	sparing_pi.channel = cxl_sparing_ctx->channel;
+	sparing_pi.rank = cxl_sparing_ctx->rank;
+	put_unaligned_le24(cxl_sparing_ctx->nibble_mask, sparing_pi.nibble_mask);
+
+	if (cxl_sparing_ctx->repair_function == EDAC_CACHELINE_MEM_SPARING ||
+	    cxl_sparing_ctx->repair_function == EDAC_ROW_MEM_SPARING ||
+	    cxl_sparing_ctx->repair_function == EDAC_BANK_MEM_SPARING) {
+		sparing_pi.bank_group = cxl_sparing_ctx->bank_group;
+		sparing_pi.bank = cxl_sparing_ctx->bank;
+	}
+	if (cxl_sparing_ctx->repair_function == EDAC_CACHELINE_MEM_SPARING ||
+	    cxl_sparing_ctx->repair_function == EDAC_ROW_MEM_SPARING)
+		put_unaligned_le24(cxl_sparing_ctx->row, sparing_pi.row);
+	if (cxl_sparing_ctx->repair_function == EDAC_CACHELINE_MEM_SPARING) {
+		sparing_pi.column = cpu_to_le16(cxl_sparing_ctx->column);
+		sparing_pi.sub_channel = cxl_sparing_ctx->sub_channel;
+	}
+
+	ret = cxl_do_maintenance(&cxlmd->cxlds->cxl_mbox, rd_params->op_class,
+				 rd_params->op_subclass,
+				 &sparing_pi, sizeof(sparing_pi));
+	if (ret) {
+		dev_err(dev, "CXL do mem sparing failed ret=%d\n", ret);
+		cxl_sparing_ctx->dpa = 0;
+		cxl_sparing_ctx->nibble_mask = 0;
+		cxl_sparing_ctx->bank_group = 0;
+		cxl_sparing_ctx->bank = 0;
+		cxl_sparing_ctx->rank = 0;
+		cxl_sparing_ctx->row = 0;
+		cxl_sparing_ctx->column = 0;
+		cxl_sparing_ctx->channel = 0;
+		cxl_sparing_ctx->sub_channel = 0;
+		return ret;
+	}
+
+	return 0;
+}
+
+static int cxl_mem_sparing_set_attrs(struct device *dev,
+				     struct cxl_mem_sparing_context *ctx,
+				     enum cxl_mem_sparing_param_type param_type)
+{
+	struct cxl_memdev_sparing_params rd_params;
+	int ret;
+
+	ret = cxl_mem_sparing_get_attrs(ctx, &rd_params);
+	if (ret) {
+		dev_err(dev, "Get cxlmemdev sparing params failed ret=%d\n",
+			ret);
+		return ret;
+	}
+
+	switch (param_type) {
+	case CXL_MEM_SPARING_PARAM_DO_QUERY:
+	case CXL_MEM_SPARING_PARAM_DO_REPAIR:
+		ret = down_read_interruptible(&cxl_region_rwsem);
+		if (ret)
+			return ret;
+		ret = down_read_interruptible(&cxl_dpa_rwsem);
+		if (ret) {
+			up_read(&cxl_region_rwsem);
+			return ret;
+		}
+		ret = cxl_mem_do_sparing_op(dev, ctx, &rd_params, param_type);
+		up_read(&cxl_dpa_rwsem);
+		up_read(&cxl_region_rwsem);
+		return ret;
+	default:
+		return -EINVAL;
+	}
+}
+
+#define CXL_SPARING_GET_ATTR(attrib, data_type)					\
+static int cxl_mem_sparing_get_##attrib(struct device *dev, void *drv_data,	\
+					data_type *val)				\
+{										\
+	struct cxl_mem_sparing_context *ctx = drv_data;				\
+										\
+	*val = ctx->attrib;							\
+										\
+	return 0;								\
+}
+CXL_SPARING_GET_ATTR(repair_function, u32)
+CXL_SPARING_GET_ATTR(persist_mode, u32)
+CXL_SPARING_GET_ATTR(dpa_support, u32)
+CXL_SPARING_GET_ATTR(dpa, u64)
+CXL_SPARING_GET_ATTR(nibble_mask, u64)
+CXL_SPARING_GET_ATTR(bank_group, u32)
+CXL_SPARING_GET_ATTR(bank, u32)
+CXL_SPARING_GET_ATTR(rank, u32)
+CXL_SPARING_GET_ATTR(row, u64)
+CXL_SPARING_GET_ATTR(column, u32)
+CXL_SPARING_GET_ATTR(channel, u32)
+CXL_SPARING_GET_ATTR(sub_channel, u32)
+
+#define CXL_SPARING_SET_ATTR(attrib, data_type)					\
+static int cxl_mem_sparing_set_##attrib(struct device *dev, void *drv_data,	\
+					data_type val)				\
+{										\
+	struct cxl_mem_sparing_context *ctx = drv_data;				\
+										\
+	ctx->attrib = val;							\
+										\
+	return 0;								\
+}
+CXL_SPARING_SET_ATTR(nibble_mask, u64)
+CXL_SPARING_SET_ATTR(bank_group, u32)
+CXL_SPARING_SET_ATTR(bank, u32)
+CXL_SPARING_SET_ATTR(rank, u32)
+CXL_SPARING_SET_ATTR(row, u64)
+CXL_SPARING_SET_ATTR(column, u32)
+CXL_SPARING_SET_ATTR(channel, u32)
+CXL_SPARING_SET_ATTR(sub_channel, u32)
+
+static int cxl_mem_sparing_set_persist_mode(struct device *dev, void *drv_data, u32 persist_mode)
+{
+	struct cxl_mem_sparing_context *ctx = drv_data;
+
+	switch (persist_mode) {
+	case EDAC_MEM_REPAIR_SOFT:
+		ctx->persist_mode = EDAC_MEM_REPAIR_SOFT;
+		return 0;
+	case EDAC_MEM_REPAIR_HARD:
+		ctx->persist_mode = EDAC_MEM_REPAIR_HARD;
+		return 0;
+	default:
+		return -EINVAL;
+	}
+}
+
+static int cxl_get_mem_sparing_safe_when_in_use(struct device *dev, void *drv_data,
+						u32 *safe)
+{
+	struct cxl_mem_sparing_context *ctx = drv_data;
+	struct cxl_memdev_sparing_params params;
+	int ret;
+
+	ret = cxl_mem_sparing_get_attrs(ctx, &params);
+	if (ret)
+		return ret;
+
+	*safe = params.cap_safe_when_in_use;
+
+	return 0;
+}
+
+static int cxl_mem_sparing_get_min_dpa(struct device *dev, void *drv_data,
+				       u64 *min_dpa)
+{
+	struct cxl_mem_sparing_context *ctx = drv_data;
+	struct cxl_memdev *cxlmd = ctx->cxlmd;
+	struct cxl_dev_state *cxlds = cxlmd->cxlds;
+
+	*min_dpa = cxlds->dpa_res.start;
+
+	return 0;
+}
+
+static int cxl_mem_sparing_get_max_dpa(struct device *dev, void *drv_data,
+				       u64 *max_dpa)
+{
+	struct cxl_mem_sparing_context *ctx = drv_data;
+	struct cxl_memdev *cxlmd = ctx->cxlmd;
+	struct cxl_dev_state *cxlds = cxlmd->cxlds;
+
+	*max_dpa = cxlds->dpa_res.end;
+
+	return 0;
+}
+
+static int cxl_mem_sparing_set_dpa(struct device *dev, void *drv_data, u64 dpa)
+{
+	struct cxl_mem_sparing_context *ctx = drv_data;
+	struct cxl_memdev *cxlmd = ctx->cxlmd;
+	struct cxl_dev_state *cxlds = cxlmd->cxlds;
+
+	if (!dpa || dpa < cxlds->dpa_res.start || dpa > cxlds->dpa_res.end)
+		return -EINVAL;
+
+	ctx->dpa = dpa;
+
+	return 0;
+}
+
+static int cxl_do_mem_sparing(struct device *dev, void *drv_data, u32 val)
+{
+	struct cxl_mem_sparing_context *ctx = drv_data;
+
+	if (val != EDAC_DO_MEM_REPAIR)
+		return -EINVAL;
+
+	return cxl_mem_sparing_set_attrs(dev, ctx, CXL_MEM_SPARING_PARAM_DO_REPAIR);
+}
+
+#define RANK_OPS \
+	.get_repair_function = cxl_mem_sparing_get_repair_function, \
+	.get_persist_mode = cxl_mem_sparing_get_persist_mode, \
+	.set_persist_mode = cxl_mem_sparing_set_persist_mode, \
+	.get_repair_safe_when_in_use = cxl_get_mem_sparing_safe_when_in_use, \
+	.get_dpa_support = cxl_mem_sparing_get_dpa_support, \
+	.get_min_dpa = cxl_mem_sparing_get_min_dpa, \
+	.get_max_dpa = cxl_mem_sparing_get_max_dpa, \
+	.get_dpa = cxl_mem_sparing_get_dpa, \
+	.set_dpa = cxl_mem_sparing_set_dpa, \
+	.get_nibble_mask = cxl_mem_sparing_get_nibble_mask, \
+	.set_nibble_mask = cxl_mem_sparing_set_nibble_mask, \
+	.get_rank = cxl_mem_sparing_get_rank, \
+	.set_rank = cxl_mem_sparing_set_rank, \
+	.get_channel = cxl_mem_sparing_get_channel, \
+	.set_channel = cxl_mem_sparing_set_channel, \
+	.do_repair = cxl_do_mem_sparing
+
+#define BANK_OPS \
+	RANK_OPS, \
+	.get_bank_group = cxl_mem_sparing_get_bank_group, \
+	.set_bank_group = cxl_mem_sparing_set_bank_group, \
+	.get_bank = cxl_mem_sparing_get_bank, \
+	.set_bank = cxl_mem_sparing_set_bank
+
+#define ROW_OPS \
+	BANK_OPS, \
+	.get_row = cxl_mem_sparing_get_row, \
+	.set_row = cxl_mem_sparing_set_row
+
+#define CACHELINE_OPS \
+	ROW_OPS, \
+	.get_column = cxl_mem_sparing_get_column, \
+	.set_column = cxl_mem_sparing_set_column, \
+	.get_sub_channel = cxl_mem_sparing_get_sub_channel, \
+	.set_sub_channel = cxl_mem_sparing_set_sub_channel
+
+static const struct edac_mem_repair_ops cxl_rank_sparing_ops = {
+	RANK_OPS,
+};
+
+static const struct edac_mem_repair_ops cxl_bank_sparing_ops = {
+	BANK_OPS,
+};
+
+static const struct edac_mem_repair_ops cxl_row_sparing_ops = {
+	ROW_OPS,
+};
+
+static const struct edac_mem_repair_ops cxl_cacheline_sparing_ops = {
+	CACHELINE_OPS,
+};
+
+struct cxl_mem_sparing_desc {
+	const uuid_t repair_uuid;
+	enum edac_mem_repair_function repair_function;
+	enum edac_mem_repair_persist_mode persist_mode;
+	enum cxl_mem_sparing_granularity granularity;
+	const struct edac_mem_repair_ops *repair_ops;
+};
+
+static const struct cxl_mem_sparing_desc mem_sparing_desc[] = {
+	{
+		.repair_uuid = CXL_FEAT_CACHELINE_SPARING_UUID,
+		.repair_function = EDAC_CACHELINE_MEM_SPARING,
+		.persist_mode = EDAC_MEM_REPAIR_SOFT,
+		.granularity = CXL_MEM_SPARING_CACHELINE,
+		.repair_ops = &cxl_cacheline_sparing_ops,
+	},
+	{
+		.repair_uuid = CXL_FEAT_ROW_SPARING_UUID,
+		.repair_function = EDAC_ROW_MEM_SPARING,
+		.persist_mode = EDAC_MEM_REPAIR_SOFT,
+		.granularity = CXL_MEM_SPARING_ROW,
+		.repair_ops = &cxl_row_sparing_ops,
+	},
+	{
+		.repair_uuid = CXL_FEAT_BANK_SPARING_UUID,
+		.repair_function = EDAC_BANK_MEM_SPARING,
+		.persist_mode = EDAC_MEM_REPAIR_SOFT,
+		.granularity = CXL_MEM_SPARING_BANK,
+		.repair_ops = &cxl_bank_sparing_ops,
+	},
+	{
+		.repair_uuid = CXL_FEAT_RANK_SPARING_UUID,
+		.repair_function = EDAC_RANK_MEM_SPARING,
+		.persist_mode = EDAC_MEM_REPAIR_SOFT,
+		.granularity = CXL_MEM_SPARING_RANK,
+		.repair_ops = &cxl_rank_sparing_ops,
+	},
+};
+
 static int cxl_memdev_scrub_init(struct cxl_memdev *cxlmd, struct cxl_region *cxlr,
 				 struct edac_dev_feature *ras_feature, u8 scrub_inst)
 {
@@ -1087,6 +1518,51 @@ static int cxl_memdev_soft_ppr_init(struct cxl_memdev *cxlmd,
 	return 0;
 }
 
+static int cxl_memdev_sparing_init(struct cxl_memdev *cxlmd,
+				   struct edac_dev_feature *ras_feature,
+				   const struct cxl_mem_sparing_desc *desc,
+				   u8 repair_inst)
+{
+	struct cxl_dev_state *cxlds = cxlmd->cxlds;
+	struct cxl_mailbox *cxl_mbox = &cxlds->cxl_mbox;
+	struct cxl_mem_sparing_context *cxl_sparing_ctx;
+	struct cxl_feat_entry *feat_entry;
+
+	feat_entry = cxl_get_supported_feature_entry(cxl_mbox->features,
+						     &desc->repair_uuid);
+	if (IS_ERR(feat_entry))
+		return -EOPNOTSUPP;
+
+	if (!(le32_to_cpu(feat_entry->flags) & CXL_FEAT_ENTRY_FLAG_CHANGABLE))
+		return -EOPNOTSUPP;
+
+	cxl_sparing_ctx = devm_kzalloc(&cxlmd->dev, sizeof(*cxl_sparing_ctx),
+				       GFP_KERNEL);
+	if (!cxl_sparing_ctx)
+		return -ENOMEM;
+
+	*cxl_sparing_ctx = (struct cxl_mem_sparing_context) {
+		.repair_uuid = desc->repair_uuid,
+		.get_feat_size = le16_to_cpu(feat_entry->get_feat_size),
+		.set_feat_size = le16_to_cpu(feat_entry->set_feat_size),
+		.get_version = feat_entry->get_feat_ver,
+		.set_version = feat_entry->set_feat_ver,
+		.effects = le16_to_cpu(feat_entry->effects),
+		.cxlmd = cxlmd,
+		.repair_function = desc->repair_function,
+		.persist_mode = desc->persist_mode,
+		.granularity = desc->granularity,
+		.dpa_support = true,
+		.instance = repair_inst++,
+	};
+	ras_feature->ft_type = RAS_FEAT_MEM_REPAIR;
+	ras_feature->instance = cxl_sparing_ctx->instance;
+	ras_feature->mem_repair_ops = desc->repair_ops;
+	ras_feature->ctx = cxl_sparing_ctx;
+
+	return 0;
+}
+
 int cxl_mem_ras_features_init(struct cxl_memdev *cxlmd, struct cxl_region *cxlr)
 {
 	struct edac_dev_feature ras_features[CXL_DEV_NUM_RAS_FEATURES];
@@ -1094,7 +1570,7 @@ int cxl_mem_ras_features_init(struct cxl_memdev *cxlmd, struct cxl_region *cxlr)
 	int num_ras_features = 0;
 	u8 repair_inst = 0;
 	u8 scrub_inst = 0;
-	int rc;
+	int rc, i;
 
 	rc = cxl_memdev_scrub_init(cxlmd, cxlr, &ras_features[num_ras_features],
 				   scrub_inst);
@@ -1136,6 +1612,18 @@ int cxl_mem_ras_features_init(struct cxl_memdev *cxlmd, struct cxl_region *cxlr)
 	num_ras_features++;
 
 feat_soft_ppr_done:
+	for (i = 0; i < CXL_MEM_SPARING_MAX; i++) {
+		rc = cxl_memdev_sparing_init(cxlmd, &ras_features[num_ras_features],
+					     &mem_sparing_desc[i], repair_inst);
+		if (rc == -EOPNOTSUPP)
+			continue;
+		if (rc < 0)
+			return rc;
+
+		repair_inst++;
+		num_ras_features++;
+	}
+
 feat_register:
 	return edac_dev_register(&cxlmd->dev, cxl_dev_name, NULL,
 				 num_ras_features, ras_features);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 87+ messages in thread

* Re: [PATCH v18 01/19] EDAC: Add support for EDAC device features control
  2025-01-06 12:09 ` [PATCH v18 01/19] EDAC: Add support for EDAC device features control shiju.jose
@ 2025-01-06 13:37   ` Borislav Petkov
  2025-01-06 14:48     ` Shiju Jose
  2025-01-13 15:06   ` Mauro Carvalho Chehab
  2025-01-30 19:18   ` Daniel Ferguson
  2 siblings, 1 reply; 87+ messages in thread
From: Borislav Petkov @ 2025-01-06 13:37 UTC (permalink / raw)
  To: shiju.jose
  Cc: linux-edac, linux-cxl, linux-acpi, linux-mm, linux-kernel,
	tony.luck, rafael, lenb, mchehab, dan.j.williams, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	ira.weiny, david, Vilas.Sridharan, leo.duran, Yazen.Ghannam,
	rientjes, jiaqiyan, Jon.Grimm, dave.hansen, naoya.horiguchi,
	james.morse, jthoughton, somasundaram.a, erdemaktas, pgonda,
	duenwen, gthelen, wschwartz, dferguson, wbs, nifan.cxl,
	tanxiaofei, prime.zeng, roberto.sassu, kangkang.shen,
	wanghuiqiang, linuxarm

On Mon, Jan 06, 2025 at 12:09:57PM +0000, shiju.jose@huawei.com wrote:
> +int edac_dev_register(struct device *parent, char *name,
> +		      void *private, int num_features,
> +		      const struct edac_dev_feature *ras_features)
> +{
> +	const struct attribute_group **ras_attr_groups;
> +	struct edac_dev_feat_ctx *ctx;
> +	int attr_gcnt = 0;
> +	int ret, feat;
> +
> +	if (!parent || !name || !num_features || !ras_features)
> +		return -EINVAL;
> +
> +	/* Double parse to make space for attributes */
> +	for (feat = 0; feat < num_features; feat++) {
> +		switch (ras_features[feat].ft_type) {
> +		/* Add feature specific code */
> +		default:
> +			return -EINVAL;
> +		}
> +	}
> +
> +	ctx = kzalloc(sizeof(*ctx), GFP_KERNEL);
> +	if (!ctx)
> +		return -ENOMEM;
> +
> +	ras_attr_groups = kcalloc(attr_gcnt + 1, sizeof(*ras_attr_groups), GFP_KERNEL);
> +	if (!ras_attr_groups) {
> +		ret = -ENOMEM;
> +		goto ctx_free;
> +	}
> +
> +	attr_gcnt = 0;
> +	for (feat = 0; feat < num_features; feat++, ras_features++) {
> +		switch (ras_features->ft_type) {
> +		/* Add feature specific code */
> +		default:
> +			ret = -EINVAL;
> +			goto groups_free;
> +		}
> +	}
> +
> +	ctx->dev.parent = parent;
> +	ctx->dev.bus = edac_get_sysfs_subsys();
> +	ctx->dev.type = &edac_dev_type;
> +	ctx->dev.groups = ras_attr_groups;
> +	ctx->private = private;
> +	dev_set_drvdata(&ctx->dev, ctx);
> +
> +	ret = dev_set_name(&ctx->dev, name);
> +	if (ret)
> +		goto groups_free;
> +
> +	ret = device_register(&ctx->dev);
> +	if (ret) {
> +		put_device(&ctx->dev);
> +		return ret;

Who is freeing ctx and ras_attr_groups when you return here?

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 87+ messages in thread

* RE: [PATCH v18 01/19] EDAC: Add support for EDAC device features control
  2025-01-06 13:37   ` Borislav Petkov
@ 2025-01-06 14:48     ` Shiju Jose
  0 siblings, 0 replies; 87+ messages in thread
From: Shiju Jose @ 2025-01-06 14:48 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: linux-edac@vger.kernel.org, linux-cxl@vger.kernel.org,
	linux-acpi@vger.kernel.org, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org, tony.luck@intel.com,
	rafael@kernel.org, lenb@kernel.org, mchehab@kernel.org,
	dan.j.williams@intel.com, dave@stgolabs.net, Jonathan Cameron,
	dave.jiang@intel.com, alison.schofield@intel.com,
	vishal.l.verma@intel.com, ira.weiny@intel.com, david@redhat.com,
	Vilas.Sridharan@amd.com, leo.duran@amd.com, Yazen.Ghannam@amd.com,
	rientjes@google.com, jiaqiyan@google.com, Jon.Grimm@amd.com,
	dave.hansen@linux.intel.com, naoya.horiguchi@nec.com,
	james.morse@arm.com, jthoughton@google.com,
	somasundaram.a@hpe.com, erdemaktas@google.com, pgonda@google.com,
	duenwen@google.com, gthelen@google.com,
	wschwartz@amperecomputing.com, dferguson@amperecomputing.com,
	wbs@os.amperecomputing.com, nifan.cxl@gmail.com, tanxiaofei,
	Zengtao (B), Roberto Sassu, kangkang.shen@futurewei.com,
	wanghuiqiang, Linuxarm


>-----Original Message-----
>From: Borislav Petkov <bp@alien8.de>
>Sent: 06 January 2025 13:38
>To: Shiju Jose <shiju.jose@huawei.com>
>Cc: linux-edac@vger.kernel.org; linux-cxl@vger.kernel.org; linux-
>acpi@vger.kernel.org; linux-mm@kvack.org; linux-kernel@vger.kernel.org;
>tony.luck@intel.com; rafael@kernel.org; lenb@kernel.org;
>mchehab@kernel.org; dan.j.williams@intel.com; dave@stgolabs.net; Jonathan
>Cameron <jonathan.cameron@huawei.com>; dave.jiang@intel.com;
>alison.schofield@intel.com; vishal.l.verma@intel.com; ira.weiny@intel.com;
>david@redhat.com; Vilas.Sridharan@amd.com; leo.duran@amd.com;
>Yazen.Ghannam@amd.com; rientjes@google.com; jiaqiyan@google.com;
>Jon.Grimm@amd.com; dave.hansen@linux.intel.com;
>naoya.horiguchi@nec.com; james.morse@arm.com; jthoughton@google.com;
>somasundaram.a@hpe.com; erdemaktas@google.com; pgonda@google.com;
>duenwen@google.com; gthelen@google.com;
>wschwartz@amperecomputing.com; dferguson@amperecomputing.com;
>wbs@os.amperecomputing.com; nifan.cxl@gmail.com; tanxiaofei
><tanxiaofei@huawei.com>; Zengtao (B) <prime.zeng@hisilicon.com>; Roberto
>Sassu <roberto.sassu@huawei.com>; kangkang.shen@futurewei.com;
>wanghuiqiang <wanghuiqiang@huawei.com>; Linuxarm
><linuxarm@huawei.com>
>Subject: Re: [PATCH v18 01/19] EDAC: Add support for EDAC device features
>control
>
>On Mon, Jan 06, 2025 at 12:09:57PM +0000, shiju.jose@huawei.com wrote:
>> +int edac_dev_register(struct device *parent, char *name,
>> +		      void *private, int num_features,
>> +		      const struct edac_dev_feature *ras_features) {
>> +	const struct attribute_group **ras_attr_groups;
>> +	struct edac_dev_feat_ctx *ctx;
>> +	int attr_gcnt = 0;
>> +	int ret, feat;
>> +
>> +	if (!parent || !name || !num_features || !ras_features)
>> +		return -EINVAL;
>> +
>> +	/* Double parse to make space for attributes */
>> +	for (feat = 0; feat < num_features; feat++) {
>> +		switch (ras_features[feat].ft_type) {
>> +		/* Add feature specific code */
>> +		default:
>> +			return -EINVAL;
>> +		}
>> +	}
>> +
>> +	ctx = kzalloc(sizeof(*ctx), GFP_KERNEL);
>> +	if (!ctx)
>> +		return -ENOMEM;
>> +
>> +	ras_attr_groups = kcalloc(attr_gcnt + 1, sizeof(*ras_attr_groups),
>GFP_KERNEL);
>> +	if (!ras_attr_groups) {
>> +		ret = -ENOMEM;
>> +		goto ctx_free;
>> +	}
>> +
>> +	attr_gcnt = 0;
>> +	for (feat = 0; feat < num_features; feat++, ras_features++) {
>> +		switch (ras_features->ft_type) {
>> +		/* Add feature specific code */
>> +		default:
>> +			ret = -EINVAL;
>> +			goto groups_free;
>> +		}
>> +	}
>> +
>> +	ctx->dev.parent = parent;
>> +	ctx->dev.bus = edac_get_sysfs_subsys();
>> +	ctx->dev.type = &edac_dev_type;
>> +	ctx->dev.groups = ras_attr_groups;
>> +	ctx->private = private;
>> +	dev_set_drvdata(&ctx->dev, ctx);
>> +
>> +	ret = dev_set_name(&ctx->dev, name);
>> +	if (ret)
>> +		goto groups_free;
>> +
>> +	ret = device_register(&ctx->dev);
>> +	if (ret) {
>> +		put_device(&ctx->dev);
>> +		return ret;
>
>Who is freeing ctx and ras_attr_groups when you return here?

Hi Boris,

ctx and ras_attr_groups are freed in the callback
function  for release  edac_dev_release(struct device *dev).

Thanks,
Shiju

>
>--
>Regards/Gruss,
>    Boris.
>
>https://people.kernel.org/tglx/notes-about-netiquette


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v18 02/19] EDAC: Add scrub control feature
  2025-01-06 12:09 ` [PATCH v18 02/19] EDAC: Add scrub control feature shiju.jose
@ 2025-01-06 15:57   ` Borislav Petkov
  2025-01-06 19:34     ` Shiju Jose
  2025-01-13 15:50   ` Mauro Carvalho Chehab
  2025-01-30 19:18   ` Daniel Ferguson
  2 siblings, 1 reply; 87+ messages in thread
From: Borislav Petkov @ 2025-01-06 15:57 UTC (permalink / raw)
  To: shiju.jose
  Cc: linux-edac, linux-cxl, linux-acpi, linux-mm, linux-kernel,
	tony.luck, rafael, lenb, mchehab, dan.j.williams, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	ira.weiny, david, Vilas.Sridharan, leo.duran, Yazen.Ghannam,
	rientjes, jiaqiyan, Jon.Grimm, dave.hansen, naoya.horiguchi,
	james.morse, jthoughton, somasundaram.a, erdemaktas, pgonda,
	duenwen, gthelen, wschwartz, dferguson, wbs, nifan.cxl,
	tanxiaofei, prime.zeng, roberto.sassu, kangkang.shen,
	wanghuiqiang, linuxarm

On Mon, Jan 06, 2025 at 12:09:58PM +0000, shiju.jose@huawei.com wrote:
> diff --git a/drivers/edac/Makefile b/drivers/edac/Makefile
> index f9cf19d8d13d..a162726cc6b9 100644
> --- a/drivers/edac/Makefile
> +++ b/drivers/edac/Makefile
> @@ -10,6 +10,7 @@ obj-$(CONFIG_EDAC)			:= edac_core.o
>  
>  edac_core-y	:= edac_mc.o edac_device.o edac_mc_sysfs.o
>  edac_core-y	+= edac_module.o edac_device_sysfs.o wq.o
> +edac_core-y	+= scrub.o

You're not being serious here - this scrub gunk is enabled by default on
*everything*?

So the main user of this is going to be CXL, AFAICT, so the scrubbing gunk
should depend at least on it or so. Definitely not unconditionally enabled on
every build.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 87+ messages in thread

* RE: [PATCH v18 02/19] EDAC: Add scrub control feature
  2025-01-06 15:57   ` Borislav Petkov
@ 2025-01-06 19:34     ` Shiju Jose
  2025-01-07  7:32       ` Borislav Petkov
  0 siblings, 1 reply; 87+ messages in thread
From: Shiju Jose @ 2025-01-06 19:34 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: linux-edac@vger.kernel.org, linux-cxl@vger.kernel.org,
	linux-acpi@vger.kernel.org, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org, tony.luck@intel.com,
	rafael@kernel.org, lenb@kernel.org, mchehab@kernel.org,
	dan.j.williams@intel.com, dave@stgolabs.net, Jonathan Cameron,
	dave.jiang@intel.com, alison.schofield@intel.com,
	vishal.l.verma@intel.com, ira.weiny@intel.com, david@redhat.com,
	Vilas.Sridharan@amd.com, leo.duran@amd.com, Yazen.Ghannam@amd.com,
	rientjes@google.com, jiaqiyan@google.com, Jon.Grimm@amd.com,
	dave.hansen@linux.intel.com, naoya.horiguchi@nec.com,
	james.morse@arm.com, jthoughton@google.com,
	somasundaram.a@hpe.com, erdemaktas@google.com, pgonda@google.com,
	duenwen@google.com, gthelen@google.com,
	wschwartz@amperecomputing.com, dferguson@amperecomputing.com,
	wbs@os.amperecomputing.com, nifan.cxl@gmail.com, tanxiaofei,
	Zengtao (B), Roberto Sassu, kangkang.shen@futurewei.com,
	wanghuiqiang, Linuxarm

>-----Original Message-----
>From: Borislav Petkov <bp@alien8.de>
>Sent: 06 January 2025 15:58
>To: Shiju Jose <shiju.jose@huawei.com>
>Cc: linux-edac@vger.kernel.org; linux-cxl@vger.kernel.org; linux-
>acpi@vger.kernel.org; linux-mm@kvack.org; linux-kernel@vger.kernel.org;
>tony.luck@intel.com; rafael@kernel.org; lenb@kernel.org;
>mchehab@kernel.org; dan.j.williams@intel.com; dave@stgolabs.net; Jonathan
>Cameron <jonathan.cameron@huawei.com>; dave.jiang@intel.com;
>alison.schofield@intel.com; vishal.l.verma@intel.com; ira.weiny@intel.com;
>david@redhat.com; Vilas.Sridharan@amd.com; leo.duran@amd.com;
>Yazen.Ghannam@amd.com; rientjes@google.com; jiaqiyan@google.com;
>Jon.Grimm@amd.com; dave.hansen@linux.intel.com;
>naoya.horiguchi@nec.com; james.morse@arm.com; jthoughton@google.com;
>somasundaram.a@hpe.com; erdemaktas@google.com; pgonda@google.com;
>duenwen@google.com; gthelen@google.com;
>wschwartz@amperecomputing.com; dferguson@amperecomputing.com;
>wbs@os.amperecomputing.com; nifan.cxl@gmail.com; tanxiaofei
><tanxiaofei@huawei.com>; Zengtao (B) <prime.zeng@hisilicon.com>; Roberto
>Sassu <roberto.sassu@huawei.com>; kangkang.shen@futurewei.com;
>wanghuiqiang <wanghuiqiang@huawei.com>; Linuxarm
><linuxarm@huawei.com>
>Subject: Re: [PATCH v18 02/19] EDAC: Add scrub control feature
>
>On Mon, Jan 06, 2025 at 12:09:58PM +0000, shiju.jose@huawei.com wrote:
>> diff --git a/drivers/edac/Makefile b/drivers/edac/Makefile index
>> f9cf19d8d13d..a162726cc6b9 100644
>> --- a/drivers/edac/Makefile
>> +++ b/drivers/edac/Makefile
>> @@ -10,6 +10,7 @@ obj-$(CONFIG_EDAC)			:= edac_core.o
>>
>>  edac_core-y	:= edac_mc.o edac_device.o edac_mc_sysfs.o
>>  edac_core-y	+= edac_module.o edac_device_sysfs.o wq.o
>> +edac_core-y	+= scrub.o
>
>You're not being serious here - this scrub gunk is enabled by default on
>*everything*?
>
>So the main user of this is going to be CXL, AFAICT, so the scrubbing gunk should
>depend at least on it or so. Definitely not unconditionally enabled on every build.
Thanks for the comment.
My understanding is that you meant the following changes (diff to this patch), for scrub?
(and similar for other features). 
Please let me know if you need any corrections.
    
========================================
diff --git a/drivers/cxl/Kconfig b/drivers/cxl/Kconfig
index 6078f02e883b..7886097f998f 100644
--- a/drivers/cxl/Kconfig
+++ b/drivers/cxl/Kconfig
@@ -158,7 +158,7 @@ config CXL_RAS_FEATURES
 	tristate "CXL: Memory RAS features"
 	depends on CXL_PCI
 	depends on CXL_MEM
-	depends on EDAC
+	select EDAC_FEAT_SCRUB
 	help
 	  The CXL memory RAS feature control is optional and allows host to
 	  control the RAS features configurations of CXL Type 3 devices.
diff --git a/drivers/edac/Kconfig b/drivers/edac/Kconfig
index 06f7b43a6f78..709bd7ad8015 100644
--- a/drivers/edac/Kconfig
+++ b/drivers/edac/Kconfig
@@ -9,6 +9,14 @@ config EDAC_ATOMIC_SCRUB
 config EDAC_SUPPORT
 	bool
 
+config EDAC_FEAT_SCRUB
+	bool
+	help
+	  The EDAC scrub feature is optional and is designed to control the
+	  memory scrubbers in the system. The common sysfs scrub interface
+	  abstracts the control of various arbitrary scrubbing functionalities
+	  into a unified set of functions.
+
 menuconfig EDAC
 	tristate "EDAC (Error Detection And Correction) reporting"
 	depends on HAS_IOMEM && EDAC_SUPPORT && RAS
diff --git a/drivers/edac/Makefile b/drivers/edac/Makefile
index 1de9fe66ac6b..71a522272215 100644
--- a/drivers/edac/Makefile
+++ b/drivers/edac/Makefile
@@ -10,7 +10,9 @@ obj-$(CONFIG_EDAC)			:= edac_core.o
 
 edac_core-y	:= edac_mc.o edac_device.o edac_mc_sysfs.o
 edac_core-y	+= edac_module.o edac_device_sysfs.o wq.o
-edac_core-y	+= scrub.o ecs.o mem_repair.o
+
+edac_core-$(CONFIG_EDAC_FEAT_SCRUB)	+= scrub.o

 edac_core-$(CONFIG_EDAC_DEBUG)		+= debugfs.o
 
diff --git a/drivers/ras/Kconfig b/drivers/ras/Kconfig
index b77790bdc73a..870f3466c2f7 100644
--- a/drivers/ras/Kconfig
+++ b/drivers/ras/Kconfig
@@ -49,7 +49,7 @@ config RAS_FMPM
 config MEM_ACPI_RAS2
 	tristate "Memory ACPI RAS2 driver"
 	depends on ACPI_RAS2
-	depends on EDAC
+	select EDAC_FEAT_SCRUB
 	help
 	  The driver binds to the platform device added by the ACPI RAS2
 	  table parser. Use a PCC channel subspace for communicating with
diff --git a/include/linux/edac.h b/include/linux/edac.h
index 5d07192bf1a7..0f6c7f3582c3 100644
--- a/include/linux/edac.h
+++ b/include/linux/edac.h
@@ -698,9 +698,16 @@ struct edac_scrub_ops {
 	int (*set_cycle_duration)(struct device *dev, void *drv_data, u32 cycle);
 };
 
+#if IS_ENABLED(CONFIG_EDAC_FEAT_SCRUB)
 int edac_scrub_get_desc(struct device *scrub_dev,
 			const struct attribute_group **attr_groups,
 			u8 instance);
+#else
+static inline int edac_scrub_get_desc(struct device *scrub_dev,
+				      const struct attribute_group **attr_groups,
+				      u8 instance)
+{ return -EOPNOTSUPP; }
+#endif
 
 ...
=============================================
>
>--
>Regards/Gruss,
>    Boris.
>
>https://people.kernel.org/tglx/notes-about-netiquette
>

Thanks,
Shiju

^ permalink raw reply related	[flat|nested] 87+ messages in thread

* Re: [PATCH v18 02/19] EDAC: Add scrub control feature
  2025-01-06 19:34     ` Shiju Jose
@ 2025-01-07  7:32       ` Borislav Petkov
  2025-01-07  9:23         ` Shiju Jose
  2025-01-08 15:47         ` Shiju Jose
  0 siblings, 2 replies; 87+ messages in thread
From: Borislav Petkov @ 2025-01-07  7:32 UTC (permalink / raw)
  To: Shiju Jose
  Cc: linux-edac@vger.kernel.org, linux-cxl@vger.kernel.org,
	linux-acpi@vger.kernel.org, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org, tony.luck@intel.com,
	rafael@kernel.org, lenb@kernel.org, mchehab@kernel.org,
	dan.j.williams@intel.com, dave@stgolabs.net, Jonathan Cameron,
	dave.jiang@intel.com, alison.schofield@intel.com,
	vishal.l.verma@intel.com, ira.weiny@intel.com, david@redhat.com,
	Vilas.Sridharan@amd.com, leo.duran@amd.com, Yazen.Ghannam@amd.com,
	rientjes@google.com, jiaqiyan@google.com, Jon.Grimm@amd.com,
	dave.hansen@linux.intel.com, naoya.horiguchi@nec.com,
	james.morse@arm.com, jthoughton@google.com,
	somasundaram.a@hpe.com, erdemaktas@google.com, pgonda@google.com,
	duenwen@google.com, gthelen@google.com,
	wschwartz@amperecomputing.com, dferguson@amperecomputing.com,
	wbs@os.amperecomputing.com, nifan.cxl@gmail.com, tanxiaofei,
	Zengtao (B), Roberto Sassu, kangkang.shen@futurewei.com,
	wanghuiqiang, Linuxarm

On Mon, Jan 06, 2025 at 07:34:41PM +0000, Shiju Jose wrote:
> My understanding is that you meant the following changes (diff to this
> patch), for scrub?  (and similar for other features).  Please let me know if
> you need any corrections.

Yes, something like that except "select" is evil and should be used only when
the items it selects do not pull in more stuff. And since scrub is all
optional, it should all be depends.

> diff --git a/drivers/edac/Kconfig b/drivers/edac/Kconfig
> index 06f7b43a6f78..709bd7ad8015 100644
> --- a/drivers/edac/Kconfig
> +++ b/drivers/edac/Kconfig
> @@ -9,6 +9,14 @@ config EDAC_ATOMIC_SCRUB
>  config EDAC_SUPPORT
>  	bool
>  
> +config EDAC_FEAT_SCRUB

EDAC_SCRUB is perfectly fine.

> +	bool
> +	help
> +	  The EDAC scrub feature is optional and is designed to control the
> +	  memory scrubbers in the system. The common sysfs scrub interface
> +	  abstracts the control of various arbitrary scrubbing functionalities
> +	  into a unified set of functions.

This should come...

> +
>  menuconfig EDAC
>  	tristate "EDAC (Error Detection And Correction) reporting"
>  	depends on HAS_IOMEM && EDAC_SUPPORT && RAS

... in here as it is part of EDAC.

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 87+ messages in thread

* RE: [PATCH v18 02/19] EDAC: Add scrub control feature
  2025-01-07  7:32       ` Borislav Petkov
@ 2025-01-07  9:23         ` Shiju Jose
  2025-01-08 15:47         ` Shiju Jose
  1 sibling, 0 replies; 87+ messages in thread
From: Shiju Jose @ 2025-01-07  9:23 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: linux-edac@vger.kernel.org, linux-cxl@vger.kernel.org,
	linux-acpi@vger.kernel.org, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org, tony.luck@intel.com,
	rafael@kernel.org, lenb@kernel.org, mchehab@kernel.org,
	dan.j.williams@intel.com, dave@stgolabs.net, Jonathan Cameron,
	dave.jiang@intel.com, alison.schofield@intel.com,
	vishal.l.verma@intel.com, ira.weiny@intel.com, david@redhat.com,
	Vilas.Sridharan@amd.com, leo.duran@amd.com, Yazen.Ghannam@amd.com,
	rientjes@google.com, jiaqiyan@google.com, Jon.Grimm@amd.com,
	dave.hansen@linux.intel.com, naoya.horiguchi@nec.com,
	james.morse@arm.com, jthoughton@google.com,
	somasundaram.a@hpe.com, erdemaktas@google.com, pgonda@google.com,
	duenwen@google.com, gthelen@google.com,
	wschwartz@amperecomputing.com, dferguson@amperecomputing.com,
	wbs@os.amperecomputing.com, nifan.cxl@gmail.com, tanxiaofei,
	Zengtao (B), Roberto Sassu, kangkang.shen@futurewei.com,
	wanghuiqiang, Linuxarm

>-----Original Message-----
>From: Borislav Petkov <bp@alien8.de>
>Sent: 07 January 2025 07:32
>To: Shiju Jose <shiju.jose@huawei.com>
>Cc: linux-edac@vger.kernel.org; linux-cxl@vger.kernel.org; linux-
>acpi@vger.kernel.org; linux-mm@kvack.org; linux-kernel@vger.kernel.org;
>tony.luck@intel.com; rafael@kernel.org; lenb@kernel.org;
>mchehab@kernel.org; dan.j.williams@intel.com; dave@stgolabs.net; Jonathan
>Cameron <jonathan.cameron@huawei.com>; dave.jiang@intel.com;
>alison.schofield@intel.com; vishal.l.verma@intel.com; ira.weiny@intel.com;
>david@redhat.com; Vilas.Sridharan@amd.com; leo.duran@amd.com;
>Yazen.Ghannam@amd.com; rientjes@google.com; jiaqiyan@google.com;
>Jon.Grimm@amd.com; dave.hansen@linux.intel.com;
>naoya.horiguchi@nec.com; james.morse@arm.com; jthoughton@google.com;
>somasundaram.a@hpe.com; erdemaktas@google.com; pgonda@google.com;
>duenwen@google.com; gthelen@google.com;
>wschwartz@amperecomputing.com; dferguson@amperecomputing.com;
>wbs@os.amperecomputing.com; nifan.cxl@gmail.com; tanxiaofei
><tanxiaofei@huawei.com>; Zengtao (B) <prime.zeng@hisilicon.com>; Roberto
>Sassu <roberto.sassu@huawei.com>; kangkang.shen@futurewei.com;
>wanghuiqiang <wanghuiqiang@huawei.com>; Linuxarm
><linuxarm@huawei.com>
>Subject: Re: [PATCH v18 02/19] EDAC: Add scrub control feature
>
>On Mon, Jan 06, 2025 at 07:34:41PM +0000, Shiju Jose wrote:
>> My understanding is that you meant the following changes (diff to this
>> patch), for scrub?  (and similar for other features).  Please let me
>> know if you need any corrections.
>
>Yes, something like that except "select" is evil and should be used only when the
>items it selects do not pull in more stuff. And since scrub is all optional, it should
>all be depends.

Sure.
>
>> diff --git a/drivers/edac/Kconfig b/drivers/edac/Kconfig index
>> 06f7b43a6f78..709bd7ad8015 100644
>> --- a/drivers/edac/Kconfig
>> +++ b/drivers/edac/Kconfig
>> @@ -9,6 +9,14 @@ config EDAC_ATOMIC_SCRUB  config EDAC_SUPPORT
>>  	bool
>>
>> +config EDAC_FEAT_SCRUB
>
>EDAC_SCRUB is perfectly fine.

Sure. I will change.
>
>> +	bool
>> +	help
>> +	  The EDAC scrub feature is optional and is designed to control the
>> +	  memory scrubbers in the system. The common sysfs scrub interface
>> +	  abstracts the control of various arbitrary scrubbing functionalities
>> +	  into a unified set of functions.
>
>This should come...
>
>> +
>>  menuconfig EDAC
>>  	tristate "EDAC (Error Detection And Correction) reporting"
>>  	depends on HAS_IOMEM && EDAC_SUPPORT && RAS
>
>... in here as it is part of EDAC.

I will move in here. 
>
>Thx.
>
>--
>Regards/Gruss,
>    Boris.
>
>https://people.kernel.org/tglx/notes-about-netiquette

Thanks,
Shiju


^ permalink raw reply	[flat|nested] 87+ messages in thread

* RE: [PATCH v18 02/19] EDAC: Add scrub control feature
  2025-01-07  7:32       ` Borislav Petkov
  2025-01-07  9:23         ` Shiju Jose
@ 2025-01-08 15:47         ` Shiju Jose
  1 sibling, 0 replies; 87+ messages in thread
From: Shiju Jose @ 2025-01-08 15:47 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: linux-edac@vger.kernel.org, linux-cxl@vger.kernel.org,
	linux-acpi@vger.kernel.org, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org, tony.luck@intel.com,
	rafael@kernel.org, lenb@kernel.org, mchehab@kernel.org,
	dan.j.williams@intel.com, dave@stgolabs.net, Jonathan Cameron,
	dave.jiang@intel.com, alison.schofield@intel.com,
	vishal.l.verma@intel.com, ira.weiny@intel.com, david@redhat.com,
	Vilas.Sridharan@amd.com, leo.duran@amd.com, Yazen.Ghannam@amd.com,
	rientjes@google.com, jiaqiyan@google.com, Jon.Grimm@amd.com,
	dave.hansen@linux.intel.com, naoya.horiguchi@nec.com,
	james.morse@arm.com, jthoughton@google.com,
	somasundaram.a@hpe.com, erdemaktas@google.com, pgonda@google.com,
	duenwen@google.com, gthelen@google.com,
	wschwartz@amperecomputing.com, dferguson@amperecomputing.com,
	wbs@os.amperecomputing.com, nifan.cxl@gmail.com, tanxiaofei,
	Zengtao (B), Roberto Sassu, kangkang.shen@futurewei.com,
	wanghuiqiang, Linuxarm

>-----Original Message-----
>From: Borislav Petkov <bp@alien8.de>
[...]
>Subject: Re: [PATCH v18 02/19] EDAC: Add scrub control feature
>
>On Mon, Jan 06, 2025 at 07:34:41PM +0000, Shiju Jose wrote:
>> My understanding is that you meant the following changes (diff to this
>> patch), for scrub?  (and similar for other features).  Please let me
>> know if you need any corrections.
>
>Yes, something like that except "select" is evil and should be used only when the
>items it selects do not pull in more stuff. And since scrub is all optional, it should
>all be depends.
>
>> diff --git a/drivers/edac/Kconfig b/drivers/edac/Kconfig index
>> 06f7b43a6f78..709bd7ad8015 100644
>> --- a/drivers/edac/Kconfig
>> +++ b/drivers/edac/Kconfig
>> @@ -9,6 +9,14 @@ config EDAC_ATOMIC_SCRUB  config EDAC_SUPPORT
>>  	bool
>>
>> +config EDAC_FEAT_SCRUB
>
>EDAC_SCRUB is perfectly fine.
>
>> +	bool
>> +	help
>> +	  The EDAC scrub feature is optional and is designed to control the
>> +	  memory scrubbers in the system. The common sysfs scrub interface
>> +	  abstracts the control of various arbitrary scrubbing functionalities
>> +	  into a unified set of functions.
>
>This should come...
>
>> +
>>  menuconfig EDAC
>>  	tristate "EDAC (Error Detection And Correction) reporting"
>>  	depends on HAS_IOMEM && EDAC_SUPPORT && RAS
>
>... in here as it is part of EDAC.
>
>Thx.
>
Hi Boris,

I  have incorporated your suggestions for the next version.  
However before sending next version, I am wondering are you planning for further review on
this v18 series and giving any other feedbacks?     

Thanks,
Shiju

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v18 04/19] EDAC: Add memory repair control feature
  2025-01-06 12:10 ` [PATCH v18 04/19] EDAC: Add memory repair " shiju.jose
@ 2025-01-09  9:19   ` Borislav Petkov
  2025-01-09 11:00     ` Shiju Jose
  2025-01-14 11:47   ` Mauro Carvalho Chehab
  2025-01-14 13:47   ` Mauro Carvalho Chehab
  2 siblings, 1 reply; 87+ messages in thread
From: Borislav Petkov @ 2025-01-09  9:19 UTC (permalink / raw)
  To: shiju.jose
  Cc: linux-edac, linux-cxl, linux-acpi, linux-mm, linux-kernel,
	tony.luck, rafael, lenb, mchehab, dan.j.williams, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	ira.weiny, david, Vilas.Sridharan, leo.duran, Yazen.Ghannam,
	rientjes, jiaqiyan, Jon.Grimm, dave.hansen, naoya.horiguchi,
	james.morse, jthoughton, somasundaram.a, erdemaktas, pgonda,
	duenwen, gthelen, wschwartz, dferguson, wbs, nifan.cxl,
	tanxiaofei, prime.zeng, roberto.sassu, kangkang.shen,
	wanghuiqiang, linuxarm

On Mon, Jan 06, 2025 at 12:10:00PM +0000, shiju.jose@huawei.com wrote:
> +What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/min_hpa
> +What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/min_dpa
> +What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/min_nibble_mask
> +What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/min_bank_group
> +What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/min_bank
> +What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/min_rank
> +What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/min_row
> +What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/min_column
> +What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/min_channel
> +What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/min_sub_channel
> +What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/max_hpa
> +What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/max_dpa
> +What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/max_nibble_mask
> +What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/max_bank_group
> +What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/max_bank
> +What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/max_rank
> +What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/max_row
> +What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/max_column
> +What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/max_channel
> +What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/max_sub_channel

So this is new. I don't remember seeing that when I looked at your patches
the last time.

Looks like you have all those attributes and now you've decided to add a min
and max for each one, in addition. And UI-wise it is a madness as there are
gazillion single-value files now.

"Attributes should be ASCII text files, preferably with only one value per
file. It is noted that it may not be efficient to contain only one value per
file, so it is socially acceptable to express an array of values of the same
type."

So you don't need those - you can simply express each attribute as a range:

echo "1:2" > /sys/bus/edac/devices/<dev-name>/mem_repairX/bank

or if you wanna scrub only one bank:

echo "1:1" > /sys/bus/edac/devices/<dev-name>/mem_repairX/bank

What is the use case of that thing?

Someone might find it useful so let's add it preemptively?

Pfff.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 87+ messages in thread

* RE: [PATCH v18 04/19] EDAC: Add memory repair control feature
  2025-01-09  9:19   ` Borislav Petkov
@ 2025-01-09 11:00     ` Shiju Jose
  2025-01-09 12:32       ` Borislav Petkov
  0 siblings, 1 reply; 87+ messages in thread
From: Shiju Jose @ 2025-01-09 11:00 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: linux-edac@vger.kernel.org, linux-cxl@vger.kernel.org,
	linux-acpi@vger.kernel.org, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org, tony.luck@intel.com,
	rafael@kernel.org, lenb@kernel.org, mchehab@kernel.org,
	dan.j.williams@intel.com, dave@stgolabs.net, Jonathan Cameron,
	dave.jiang@intel.com, alison.schofield@intel.com,
	vishal.l.verma@intel.com, ira.weiny@intel.com, david@redhat.com,
	Vilas.Sridharan@amd.com, leo.duran@amd.com, Yazen.Ghannam@amd.com,
	rientjes@google.com, jiaqiyan@google.com, Jon.Grimm@amd.com,
	dave.hansen@linux.intel.com, naoya.horiguchi@nec.com,
	james.morse@arm.com, jthoughton@google.com,
	somasundaram.a@hpe.com, erdemaktas@google.com, pgonda@google.com,
	duenwen@google.com, gthelen@google.com,
	wschwartz@amperecomputing.com, dferguson@amperecomputing.com,
	wbs@os.amperecomputing.com, nifan.cxl@gmail.com, tanxiaofei,
	Zengtao (B), Roberto Sassu, kangkang.shen@futurewei.com,
	wanghuiqiang, Linuxarm

>-----Original Message-----
>From: Borislav Petkov <bp@alien8.de>
>Sent: 09 January 2025 09:19
>To: Shiju Jose <shiju.jose@huawei.com>
>Cc: linux-edac@vger.kernel.org; linux-cxl@vger.kernel.org; linux-
>acpi@vger.kernel.org; linux-mm@kvack.org; linux-kernel@vger.kernel.org;
>tony.luck@intel.com; rafael@kernel.org; lenb@kernel.org;
>mchehab@kernel.org; dan.j.williams@intel.com; dave@stgolabs.net; Jonathan
>Cameron <jonathan.cameron@huawei.com>; dave.jiang@intel.com;
>alison.schofield@intel.com; vishal.l.verma@intel.com; ira.weiny@intel.com;
>david@redhat.com; Vilas.Sridharan@amd.com; leo.duran@amd.com;
>Yazen.Ghannam@amd.com; rientjes@google.com; jiaqiyan@google.com;
>Jon.Grimm@amd.com; dave.hansen@linux.intel.com;
>naoya.horiguchi@nec.com; james.morse@arm.com; jthoughton@google.com;
>somasundaram.a@hpe.com; erdemaktas@google.com; pgonda@google.com;
>duenwen@google.com; gthelen@google.com;
>wschwartz@amperecomputing.com; dferguson@amperecomputing.com;
>wbs@os.amperecomputing.com; nifan.cxl@gmail.com; tanxiaofei
><tanxiaofei@huawei.com>; Zengtao (B) <prime.zeng@hisilicon.com>; Roberto
>Sassu <roberto.sassu@huawei.com>; kangkang.shen@futurewei.com;
>wanghuiqiang <wanghuiqiang@huawei.com>; Linuxarm
><linuxarm@huawei.com>
>Subject: Re: [PATCH v18 04/19] EDAC: Add memory repair control feature
>
>On Mon, Jan 06, 2025 at 12:10:00PM +0000, shiju.jose@huawei.com wrote:
>> +What:		/sys/bus/edac/devices/<dev-
>name>/mem_repairX/min_hpa
>> +What:		/sys/bus/edac/devices/<dev-
>name>/mem_repairX/min_dpa
>> +What:		/sys/bus/edac/devices/<dev-
>name>/mem_repairX/min_nibble_mask
>> +What:		/sys/bus/edac/devices/<dev-
>name>/mem_repairX/min_bank_group
>> +What:		/sys/bus/edac/devices/<dev-
>name>/mem_repairX/min_bank
>> +What:		/sys/bus/edac/devices/<dev-
>name>/mem_repairX/min_rank
>> +What:		/sys/bus/edac/devices/<dev-
>name>/mem_repairX/min_row
>> +What:		/sys/bus/edac/devices/<dev-
>name>/mem_repairX/min_column
>> +What:		/sys/bus/edac/devices/<dev-
>name>/mem_repairX/min_channel
>> +What:		/sys/bus/edac/devices/<dev-
>name>/mem_repairX/min_sub_channel
>> +What:		/sys/bus/edac/devices/<dev-
>name>/mem_repairX/max_hpa
>> +What:		/sys/bus/edac/devices/<dev-
>name>/mem_repairX/max_dpa
>> +What:		/sys/bus/edac/devices/<dev-
>name>/mem_repairX/max_nibble_mask
>> +What:		/sys/bus/edac/devices/<dev-
>name>/mem_repairX/max_bank_group
>> +What:		/sys/bus/edac/devices/<dev-
>name>/mem_repairX/max_bank
>> +What:		/sys/bus/edac/devices/<dev-
>name>/mem_repairX/max_rank
>> +What:		/sys/bus/edac/devices/<dev-
>name>/mem_repairX/max_row
>> +What:		/sys/bus/edac/devices/<dev-
>name>/mem_repairX/max_column
>> +What:		/sys/bus/edac/devices/<dev-
>name>/mem_repairX/max_channel
>> +What:		/sys/bus/edac/devices/<dev-
>name>/mem_repairX/max_sub_channel
>
>So this is new. I don't remember seeing that when I looked at your patches the
>last time.
>
>Looks like you have all those attributes and now you've decided to add a min and
>max for each one, in addition. And UI-wise it is a madness as there are gazillion
>single-value files now.
>

Thanks for the feedbacks.

The min_ and max_ attributes of the control attributes are added  for your
feedback on V15 to expose supported ranges of these control attributes to the user, 
in the following links.   
However these min_ and max_ attributes are 'RO' instead of 'RW' as specified in the doc, 
which to be fixed in the doc.
https://lore.kernel.org/lkml/20241114133249.GEZzX8ATNyc_Xw1L52@fat_crate.local/
https://lore.kernel.org/lkml/fa5d6bdd08104cf1a09c4960a0f9bc46@huawei.com/
https://lore.kernel.org/lkml/20241119123657.GCZzyGaZIExvUHPLKL@fat_crate.local/

>"Attributes should be ASCII text files, preferably with only one value per file. It is
>noted that it may not be efficient to contain only one value per file, so it is
>socially acceptable to express an array of values of the same type."
>
>So you don't need those - you can simply express each attribute as a range:
>
>echo "1:2" > /sys/bus/edac/devices/<dev-name>/mem_repairX/bank
>
>or if you wanna scrub only one bank:

After internal discussion, we think this is the source of the confusion. 
This is not scrub where a range would indeed make sense. It is repair. 
We are not aware of a failure mechanism where a set of memory banks
would fail together but not the whole of the next level up in the memory topology. 

In theory we might get a stupid device design where it reports coarse level
errors but can only repair at fine levels where a range might be appropriate.
We are not sure that makes sense in practice and with a range interface we will
get mess like running out of repair resources half way through a list with
no visibility of what is repaired.

However, given the repair flow is driven by userspace receiving error records
that will only possible values to repair, we think these bounds on what can be
repaired are a nice to have rather than necessary so we would propose we do not
add max_ and min_ for now and see how the use cases evolve.
>
>echo "1:1" > /sys/bus/edac/devices/<dev-name>/mem_repairX/bank
>
>What is the use case of that thing?
>
>Someone might find it useful so let's add it preemptively?
>
>Pfff.
>
>--
>Regards/Gruss,
>    Boris.
>
>https://people.kernel.org/tglx/notes-about-netiquette

Thanks,
Shiju

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v18 04/19] EDAC: Add memory repair control feature
  2025-01-09 11:00     ` Shiju Jose
@ 2025-01-09 12:32       ` Borislav Petkov
  2025-01-09 14:24         ` Jonathan Cameron
  0 siblings, 1 reply; 87+ messages in thread
From: Borislav Petkov @ 2025-01-09 12:32 UTC (permalink / raw)
  To: Shiju Jose
  Cc: linux-edac@vger.kernel.org, linux-cxl@vger.kernel.org,
	linux-acpi@vger.kernel.org, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org, tony.luck@intel.com,
	rafael@kernel.org, lenb@kernel.org, mchehab@kernel.org,
	dan.j.williams@intel.com, dave@stgolabs.net, Jonathan Cameron,
	dave.jiang@intel.com, alison.schofield@intel.com,
	vishal.l.verma@intel.com, ira.weiny@intel.com, david@redhat.com,
	Vilas.Sridharan@amd.com, leo.duran@amd.com, Yazen.Ghannam@amd.com,
	rientjes@google.com, jiaqiyan@google.com, Jon.Grimm@amd.com,
	dave.hansen@linux.intel.com, naoya.horiguchi@nec.com,
	james.morse@arm.com, jthoughton@google.com,
	somasundaram.a@hpe.com, erdemaktas@google.com, pgonda@google.com,
	duenwen@google.com, gthelen@google.com,
	wschwartz@amperecomputing.com, dferguson@amperecomputing.com,
	wbs@os.amperecomputing.com, nifan.cxl@gmail.com, tanxiaofei,
	Zengtao (B), Roberto Sassu, kangkang.shen@futurewei.com,
	wanghuiqiang, Linuxarm

On Thu, Jan 09, 2025 at 11:00:43AM +0000, Shiju Jose wrote:
> The min_ and max_ attributes of the control attributes are added  for your
> feedback on V15 to expose supported ranges of these control attributes to the user, 
> in the following links.  

Sure, but you can make that differently:

cat /sys/bus/edac/devices/<dev-name>/mem_repairX/bank
[x:y]

which is the allowed range.

echo ... 

then writes in the bank.

> ... so we would propose we do not add max_ and min_ for now and see how the
> use cases evolve.

Yes, you should apply that same methodology to the rest of the new features
you're adding: only add functionality for the stuff that is actually being
used now. You can always extend it later.

Changing an already user-visible API is a whole different story and a lot lot
harder, even impossible.

So I'd suggest you prune the EDAC patches from all the hypothetical usage and
then send only what remains so that I can try to queue them.

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v18 04/19] EDAC: Add memory repair control feature
  2025-01-09 12:32       ` Borislav Petkov
@ 2025-01-09 14:24         ` Jonathan Cameron
  2025-01-09 15:18           ` Borislav Petkov
  2025-01-14 12:38           ` Mauro Carvalho Chehab
  0 siblings, 2 replies; 87+ messages in thread
From: Jonathan Cameron @ 2025-01-09 14:24 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Shiju Jose, linux-edac@vger.kernel.org, linux-cxl@vger.kernel.org,
	linux-acpi@vger.kernel.org, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org, tony.luck@intel.com,
	rafael@kernel.org, lenb@kernel.org, mchehab@kernel.org,
	dan.j.williams@intel.com, dave@stgolabs.net, dave.jiang@intel.com,
	alison.schofield@intel.com, vishal.l.verma@intel.com,
	ira.weiny@intel.com, david@redhat.com, Vilas.Sridharan@amd.com,
	leo.duran@amd.com, Yazen.Ghannam@amd.com, rientjes@google.com,
	jiaqiyan@google.com, Jon.Grimm@amd.com,
	dave.hansen@linux.intel.com, naoya.horiguchi@nec.com,
	james.morse@arm.com, jthoughton@google.com,
	somasundaram.a@hpe.com, erdemaktas@google.com, pgonda@google.com,
	duenwen@google.com, gthelen@google.com,
	wschwartz@amperecomputing.com, dferguson@amperecomputing.com,
	wbs@os.amperecomputing.com, nifan.cxl@gmail.com, tanxiaofei,
	Zengtao (B), Roberto Sassu, kangkang.shen@futurewei.com,
	wanghuiqiang, Linuxarm

On Thu, 9 Jan 2025 13:32:22 +0100
Borislav Petkov <bp@alien8.de> wrote:

Hi Boris,

> On Thu, Jan 09, 2025 at 11:00:43AM +0000, Shiju Jose wrote:
> > The min_ and max_ attributes of the control attributes are added  for your
> > feedback on V15 to expose supported ranges of these control attributes to the user, 
> > in the following links.    
> 
> Sure, but you can make that differently:
> 
> cat /sys/bus/edac/devices/<dev-name>/mem_repairX/bank
> [x:y]
> 
> which is the allowed range.

To my thinking that would fail the test of being an intuitive interface.
To issue a repair command requires that multiple attributes be configured
before triggering the actual repair.

Think of it as setting the coordinates of the repair in a high dimensional
space.

In the extreme case of fine grained repair (Cacheline), to identify the
relevant subunit of memory (obtained from the error record that we are
basing the decision to repair on) we need to specify all of:

Channel, sub-channel, rank,  bank group, row, column and nibble mask.
For coarser granularity repair only specify a subset of these applies and
only the relevant controls are exposed to userspace.

They are broken out as specific attributes to enable each to be set before
triggering the action with a write to the repair attribute.

There are several possible alternatives:

Option 1

"A:B:C:D:E:F:G:H:I:J" opaque single write to trigger the repair where
each number is providing one of those coordinates and where a readback
let's us known what each number is.

That single attribute interface is very hard to extend in an intuitive way.

History tell us more levels will be introduced in the middle, not just
at the finest granularity, making such an interface hard to extend in
a backwards compatible way.

Another alternative of a key value list would make for a nasty sysfs
interface.

Option 2 
There are sysfs interfaces that use a selection type presentation.

Write: "C", Read: "A, B, [C], D" but that only works well for discrete sets
of options and is a pain to parse if read back is necessary.

So in conclusion, I think the proposed multiple sysfs attribute style
with them reading back the most recent value written is the least bad
solution to a complex control interface.

> 
> echo ... 
> 
> then writes in the bank.
> 
> > ... so we would propose we do not add max_ and min_ for now and see how the
> > use cases evolve.  
> 
> Yes, you should apply that same methodology to the rest of the new features
> you're adding: only add functionality for the stuff that is actually being
> used now. You can always extend it later.
> 
> Changing an already user-visible API is a whole different story and a lot lot
> harder, even impossible.
> 
> So I'd suggest you prune the EDAC patches from all the hypothetical usage and
> then send only what remains so that I can try to queue them.

Sure. In this case the addition of min/max was perhaps a wrong response to
your request for a way to those ranges rather than just rejecting a write
of something out of range as earlier version did.

We can revisit in future if range discovery becomes necessary.  Personally
I don't think it is given we are only taking these actions in response error
records that give us precisely what to write and hence are always in range.

Jonathan

> 
> Thx.
> 

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v18 04/19] EDAC: Add memory repair control feature
  2025-01-09 14:24         ` Jonathan Cameron
@ 2025-01-09 15:18           ` Borislav Petkov
  2025-01-09 16:01             ` Jonathan Cameron
  2025-01-14 12:38           ` Mauro Carvalho Chehab
  1 sibling, 1 reply; 87+ messages in thread
From: Borislav Petkov @ 2025-01-09 15:18 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: Shiju Jose, linux-edac@vger.kernel.org, linux-cxl@vger.kernel.org,
	linux-acpi@vger.kernel.org, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org, tony.luck@intel.com,
	rafael@kernel.org, lenb@kernel.org, mchehab@kernel.org,
	dan.j.williams@intel.com, dave@stgolabs.net, dave.jiang@intel.com,
	alison.schofield@intel.com, vishal.l.verma@intel.com,
	ira.weiny@intel.com, david@redhat.com, Vilas.Sridharan@amd.com,
	leo.duran@amd.com, Yazen.Ghannam@amd.com, rientjes@google.com,
	jiaqiyan@google.com, Jon.Grimm@amd.com,
	dave.hansen@linux.intel.com, naoya.horiguchi@nec.com,
	james.morse@arm.com, jthoughton@google.com,
	somasundaram.a@hpe.com, erdemaktas@google.com, pgonda@google.com,
	duenwen@google.com, gthelen@google.com,
	wschwartz@amperecomputing.com, dferguson@amperecomputing.com,
	wbs@os.amperecomputing.com, nifan.cxl@gmail.com, tanxiaofei,
	Zengtao (B), Roberto Sassu, kangkang.shen@futurewei.com,
	wanghuiqiang, Linuxarm

On Thu, Jan 09, 2025 at 02:24:33PM +0000, Jonathan Cameron wrote:
> To my thinking that would fail the test of being an intuitive interface.
> To issue a repair command requires that multiple attributes be configured
> before triggering the actual repair.
> 
> Think of it as setting the coordinates of the repair in a high dimensional
> space.

Why?

You can write every attribute in its separate file and have a "commit" or
"start" file which does that.

Or you can designate a file which starts the process. This is how I'm
injecting errors on x86:

see readme_msg here: arch/x86/kernel/cpu/mce/inject.c

More specifically:

"flags:\t Injection type to be performed. Writing to this file will trigger a\n"
"\t real machine check, an APIC interrupt or invoke the error decoder routines\n"
"\t for AMD processors.\n"

So you set everything else, and as the last step you set the injection type
*and* you also trigger it with this one write.

> Sure. In this case the addition of min/max was perhaps a wrong response to
> your request for a way to those ranges rather than just rejecting a write
> of something out of range as earlier version did.
> 
> We can revisit in future if range discovery becomes necessary.  Personally
> I don't think it is given we are only taking these actions in response error
> records that give us precisely what to write and hence are always in range.

My goal here was to make this user-friendly. Because you need some way of
knowing what valid ranges are and in order to trigger the repair, if it needs
to happen for a range.

Or, you can teach the repair logic to ignore invalid ranges and "clamp" things
to whatever makes sense.

Again, I'm looking at it from the usability perspective. I haven't actually
needed this scrub+repair functionality yet to know whether the UI makes sense.
So yeah, collecting some feedback from real-life use cases would probably give
you a lot better understanding of how that UI should be designed... perhaps
you won't ever need the ranges, whow knows.

So yes, preemptively designing stuff like that "in the dark" is kinda hard.
:-)

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v18 04/19] EDAC: Add memory repair control feature
  2025-01-09 15:18           ` Borislav Petkov
@ 2025-01-09 16:01             ` Jonathan Cameron
  2025-01-09 16:19               ` Borislav Petkov
  2025-01-14 12:57               ` Mauro Carvalho Chehab
  0 siblings, 2 replies; 87+ messages in thread
From: Jonathan Cameron @ 2025-01-09 16:01 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Shiju Jose, linux-edac@vger.kernel.org, linux-cxl@vger.kernel.org,
	linux-acpi@vger.kernel.org, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org, tony.luck@intel.com,
	rafael@kernel.org, lenb@kernel.org, mchehab@kernel.org,
	dan.j.williams@intel.com, dave@stgolabs.net, dave.jiang@intel.com,
	alison.schofield@intel.com, vishal.l.verma@intel.com,
	ira.weiny@intel.com, david@redhat.com, Vilas.Sridharan@amd.com,
	leo.duran@amd.com, Yazen.Ghannam@amd.com, rientjes@google.com,
	jiaqiyan@google.com, Jon.Grimm@amd.com,
	dave.hansen@linux.intel.com, naoya.horiguchi@nec.com,
	james.morse@arm.com, jthoughton@google.com,
	somasundaram.a@hpe.com, erdemaktas@google.com, pgonda@google.com,
	duenwen@google.com, gthelen@google.com,
	wschwartz@amperecomputing.com, dferguson@amperecomputing.com,
	wbs@os.amperecomputing.com, nifan.cxl@gmail.com, tanxiaofei,
	Zengtao (B), Roberto Sassu, kangkang.shen@futurewei.com,
	wanghuiqiang, Linuxarm

On Thu, 9 Jan 2025 16:18:54 +0100
Borislav Petkov <bp@alien8.de> wrote:

> On Thu, Jan 09, 2025 at 02:24:33PM +0000, Jonathan Cameron wrote:
> > To my thinking that would fail the test of being an intuitive interface.
> > To issue a repair command requires that multiple attributes be configured
> > before triggering the actual repair.
> > 
> > Think of it as setting the coordinates of the repair in a high dimensional
> > space.  
> 
> Why?

Ok. To me the fact it's not a single write was relevant. Seems not
in your mental model of how this works.  For me a single write
that you cannot query back is fine, setting lots of parameters and
being unable to query any of them less so.  I guess you disagree.
In interests of progress I'm not going to argue further. No one is
going to use this interface by hand anyway so the lost of useability
I'm seeing doesn't matter a lot.

> 
> You can write every attribute in its separate file and have a "commit" or
> "start" file which does that.

That's what we have.

> 
> Or you can designate a file which starts the process. This is how I'm
> injecting errors on x86:
> 
> see readme_msg here: arch/x86/kernel/cpu/mce/inject.c
> 
> More specifically:
> 
> "flags:\t Injection type to be performed. Writing to this file will trigger a\n"
> "\t real machine check, an APIC interrupt or invoke the error decoder routines\n"
> "\t for AMD processors.\n"
> 
> So you set everything else, and as the last step you set the injection type
> *and* you also trigger it with this one write.

Agreed. I'm not sure of the relevance though. This is how it works and
there is no proposal to change that.  

What I was trying to argue was for an interface that let you set all the
coordinates and read back what they were before hitting go. 

> 
> > Sure. In this case the addition of min/max was perhaps a wrong response to
> > your request for a way to those ranges rather than just rejecting a write
> > of something out of range as earlier version did.
> > 
> > We can revisit in future if range discovery becomes necessary.  Personally
> > I don't think it is given we are only taking these actions in response error
> > records that give us precisely what to write and hence are always in range.  
> 
> My goal here was to make this user-friendly. Because you need some way of
> knowing what valid ranges are and in order to trigger the repair, if it needs
> to happen for a range.

In at least the CXL case I'm fairly sure most of them are not discoverable.
Until you see errors you have no idea what the memory topology is.

> 
> Or, you can teach the repair logic to ignore invalid ranges and "clamp" things
> to whatever makes sense.

For that you'd need to have a path to read back what happened.

> 
> Again, I'm looking at it from the usability perspective. I haven't actually
> needed this scrub+repair functionality yet to know whether the UI makes sense.
> So yeah, collecting some feedback from real-life use cases would probably give
> you a lot better understanding of how that UI should be designed... perhaps
> you won't ever need the ranges, whow knows.
> 
> So yes, preemptively designing stuff like that "in the dark" is kinda hard.
> :-)

The discoverability is unnecessary for any known usecase.

Ok. Then can we just drop the range discoverability entirely or we go with
your suggestion and do not support read back of what has been
requested but instead have the reads return a range if known or "" /
return -EONOTSUPP if simply not known?


I can live with that though to me we are heading in the direction of
a less intuitive interface to save a small number of additional files.

Jonathan

> 


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v18 04/19] EDAC: Add memory repair control feature
  2025-01-09 16:01             ` Jonathan Cameron
@ 2025-01-09 16:19               ` Borislav Petkov
  2025-01-09 18:34                 ` Jonathan Cameron
  2025-01-14 12:57               ` Mauro Carvalho Chehab
  1 sibling, 1 reply; 87+ messages in thread
From: Borislav Petkov @ 2025-01-09 16:19 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: Shiju Jose, linux-edac@vger.kernel.org, linux-cxl@vger.kernel.org,
	linux-acpi@vger.kernel.org, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org, tony.luck@intel.com,
	rafael@kernel.org, lenb@kernel.org, mchehab@kernel.org,
	dan.j.williams@intel.com, dave@stgolabs.net, dave.jiang@intel.com,
	alison.schofield@intel.com, vishal.l.verma@intel.com,
	ira.weiny@intel.com, david@redhat.com, Vilas.Sridharan@amd.com,
	leo.duran@amd.com, Yazen.Ghannam@amd.com, rientjes@google.com,
	jiaqiyan@google.com, Jon.Grimm@amd.com,
	dave.hansen@linux.intel.com, naoya.horiguchi@nec.com,
	james.morse@arm.com, jthoughton@google.com,
	somasundaram.a@hpe.com, erdemaktas@google.com, pgonda@google.com,
	duenwen@google.com, gthelen@google.com,
	wschwartz@amperecomputing.com, dferguson@amperecomputing.com,
	wbs@os.amperecomputing.com, nifan.cxl@gmail.com, tanxiaofei,
	Zengtao (B), Roberto Sassu, kangkang.shen@futurewei.com,
	wanghuiqiang, Linuxarm

On Thu, Jan 09, 2025 at 04:01:59PM +0000, Jonathan Cameron wrote:
> Ok. To me the fact it's not a single write was relevant. Seems not
> in your mental model of how this works.  For me a single write
> that you cannot query back is fine, setting lots of parameters and
> being unable to query any of them less so.  I guess you disagree.

Why can't you query it back?

grep -r . /sysfs/dir/

All files' values have been previously set and should still be there on
a read, I'd strongly hope. Your ->read routines should give the values back.

> In interests of progress I'm not going to argue further. No one is
> going to use this interface by hand anyway so the lost of useability
> I'm seeing doesn't matter a lot.

I had the suspicion that this user interface is not really going to be used by
a user but by a tool. But then if you don't have a tool, you're lost.

This is one of the reasons why you can control ftrace directly on the shell
too - without a tool. This is very useful in certain cases where you cannot
run some userspace tools.

> In at least the CXL case I'm fairly sure most of them are not discoverable.
> Until you see errors you have no idea what the memory topology is.

Ok.

> For that you'd need to have a path to read back what happened.

So how is this scrubbing going to work? You get an error, you parse it for all
the attributes and you go and write those attributes into the scrub interface
and it starts scrubbing?

But then why do you even need the interface at all?

Why can't the kernel automatically collect all those attributes and start the
scrubbing automatically - no need for any user interaction...?

So why do you *actually* even need user interaction here and why can't the
kernel be smart enough to start the scrub automatically?

> Ok. Then can we just drop the range discoverability entirely or we go with
> your suggestion and do not support read back of what has been
> requested but instead have the reads return a range if known or "" /
> return -EONOTSUPP if simply not known?

Probably.

> I can live with that though to me we are heading in the direction of
> a less intuitive interface to save a small number of additional files.

This is not the point. I already alluded to this earlier - we're talking about
a user visible interface which, once it goes out, it is cast in stone forever.

So those files better have a good reason to exist...

And if we're not sure yet, we can upstream only those which are fine now and
then continue discussing the rest.

HTH.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v18 04/19] EDAC: Add memory repair control feature
  2025-01-09 16:19               ` Borislav Petkov
@ 2025-01-09 18:34                 ` Jonathan Cameron
  2025-01-09 23:51                   ` Dan Williams
                                     ` (2 more replies)
  0 siblings, 3 replies; 87+ messages in thread
From: Jonathan Cameron @ 2025-01-09 18:34 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Shiju Jose, linux-edac@vger.kernel.org, linux-cxl@vger.kernel.org,
	linux-acpi@vger.kernel.org, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org, tony.luck@intel.com,
	rafael@kernel.org, lenb@kernel.org, mchehab@kernel.org,
	dan.j.williams@intel.com, dave@stgolabs.net, dave.jiang@intel.com,
	alison.schofield@intel.com, vishal.l.verma@intel.com,
	ira.weiny@intel.com, david@redhat.com, Vilas.Sridharan@amd.com,
	leo.duran@amd.com, Yazen.Ghannam@amd.com, rientjes@google.com,
	jiaqiyan@google.com, Jon.Grimm@amd.com,
	dave.hansen@linux.intel.com, naoya.horiguchi@nec.com,
	james.morse@arm.com, jthoughton@google.com,
	somasundaram.a@hpe.com, erdemaktas@google.com, pgonda@google.com,
	duenwen@google.com, gthelen@google.com,
	wschwartz@amperecomputing.com, dferguson@amperecomputing.com,
	wbs@os.amperecomputing.com, nifan.cxl@gmail.com, tanxiaofei,
	Zengtao (B), Roberto Sassu, kangkang.shen@futurewei.com,
	wanghuiqiang, Linuxarm

On Thu, 9 Jan 2025 17:19:02 +0100
Borislav Petkov <bp@alien8.de> wrote:

> On Thu, Jan 09, 2025 at 04:01:59PM +0000, Jonathan Cameron wrote:
> > Ok. To me the fact it's not a single write was relevant. Seems not
> > in your mental model of how this works.  For me a single write
> > that you cannot query back is fine, setting lots of parameters and
> > being unable to query any of them less so.  I guess you disagree.  
> 
> Why can't you query it back?
> 
> grep -r . /sysfs/dir/
> 
> All files' values have been previously set and should still be there on
> a read, I'd strongly hope. Your ->read routines should give the values back.

Today you can.  Seems we are talking cross purposes.

I'm confused. I thought your proposal was for "bank" attribute to present an
allowed range on read.
"bank" attribute is currently written to and read back as the value of the bank on which
to conduct a repair.  Maybe this disconnect is down to the fact max_ and min_
attributes should have been marked as RO in the docs. They aren't controls,
just presentation of limits to userspace.

Was intent a separate bank_range type attribute rather than max_bank, min_bank?
One of those would be absolutely fine (similar to the _available attributes
in IIO - I added those years ago to meet a similar need and we've never had
any issues with those).

> 
> > In interests of progress I'm not going to argue further. No one is
> > going to use this interface by hand anyway so the lost of useability
> > I'm seeing doesn't matter a lot.  
> 
> I had the suspicion that this user interface is not really going to be used by
> a user but by a tool. But then if you don't have a tool, you're lost.
> 
> This is one of the reasons why you can control ftrace directly on the shell
> too - without a tool. This is very useful in certain cases where you cannot
> run some userspace tools.

I fully agree. What I was saying was in response to me thinking you wanted it
to not be possible to read back the user set values (overlapping uses of
single bank attribute which wasn't what you meant). That is useful for a user
wanting to do the cat /sys/... that you mention above, but not vital if they are
directly reading the tracepoints for the error records and poking the
sysfs interface.

Given it seems I misunderstood that suggestion, ignore my reply to that
as irrelevant.

> 
> > In at least the CXL case I'm fairly sure most of them are not discoverable.
> > Until you see errors you have no idea what the memory topology is.  
> 
> Ok.
> 
> > For that you'd need to have a path to read back what happened.  
> 
> So how is this scrubbing going to work? You get an error, you parse it for all
> the attributes and you go and write those attributes into the scrub interface
> and it starts scrubbing?

Repair not scrubbing. They are different things we should keep separate,
scrub corrects the value, if it can, but doesn't change the underlying memory to
new memory cells to avoid repeated errors. Replacing scrub with repair 
(which I think was the intent here)...

You get error records that describe the error seen in hardware, write back the
values into this interface and tell it to repair the memory.  This is not
necessarily a synchronous or immediate thing - instead typically based on
trend analysis.

As an example, the decision might be that bit of ram threw up 3 errors
over a month including multiple system reboots (for other reasons) and
that is over some threshold so we use a spare memory line to replace it.

> 
> But then why do you even need the interface at all?
> 
> Why can't the kernel automatically collect all those attributes and start the
> scrubbing automatically - no need for any user interaction...?
> 
> So why do you *actually* even need user interaction here and why can't the
> kernel be smart enough to start the scrub automatically?

Short answer, it needs to be very smart and there isn't a case of one size
fits all - hence suggested approach of making it a user space problem.

There are hardware autonomous solutions and ones handled by host firmware.
That is how repair is done in many servers - at most software sees a slightly
latency spike as the memory is repaired under the hood. Some CXL devices
will do this as well. Those CXL devices may provide an additional repair
interface for the less clear cut decisions that need more data processing
/ analysis than the device firmware is doing. Other CXL devices will take
the view the OS is best placed to make all the decisions - those sometimes
will give a 'maintenance needed' indication in the error records but that
is still a hint the host may or may not take any notice of.

Given in the systems being considered here, software is triggering the repair,
we want to allow for policy in the decision. In simple cases we could push
that policy into the kernel e.g. just repair the moment we see an error record.

These repair resources are very limited in number, so immediately repairing
may a bad idea. We want to build up a history of errors before making
such a decision.  That can be done in kernel. 

The decision to repair memory is heavily influenced by policy and time considerations
against device resource constraints.

Some options that are hard to do in kernel.

1. Typical asynchronous error report for a corrected error.

   Tells us memory had an error (perhaps from a scrubbing engine on the device
   running checks). No need to take action immediately. Instead build up more data
   over time and if lots of errors occur make decision to repair as no we are sure it
   is worth doing rather than a single random event. We may tune scrubbing engines
   to check this memory more frequently and adjust our data analysis to take that
   into account for setting thresholds etc.
   When an admin considers it a good time to take action, offline the memory and
   repair before bringing it back into use (sometimes by rebooting the machine).
   Sometimes repair can be triggered in a software transparent way, sometimes not.
   This also applies to uncorrectable errors though in that case you can't necessarily
   repair it without ever seeing a synchronous poison with all the impacts that has.

2. Soft repair across boots.  We are actually storing the error records, then only
   applying the fix on reboot before using the memory - so maintaining a list
   of bad memory and saving it to a file to read back on boot. We could provide
   another kernel interface to get this info and reinject it after reboot instead
   of doing it in userspace but that is another ABI to design.

3. Complex policy across fleets.  A lot of work is going on around prediction techniques
   that may change the local policy on each node dependent on the overall reliability
   patterns of a particular batch of devices and local characteristics, service guarantees
   etc. If it is hard repair, then once you've run out you need schedule an engineer
   out to replace the DIMM. All complex inputs to the decision.

Similar cases like CPU offlining on repeated errors are done in userspace (e.g.
RAS Daemon) for similar reasons of long term data gathering and potentially
complex algorithms.

> 
> > Ok. Then can we just drop the range discoverability entirely or we go with
> > your suggestion and do not support read back of what has been
> > requested but instead have the reads return a range if known or "" /
> > return -EONOTSUPP if simply not known?  
> 
> Probably.

Too many options in the above paragraph so just to check...  Probably to which?
If it's a separate attribute from the one we write the control so then
we do what is already done here and don't present the interface at all if
the range isn't discoverable.

> 
> > I can live with that though to me we are heading in the direction of
> > a less intuitive interface to save a small number of additional files.  
> 
> This is not the point. I already alluded to this earlier - we're talking about
> a user visible interface which, once it goes out, it is cast in stone forever.
> 
> So those files better have a good reason to exist...
> 
> And if we're not sure yet, we can upstream only those which are fine now and
> then continue discussing the rest.

Ok. Best path is drop the available range support then (so no min_ max_ or
anything to replace them for now).

Added bonus is we don't have to rush this conversation and can make sure we
come to the right solution driven by use cases.

Jonathan

> HTH.
> 

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v18 04/19] EDAC: Add memory repair control feature
  2025-01-09 18:34                 ` Jonathan Cameron
@ 2025-01-09 23:51                   ` Dan Williams
  2025-01-10 11:01                     ` Jonathan Cameron
  2025-01-11 17:12                   ` Borislav Petkov
  2025-01-14 13:10                   ` Mauro Carvalho Chehab
  2 siblings, 1 reply; 87+ messages in thread
From: Dan Williams @ 2025-01-09 23:51 UTC (permalink / raw)
  To: Jonathan Cameron, Borislav Petkov
  Cc: Shiju Jose, linux-edac@vger.kernel.org, linux-cxl@vger.kernel.org,
	linux-acpi@vger.kernel.org, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org, tony.luck@intel.com,
	rafael@kernel.org, lenb@kernel.org, mchehab@kernel.org,
	dan.j.williams@intel.com, dave@stgolabs.net, dave.jiang@intel.com,
	alison.schofield@intel.com, vishal.l.verma@intel.com,
	ira.weiny@intel.com, david@redhat.com, Vilas.Sridharan@amd.com,
	leo.duran@amd.com, Yazen.Ghannam@amd.com, rientjes@google.com,
	jiaqiyan@google.com, Jon.Grimm@amd.com,
	dave.hansen@linux.intel.com, naoya.horiguchi@nec.com,
	james.morse@arm.com, jthoughton@google.com,
	somasundaram.a@hpe.com, erdemaktas@google.com, pgonda@google.com,
	duenwen@google.com, gthelen@google.com,
	wschwartz@amperecomputing.com, dferguson@amperecomputing.com,
	wbs@os.amperecomputing.com, nifan.cxl@gmail.com, tanxiaofei,
	Zengtao (B), Roberto Sassu, kangkang.shen@futurewei.com,
	wanghuiqiang, Linuxarm

Jonathan Cameron wrote:
> Ok. Best path is drop the available range support then (so no min_ max_ or
> anything to replace them for now).

I think less is more in this case.

The hpa, dpa, nibble, column, channel, bank, rank, row... ABI looks too
wide for userspace to have a chance at writing a competent tool. At
least I am struggling with where to even begin with those ABIs if I was
asked to write a tool. Does a tool already exist for those?

Some questions that read on those ABIs are:

1/ What if the platform has translation between HPA (CXL decode) and SPA
(physical addresses reported in trace points that PIO and DMA see)?

2/ What if memory is interleaved across repair domains? 

3/ What if the device does not use DDR terminology / topology terms for
repair?

I expect the flow rasdaemon would want is that the current PFA (leaky
bucket Pre-Failure Analysis) decides that the number of soft-offlines it
has performed exceeds some threshold and it wants to attempt to repair
memory.

However, what is missing today for volatile memory is that some failures
can be repaired with in-band writes and some failures need heavier
hammers like Post-Package-Repair to actively swap in whole new banks of
memory. So don't we need something like "soft-offline-undo" on the way
to PPR?

So, yes, +1 to simpler for now where software effectively just needs to
deal with a handful of "region repair" buttons and the semantics of
those are coarse and sub-optimal. Wait for a future where a tool author
says, "we have had good success getting bulk offlined pages back into
service, but now we need this specific finer grained kernel interface to
avoid wasting spare banks prematurely".

Anything more complex than a set of /sys/devices/system/memory/
devices has a /sys/bus/edac/devices/devX/repair button, feels like a
generation ahead of where the initial sophistication needs to lie.

That said, I do not closely follow ras tooling to say whether someone
has already identified the critical need for a fine grained repair ABI?

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v18 04/19] EDAC: Add memory repair control feature
  2025-01-09 23:51                   ` Dan Williams
@ 2025-01-10 11:01                     ` Jonathan Cameron
  2025-01-10 22:49                       ` Dan Williams
  0 siblings, 1 reply; 87+ messages in thread
From: Jonathan Cameron @ 2025-01-10 11:01 UTC (permalink / raw)
  To: Dan Williams
  Cc: Borislav Petkov, Shiju Jose, linux-edac@vger.kernel.org,
	linux-cxl@vger.kernel.org, linux-acpi@vger.kernel.org,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	tony.luck@intel.com, rafael@kernel.org, lenb@kernel.org,
	mchehab@kernel.org, dave@stgolabs.net, dave.jiang@intel.com,
	alison.schofield@intel.com, vishal.l.verma@intel.com,
	ira.weiny@intel.com, david@redhat.com, Vilas.Sridharan@amd.com,
	leo.duran@amd.com, Yazen.Ghannam@amd.com, rientjes@google.com,
	jiaqiyan@google.com, Jon.Grimm@amd.com,
	dave.hansen@linux.intel.com, naoya.horiguchi@nec.com,
	james.morse@arm.com, jthoughton@google.com,
	somasundaram.a@hpe.com, erdemaktas@google.com, pgonda@google.com,
	duenwen@google.com, gthelen@google.com,
	wschwartz@amperecomputing.com, dferguson@amperecomputing.com,
	wbs@os.amperecomputing.com, nifan.cxl@gmail.com, tanxiaofei,
	Zengtao (B), Roberto Sassu, kangkang.shen@futurewei.com,
	wanghuiqiang, Linuxarm

On Thu, 9 Jan 2025 15:51:39 -0800
Dan Williams <dan.j.williams@intel.com> wrote:

> Jonathan Cameron wrote:
> > Ok. Best path is drop the available range support then (so no min_ max_ or
> > anything to replace them for now).  
> 
> I think less is more in this case.

A few notes before I get to specific questions.

Key in the discussion that follows is that the 'repair' is a separate
from the 'decision to repair'.  They mostly need different information
all of which is in the error trace points. A lot of the questions
are about the 'decision to repair' part not the repair itself.

Critical point is there is on some devices (all CXL ones), there may
be no direct way to discover the mapping from HPA/SPA/DPA to bank row etc
other than the error record. The memory mapping functions are an internal
detail not exposed on any interfaces. Odd though it may seem, those
mapping functions are considered confidential enough that manufacturers
don't always publish them (though I believe they are fairly easy to reverse
engineer) - I know a team whose job involves designing those.

Anyhow, short of the kernel or RAS Daemon carrying a look up table
of all known devices (no support for new ones until they are added) we
can't do a reverse map from DPA etc to bank. There are complex ways
round this like storing the mappings when issuing an error record,
to build up the necessary reverse map, but that would have to be
preserved across boot. These error tend not to be frequent, so cross
reboot /kexec etc need to be incorporated.

PPR on CXL does use DPA, but memory sparing commands are meant to
supersede that interface (the reason for that is perhaps bordering on
consortium confidential, but lets say it doesn't work well for some
cases). Memory sparing does not use DPA.

I'd advise mostly ignoring PPR and looking at memory sparing in
the CXL spec if you want to consider an example. For PPR DPA is used
(there is an HPA option that might be available). DPA is still needed
for on boot soft repair (or we could delay that until regions configured,
but then we'd need to do DPA to HPA mapping as that will be different
on a new config, and then write HPA for the kernel to map it back to DPA.

> 
> The hpa, dpa, nibble, column, channel, bank, rank, row... ABI looks too
> wide for userspace to have a chance at writing a competent tool. At
> least I am struggling with where to even begin with those ABIs if I was
> asked to write a tool. Does a tool already exist for those?

There is little choice on that - those are the controls for this type
of repair. If we do something like a magic 'key' based on a concatenation
of those values we need to define that description to replace a clean
self describing interface. I'm not 100% against that but I think it would
be a terrible interface design and I don't think anyone is really in favor of it.

All a userspace tool does is read the error record fields of
exactly those names.  From that it will log data (already happening under
those names in RAS daemon alongside HPA/ DPA).  Then, in simplest case,
a threshold is passed and we write those values to the repair interface. 

There is zero need in that simple case for these to be understood at all.
You can think of them as a complex key but divided into well defined fields. 

For more complex decision algorithms, that structure info may be needed
to make the decision. As a dumb example, maybe certain banks are more
error prone on a batch of devices so we need a lower threshold before repairing.

Simplest case is maybe 20-30 lines of code looping over result of an SQL
query on the RASDaemon DB and writing the values to the files.
Not the most challenging userspace tool.  The complexity is in
the analysis of the errors, not this part. I don't think we bothered
doing this one yet in rasdaemon because we considered it obvious enough
an example wasn't needed. (Mauro / Shiju, does that estimate sound reasonable?)
We would need a couple of variants but those map 1:1 with the variants of
error record parsing and logging RAS Daemon already has.

> 
> Some questions that read on those ABIs are:
> 
> 1/ What if the platform has translation between HPA (CXL decode) and SPA
> (physical addresses reported in trace points that PIO and DMA see)?

See later for discussion of other interfaces.. This is assuming the
repair key is not HPA (it is on some systems / situations) - if it's
the repair key then that is easily dealt with.

HPA / SPA more or less irrelevant for repair itself, they are relevant
for the decision to repair. In the 'on reboot' soft repair case they may
not even exist at the time of repair as it would be expected to happen
before we've brought up a region (to get the RAM into a good state at boot).

For cases where the memory decoders are configured and so there is an HPA
to DPA mapping:
The trace reports provide both all these somewhat magic values and
the HPA.  Thus we can do the HPA aware stuff on that before then looking
up the other bit of the appropriate error reports to get the bank row etc.

> 
> 2/ What if memory is interleaved across repair domains? 

Also not relevant to a repair control and only a little relevant to the
decision to repair.  The domains would be handled separately but if
we are have to offline a chunk of memory to do it (needed for repair
on some devices) that may be a bigger chunk if fine grained interleave
in use and that may affect the decision.

> 
> 3/ What if the device does not use DDR terminology / topology terms for
> repair?

Then we provide the additional interfaces assuming the correspond to well
known terms.  If someone is using a magic key then we can get grumpy
with them, but that can also be supported.

Mostly I'd expect a new technology to overlap a lot of the existing
interface and maybe add one or two more; which layer in the stack for
HBM for instance.

The main alternative is where the device takes an HPA / SPA / DPA. We have one
driver that does that queued up behind this series that uses HPA. PPR uses
DPA.  In that case userspace first tries to see if it can repair by HPA then
DPA and if not moves on to see if it it can use the fuller description.
We will see devices supporting HPA / DPA (which to use depends on when you
are doing the operation and what has been configured) but mostly I'd expect
either HPA/DPA or fine grained on a given repair instance.

HPA only works if the address decoders are always configured (so not on CXL)
What is actually happening in that case is typically that a firmware is
involved that can look up address decoders etc, and map the control HPA
to Bank / row etc to issue the actual low level commands.  This keeps
the memory mapping completely secret rather than exposing it in error
records.

> 
> I expect the flow rasdaemon would want is that the current PFA (leaky
> bucket Pre-Failure Analysis) decides that the number of soft-offlines it
> has performed exceeds some threshold and it wants to attempt to repair
> memory.

Sparing may happen prior to point where we'd have done a soft offline
if non disruptive (whether it is can be read from another bit of the
ABI).  Memory repair might be much less disruptive than soft-offline!
I rather hope memory manufacturers build that, but I'm aware of at least
one case where they didn't and the memory must be offline.

> 
> However, what is missing today for volatile memory is that some failures
> can be repaired with in-band writes and some failures need heavier
> hammers like Post-Package-Repair to actively swap in whole new banks of
> memory. So don't we need something like "soft-offline-undo" on the way
> to PPR?

Ultimately we may do. That discussion was in one of the earlier threads
on more heavy weight case of recovery from poison (unfortunately I can't
find the thread) - the ask was for example code so that the complexity
could be weighed against the demand for this sort of live repair or a lesser
version where repair can only be done once a region is offline (and parts
poisoned).

However, there are other usecases where this isn't needed which is why
that isn't a precursor for this series.

Initial enablement targets two situations:
1) Repair can be done in non disruptive way - no need to soft offline at all.
2) Repair can be done at boot before memory is onlined or on admin
   action to take the whole region offline, then repair specific chunks of
   memory before bringing it back online.

> 
> So, yes, +1 to simpler for now where software effectively just needs to
> deal with a handful of "region repair" buttons and the semantics of
> those are coarse and sub-optimal. Wait for a future where a tool author
> says, "we have had good success getting bulk offlined pages back into
> service, but now we need this specific finer grained kernel interface to
> avoid wasting spare banks prematurely".

Depends on where you think that interface is.  I can absolutely see that
as a control to RAS Daemon.  Option 2 above, region is offline, repair
all dodgy looking fine grained buckets.

Note though that a suboptimal repair may mean permanent use of very rare
resources.  So there needs to be a control a the finest granularity as well.
Which order those get added to userspace tools doesn't matter to me.

If you mean that interface in kernel it brings some non trivial requirements.
The kernel would need all of:
1) Tracking interface for all error records so the reverse map from region
   to specific bank / row etc is available for a subset of entries.  The
   kernel would need to know which of those are important (soft offline
   might help in that use case, otherwise that means decision algorithms
   are in kernel or we have fine grained queue for region repair in parallel
   with soft-offline).
2) A way to inject the reverse map information from a userspace store
  (to deal with reboot etc).

That sounds a lot harder to deal with than relying on the usespace program
that already does the tracking across boots.

> 
> Anything more complex than a set of /sys/devices/system/memory/
> devices has a /sys/bus/edac/devices/devX/repair button, feels like a
> generation ahead of where the initial sophistication needs to lie.
> 
> That said, I do not closely follow ras tooling to say whether someone
> has already identified the critical need for a fine grained repair ABI?

It's not that we necessarily want to repair at fine grain, it's that
the control interface to hardware is fine grained and the reverse mapping
often unknown except for specific error records.

I'm fully on board with simple interfaces for common cases like repair
the bad memory in this region.  I'm just strongly against moving the
complexity of doing that into the kernel.

Jonathan

> 

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v18 04/19] EDAC: Add memory repair control feature
  2025-01-10 11:01                     ` Jonathan Cameron
@ 2025-01-10 22:49                       ` Dan Williams
  2025-01-13 11:40                         ` Jonathan Cameron
  0 siblings, 1 reply; 87+ messages in thread
From: Dan Williams @ 2025-01-10 22:49 UTC (permalink / raw)
  To: Jonathan Cameron, Dan Williams
  Cc: Borislav Petkov, Shiju Jose, linux-edac@vger.kernel.org,
	linux-cxl@vger.kernel.org, linux-acpi@vger.kernel.org,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	tony.luck@intel.com, rafael@kernel.org, lenb@kernel.org,
	mchehab@kernel.org, dave@stgolabs.net, dave.jiang@intel.com,
	alison.schofield@intel.com, vishal.l.verma@intel.com,
	ira.weiny@intel.com, david@redhat.com, Vilas.Sridharan@amd.com,
	leo.duran@amd.com, Yazen.Ghannam@amd.com, rientjes@google.com,
	jiaqiyan@google.com, Jon.Grimm@amd.com,
	dave.hansen@linux.intel.com, naoya.horiguchi@nec.com,
	james.morse@arm.com, jthoughton@google.com,
	somasundaram.a@hpe.com, erdemaktas@google.com, pgonda@google.com,
	duenwen@google.com, gthelen@google.com,
	wschwartz@amperecomputing.com, dferguson@amperecomputing.com,
	wbs@os.amperecomputing.com, nifan.cxl@gmail.com, tanxiaofei,
	Zengtao (B), Roberto Sassu, kangkang.shen@futurewei.com,
	wanghuiqiang, Linuxarm

Jonathan Cameron wrote:
> On Thu, 9 Jan 2025 15:51:39 -0800
> Dan Williams <dan.j.williams@intel.com> wrote:
> 
> > Jonathan Cameron wrote:
> > > Ok. Best path is drop the available range support then (so no min_ max_ or
> > > anything to replace them for now).  
> > 
> > I think less is more in this case.
> 
> A few notes before I get to specific questions.
> 
> Key in the discussion that follows is that the 'repair' is a separate
> from the 'decision to repair'.  They mostly need different information
> all of which is in the error trace points. A lot of the questions
> are about the 'decision to repair' part not the repair itself.
> 
[snipped the parts I agree with]
> I'd advise mostly ignoring PPR and looking at memory sparing in
> the CXL spec if you want to consider an example. For PPR DPA is used
> (there is an HPA option that might be available). DPA is still needed
> for on boot soft repair (or we could delay that until regions configured,
> but then we'd need to do DPA to HPA mapping as that will be different
> on a new config, and then write HPA for the kernel to map it back to DPA.

This is helpful because I was indeed getting lost in what kind of
"repair" was being discussed in the thread. Ok, lets focus on sparing
commands.

> 
> > 
> > The hpa, dpa, nibble, column, channel, bank, rank, row... ABI looks too
> > wide for userspace to have a chance at writing a competent tool. At
> > least I am struggling with where to even begin with those ABIs if I was
> > asked to write a tool. Does a tool already exist for those?
> 
> There is little choice on that - those are the controls for this type
> of repair. If we do something like a magic 'key' based on a concatenation
> of those values we need to define that description to replace a clean
> self describing interface. I'm not 100% against that but I think it would
> be a terrible interface design and I don't think anyone is really in favor of it.
> 
> All a userspace tool does is read the error record fields of
> exactly those names.  From that it will log data (already happening under
> those names in RAS daemon alongside HPA/ DPA).  Then, in simplest case,
> a threshold is passed and we write those values to the repair interface. 
> 
> There is zero need in that simple case for these to be understood at all.

This is where you lose me. The error record is a point in time snapshot
of the SPA:HPA:DPA:<proprietary internal "DIMM" mapping>. The security
model for memory operations is based on coordinating with the kernel's
understanding of how that SPA is *currently* being used.

The kernel can not just take userspace's word for it that potentially
data changing, or temporary loss-of-use operations are safe to execute
just because once upon a time userspace saw an error record that
implicated a given SPA in the past, especially over reboot. 

The SPA:HPA:DPA:DIMM tuple is invalidated on reconfiguration and reboot
events. It follows that the kernel has a security/integrity interest in
declining to act on invalidated tuples. This is solvable within a single
boot as the kernel can cache the error records and userspace can ask to
"trigger sparing based on cached record X".

For the reboot case when the error record cache is flushed the kernel
needs a reliable way to refill that cache, not an ABI for userspace to
say "take my word for it, this *should* be safe".

[snipped the explanation of replaying the old trace record parameters
data back through sysfs, because that is precisely the hang up I have
with the proposal]

> 
> > 
> > Some questions that read on those ABIs are:
> > 
> > 1/ What if the platform has translation between HPA (CXL decode) and SPA
> > (physical addresses reported in trace points that PIO and DMA see)?
> 
> See later for discussion of other interfaces.. This is assuming the
> repair key is not HPA (it is on some systems / situations) - if it's
> the repair key then that is easily dealt with.
> 
> HPA / SPA more or less irrelevant for repair itself, they are relevant
> for the decision to repair. In the 'on reboot' soft repair case they may
> not even exist at the time of repair as it would be expected to happen
> before we've brought up a region (to get the RAM into a good state at boot).
> 
> For cases where the memory decoders are configured and so there is an HPA
> to DPA mapping:
> The trace reports provide both all these somewhat magic values and
> the HPA.  Thus we can do the HPA aware stuff on that before then looking
> up the other bit of the appropriate error reports to get the bank row etc.
> 
> > 
> > 2/ What if memory is interleaved across repair domains? 
> 
> Also not relevant to a repair control and only a little relevant to the
> decision to repair.  The domains would be handled separately but if
> we are have to offline a chunk of memory to do it (needed for repair
> on some devices) that may be a bigger chunk if fine grained interleave
> in use and that may affect the decision.

Again, the repair control assumes that the kernel can just trust
userspace to get it right. When the kernel knows the SPA implications it
can add safety like "you are going to issue sparing on deviceA that will
temporarily take deviceA offline. CXL subsystem tells me deviceA is
interleaved with deviceB in SPA so the whole SPA range needs to be
offline before this operation proceeds". That is not someting that
userspace can reliably coordinate.

> > 3/ What if the device does not use DDR terminology / topology terms for
> > repair?
> 
> Then we provide the additional interfaces assuming the correspond to well
> known terms.  If someone is using a magic key then we can get grumpy
> with them, but that can also be supported.
> 
> Mostly I'd expect a new technology to overlap a lot of the existing
> interface and maybe add one or two more; which layer in the stack for
> HBM for instance.

The concern is the assertion that sysfs needs to care about all these
parameters vs an ABI that says "repair errorX". If persistence and
validity of error records is the concern lets build an ABI for that and
not depend upon trust in userspace to properly coordinate memory
integrity concerns.

> 
> The main alternative is where the device takes an HPA / SPA / DPA. We have one
> driver that does that queued up behind this series that uses HPA. PPR uses
> DPA.  In that case userspace first tries to see if it can repair by HPA then
> DPA and if not moves on to see if it it can use the fuller description.
> We will see devices supporting HPA / DPA (which to use depends on when you
> are doing the operation and what has been configured) but mostly I'd expect
> either HPA/DPA or fine grained on a given repair instance.
> 
> HPA only works if the address decoders are always configured (so not on CXL)
> What is actually happening in that case is typically that a firmware is
> involved that can look up address decoders etc, and map the control HPA
> to Bank / row etc to issue the actual low level commands.  This keeps
> the memory mapping completely secret rather than exposing it in error
> records.
> 
> > 
> > I expect the flow rasdaemon would want is that the current PFA (leaky
> > bucket Pre-Failure Analysis) decides that the number of soft-offlines it
> > has performed exceeds some threshold and it wants to attempt to repair
> > memory.
> 
> Sparing may happen prior to point where we'd have done a soft offline
> if non disruptive (whether it is can be read from another bit of the
> ABI).  Memory repair might be much less disruptive than soft-offline!
> I rather hope memory manufacturers build that, but I'm aware of at least
> one case where they didn't and the memory must be offline.

That's a good point, spare before offline makes sense.

[..]
> However, there are other usecases where this isn't needed which is why
> that isn't a precursor for this series.
> 
> Initial enablement targets two situations:
> 1) Repair can be done in non disruptive way - no need to soft offline at all.

Modulo needing to quiesce access over the sparing event?

> 2) Repair can be done at boot before memory is onlined or on admin
>    action to take the whole region offline, then repair specific chunks of
>    memory before bringing it back online.

Which is userspace racing the kernel to online memory?

> > So, yes, +1 to simpler for now where software effectively just needs to
> > deal with a handful of "region repair" buttons and the semantics of
> > those are coarse and sub-optimal. Wait for a future where a tool author
> > says, "we have had good success getting bulk offlined pages back into
> > service, but now we need this specific finer grained kernel interface to
> > avoid wasting spare banks prematurely".
> 
> Depends on where you think that interface is.  I can absolutely see that
> as a control to RAS Daemon.  Option 2 above, region is offline, repair
> all dodgy looking fine grained buckets.
> 
> Note though that a suboptimal repair may mean permanent use of very rare
> resources.  So there needs to be a control a the finest granularity as well.
> Which order those get added to userspace tools doesn't matter to me.
> 
> If you mean that interface in kernel it brings some non trivial requirements.
> The kernel would need all of:
> 1) Tracking interface for all error records so the reverse map from region
>    to specific bank / row etc is available for a subset of entries.  The
>    kernel would need to know which of those are important (soft offline
>    might help in that use case, otherwise that means decision algorithms
>    are in kernel or we have fine grained queue for region repair in parallel
>    with soft-offline).
> 2) A way to inject the reverse map information from a userspace store
>   (to deal with reboot etc).

Not a way to inject the reverse map information, a way to inject the
error records and assert that memory topology changes have not
invalidated those records.

> That sounds a lot harder to deal with than relying on the usespace program
> that already does the tracking across boots.

I am stuck behind the barrier of userspace must not assume it knows
better than the kernel about the SPA impact of a DIMM sparing
event. The kernel needs evidence either live records from within the
same kernel boot or validated records from a previous boot.

...devices could also help us out here with a way to replay DIMM error
events. That would allow for refreshing error records even if the
memory topology change because the new record would generate a refreshed
SPA:HPA:DPA:DIMM tuple.

> > Anything more complex than a set of /sys/devices/system/memory/
> > devices has a /sys/bus/edac/devices/devX/repair button, feels like a
> > generation ahead of where the initial sophistication needs to lie.
> > 
> > That said, I do not closely follow ras tooling to say whether someone
> > has already identified the critical need for a fine grained repair ABI?
> 
> It's not that we necessarily want to repair at fine grain, it's that
> the control interface to hardware is fine grained and the reverse mapping
> often unknown except for specific error records.
> 
> I'm fully on board with simple interfaces for common cases like repair
> the bad memory in this region.  I'm just strongly against moving the
> complexity of doing that into the kernel.

Yes, we are just caught up on where that "...but no simpler" line is
drawn.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v18 04/19] EDAC: Add memory repair control feature
  2025-01-09 18:34                 ` Jonathan Cameron
  2025-01-09 23:51                   ` Dan Williams
@ 2025-01-11 17:12                   ` Borislav Petkov
  2025-01-13 11:07                     ` Jonathan Cameron
  2025-01-14 13:10                   ` Mauro Carvalho Chehab
  2 siblings, 1 reply; 87+ messages in thread
From: Borislav Petkov @ 2025-01-11 17:12 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: Shiju Jose, linux-edac@vger.kernel.org, linux-cxl@vger.kernel.org,
	linux-acpi@vger.kernel.org, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org, tony.luck@intel.com,
	rafael@kernel.org, lenb@kernel.org, mchehab@kernel.org,
	dan.j.williams@intel.com, dave@stgolabs.net, dave.jiang@intel.com,
	alison.schofield@intel.com, vishal.l.verma@intel.com,
	ira.weiny@intel.com, david@redhat.com, Vilas.Sridharan@amd.com,
	leo.duran@amd.com, Yazen.Ghannam@amd.com, rientjes@google.com,
	jiaqiyan@google.com, Jon.Grimm@amd.com,
	dave.hansen@linux.intel.com, naoya.horiguchi@nec.com,
	james.morse@arm.com, jthoughton@google.com,
	somasundaram.a@hpe.com, erdemaktas@google.com, pgonda@google.com,
	duenwen@google.com, gthelen@google.com,
	wschwartz@amperecomputing.com, dferguson@amperecomputing.com,
	wbs@os.amperecomputing.com, nifan.cxl@gmail.com, tanxiaofei,
	Zengtao (B), Roberto Sassu, kangkang.shen@futurewei.com,
	wanghuiqiang, Linuxarm

On Thu, Jan 09, 2025 at 06:34:48PM +0000, Jonathan Cameron wrote:
> Today you can.  Seems we are talking cross purposes.
> 
> I'm confused. I thought your proposal was for "bank" attribute to present an
> allowed range on read.
> "bank" attribute is currently written to and read back as the value of the bank on which
> to conduct a repair.  Maybe this disconnect is down to the fact max_ and min_
> attributes should have been marked as RO in the docs. They aren't controls,
> just presentation of limits to userspace.
> 
> Was intent a separate bank_range type attribute rather than max_bank, min_bank?

I don't know - I'm just throwing ideas out there. You could do:

cat /sys/.../bank

and that gives you

[<low> <current_value> <high>]

So you have all the needed information. Dunno if this would be abusing sysfs
rules too much tho.

> > 
> > > In at least the CXL case I'm fairly sure most of them are not discoverable.
> > > Until you see errors you have no idea what the memory topology is.  
> > 
> > Ok.
> > 
> > > For that you'd need to have a path to read back what happened.  
> > 
> > So how is this scrubbing going to work? You get an error, you parse it for all
:> > the attributes and you go and write those attributes into the scrub interface
> > and it starts scrubbing?
> 
> Repair not scrubbing. They are different things we should keep separate,
> scrub corrects the value, if it can, but doesn't change the underlying memory to
> new memory cells to avoid repeated errors. Replacing scrub with repair 
> (which I think was the intent here)...

Really?

So how is scrubbing defined for CXL? You read memory, do ECC check on it,
report any potential errors but write back the *original* wrong value?!

I thought the point of scrubbing is to repair it while at it too...

> You get error records that describe the error seen in hardware, write back the
> values into this interface and tell it to repair the memory.  This is not
> necessarily a synchronous or immediate thing - instead typically based on
> trend analysis.

This is just silly: I'm scrubbing, I found an error, I should simply fix it
while at it. Why would I need an additional command to repair it?!

> As an example, the decision might be that bit of ram threw up 3 errors
> over a month including multiple system reboots (for other reasons) and
> that is over some threshold so we use a spare memory line to replace it.

Right.

> Short answer, it needs to be very smart and there isn't a case of one size
> fits all - hence suggested approach of making it a user space problem.

Making it a userspace problem is pretty much always a sign that the hw design
failed.

> Given in the systems being considered here, software is triggering the repair,
> we want to allow for policy in the decision. 

Right, you can leave a high-level decision to userspace: repair only when
idle, repair only non-correctable errors, blabla but exposing every single
aspect of every single error... meh.

> In simple cases we could push that policy into the kernel e.g. just repair
> the moment we see an error record.
> 
> These repair resources are very limited in number, so immediately repairing
> may a bad idea. We want to build up a history of errors before making
> such a decision.  That can be done in kernel. 

Yap, we are doing this now:

drivers/ras/cec.c

Userspace involvement is minimal, if at all. It is mostly controlling the
parameters of the leaky bucket.

> The decision to repair memory is heavily influenced by policy and time considerations
> against device resource constraints.
> 
> Some options that are hard to do in kernel.
> 
> 1. Typical asynchronous error report for a corrected error.
> 
>    Tells us memory had an error (perhaps from a scrubbing engine on the device
>    running checks). No need to take action immediately. Instead build up more data
>    over time and if lots of errors occur make decision to repair as no we are sure it
>    is worth doing rather than a single random event. We may tune scrubbing engines
>    to check this memory more frequently and adjust our data analysis to take that
>    into account for setting thresholds etc.

See above.

What happens when your daemon dies and loses all that collected data?

> 2. Soft repair across boots.  We are actually storing the error records, then only
>    applying the fix on reboot before using the memory - so maintaining a list
>    of bad memory and saving it to a file to read back on boot. We could provide
>    another kernel interface to get this info and reinject it after reboot instead
>    of doing it in userspace but that is another ABI to design.

We did something similar recently: drivers/ras/amd/fmpm.c. It basically
"replays" errors from persistent storage as that memory cannot be replaced.

> 3. Complex policy across fleets.  A lot of work is going on around prediction techniques
>    that may change the local policy on each node dependent on the overall reliability
>    patterns of a particular batch of devices and local characteristics, service guarantees
>    etc. If it is hard repair, then once you've run out you need schedule an engineer
>    out to replace the DIMM. All complex inputs to the decision.

You probably could say here: "repair or report when this and that." or
"offline page and report error" and similar high-level decisions by leaving
the details to the kernel instead of looking at every possible error in
userspace and returning back to the kernel to state your decision.

> Similar cases like CPU offlining on repeated errors are done in userspace (e.g.
> RAS Daemon) for similar reasons of long term data gathering and potentially
> complex algorithms.
>   
> > 
> > > Ok. Then can we just drop the range discoverability entirely or we go with
> > > your suggestion and do not support read back of what has been
> > > requested but instead have the reads return a range if known or "" /
> > > return -EONOTSUPP if simply not known?  
> > 
> > Probably.
> 
> Too many options in the above paragraph so just to check...  Probably to which?
> If it's a separate attribute from the one we write the control so then
> we do what is already done here and don't present the interface at all if
> the range isn't discoverable.

Probably means I still don't get a warm and fuzzy feeling about this design.
As I've noted above.

> Ok. Best path is drop the available range support then (so no min_ max_ or
> anything to replace them for now).
> 
> Added bonus is we don't have to rush this conversation and can make sure we
> come to the right solution driven by use cases.

Yap, that sounds like a prudent idea.

What I'm trying to say, basically, is, this interface through sysfs is
a *lot* of attributes, there's no clear cut use case where we can judge how
useful it is and as I alluded to above, I really think that you should leave
the high-level decisions to userspace and let the kernel do its job.

This'll make your interface a lot simpler.

And if you really need to control every single aspect of scrubbing in
userspace, then you can always come later with proper design and use case.

But again, I really think you should keep as much recovery logic in the kernel
and as automatic as possible. Only when you really really need user input,
only then you should allow it...

I hope I'm making sense here...

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v18 04/19] EDAC: Add memory repair control feature
  2025-01-11 17:12                   ` Borislav Petkov
@ 2025-01-13 11:07                     ` Jonathan Cameron
  2025-01-21 16:16                       ` Borislav Petkov
  0 siblings, 1 reply; 87+ messages in thread
From: Jonathan Cameron @ 2025-01-13 11:07 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Shiju Jose, linux-edac@vger.kernel.org, linux-cxl@vger.kernel.org,
	linux-acpi@vger.kernel.org, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org, tony.luck@intel.com,
	rafael@kernel.org, lenb@kernel.org, mchehab@kernel.org,
	dan.j.williams@intel.com, dave@stgolabs.net, dave.jiang@intel.com,
	alison.schofield@intel.com, vishal.l.verma@intel.com,
	ira.weiny@intel.com, david@redhat.com, Vilas.Sridharan@amd.com,
	leo.duran@amd.com, Yazen.Ghannam@amd.com, rientjes@google.com,
	jiaqiyan@google.com, Jon.Grimm@amd.com,
	dave.hansen@linux.intel.com, naoya.horiguchi@nec.com,
	james.morse@arm.com, jthoughton@google.com,
	somasundaram.a@hpe.com, erdemaktas@google.com, pgonda@google.com,
	duenwen@google.com, gthelen@google.com,
	wschwartz@amperecomputing.com, dferguson@amperecomputing.com,
	wbs@os.amperecomputing.com, nifan.cxl@gmail.com, tanxiaofei,
	Zengtao (B), Roberto Sassu, kangkang.shen@futurewei.com,
	wanghuiqiang, Linuxarm

On Sat, 11 Jan 2025 18:12:43 +0100
Borislav Petkov <bp@alien8.de> wrote:

> On Thu, Jan 09, 2025 at 06:34:48PM +0000, Jonathan Cameron wrote:
> > Today you can.  Seems we are talking cross purposes.
> > 
> > I'm confused. I thought your proposal was for "bank" attribute to present an
> > allowed range on read.
> > "bank" attribute is currently written to and read back as the value of the bank on which
> > to conduct a repair.  Maybe this disconnect is down to the fact max_ and min_
> > attributes should have been marked as RO in the docs. They aren't controls,
> > just presentation of limits to userspace.
> > 
> > Was intent a separate bank_range type attribute rather than max_bank, min_bank?  
> 
> I don't know - I'm just throwing ideas out there. You could do:
> 
> cat /sys/.../bank
> 
> and that gives you
> 
> [<low> <current_value> <high>]
> 
> So you have all the needed information. Dunno if this would be abusing sysfs
> rules too much tho.

We could do that, trade off between count of files and parsing complexity, but
given I don't expect anyone to parse the reads anyway, sure.

If we do go that way I guess we'd have things like
<0> 33 <123>
33

To allow for cases where the range is known vs not known? Or
<?> 33 <?> maybe?

We can do that if you prefer.  I'm not that fussed how this is handled
because, for tooling at least, I don't see why we'd ever read it.
It's for human parsing only and the above is fine.

> 
> > >   
> > > > In at least the CXL case I'm fairly sure most of them are not discoverable.
> > > > Until you see errors you have no idea what the memory topology is.    
> > > 
> > > Ok.
> > >   
> > > > For that you'd need to have a path to read back what happened.    
> > > 
> > > So how is this scrubbing going to work? You get an error, you parse it for all
> :> > the attributes and you go and write those attributes into the scrub interface
> > > and it starts scrubbing?  
> > 
> > Repair not scrubbing. They are different things we should keep separate,
> > scrub corrects the value, if it can, but doesn't change the underlying memory to
> > new memory cells to avoid repeated errors. Replacing scrub with repair 
> > (which I think was the intent here)...  
> 
> Really?
> 
> So how is scrubbing defined for CXL? You read memory, do ECC check on it,
> report any potential errors but write back the *original* wrong value?!

What you describe is correction, not repair. Of course it corrects!
These two terms mean different things in CXL world and in DDR specs etc.
We should have made that distinction clearer as perhaps that is the root
of this struggle to reach agreement.

> 
> I thought the point of scrubbing is to repair it while at it too...

No. There is a major difference between correction (which scrubbing does)
and repair.

Point of scrubbing is to correct errors if it can by using (typically)
the single error correction of SECDED ECC codes (single correct, double detect).
In many cases it reports the error and uncorrectable errors it runs into.
This is common for all scrub implementations not just CXL ones.

What happens with those error reports differs between systems. This is
not CXL specific. I'm aware of other systems that make repair an OS managed
thing - we have another driver queued up behind this set that does exactly
that.

In many existing servers (I know some of ours do this for example), the
firmware / BMC etc keep track of those errors and make the decision to
repair the memory. It does not do this on detection of one correctable
error, it does it after a repeated pattern.

Repair cam be a feature of the DIMMs themselves or it can be a feature
of the memory controller. It is basically replacing them with spare
memory from somewhere else (usually elsewhere on same DIMMs that have
a bit of spare capacity for this).  Bit like a hot spare in a RAID setup.

In some other systems the OS gets the errors and is responsible for making
the decision. Sticking to the corrected error case (uncorrected handling
is going to require a lot more work given we've lost data, Dan asked about that
in the other branch of the thread), the OS as a whole (kernel + userspace)
gets the error records and makes the policy decision to repair based on
assessment of risk vs resource availability to make a repair.

Two reasons for this
1) Hardware isn't necessarily capable of repairing autonomously as
   other actions may be needed (memory traffic to some granularity of
   memory may need to be stopped to avoid timeouts). Note there are many
   graduations of this from A) can do it live with no visible effect, through
   B) offline a page, to C) offlining the whole device.
2) Policy can be a lot more sophisticated than a BMC can do.

> 
> > You get error records that describe the error seen in hardware, write back the
> > values into this interface and tell it to repair the memory.  This is not
> > necessarily a synchronous or immediate thing - instead typically based on
> > trend analysis.  
> 
> This is just silly: I'm scrubbing, I found an error, I should simply fix it
> while at it. Why would I need an additional command to repair it?!

This is confusing correction with repair.  They are different operations.
Correction of course happens (if we can- normally single bit errors only).
No OS involvement in scrub based correction.
Repair is normally only for repeated errors.   If the memory is fine and
we get a cosmic ray or similar flipping a bit there is no long term
increased likelihood of seeing another error.  If we get one every X
hours then it is highly likely the issue is something wrong with the memory.

> 
> > As an example, the decision might be that bit of ram threw up 3 errors
> > over a month including multiple system reboots (for other reasons) and
> > that is over some threshold so we use a spare memory line to replace it.  
> 
> Right.
> 
> > Short answer, it needs to be very smart and there isn't a case of one size
> > fits all - hence suggested approach of making it a user space problem.  
> 
> Making it a userspace problem is pretty much always a sign that the hw design
> failed.

In some cases perhaps, but another very strong driver is that policy is involved.

We can either try put a complex design in firmware and poke it with N opaque
parameters from a userspace tool or via some out of band method or we can put
the algorithm in userspace where it can be designed to incorporate lessons learnt
over time.  We will start simple and see what is appropriate as this starts
to get used in large fleets.  This stuff is a reasonable target for AI type
algorithms etc that we aren't going to put in the kernel.

Doing this at all is a reliability optimization, normally it isn't required for
correct operation.

> 
> > Given in the systems being considered here, software is triggering the repair,
> > we want to allow for policy in the decision.   
> 
> Right, you can leave a high-level decision to userspace: repair only when
> idle, repair only non-correctable errors, blabla but exposing every single
> aspect of every single error... meh.

Those aspects identify the error location. As I put in reply to Dan a device
does not necessarily have a fixed mapping to a device physical address over
reboots etc.  There are solutions that scramble that mapping on reboot
or hotplug for various reasons (security and others).

For CXL at least these are all in userspace already (RAS-daemon has supported
them for quite a while).  The same data is coming in a CPER record for other
memory errors though I'm not sure it is currently surfaced to RAS Daemon.

> 
> > In simple cases we could push that policy into the kernel e.g. just repair
> > the moment we see an error record.
> > 
> > These repair resources are very limited in number, so immediately repairing
> > may a bad idea. We want to build up a history of errors before making
> > such a decision.  That can be done in kernel.   
> 
> Yap, we are doing this now:
> 
> drivers/ras/cec.c
> 
> Userspace involvement is minimal, if at all. It is mostly controlling the
> parameters of the leaky bucket.

Offline has no permanent cost and no limit on number of times you can
do it. Repair is definitely a limited resource and may permanently use
up that resource (discoverable as a policy wants to know that too!)
In some cases once you run out of repair resources you have to send an
engineer to replace the memory before you can do it again.

So to me, these are different situations requiring different solutions.

> 
> > The decision to repair memory is heavily influenced by policy and time considerations
> > against device resource constraints.
> > 
> > Some options that are hard to do in kernel.
> > 
> > 1. Typical asynchronous error report for a corrected error.
> > 
> >    Tells us memory had an error (perhaps from a scrubbing engine on the device
> >    running checks). No need to take action immediately. Instead build up more data
> >    over time and if lots of errors occur make decision to repair as no we are sure it
> >    is worth doing rather than a single random event. We may tune scrubbing engines
> >    to check this memory more frequently and adjust our data analysis to take that
> >    into account for setting thresholds etc.  
> 
> See above.
> 
> What happens when your daemon dies and loses all that collected data?

It is logged always - in rasdaemon, either local sqlite DB or people are carrying
patches to do fleet scale logging.  This works across reboot so a daemon dying is
part of the design rather than just an error case.
If the daemon is able to corrupt your error logs then we have bigger problems.

> 
> > 2. Soft repair across boots.  We are actually storing the error records, then only
> >    applying the fix on reboot before using the memory - so maintaining a list
> >    of bad memory and saving it to a file to read back on boot. We could provide
> >    another kernel interface to get this info and reinject it after reboot instead
> >    of doing it in userspace but that is another ABI to design.  
> 
> We did something similar recently: drivers/ras/amd/fmpm.c. It basically
> "replays" errors from persistent storage as that memory cannot be replaced.

Ok. I guess it is an option (I wasn't aware of that work).

I was thinking that was far more complex to deal with than just doing it in
userspace tooling. From a quick look that solution seems to rely on ACPI ERSR
infrastructure to provide that persistence that we won't generally have but
I suppose we can read it from the filesystem or other persistent stores.
We'd need to be a lot more general about that as can't make system assumptions
that can be made in AMD specific code.

So could be done, I don't think it is a good idea in this case, but that
example does suggest it is possible.

> > 3. Complex policy across fleets.  A lot of work is going on around prediction techniques
> >    that may change the local policy on each node dependent on the overall reliability
> >    patterns of a particular batch of devices and local characteristics, service guarantees
> >    etc. If it is hard repair, then once you've run out you need schedule an engineer
> >    out to replace the DIMM. All complex inputs to the decision.  
> 
> You probably could say here: "repair or report when this and that." or
> "offline page and report error" and similar high-level decisions by leaving
> the details to the kernel instead of looking at every possible error in
> userspace and returning back to the kernel to state your decision.

In approach we are targetting, there is no round trip situation.  We let the kernel
deal with any synchronous error just fine and run it's existing logic
to offline problematic memory.  That needs to be timely and to carry on operating
exactly as it always has.

In parallel with that we gather the error reports that we will already be
gathering and run analysis on those.  From that we decide if a memory is likely to fail
again and perform a sparing operation if appropriate.
Effectively this is 'free'. All the information is already there in userspace
and already understood by tools like rasdaemon, we are not expanding that
reporting interface at all.

Or we don't decide to and the memory continues to be used as before.  Sure there
may be a higher chance of error but there isn't one right now.  (or there is and
we have offlined the memory anyway via normal mechanisms so no problem).

There is an interesting future exploration to be done on recovery from poison
(uncorrectable error) situations, but that is not in scope of this initial
code and that would likely be an asynchronous best effort process as well.

> 
> > Similar cases like CPU offlining on repeated errors are done in userspace (e.g.
> > RAS Daemon) for similar reasons of long term data gathering and potentially
> > complex algorithms.
> >     
> > >   
> > > > Ok. Then can we just drop the range discoverability entirely or we go with
> > > > your suggestion and do not support read back of what has been
> > > > requested but instead have the reads return a range if known or "" /
> > > > return -EONOTSUPP if simply not known?    
> > > 
> > > Probably.  
> > 
> > Too many options in the above paragraph so just to check...  Probably to which?
> > If it's a separate attribute from the one we write the control so then
> > we do what is already done here and don't present the interface at all if
> > the range isn't discoverable.  
> 
> Probably means I still don't get a warm and fuzzy feeling about this design.
> As I've noted above.

Ok. I was confused as the paragraph asked and (A) or (B) question.

> 
> > Ok. Best path is drop the available range support then (so no min_ max_ or
> > anything to replace them for now).
> > 
> > Added bonus is we don't have to rush this conversation and can make sure we
> > come to the right solution driven by use cases.  
> 
> Yap, that sounds like a prudent idea.
> 
> What I'm trying to say, basically, is, this interface through sysfs is
> a *lot* of attributes, there's no clear cut use case where we can judge how
> useful it is and as I alluded to above, I really think that you should leave
> the high-level decisions to userspace and let the kernel do its job.
> 
> This'll make your interface a lot simpler.

Ok.  It seems you correlate number of files with complexity.
I correlated difficulty of understanding those files with complexity.
Everyone one of the files is clearly defined and aligned with long history
of how to describe DRAM (see how long CPER records have used these
fields for example - they go back to the beginning).

1) It is only 8 files (if we drop the range info).
   I don't consider that a complex enough interface to worry about complexity.
2) Moving the control into the kernel will require a control interface as
   there will be multiple tuneables for those decisions. Like most such
   algorithm parameter questions we may spend years arguing about what those
   look like.  Sure that problem also occurs in userspace but we aren't
   obliging people to use what we put in RASDaemon.

> 
> And if you really need to control every single aspect of scrubbing in
> userspace, then you can always come later with proper design and use case.

I'm all in favor of building an interface up by providing minimum first
and then adding to it, but here what is proposed is the minimum for basic
functionality and the alternative of doing the whole thing in kernel both
puts complexity in the wrong place and restricts us in what is possible.
See below for a proposal to maybe make some progress.

> 
> But again, I really think you should keep as much recovery logic in the kernel
> and as automatic as possible. Only when you really really need user input,
> only then you should allow it...

This is missing the key considerations of resource availability being
limited and not addressing the main usecase here. This is not and has
never been about the logic to recover form an error. It is about reducing
the possibility of repeated errors.

> 
> I hope I'm making sense here...

To some degree but I think there is a major mismatch in what we think
this is for.

What I've asked Shiju to look at is splitting the repair infrastructure
into two cases so that maybe we can make partial progress:

1) Systems that support repair by Physical Address
 - Covers Post Package Repair for CXL

2) Systems that support repair by description of the underlying hardware
- Covers Memory Sparing interfaces for CXL. 

We need both longer term anyway, but maybe 1 is less controversial simply
on basis it has fewer control parameters

This still fundamentally puts the policy in userspace where I
believe it belongs.

Jonathan

> 
> Thx.
> 

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v18 04/19] EDAC: Add memory repair control feature
  2025-01-10 22:49                       ` Dan Williams
@ 2025-01-13 11:40                         ` Jonathan Cameron
  2025-01-14 19:35                           ` Dan Williams
  0 siblings, 1 reply; 87+ messages in thread
From: Jonathan Cameron @ 2025-01-13 11:40 UTC (permalink / raw)
  To: Dan Williams
  Cc: Borislav Petkov, Shiju Jose, linux-edac@vger.kernel.org,
	linux-cxl@vger.kernel.org, linux-acpi@vger.kernel.org,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	tony.luck@intel.com, rafael@kernel.org, lenb@kernel.org,
	mchehab@kernel.org, dave@stgolabs.net, dave.jiang@intel.com,
	alison.schofield@intel.com, vishal.l.verma@intel.com,
	ira.weiny@intel.com, david@redhat.com, Vilas.Sridharan@amd.com,
	leo.duran@amd.com, Yazen.Ghannam@amd.com, rientjes@google.com,
	jiaqiyan@google.com, Jon.Grimm@amd.com,
	dave.hansen@linux.intel.com, naoya.horiguchi@nec.com,
	james.morse@arm.com, jthoughton@google.com,
	somasundaram.a@hpe.com, erdemaktas@google.com, pgonda@google.com,
	duenwen@google.com, gthelen@google.com,
	wschwartz@amperecomputing.com, dferguson@amperecomputing.com,
	wbs@os.amperecomputing.com, nifan.cxl@gmail.com, tanxiaofei,
	Zengtao (B), Roberto Sassu, kangkang.shen@futurewei.com,
	wanghuiqiang, Linuxarm

On Fri, 10 Jan 2025 14:49:03 -0800
Dan Williams <dan.j.williams@intel.com> wrote:

> Jonathan Cameron wrote:
> > On Thu, 9 Jan 2025 15:51:39 -0800
> > Dan Williams <dan.j.williams@intel.com> wrote:
> >   
> > > Jonathan Cameron wrote:  
> > > > Ok. Best path is drop the available range support then (so no min_ max_ or
> > > > anything to replace them for now).    
> > > 
> > > I think less is more in this case.  
> > 
> > A few notes before I get to specific questions.
> > 
> > Key in the discussion that follows is that the 'repair' is a separate
> > from the 'decision to repair'.  They mostly need different information
> > all of which is in the error trace points. A lot of the questions
> > are about the 'decision to repair' part not the repair itself.
> >   
> [snipped the parts I agree with]
> > I'd advise mostly ignoring PPR and looking at memory sparing in
> > the CXL spec if you want to consider an example. For PPR DPA is used
> > (there is an HPA option that might be available). DPA is still needed
> > for on boot soft repair (or we could delay that until regions configured,
> > but then we'd need to do DPA to HPA mapping as that will be different
> > on a new config, and then write HPA for the kernel to map it back to DPA.  
> 
> This is helpful because I was indeed getting lost in what kind of
> "repair" was being discussed in the thread. Ok, lets focus on sparing
> commands.
> 
> >   
> > > 
> > > The hpa, dpa, nibble, column, channel, bank, rank, row... ABI looks too
> > > wide for userspace to have a chance at writing a competent tool. At
> > > least I am struggling with where to even begin with those ABIs if I was
> > > asked to write a tool. Does a tool already exist for those?  
> > 
> > There is little choice on that - those are the controls for this type
> > of repair. If we do something like a magic 'key' based on a concatenation
> > of those values we need to define that description to replace a clean
> > self describing interface. I'm not 100% against that but I think it would
> > be a terrible interface design and I don't think anyone is really in favor of it.
> > 
> > All a userspace tool does is read the error record fields of
> > exactly those names.  From that it will log data (already happening under
> > those names in RAS daemon alongside HPA/ DPA).  Then, in simplest case,
> > a threshold is passed and we write those values to the repair interface. 
> > 
> > There is zero need in that simple case for these to be understood at all.  
> 
> This is where you lose me. The error record is a point in time snapshot
> of the SPA:HPA:DPA:<proprietary internal "DIMM" mapping>. The security
> model for memory operations is based on coordinating with the kernel's
> understanding of how that SPA is *currently* being used.

Whilst it is being used I agree.  Key is to only do disruptive / data
changing actions when it is not being used.

> 
> The kernel can not just take userspace's word for it that potentially
> data changing, or temporary loss-of-use operations are safe to execute
> just because once upon a time userspace saw an error record that
> implicated a given SPA in the past, especially over reboot. 

There are two cases (discoverable from hardware)

1) Non disruptive.  No security concern as the device guarantees to
   not interrupt traffic and the memory contents is copied to the new
   location. Basically software never knows it happened.
2) Disruptive.  We only allow this if the memory is offline. In the CXL case
   the CXL specific code must check no memory on the device is online so
   we aren't disrupting anything.  The other implementation we have code
   for (will post after this lands) has finer granularity constraints and only
   the page needs to be offline.
   As it is offline the content is not preserved anyway. We may need to add extra
   constraints along with future support for temporal persistence / sharing but
   we can do that as part of adding that support in general.
   (Personally I think in those cases memory repair is a job for the out of
    band management anyway).

In neither case am I seeing a security concern.  Am I missing something?

> 
> The SPA:HPA:DPA:DIMM tuple is invalidated on reconfiguration and reboot
> events. It follows that the kernel has a security/integrity interest in
> declining to act on invalidated tuples. This is solvable within a single
> boot as the kernel can cache the error records and userspace can ask to
> "trigger sparing based on cached record X".

The above rules remove this complexity.  Either it is always safe by
device design, or the memory is offline and we'll zero fill it anyway
so no security concern.

> 
> For the reboot case when the error record cache is flushed the kernel
> needs a reliable way to refill that cache, not an ABI for userspace to
> say "take my word for it, this *should* be safe".

It is safe because of 1 and 2 above we are not editing data in use
except in a fashion that the device guarantees is safe.

If you don't trust the device on this you have bigger problems.

> 
> [snipped the explanation of replaying the old trace record parameters
> data back through sysfs, because that is precisely the hang up I have
> with the proposal]
> 
> >   
> > > 
> > > Some questions that read on those ABIs are:
> > > 
> > > 1/ What if the platform has translation between HPA (CXL decode) and SPA
> > > (physical addresses reported in trace points that PIO and DMA see)?  
> > 
> > See later for discussion of other interfaces.. This is assuming the
> > repair key is not HPA (it is on some systems / situations) - if it's
> > the repair key then that is easily dealt with.
> > 
> > HPA / SPA more or less irrelevant for repair itself, they are relevant
> > for the decision to repair. In the 'on reboot' soft repair case they may
> > not even exist at the time of repair as it would be expected to happen
> > before we've brought up a region (to get the RAM into a good state at boot).
> > 
> > For cases where the memory decoders are configured and so there is an HPA
> > to DPA mapping:
> > The trace reports provide both all these somewhat magic values and
> > the HPA.  Thus we can do the HPA aware stuff on that before then looking
> > up the other bit of the appropriate error reports to get the bank row etc.
> >   
> > > 
> > > 2/ What if memory is interleaved across repair domains?   
> > 
> > Also not relevant to a repair control and only a little relevant to the
> > decision to repair.  The domains would be handled separately but if
> > we are have to offline a chunk of memory to do it (needed for repair
> > on some devices) that may be a bigger chunk if fine grained interleave
> > in use and that may affect the decision.  
> 
> Again, the repair control assumes that the kernel can just trust
> userspace to get it right. When the kernel knows the SPA implications it
> can add safety like "you are going to issue sparing on deviceA that will
> temporarily take deviceA offline. CXL subsystem tells me deviceA is
> interleaved with deviceB in SPA so the whole SPA range needs to be
> offline before this operation proceeds". That is not someting that
> userspace can reliably coordinate.

Absolutely he kernel has to enforce this. Same way we protect against
poison injection in some cases.  Right now the enforcement is slightly
wrong (Shiju is looking at it again) as we were enforcing at wrong
granularity (specific dpa, not device). Identifying that hole is a good
outcome of this discussion making us take another look.

Enforcing this is one of the key jobs of the CXL specific driver.
We considered doing it in the core, but the granularity differences
between our initial few examples meant we decided on specific driver
implementations of the checks for now.

> 
> > > 3/ What if the device does not use DDR terminology / topology terms for
> > > repair?  
> > 
> > Then we provide the additional interfaces assuming the correspond to well
> > known terms.  If someone is using a magic key then we can get grumpy
> > with them, but that can also be supported.
> > 
> > Mostly I'd expect a new technology to overlap a lot of the existing
> > interface and maybe add one or two more; which layer in the stack for
> > HBM for instance.  
> 
> The concern is the assertion that sysfs needs to care about all these
> parameters vs an ABI that says "repair errorX". If persistence and
> validity of error records is the concern lets build an ABI for that and
> not depend upon trust in userspace to properly coordinate memory
> integrity concerns.

It doesn't have to.  It just has to ensure that the memory device is in the correct
state.  So check, not coordinate. At a larger scale, coordination is already doable
(subject to races that we must avoid by holding locks), tear down the regions
so there are no mappings on the device you want to repair.  Don't bring them
up again until after you are done.

The main use case is probably do it before you bring the mappings up, but
same result.

> 
> > 
> > The main alternative is where the device takes an HPA / SPA / DPA. We have one
> > driver that does that queued up behind this series that uses HPA. PPR uses
> > DPA.  In that case userspace first tries to see if it can repair by HPA then
> > DPA and if not moves on to see if it it can use the fuller description.
> > We will see devices supporting HPA / DPA (which to use depends on when you
> > are doing the operation and what has been configured) but mostly I'd expect
> > either HPA/DPA or fine grained on a given repair instance.
> > 
> > HPA only works if the address decoders are always configured (so not on CXL)
> > What is actually happening in that case is typically that a firmware is
> > involved that can look up address decoders etc, and map the control HPA
> > to Bank / row etc to issue the actual low level commands.  This keeps
> > the memory mapping completely secret rather than exposing it in error
> > records.
> >   
> > > 
> > > I expect the flow rasdaemon would want is that the current PFA (leaky
> > > bucket Pre-Failure Analysis) decides that the number of soft-offlines it
> > > has performed exceeds some threshold and it wants to attempt to repair
> > > memory.  
> > 
> > Sparing may happen prior to point where we'd have done a soft offline
> > if non disruptive (whether it is can be read from another bit of the
> > ABI).  Memory repair might be much less disruptive than soft-offline!
> > I rather hope memory manufacturers build that, but I'm aware of at least
> > one case where they didn't and the memory must be offline.  
> 
> That's a good point, spare before offline makes sense.

If transparent an resources not constrained.
Very much not if we have to tear down the memory first.

> 
> [..]
> > However, there are other usecases where this isn't needed which is why
> > that isn't a precursor for this series.
> > 
> > Initial enablement targets two situations:
> > 1) Repair can be done in non disruptive way - no need to soft offline at all.  
> 
> Modulo needing to quiesce access over the sparing event?

Absolutely.  This is only doable in devices that don't need to quiesce.

> 
> > 2) Repair can be done at boot before memory is onlined or on admin
> >    action to take the whole region offline, then repair specific chunks of
> >    memory before bringing it back online.  
> 
> Which is userspace racing the kernel to online memory?

If you are doing this scheme you don't automatically online memory. So
both are in userspace control and can be easily sequenced.
If you aren't auto onlining then buy devices with hard PPR and do it by offlining
manually, repairing and rebooting. Or buy devices that don't need to quiecse
and cross your fingers the dodgy ram doesn't throw an error before you get
that far.  Little choice if you decide to online right at the start as normal
memory.

> 
> > > So, yes, +1 to simpler for now where software effectively just needs to
> > > deal with a handful of "region repair" buttons and the semantics of
> > > those are coarse and sub-optimal. Wait for a future where a tool author
> > > says, "we have had good success getting bulk offlined pages back into
> > > service, but now we need this specific finer grained kernel interface to
> > > avoid wasting spare banks prematurely".  
> > 
> > Depends on where you think that interface is.  I can absolutely see that
> > as a control to RAS Daemon.  Option 2 above, region is offline, repair
> > all dodgy looking fine grained buckets.
> > 
> > Note though that a suboptimal repair may mean permanent use of very rare
> > resources.  So there needs to be a control a the finest granularity as well.
> > Which order those get added to userspace tools doesn't matter to me.
> > 
> > If you mean that interface in kernel it brings some non trivial requirements.
> > The kernel would need all of:
> > 1) Tracking interface for all error records so the reverse map from region
> >    to specific bank / row etc is available for a subset of entries.  The
> >    kernel would need to know which of those are important (soft offline
> >    might help in that use case, otherwise that means decision algorithms
> >    are in kernel or we have fine grained queue for region repair in parallel
> >    with soft-offline).
> > 2) A way to inject the reverse map information from a userspace store
> >   (to deal with reboot etc).  
> 
> Not a way to inject the reverse map information, a way to inject the
> error records and assert that memory topology changes have not
> invalidated those records.

There is no way to tell that the topology hasn't changed.
For the reasons above, I don't think we care. Instead of trying to stop
userspace reparing the wrong memory, make sure it is safe for it to do that.
(The kernel is rarely in the business of preventing the slightly stupid)

> 
> > That sounds a lot harder to deal with than relying on the usespace program
> > that already does the tracking across boots.  
> 
> I am stuck behind the barrier of userspace must not assume it knows
> better than the kernel about the SPA impact of a DIMM sparing
> event. The kernel needs evidence either live records from within the
> same kernel boot or validated records from a previous boot.

I think this is the wrong approach.  The operation must be 'safe'.
With that in place we absolutely can let userspace assume it knows better than
the kernel. 

> 
> ...devices could also help us out here with a way to replay DIMM error
> events. That would allow for refreshing error records even if the
> memory topology change because the new record would generate a refreshed
> SPA:HPA:DPA:DIMM tuple.

Maybe, but I'm not seeing necessity.

> 
> > > Anything more complex than a set of /sys/devices/system/memory/
> > > devices has a /sys/bus/edac/devices/devX/repair button, feels like a
> > > generation ahead of where the initial sophistication needs to lie.
> > > 
> > > That said, I do not closely follow ras tooling to say whether someone
> > > has already identified the critical need for a fine grained repair ABI?  
> > 
> > It's not that we necessarily want to repair at fine grain, it's that
> > the control interface to hardware is fine grained and the reverse mapping
> > often unknown except for specific error records.
> > 
> > I'm fully on board with simple interfaces for common cases like repair
> > the bad memory in this region.  I'm just strongly against moving the
> > complexity of doing that into the kernel.  
> 
> Yes, we are just caught up on where that "...but no simpler" line is
> drawn.
> 
Sure.  For now, I've proposed we split the two cases.
1) HPA / DPA repair (PPR)
2) Memory topology based repair (Sparing)

If we can make progress on (1) perhaps we can come to a conclusion on what
is required.

Note that so far I see no reason why either should do any checking against
errors observed by the kernel given the security guarantees above.
Userspace can repair the wrong bit of memory. That's pointless and burning
limited resources, but not a security problem.

Jonathan


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v18 00/19] EDAC: Scrub: introduce generic EDAC RAS control feature driver + CXL/ACPI-RAS2 drivers
  2025-01-06 12:09 [PATCH v18 00/19] EDAC: Scrub: introduce generic EDAC RAS control feature driver + CXL/ACPI-RAS2 drivers shiju.jose
                   ` (18 preceding siblings ...)
  2025-01-06 12:10 ` [PATCH v18 19/19] cxl/memfeature: Add CXL memory device memory sparing " shiju.jose
@ 2025-01-13 14:46 ` Mauro Carvalho Chehab
  2025-01-13 15:36   ` Jonathan Cameron
  2025-01-13 18:15   ` Shiju Jose
  2025-01-30 19:18 ` Daniel Ferguson
  20 siblings, 2 replies; 87+ messages in thread
From: Mauro Carvalho Chehab @ 2025-01-13 14:46 UTC (permalink / raw)
  To: shiju.jose
  Cc: linux-edac, linux-cxl, linux-acpi, linux-mm, linux-kernel, bp,
	tony.luck, rafael, lenb, mchehab, dan.j.williams, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	ira.weiny, david, Vilas.Sridharan, leo.duran, Yazen.Ghannam,
	rientjes, jiaqiyan, Jon.Grimm, dave.hansen, naoya.horiguchi,
	james.morse, jthoughton, somasundaram.a, erdemaktas, pgonda,
	duenwen, gthelen, wschwartz, dferguson, wbs, nifan.cxl,
	tanxiaofei, prime.zeng, roberto.sassu, kangkang.shen,
	wanghuiqiang, linuxarm

Em Mon, 6 Jan 2025 12:09:56 +0000
<shiju.jose@huawei.com> escreveu:

> From: Shiju Jose <shiju.jose@huawei.com>
> 
> Previously known as "ras: scrub: introduce subsystem + CXL/ACPI-RAS2 drivers".
> 
> Augmenting EDAC for controlling RAS features
> ============================================
> The proposed expansion of EDAC for controlling RAS features and
> exposing features control attributes to userspace in sysfs.
> Some Examples:
>  - Scrub control
>  - Error Check Scrub (ECS) control
>  - ACPI RAS2 features
>  - Post Package Repair (PPR) control
>  - Memory Sparing Repair control etc.
> 
> High level design is illustrated in the following diagram.
>  
>          _______________________________________________
>         |   Userspace - Rasdaemon                       |
>         |  ____________                                 |
>         | | RAS CXL    |       _______________          | 
>         | | Err Handler|----->|               |         |
>         | |____________|      | RAS Dynamic   |         |
>         |  ____________       | Scrub, Memory |         |
>         | | RAS Memory |----->| Repair Control|         |
>         | | Err Handler|      |_______________|         |
>         | |____________|           |                    |
>         |__________________________|____________________|                              
>                                    |
>                                    |
>     _______________________________|______________________________
>    |   Kernel EDAC based SubSystem | for RAS Features Control     |
>    | ______________________________|____________________________  |
>    || EDAC Core          Sysfs EDAC| Bus                        | |
>    ||    __________________________|________ _    _____________ | |
>    ||   |/sys/bus/edac/devices/<dev>/scrubX/ |   | EDAC Device || |
>    ||   |/sys/bus/edac/devices/<dev>/ecsX/   |<->| EDAC MC     || |
>    ||   |/sys/bus/edac/devices/<dev>/repairX |   | EDAC Sysfs  || |
>    ||   |____________________________________|   |_____________|| |
>    ||                               | EDAC Bus                  | |
>    ||               Get             |       Get                 | |
>    ||    __________ Features       |   Features __________    | |

NIT: there is a misalignment here.

>    ||   |          |Descs  _________|______ Descs|          |   | |
>    ||   |EDAC Scrub|<-----| EDAC Device    |     | EDAC Mem |   | |
>    ||   |__________|      | Driver- RAS    |---->| Repair   |   | |
>    ||    __________       | Feature Control|     |__________|   | |
>    ||   |          |<-----|________________|                    | |
>    ||   |EDAC ECS  |   Register RAS | Features                  | |
>    ||   |__________|                |                           | |
>    ||         ______________________|_________                  | |
>    ||_________|_____________|________________|__________________| |
>    |   _______|____    _____|_________   ____|_________           |
>    |  |            |  | CXL Mem Driver| | Client Driver|          |
>    |  | ACPI RAS2  |  | Sparing, PPR, | | Mem Repair   |          |
>    |  | Driver     |  | Scrub, ECS    | | Features     |          |
>    |  |____________|  |_______________| |______________|          |
>    |        |              |              |                       |
>    |________|______________|______________|_______________________|
>             |              |              |                     
>      _______|______________|______________|_______________________
>     |     __|______________|_ ____________|____________ ____      |
>     |    |                                                  |     |
>     |    |            Platform HW and Firmware              |     |
>     |    |__________________________________________________|     |
>     |_____________________________________________________________|                             
> 
> 1. EDAC RAS Features components - Create feature specific descriptors.
>    for example, EDAC scrub, EDAC ECS, EDAC memory repair in the above
>    diagram. 
> 2. EDAC device driver for controlling RAS Features - Get feature's attr
>    descriptors from EDAC RAS feature component and registers device's
>    RAS features with EDAC bus and expose the feature's sysfs attributes
>    under the sysfs EDAC bus.
> 3. RAS dynamic scrub controller - Userspace sample module added for scrub
>    control in rasdaemon to issue scrubbing when excess number of memory
>    errors are reported in a short span of time.
> 
> The added EDAC feature specific components (e.g. EDAC scrub, EDAC ECS,
> EDAC memory repair etc) do callbacks to  the parent driver (e.g. CXL
> driver, ACPI RAS driver etc) for the controls rather than just letting the
> caller deal with it because of the following reasons.
> 1. Enforces a common API across multiple implementations can do that
>    via review, but that's not generally gone well in the long run for
>    subsystems that have done it (several have later moved to callback
>    and feature list based approaches).
> 2. Gives a path for 'intercepting' in the EDAC feature driver.
>    An example for this is that we could intercept PPR repair calls
>    and sanity check that the memory in question is offline before
>    passing back to the underlying code.  Sure we could rely on doing
>    that via some additional calls from the parent driver, but the
>    ABI will get messier.
> 3. (Speculative) we may get in kernel users of some features in the
>    long run.
> 
> More details of the common RAS features are described in the following
> sections.
> 
> Memory Scrubbing
> ================
> Increasing DRAM size and cost has made memory subsystem reliability
> an important concern. These modules are used where potentially
> corrupted data could cause expensive or fatal issues. Memory errors are
> one of the top hardware failures that cause server and workload crashes.
> 
> Memory scrub is a feature where an ECC engine reads data from
> each memory media location, corrects with an ECC if necessary and
> writes the corrected data back to the same memory media location.
> 
> The memory DIMMs could be scrubbed at a configurable rate to detect
> uncorrected memory errors and attempts to recover from detected memory
> errors providing the following benefits.
> - Proactively scrubbing memory DIMMs reduces the chance of a correctable
>   error becoming uncorrectable.
> - Once detected, uncorrected errors caught in unallocated memory pages are
>   isolated and prevented from being allocated to an application or the OS.
> - The probability of software/hardware products encountering memory
>   errors is reduced.
> Some details of background can be found in Reference [5].
> 
> There are 2 types of memory scrubbing,
> 1. Background (patrol) scrubbing of the RAM whilst the RAM is otherwise
>    idle.
> 2. On-demand scrubbing for a specific address range/region of memory.
> 
> There are several types of interfaces to HW memory scrubbers
> identified such as ACPI NVDIMM ARS(Address Range Scrub), CXL memory
> device patrol scrub, CXL DDR5 ECS, ACPI RAS2 memory scrubbing.
> 
> The scrub control varies between different memory scrubbers. To allow
> for standard userspace tooling there is a need to present these controls
> with a standard ABI.
> 
> Introduce generic memory EDAC scrub control which allows user to control
> underlying scrubbers in the system via generic sysfs scrub control
> interface. The common sysfs scrub control interface abstracts the control
> of an arbitrary scrubbing functionality to a common set of functions.
> 
> Use case of common scrub control feature
> ========================================
> 1. There are several types of interfaces to HW memory scrubbers identified
>    such as ACPI NVDIMM ARS(Address Range Scrub), CXL memory device patrol
>    scrub, CXL DDR5 ECS, ACPI RAS2 memory scrubbing features and software
>    based memory scrubber(discussed in the community Reference [5]).
>    Also some scrubbers support controlling (background) patrol scrubbing
>    (ACPI RAS2, CXL) and/or on-demand scrubbing(ACPI RAS2, ACPI ARS).
>    However the scrub controls varies between memory scrubbers. Thus there
>    is a requirement for a standard generic sysfs scrub controls exposed
>    to userspace for the seamless control of the HW/SW scrubbers in
>    the system by admin/scripts/tools etc.
> 2. Scrub controls in user space allow the user to disable the scrubbing
>    in case disabling of the background patrol scrubbing or changing the
>    scrub rate are needed for other purposes such as performance-aware
>    operations which requires the background operations to be turned off
>    or reduced.
> 3. Allows to perform on-demand scrubbing for specific address range if
>    supported by the scrubber.
> 4. User space tools controls scrub the memory DIMMs regularly at a
>    configurable scrub rate using the sysfs scrub controls discussed help,
>    - to detect uncorrectable memory errors early before user accessing memory,
>      which helps to recover the detected memory errors.
>    - reduces the chance of a correctable error becoming uncorrectable.
> 5. Policy control for hotplugged memory. There is not necessarily a system
>    wide bios or similar in the loop to control the scrub settings on a CXL
>    device that wasn't there at boot. What that setting should be is a policy
>    decision as we are trading of reliability vs performance - hence it should
>    be in control of userspace. As such, 'an' interface is needed. Seems more
>    sensible to try and unify it with other similar interfaces than spin
>    yet another one.
> 
> The draft version of userspace code added in rasdaemon for dynamic scrub
> control, based on frequency of memory errors reported to userspace, tested
> for CXL device based patrol scrubbing feature and ACPI RAS2 based
> scrubbing feature.
> 
> https://github.com/shijujose4/rasdaemon/tree/ras_feature_control
> 
> ToDO: For memory repair features, such as PPR, memory sparing, rasdaemon
> collates records and decides to replace a row if there are lots of
> corrected errors, or a single uncorrected error or error record received
> with maintenance request flag set as in some CXL event records.
> 
> Comparison of scrubbing features
> ================================
>  ................................................................
>  .              .   ACPI    . CXL patrol.  CXL ECS  .  ARS      .
>  .  Name        .   RAS2    . scrub     .           .           .
>  ................................................................
>  .              .           .           .           .           .
>  . On-demand    . Supported . No        . No        . Supported .
>  . Scrubbing    .           .           .           .           .
>  .              .           .           .           .           .  
>  ................................................................
>  .              .           .           .           .           .
>  . Background   . Supported . Supported . Supported . No        .
>  . scrubbing    .           .           .           .           .
>  .              .           .           .           .           .
>  ................................................................
>  .              .           .           .           .           .
>  . Mode of      . Scrub ctrl. per device. per memory.  Unknown  .
>  . scrubbing    . per NUMA  .           . media     .           .
>  .              . domain.   .           .           .           .
>  ................................................................
>  .              .           .           .           .           . 
>  . Query scrub  . Supported . Supported . Supported . Supported .       
>  . capabilities .           .           .           .           .
>  .              .           .           .           .           .
>  ................................................................
>  .              .           .           .           .           . 
>  . Setting      . Supported . No        . No        . Supported .       
>  . address range.           .           .           .           .
>  .              .           .           .           .           .
>  ................................................................
>  .              .           .           .           .           . 
>  . Setting      . Supported . Supported . No        . No        .       
>  . scrub rate   .           .           .           .           .
>  .              .           .           .           .           .
>  ................................................................
>  .              .           .           .           .           . 
>  . Unit for     . Not       . in hours  . No        . No        .       
>  . scrub rate   . Defined   .           .           .           .
>  .              .           .           .           .           .
>  ................................................................
>  .              . Supported .           .           .           .
>  . Scrub        . on-demand . No        . No        . Supported .
>  . status/      . scrubbing .           .           .           .
>  . Completion   . only      .           .           .           .
>  ................................................................
>  . UC error     .           .CXL general.CXL general. ACPI UCE  .
>  . reporting    . Exception .media/DRAM .media/DRAM . notify and.
>  .              .           .event/media.event/media. query     .
>  .              .           .scan?      .scan?      . ARS status.
>  ................................................................
>  .              .           .           .           .           .      
>  . Clear UC     .  No       . No        .  No       . Supported .
>  . error        .           .           .           .           .
>  .              .           .           .           .           .  
>  ................................................................
>  .              .           .           .           .           .
>  . Translate    . No        . No        . No        . Supported .
>  . *(1)SPA to   .           .           .           .           .
>  . *(2)DPA      .           .           .           .           .  
>  ................................................................
> 
> *(1) - SPA - System Physical Address. See section 9.19.7.8
>        Function Index 5 - Translate SPA of ACPI spec r6.5.  
> *(2) - DPA - Device Physical Address. See section 9.19.7.8
>        Function Index 5 - Translate SPA of ACPI spec r6.5.  

NIT: this table contains terms that are defined only at the text
below. The text describing, for instance, ARS, needs to come
before the table. IMO, it needs to contain ReST links to the
texts defining what each line/row contains (see below about ReST).

> 
> CXL Memory Scrubbing features
> =============================
> CXL spec r3.1 section 8.2.9.9.11.1 describes the memory device patrol scrub
> control feature. The device patrol scrub proactively locates and makes
> corrections to errors in regular cycle. The patrol scrub control allows the
> request to configure patrol scrubber's input configurations.
> 
> The patrol scrub control allows the requester to specify the number of
> hours in which the patrol scrub cycles must be completed, provided that
> the requested number is not less than the minimum number of hours for the
> patrol scrub cycle that the device is capable of. In addition, the patrol
> scrub controls allow the host to disable and enable the feature in case
> disabling of the feature is needed for other purposes such as
> performance-aware operations which require the background operations to be
> turned off.
> 
> The Error Check Scrub (ECS) is a feature defined in JEDEC DDR5 SDRAM
> Specification (JESD79-5) and allows the DRAM to internally read, correct
> single-bit errors, and write back corrected data bits to the DRAM array
> while providing transparency to error counts.
> 
> The DDR5 device contains number of memory media FRUs per device. The
> DDR5 ECS feature and thus the ECS control driver supports configuring
> the ECS parameters per FRU.
> 
> ACPI RAS2 Hardware-based Memory Scrubbing
> =========================================
> ACPI spec 6.5 section 5.2.21 ACPI RAS2 describes ACPI RAS2 table
> provides interfaces for platform RAS features and supports independent
> RAS controls and capabilities for a given RAS feature for multiple
> instances of the same component in a given system.
> Memory RAS features apply to RAS capabilities, controls and operations
> that are specific to memory. RAS2 PCC sub-spaces for memory-specific RAS
> features have a Feature Type of 0x00 (Memory).
> 
> The platform can use the hardware-based memory scrubbing feature to expose
> controls and capabilities associated with hardware-based memory scrub
> engines. The RAS2 memory scrubbing feature supports following as per spec,
>  - Independent memory scrubbing controls for each NUMA domain, identified
>    using its proximity domain.
>    Note: However AmpereComputing has single entry repeated as they have
>          centralized controls.
>  - Provision for background (patrol) scrubbing of the entire memory system,
>    as well as on-demand scrubbing for a specific region of memory.
> 
> ACPI Address Range Scrubbing(ARS)
> ================================
> ARS allows the platform to communicate memory errors to system software.
> This capability allows system software to prevent accesses to addresses
> with uncorrectable errors in memory. ARS functions manage all NVDIMMs
> present in the system. Only one scrub can be in progress system wide
> at any given time.
> Following functions are supported as per the specification.
> 1. Query ARS Capabilities for a given address range, indicates platform
>    supports the ACPI NVDIMM Root Device Unconsumed Error Notification.
> 2. Start ARS triggers an Address Range Scrub for the given memory range.
>    Address scrubbing can be done for volatile memory, persistent memory,
>    or both.
> 3. Query ARS Status command allows software to get the status of ARS,  
>    including the progress of ARS and ARS error record.
> 4. Clear Uncorrectable Error.
> 5. Translate SPA
> 6. ARS Error Inject etc.
> Note: Support for ARS is not added in this series because to reduce the
> line of code for review and could be added after initial code is merged. 
> We'd like feedback on whether this is of interest to ARS community?
> 
> Post Package Repair(PPR)
> ========================
> PPR (Post Package Repair) maintenance operation requests the memory device
> to perform a repair operation on its media if supported. A memory device
> may support two types of PPR: Hard PPR (hPPR), for a permanent row repair,
> and Soft PPR (sPPR), for a temporary row repair. sPPR is much faster than
> hPPR, but the repair is lost with a power cycle. During the execution of a
> PPR maintenance operation, a memory device, may or may not retain data and
> may or may not be able to process memory requests correctly. sPPR maintenance
> operation may be executed at runtime, if data is retained and memory requests
> are correctly processed. hPPR maintenance operation may be executed only at
> boot because data would not be retained.
> 
> Use cases of common PPR control feature
> =======================================
> 1. The Soft PPR (sPPR) and Hard PPR (hPPR) share similar control interfaces,
> thus there is a requirement for a standard generic sysfs PPR controls exposed
> to userspace for the seamless control of the PPR features in the system by the
> admin/scripts/tools etc.
> 2. When a CXL device identifies a failure on a memory component, the device
> may inform the host about the need for a PPR maintenance operation by using
> an event record, where the maintenance needed flag is set. The event record
> specifies the DPA that should be repaired. Kernel reports the corresponding
> cxl general media or DRAM trace event to userspace. The userspace tool,
> for reg. rasdaemon initiate a PPR maintenance operation in response to a
> device request using the sysfs PPR control.
> 3. User space tools, for eg. rasdaemon, do request PPR on a memory region
> when uncorrected memory error or excess corrected memory errors reported
> on that memory.
> 4. Likely multiple instances of PPR present per memory device.
> 
> Memory Sparing
> ==============
> Memory sparing is defined as a repair function that replaces a portion of
> memory with a portion of functional memory at that same DPA. User space
> tool, e.g. rasdaemon, may request the sparing operation for a given
> address for which the uncorrectable error is reported. In CXL,
> (CXL spec 3.1 section 8.2.9.7.1.4) subclasses for sparing operation vary
> in terms of the scope of the sparing being performed. The cacheline sparing
> subclass refers to a sparing action that can replace a full cacheline.
> Row sparing is provided as an alternative to PPR sparing functions and its
> scope is that of a single DDR row. Bank sparing allows an entire bank to
> be replaced. Rank sparing is defined as an operation in which an entire
> DDR rank is replaced.
> 
> Series adds,
> 1. EDAC device driver extended for controlling RAS features, EDAC scrub
>    driver, EDAC ECS driver, EDAC memory repair driver supports memory
>    scrub control, ECS control, memory repair(PPR, sparing) control
>    respectively.
> 2. Several common patches from Dave's cxl/fwctl series.   
> 3. Support for CXL feature mailbox commands, which is used by CXL device
>    scrubbing and memory repair features. 
> 4. CXL features driver supporting patrol scrub control (device and
>    region based).
>    
> 5. CXL features driver supporting ECS control feature.
> 6. ACPI RAS2 driver adds OS interface for RAS2 communication through
>    PCC mailbox and extracts ACPI RAS2 feature table (RAS2) and
>    create platform device for the RAS memory features, which binds
>    to the memory ACPI RAS2 driver.
> 7. Memory ACPI RAS2 driver gets the PCC subspace for communicating
>    with the ACPI compliant platform supports ACPI RAS2. Add callback
>    functions and registers with EDAC device to support user to
>    control the HW patrol scrubbers exposed to the kernel via the
>    ACPI RAS2 table.
> 8. Support for CXL maintenance mailbox command, which is used by
>    CXL device memory repair feature.   
> 9. CXL features driver supporting PPR control feature.
> 10. CXL features driver supporting memory sparing control feature.
>     Note: There are other PPR, memory sparing drivers to come.

The text above should be inside Documentation, and not on patch 0.

A big description like that makes hard to review this series. It is
also easier to review the text after having it parsed by kernel doc
build specially for summary tables like the "Comparison of scrubbing 
features", which deserves ReST links processed by Sphinx to the 
corresponding definitions of the terms that are be compared there.

> Open Questions based on feedbacks from the community:
> 1. Leo: Standardize unit for scrub rate, for example ACPI RAS2 does not define
>    unit for the scrub rate. RAS2 clarification needed. 

I noticed the same when reviewing a patch series for rasdaemon. Ideally,
ACPI requires an errata defining what units are expected for scrub rate.

While ACPI doesn't define it, better to not add support for it - or be
conservative using a low granularity for it (like using minutes instead 
of hours).

> 2. Jonathan:
>    - Any need for discoverability of capability to scan different regions,
>    such as global PA space to userspace. Left as future extension.
>    - For EDAC memory repair, control attribute for granularity(cache/row//bank/rank)
>      is needed?
> 
> 3. Jiaqi:
>    - STOP_PATROL_SCRUBBER from RAS2 must be blocked and, must not be exposed to
>      OS/userspace. Stopping patrol scrubber is unacceptable for platform where
>      OEM has enabled patrol scrubber, because the patrol scrubber is a key part
>      of logging and is repurposed for other RAS actions.
>    If the OEM does not want to expose this control, they should lock it down so the
>    interface is not exposed to the OS. These features are optional after all.
>    - "Requested Address Range"/"Actual Address Range" (region to scrub) is a
>       similarly bad thing to expose in RAS2.
>    If the OEM does not want to expose this, they should lock it down so the
>    interface is not exposed to the OS. These features are optional after all.
>    As per LPC discussion, support for stop and attributes for addr range
>    to be exposed to the userspace. 
> 4. Borislav: 
>    - How the scrub control exposed to userspace will be used?
>      POC added in rasdaemon with dynamic scrub control for CXL memory media
>      errors and memory errors reported to userspace.
>      https://github.com/shijujose4/rasdaemon/tree/scrub_control_6_june_2024
>    - Is the scrub interface is sufficient for the use cases?
>    - Who is going to use scrub controls tools/admin/scripts?
>      1) Rasdaemon for dynamic control
>      2) Udev script for more static 'defaults' on hotplug etc.
> 5. PPR   
>    - For PPR, rasdaemon collates records and decides to replace a row if there
>      are lots of corrected errors, or a single uncorrected error or error record
>      received with maintenance request flag set as in CXL DRAM error record.
>    - sPPR more or less startup only (so faking hPPR) or actually useful
>      in a running system (if not the safe version that keeps everything
>      running whilst replacement is ongoing)
>    - Is future proofing for multiple PPR units useful given we've mashed
>      together hPPR and sPPR for CXL.
> 
> Implementation
> ==============
> 1. Linux kernel
> Version 17 of kernel implementations of RAS features control is available in,
> https://github.com/shijujose4/linux.git
> Branch: edac-enhancement-ras-features_v18
> 
> Note: Took updated patches for CXL feature infrastructure and feature commands
>    from Dave's cxl/features branch.
>    https://git.kernel.org/pub/scm/linux/kernel/git/djiang/linux.git/log/?h=cxl/features
>    
>    Apologise to Dave for not waiting enough for permission to sendout his patches
>    in this series because of rush. 
> 
> 2. QEMU emulation
> QEMU for CXL RAS features implementation is available in, 
> https://gitlab.com/shiju.jose/qemu.git
> Branch: cxl-ras-features-2024-10-24
> 
> 3. Userspace rasdaemon
> The draft version of userspace sample code for dynamic scrub control,
> based on frequency of memory errors reported to userspace, is added
> in rasdaemon and enabled, tested for CXL device based patrol scrubbing
> feature and ACPI RAS2 based scrubbing feature. This required updation
> for the latest sysfs scrub interface.
> https://github.com/shijujose4/rasdaemon/tree/ras_feature_control
> 
> ToDO: For PPR, rasdaemon collates records and decides to replace a row if there
> are lots of corrected errors, or a single uncorrected error or error
> record received with maintenance request flag set as in CXL DRAM error
> record.
>   
> References:
> 1. ACPI spec r6.5 section 5.2.21 ACPI RAS2.
> 2. ACPI spec r6.5 section 9.19.7.2 ARS.
> 3. CXL spec  r3.1 8.2.9.9.11.1 Device patrol scrub control feature
> 4. CXL spec  r3.1 8.2.9.9.11.2 DDR5 ECS feature
> 5. CXL spec  r3.1 8.2.9.7.1.1 PPR Maintenance Operations
> 6. CXL spec  r3.1 8.2.9.7.2.1 sPPR Feature Discovery and Configuration
> 7. CXL spec  r3.1 8.2.9.7.2.2 hPPR Feature Discovery and Configuration
> 8. Background information about kernel support for memory scan, memory
>    error detection and ACPI RASF.
>    https://lore.kernel.org/all/20221103155029.2451105-1-jiaqiyan@google.com/
> 9. Discussions on RASF:
>    https://lore.kernel.org/lkml/20230915172818.761-1-shiju.jose@huawei.com/#r 
> 
> Changes
> =======
> v17 -> v18:
> 1. Rebased to kernel version 6.13-rc5
> 2. Reordered patches for feedback from Jonathan on v17.
> 3.
> 3.1 Took updated patches for CXL feature infrastructure and feature commands
>    from Dave's cxl/features branch.
>    https://git.kernel.org/pub/scm/linux/kernel/git/djiang/linux.git/log/?h=cxl/features
>    Updated, debug and tested CXL RAS features.
>    
>    Apologise to Dave for not waiting enough for permission to sendout his patches
>    in this series because of rush.    
>    
> 3.2. RAS features in the cxl/core/memfeature.c updated for interface
>      changes in the CXL feature commands.
> 4. Modified ACPI RAS2 code for the recent interface changes in the
>    PCC mbox code.
> 
> v16 -> v17:
> 1. 
> 1.1 Took several patches for CXL feature commands from Dave's 
>    fwctl/cxl series and add fixes pointed by Jonathan in those patches.
> 1.2. cxl/core/memfeature.c updated for interface changes in the
>    Get Supported Feature, Get Feature and Set Feature functions.
> 1.3. Used the UUID's for RAS features in CXL features code from
>      include/cxl/features.h    
> 2. Changes based on feedbacks from Boris
>  - Added attributes in EDAC memory repair to return the range for DPA
>    and other control attributes, and added callback functions for the
>    DPA range in CXL PPR and memory sparing code, which is the only one
>    supported in the CXL.
>  - Removed 'query' attribute for memory repair feature.
> 
> v15 -> v16:
> 1. Changes and Fixes for feedbacks from Boris
>  - Modified documentations and interface file for EDAC memory repair
>    to add more details and use cases.
>  - Merged documentations to corresponding patches instead of common patch
>    for full documentation for better readability.
>  - Removed 'persist_mode_avail' attribute for memory repair feature.
> 2. Changes for feedback from Dave Jiang
>  - Dave suggested helper function for ECS initialization in cxl/core/memfeature.c,
>    which added for all CXl RAS features, scrub, ECS, PPR and memory sparing features.
>  - Fixed endian conversion pointed by Dave in CXL memory sparing. Also I fixed
>      similar in CXL scrub, ECS and PPR features.
> 3. Changes for feedback from Ni Fan.
>  - Fixed a memory leak in edac_dev_register() for memory repair feature
>    and few suggestions by Ni Fan.
> 
> v14 -> v15:
> 1. Changes and Fixes for feedbacks from Boris
>   - Added documentations for edac features, scrub and memory_repair etc
>     and placed in a separate patch.
>   - Deleted extra 2 attributes for EDAC ECS log_entry_type_per_* and
>     mode_counts_*.
>   - Rsolved issues reported in Documentation/ABI/testing/sysfs-edac-ecs.
>   - Deleted unused pr_ftmt() from few files.
>   - Fixed some formating issues EDAC ECS code and similar in other files. 
>     etc.
> 2. Change for feedback from Dave Jiang
>   - In CXL code for patrol scrub control, Dave suggested replace
>     void *drv_data with a union of parameters in cxl_ps_get_attrs() and
>     similar functions.
>     This is fixed by replacing void *drv_data with corresponding context
>     structure(struct cxl_patrol_scrub_context) in CXL local functions as
>     struct cxl_patrol_scrub_context difficult and can't be visible in
>     generic EDAC control interface. Similar changes are made for CXL ECS,
>     CXL PPR and CXL memory sparing local functions.
> 
> v13 -> v14:
> 1. Changes and Fixes for feedback from Boris
>   - Check grammar of patch description.
>   - Changed scrub control attributes for memory scrub range to "addr" and "size".
>   - Fixed unreached code in edac_dev_register(). 
>   - Removed enable_on_demand attribute from EDAC scrub control and modified
>     RAS2 driver for the same.
>   - Updated ABI documentation for EDAC scrub control.
>     etc.
> 
> 2. Changes for feedback from Greg/Rafael/Jonathan for ACPI RAS2
>   - Replaced platform device creation and binding with
>     auxiliary device creation and binding with ACPI RAS2
>     memory auxiliary driver.
> 
> 3. Changes and Fixes for feedback from Jonathan
>   - Fixed unreached code in edac_dev_register(). 
>   - Optimize callback functions in CXL ECS using macros.
>   - Add readback attributes for the EDAC memory repair feature
>     and add support in the CXL driver for PPR and memory sparing.
>   - Add refactoring in the CXL driver for PPR and memory sparing
>     for query/repair maintenance commands.
>   - Add cxl_dpa_to_region_locked() function.  
>   - Some more cleanups in the ACPI RAS2 and RAS2 memory drivers.
>     etc.
> 
> 4. Changes and Fixes for feedback from Ni Fan
>    - Fixed compilation error - cxl_mem_ras_features_init refined, when CXL components
>      build as module.
> 
> 5. Optimize callback functions in CXL memory sparing using macros.
>    etc.
>    
> v12 -> v13:
> 1. Changes and Fixes for feedback from Boris
>   - Function edac_dev_feat_init() merge with edac_dev_register()
>   - Add macros in EDAC feature specific code for repeated code.
>   - Correct spelling mistakes.
>   - Removed feature specific code from the patch "EDAC: Add support
>     for EDAC device features control"
> 2. Changes for feedbacks from Dave Jiang
>    - Move fields num_features and entries to struct cxl_mailbox,
>      in "cxl: Add Get Supported Features command for kernel usage"
>    - Use series from 
>      https://lore.kernel.org/linux-cxl/20240905223711.1990186-1-dave.jiang@intel.com/   
> 3. Changes and Fixes for feedback from Ni Fan
>    - In documentation scrub* to scrubX, ecs_fru* to ecs_fruX
>    - Corrected some grammar mistakes in the patch headers.
>    - Fixed an error print for min_scrub_cycle_hrs in the CXL patrol scrub
>      code.
>    - Improved an error print in the CXL ECS code.
>    - bool -> tristate for config CXL_RAS_FEAT
> 4. Add support for CXL memory sparing feature.
> 5. Add common EDAC memory repair driver for controlling memory repair
>    features, PPR, memory sparing etc.
> 
> v11 -> v12:
> 1. Changes and Fixes for feedback from Boris mainly for
>     patch "EDAC: Add support for EDAC device features control"
>     and other generic comments.
> 
> 2. Took CXL patches from Dave Jiang for "Add Get Supported Features
>    command for kernel usage" and other related patches. Merged helper
>    functions from this series to the above patch. Modifications of
>    CXL code in this series due to refactoring of CXL mailbox in Dave's
>    patches.
> 
> 3. Modified EDAC scrub control code to support multiple scrub instances
>    per device.
> 
> v10 -> v11:
> 1. Feedback from Borislav:
>    - Add generic EDAC code for control device features to
>      /drivers/edac/edac_device.c.
>    - Add common structure in edac for device feature's data.
>    
> 2. Some more optimizations in generic EDAC code for control
>    device features.
> 
> 3. Changes for feedback from Fan for ACPI RAS2 memory driver.
> 
> 4. Add support for control memory PPR (Post Package Repair) features
>    in EDAC.
>    
> 5. Add support for maintenance command in the CXL mailbox code,
>    which is needed for support PPR features in CXL driver.  
> 
> 6. Add support for control memory PPR (Post Package Repair) features
>    and do perform PPR maintenance operation in CXL driver.
> 
> 7. Rename drivers/cxl/core/memscrub.c to drivers/cxl/core/memfeature.c
> 
> v9 -> v10:
> 1. Feedback from Mauro Carvalho Chehab:
>    - Changes suggested in EDAC RAS feature driver.
>      use uppercase for enums, if else to switch-case, documentation for
>      static scrub and ecs init functions etc.
>    - Changes suggested in EDAC scrub.
>      unit of scrub cycle hour to seconds.
>      attribute node cycle_in_hours_available to min_cycle_duration and 
>      max_cycle_duration.
>      attribute node cycle_in_hours to current_cycle_duration.
>      Use base 0 for kstrtou64() and kstrtol() functions.
>      etc.
>    - Changes suggested in EDAC ECS.
>      uppercase for enums
>      add ABI documentation. etc
>         
> 2. Feedback from Fan:
>    - Changes suggested in EDAC RAS feature driver.
>      use uppercase for enums, change if...else to switch-case. 
>      some optimization in edac_ras_dev_register() function
>      add missing goto free_ctx
>    - Changes suggested in the code for feature commands.  
>    - CXL driver scrub and ECS code
>      use uppercase for enums, fix typo, use enum type for mode
>      fix lonf lines etc.
>        
> v8 -> v9:
> 1. Feedback from Borislav:
>    - Add scrub control driver to the EDAC on feedback from Borislav.
>    - Changed DEVICE_ATTR_..() static.
>    - Changed the write permissions for scrub control sysfs files as
>      root-only.
> 2. Feedback from Fan:
>    - Optimized cxl_get_feature() function by using min() and removed
>      feat_out_min_size.
>    - Removed unreached return from cxl_set_feature() function.
>    - Changed the term  "rate" to "cycle_in_hours" in all the
>      scrub control code.
>    - Allow cxl_mem_probe() continue if cxl_mem_patrol_scrub_init() fail,
>      with just a debug warning.
>       
> 3. Feedback from Jonathan:
>    - Removed patch __free() based cleanup function for acpi_put_table.
>      and added fix in the acpi RAS2 driver.
> 
> 4. Feedback from Dan Williams:
>    - Allow cxl_mem_probe() continue if cxl_mem_patrol_scrub_init() fail,
>      with just a debug warning.
>    - Add support for CXL region based scrub control.
> 
> 5. Feedback from Daniel Ferguson on RAS2 drivers:
>     In the ACPI RAS2 driver,
>   - Incorporated the changes given for clearing error reported.
>   - Incorporated the changes given for check the Set RAS Capability
>     status and return an appropriate error.
>     In the RAS2 memory driver,
>   - Added more checks for start/stop bg and on-demand scrubbing
>     so that addr range in cache do not get cleared and restrict
>     permitted operations during scrubbing.
> 
> History for v1 to v8 is available here.
> https://lore.kernel.org/lkml/20240726160556.2079-1-shiju.jose@huawei.com/
> 
> 
> 
> Dave Jiang (6):
>   cxl: Refactor user ioctl command path from mds to mailbox
>   cxl: Add skeletal features driver
>   cxl: Enumerate feature commands
>   cxl: Add Get Supported Features command for kernel usage
>   cxl: Add features driver attribute to emit number of features
>     supported
>   cxl: Setup exclusive CXL features that are reserved for the kernel
> 
> Shiju Jose (13):
>   EDAC: Add support for EDAC device features control
>   EDAC: Add scrub control feature
>   EDAC: Add ECS control feature
>   EDAC: Add memory repair control feature
>   ACPI:RAS2: Add ACPI RAS2 driver
>   ras: mem: Add memory ACPI RAS2 driver
>   cxl/mbox: Add GET_FEATURE mailbox command
>   cxl/mbox: Add SET_FEATURE mailbox command
>   cxl/memfeature: Add CXL memory device patrol scrub control feature
>   cxl/memfeature: Add CXL memory device ECS control feature
>   cxl/mbox: Add support for PERFORM_MAINTENANCE mailbox command
>   cxl/memfeature: Add CXL memory device soft PPR control feature
>   cxl/memfeature: Add CXL memory device memory sparing control feature
> 
>  Documentation/ABI/testing/sysfs-edac-ecs      |   63 +
>  .../ABI/testing/sysfs-edac-memory-repair      |  244 +++
>  Documentation/ABI/testing/sysfs-edac-scrub    |   74 +
>  Documentation/edac/features.rst               |  102 ++
>  Documentation/edac/index.rst                  |   12 +
>  Documentation/edac/memory_repair.rst          |  249 +++
>  Documentation/edac/scrub.rst                  |  393 ++++
>  drivers/acpi/Kconfig                          |   11 +
>  drivers/acpi/Makefile                         |    1 +
>  drivers/acpi/ras2.c                           |  407 ++++
>  drivers/cxl/Kconfig                           |   25 +
>  drivers/cxl/Makefile                          |    3 +
>  drivers/cxl/core/Makefile                     |    2 +
>  drivers/cxl/core/core.h                       |    7 +-
>  drivers/cxl/core/features.c                   |  287 +++
>  drivers/cxl/core/mbox.c                       |  167 +-
>  drivers/cxl/core/memdev.c                     |   22 +-
>  drivers/cxl/core/memfeature.c                 | 1631 +++++++++++++++++
>  drivers/cxl/core/port.c                       |    3 +
>  drivers/cxl/core/region.c                     |    6 +
>  drivers/cxl/cxl.h                             |    3 +
>  drivers/cxl/cxlmem.h                          |   67 +-
>  drivers/cxl/features.c                        |  215 +++
>  drivers/cxl/mem.c                             |    5 +
>  drivers/cxl/pci.c                             |   19 +
>  drivers/edac/Makefile                         |    1 +
>  drivers/edac/ecs.c                            |  207 +++
>  drivers/edac/edac_device.c                    |  183 ++
>  drivers/edac/mem_repair.c                     |  492 +++++
>  drivers/edac/scrub.c                          |  209 +++
>  drivers/ras/Kconfig                           |   10 +
>  drivers/ras/Makefile                          |    1 +
>  drivers/ras/acpi_ras2.c                       |  385 ++++
>  include/acpi/ras2_acpi.h                      |   45 +
>  include/cxl/features.h                        |  171 ++
>  include/cxl/mailbox.h                         |   45 +-
>  include/linux/edac.h                          |  238 +++
>  tools/testing/cxl/Kbuild                      |    1 +
>  38 files changed, 5909 insertions(+), 97 deletions(-)
>  create mode 100644 Documentation/ABI/testing/sysfs-edac-ecs
>  create mode 100644 Documentation/ABI/testing/sysfs-edac-memory-repair
>  create mode 100644 Documentation/ABI/testing/sysfs-edac-scrub
>  create mode 100644 Documentation/edac/features.rst
>  create mode 100644 Documentation/edac/index.rst
>  create mode 100644 Documentation/edac/memory_repair.rst
>  create mode 100644 Documentation/edac/scrub.rst
>  create mode 100755 drivers/acpi/ras2.c
>  create mode 100644 drivers/cxl/core/features.c
>  create mode 100644 drivers/cxl/core/memfeature.c
>  create mode 100644 drivers/cxl/features.c
>  create mode 100755 drivers/edac/ecs.c
>  create mode 100755 drivers/edac/mem_repair.c
>  create mode 100755 drivers/edac/scrub.c
>  create mode 100644 drivers/ras/acpi_ras2.c
>  create mode 100644 include/acpi/ras2_acpi.h
>  create mode 100644 include/cxl/features.h
> 



Thanks,
Mauro

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v18 01/19] EDAC: Add support for EDAC device features control
  2025-01-06 12:09 ` [PATCH v18 01/19] EDAC: Add support for EDAC device features control shiju.jose
  2025-01-06 13:37   ` Borislav Petkov
@ 2025-01-13 15:06   ` Mauro Carvalho Chehab
  2025-01-14  9:55     ` Jonathan Cameron
  2025-01-14 10:08     ` Shiju Jose
  2025-01-30 19:18   ` Daniel Ferguson
  2 siblings, 2 replies; 87+ messages in thread
From: Mauro Carvalho Chehab @ 2025-01-13 15:06 UTC (permalink / raw)
  To: shiju.jose
  Cc: linux-edac, linux-cxl, linux-acpi, linux-mm, linux-kernel, bp,
	tony.luck, rafael, lenb, mchehab, dan.j.williams, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	ira.weiny, david, Vilas.Sridharan, leo.duran, Yazen.Ghannam,
	rientjes, jiaqiyan, Jon.Grimm, dave.hansen, naoya.horiguchi,
	james.morse, jthoughton, somasundaram.a, erdemaktas, pgonda,
	duenwen, gthelen, wschwartz, dferguson, wbs, nifan.cxl,
	tanxiaofei, prime.zeng, roberto.sassu, kangkang.shen,
	wanghuiqiang, linuxarm

Em Mon, 6 Jan 2025 12:09:57 +0000
<shiju.jose@huawei.com> escreveu:

> From: Shiju Jose <shiju.jose@huawei.com>
> 
> Add generic EDAC device feature controls supporting the registration
> of RAS features available in the system. The driver exposes control
> attributes for these features to userspace in
> /sys/bus/edac/devices/<dev-name>/<ras-feature>/
> 
> Co-developed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
> ---
>  Documentation/edac/features.rst |  94 ++++++++++++++++++++++++++++++
>  Documentation/edac/index.rst    |  10 ++++
>  drivers/edac/edac_device.c      | 100 ++++++++++++++++++++++++++++++++
>  include/linux/edac.h            |  28 +++++++++
>  4 files changed, 232 insertions(+)
>  create mode 100644 Documentation/edac/features.rst
>  create mode 100644 Documentation/edac/index.rst
> 
> diff --git a/Documentation/edac/features.rst b/Documentation/edac/features.rst
> new file mode 100644
> index 000000000000..f32f259ce04d
> --- /dev/null
> +++ b/Documentation/edac/features.rst
> @@ -0,0 +1,94 @@
> +.. SPDX-License-Identifier: GPL-2.0

SPDX should match what's written there, e. g.

	.. SPDX-License-Identifier: GPL-2.0 OR GFDL-1.2-no-invariants-or-later

Please notice that GNU FDL family contains both open source and non-open
source licenses. The open-source one is this:

	https://spdx.org/licenses/GFDL-1.2-no-invariants-or-later.html

E.g. it is a the license permits changing the entire document in
the future, as there's no invariant parts on it.

> +
> +============================================
> +Augmenting EDAC for controlling RAS features
> +============================================
> +
> +Copyright (c) 2024 HiSilicon Limited.

2024-2025?

> +
> +:Author:   Shiju Jose <shiju.jose@huawei.com>
> +:License:  The GNU Free Documentation License, Version 1.2
> +          (dual licensed under the GPL v2)

You need to define if invariant parts are allowed or not, e. g.:

	:License: The GNU Free Documentation License, Version 1.2 without Invariant Sections, Front-Cover Texts nor Back-Cover Texts.
		  (dual licensed under the GPL v2)


> +:Original Reviewers:
> +
> +- Written for: 6.14
> +
> +Introduction
> +------------
> +The expansion of EDAC for controlling RAS features and exposing features
> +control attributes to userspace via sysfs. Some Examples:
> +
> +* Scrub control
> +
> +* Error Check Scrub (ECS) control
> +
> +* ACPI RAS2 features
> +
> +* Post Package Repair (PPR) control
> +
> +* Memory Sparing Repair control etc.
> +
> +High level design is illustrated in the following diagram::
> +
> +         _______________________________________________
> +        |   Userspace - Rasdaemon                       |
> +        |  _____________                                |
> +        | | RAS CXL mem |      _______________          |
> +        | |error handler|---->|               |         |
> +        | |_____________|     | RAS dynamic   |         |
> +        |  _____________      | scrub, memory |         |
> +        | | RAS memory  |---->| repair control|         |
> +        | |error handler|     |_______________|         |
> +        | |_____________|          |                    |
> +        |__________________________|____________________|
> +                                   |
> +                                   |
> +    _______________________________|______________________________
> +   |     Kernel EDAC extension for | controlling RAS Features     |
> +   | ______________________________|____________________________  |
> +   || EDAC Core          Sysfs EDAC| Bus                        | |
> +   ||    __________________________|_________     _____________ | |
> +   ||   |/sys/bus/edac/devices/<dev>/scrubX/ |   | EDAC device || |
> +   ||   |/sys/bus/edac/devices/<dev>/ecsX/   |<->| EDAC MC     || |
> +   ||   |/sys/bus/edac/devices/<dev>/repairX |   | EDAC sysfs  || |
> +   ||   |____________________________________|   |_____________|| |
> +   ||                           EDAC|Bus                        | |
> +   ||                               |                           | |
> +   ||    __________ Get feature     |      Get feature          | |
> +   ||   |          |desc   _________|______ desc  __________    | |
> +   ||   |EDAC scrub|<-----| EDAC device    |     |          |   | |
> +   ||   |__________|      | driver- RAS    |---->| EDAC mem |   | |
> +   ||    __________       | feature control|     | repair   |   | |
> +   ||   |          |<-----|________________|     |__________|   | |
> +   ||   |EDAC ECS  |    Register RAS|features                   | |
> +   ||   |__________|                |                           | |
> +   ||         ______________________|_____________              | |
> +   ||_________|_______________|__________________|______________| |
> +   |   _______|____    _______|_______       ____|__________      |
> +   |  |            |  | CXL mem driver|     | Client driver |     |
> +   |  | ACPI RAS2  |  | scrub, ECS,   |     | memory repair |     |
> +   |  | driver     |  | sparing, PPR  |     | features      |     |
> +   |  |____________|  |_______________|     |_______________|     |
> +   |        |                 |                    |              |
> +   |________|_________________|____________________|______________|
> +            |                 |                    |
> +    ________|_________________|____________________|______________
> +   |     ___|_________________|____________________|_______       |
> +   |    |                                                  |      |
> +   |    |            Platform HW and Firmware              |      |
> +   |    |__________________________________________________|      |
> +   |______________________________________________________________|
> +
> +
> +1. EDAC Features components - Create feature specific descriptors.
> +For example, EDAC scrub, EDAC ECS, EDAC memory repair in the above
> +diagram.
> +
> +2. EDAC device driver for controlling RAS Features - Get feature's attribute
> +descriptors from EDAC RAS feature component and registers device's RAS
> +features with EDAC bus and exposes the features control attributes via
> +the sysfs EDAC bus. For example, /sys/bus/edac/devices/<dev-name>/<feature>X/
> +
> +3. RAS dynamic feature controller - Userspace sample modules in rasdaemon for
> +dynamic scrub/repair control to issue scrubbing/repair when excess number
> +of corrected memory errors are reported in a short span of time.
> diff --git a/Documentation/edac/index.rst b/Documentation/edac/index.rst
> new file mode 100644
> index 000000000000..b6c265a4cffb
> --- /dev/null
> +++ b/Documentation/edac/index.rst
> @@ -0,0 +1,10 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +==============
> +EDAC Subsystem
> +==============
> +
> +.. toctree::
> +   :maxdepth: 1
> +
> +   features
> diff --git a/drivers/edac/edac_device.c b/drivers/edac/edac_device.c
> index 621dc2a5d034..9fce46dd7405 100644
> --- a/drivers/edac/edac_device.c
> +++ b/drivers/edac/edac_device.c
> @@ -570,3 +570,103 @@ void edac_device_handle_ue_count(struct edac_device_ctl_info *edac_dev,
>  		      block ? block->name : "N/A", count, msg);
>  }
>  EXPORT_SYMBOL_GPL(edac_device_handle_ue_count);
> +
> +static void edac_dev_release(struct device *dev)
> +{
> +	struct edac_dev_feat_ctx *ctx = container_of(dev, struct edac_dev_feat_ctx, dev);
> +
> +	kfree(ctx->dev.groups);
> +	kfree(ctx);
> +}
> +
> +const struct device_type edac_dev_type = {
> +	.name = "edac_dev",
> +	.release = edac_dev_release,
> +};
> +
> +static void edac_dev_unreg(void *data)
> +{
> +	device_unregister(data);
> +}
> +
> +/**
> + * edac_dev_register - register device for RAS features with EDAC
> + * @parent: parent device.
> + * @name: parent device's name.
> + * @private: parent driver's data to store in the context if any.
> + * @num_features: number of RAS features to register.
> + * @ras_features: list of RAS features to register.
> + *
> + * Return:
> + *  * %0       - Success.
> + *  * %-EINVAL - Invalid parameters passed.
> + *  * %-ENOMEM - Dynamic memory allocation failed.
> + *
> + */
> +int edac_dev_register(struct device *parent, char *name,
> +		      void *private, int num_features,
> +		      const struct edac_dev_feature *ras_features)
> +{
> +	const struct attribute_group **ras_attr_groups;
> +	struct edac_dev_feat_ctx *ctx;
> +	int attr_gcnt = 0;
> +	int ret, feat;
> +
> +	if (!parent || !name || !num_features || !ras_features)
> +		return -EINVAL;
> +
> +	/* Double parse to make space for attributes */
> +	for (feat = 0; feat < num_features; feat++) {
> +		switch (ras_features[feat].ft_type) {
> +		/* Add feature specific code */
> +		default:
> +			return -EINVAL;
> +		}
> +	}
> +
> +	ctx = kzalloc(sizeof(*ctx), GFP_KERNEL);
> +	if (!ctx)
> +		return -ENOMEM;
> +
> +	ras_attr_groups = kcalloc(attr_gcnt + 1, sizeof(*ras_attr_groups), GFP_KERNEL);
> +	if (!ras_attr_groups) {
> +		ret = -ENOMEM;
> +		goto ctx_free;
> +	}
> +
> +	attr_gcnt = 0;
> +	for (feat = 0; feat < num_features; feat++, ras_features++) {
> +		switch (ras_features->ft_type) {
> +		/* Add feature specific code */
> +		default:
> +			ret = -EINVAL;
> +			goto groups_free;
> +		}
> +	}
> +
> +	ctx->dev.parent = parent;
> +	ctx->dev.bus = edac_get_sysfs_subsys();
> +	ctx->dev.type = &edac_dev_type;
> +	ctx->dev.groups = ras_attr_groups;
> +	ctx->private = private;
> +	dev_set_drvdata(&ctx->dev, ctx);
> +
> +	ret = dev_set_name(&ctx->dev, name);
> +	if (ret)
> +		goto groups_free;
> +
> +	ret = device_register(&ctx->dev);
> +	if (ret) {
> +		put_device(&ctx->dev);

> +		return ret;

As register failed, you need to change it to a goto groups_free,
as edac_dev_release() won't be called.

> +	}
> +
> +	return devm_add_action_or_reset(parent, edac_dev_unreg, &ctx->dev);
> +
> +groups_free:
> +	kfree(ras_attr_groups);
> +ctx_free:
> +	kfree(ctx);
> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(edac_dev_register);
> diff --git a/include/linux/edac.h b/include/linux/edac.h
> index b4ee8961e623..521b17113d4d 100644
> --- a/include/linux/edac.h
> +++ b/include/linux/edac.h
> @@ -661,4 +661,32 @@ static inline struct dimm_info *edac_get_dimm(struct mem_ctl_info *mci,
>  
>  	return mci->dimms[index];
>  }
> +
> +#define EDAC_FEAT_NAME_LEN	128

This macro was not used on this patch.

> +
> +/* RAS feature type */
> +enum edac_dev_feat {
> +	RAS_FEAT_MAX
> +};
> +
> +/* EDAC device feature information structure */
> +struct edac_dev_data {
> +	u8 instance;
> +	void *private;
> +};
> +
> +struct edac_dev_feat_ctx {
> +	struct device dev;
> +	void *private;
> +};
> +
> +struct edac_dev_feature {
> +	enum edac_dev_feat ft_type;
> +	u8 instance;
> +	void *ctx;
> +};
> +
> +int edac_dev_register(struct device *parent, char *dev_name,
> +		      void *parent_pvt_data, int num_features,
> +		      const struct edac_dev_feature *ras_features);
>  #endif /* _LINUX_EDAC_H_ */

Thanks,
Mauro

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v18 00/19] EDAC: Scrub: introduce generic EDAC RAS control feature driver + CXL/ACPI-RAS2 drivers
  2025-01-13 14:46 ` [PATCH v18 00/19] EDAC: Scrub: introduce generic EDAC RAS control feature driver + CXL/ACPI-RAS2 drivers Mauro Carvalho Chehab
@ 2025-01-13 15:36   ` Jonathan Cameron
  2025-01-14 14:06     ` Mauro Carvalho Chehab
  2025-01-13 18:15   ` Shiju Jose
  1 sibling, 1 reply; 87+ messages in thread
From: Jonathan Cameron @ 2025-01-13 15:36 UTC (permalink / raw)
  To: Mauro Carvalho Chehab, linux-acpi, linux-mm, vishal.l.verma
  Cc: shiju.jose, linux-edac, linux-cxl, linux-kernel, bp, tony.luck,
	rafael, lenb, mchehab, dan.j.williams, dave, dave.jiang,
	alison.schofield, ira.weiny, david, Vilas.Sridharan, leo.duran,
	Yazen.Ghannam, rientjes, jiaqiyan, Jon.Grimm, dave.hansen,
	naoya.horiguchi, james.morse, jthoughton, somasundaram.a,
	erdemaktas, pgonda, duenwen, gthelen, wschwartz, dferguson, wbs,
	nifan.cxl, tanxiaofei, prime.zeng, roberto.sassu, kangkang.shen,
	wanghuiqiang, linuxarm


> >    
> > 5. CXL features driver supporting ECS control feature.
> > 6. ACPI RAS2 driver adds OS interface for RAS2 communication through
> >    PCC mailbox and extracts ACPI RAS2 feature table (RAS2) and
> >    create platform device for the RAS memory features, which binds
> >    to the memory ACPI RAS2 driver.
> > 7. Memory ACPI RAS2 driver gets the PCC subspace for communicating
> >    with the ACPI compliant platform supports ACPI RAS2. Add callback
> >    functions and registers with EDAC device to support user to
> >    control the HW patrol scrubbers exposed to the kernel via the
> >    ACPI RAS2 table.
> > 8. Support for CXL maintenance mailbox command, which is used by
> >    CXL device memory repair feature.   
> > 9. CXL features driver supporting PPR control feature.
> > 10. CXL features driver supporting memory sparing control feature.
> >     Note: There are other PPR, memory sparing drivers to come.  
> 
> The text above should be inside Documentation, and not on patch 0.
> 
> A big description like that makes hard to review this series. It is
> also easier to review the text after having it parsed by kernel doc
> build specially for summary tables like the "Comparison of scrubbing 
> features", which deserves ReST links processed by Sphinx to the 
> corresponding definitions of the terms that are be compared there.

Whilst I fully agree that having a huge cover letter makes for a burden
for any reviewer coming to the series, this is here at specific request
of reviewers.  We can look at keeping more of it in documentation though
it's a bit white paper like in comparison with what I'd normally expect
to see in kernel documentation.

> 
> > Open Questions based on feedbacks from the community:
> > 1. Leo: Standardize unit for scrub rate, for example ACPI RAS2 does not define
> >    unit for the scrub rate. RAS2 clarification needed.   
> 
> I noticed the same when reviewing a patch series for rasdaemon. Ideally,
> ACPI requires an errata defining what units are expected for scrub rate.

There is a code first ACPI ECN that indeed adds units.  That is accepted
for next ACPI specification release.

Seems the tianocore bugzilla is unhelpfully down for a migration
but it should be id 1013 at bugzilla.tianocore.com

That adds a detailed description of what the scrub rate settings mean but
we may well still have older platforms where the scaling is arbitrary.
The units defined are sufficient to map to whatever presentation we like.

> 
> While ACPI doesn't define it, better to not add support for it - or be
> conservative using a low granularity for it (like using minutes instead 
> of hours).

I don't mind changing this, though for systems we are aware of default scrub
is typically once or twice in 24 hours.

Jonathan


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v18 02/19] EDAC: Add scrub control feature
  2025-01-06 12:09 ` [PATCH v18 02/19] EDAC: Add scrub control feature shiju.jose
  2025-01-06 15:57   ` Borislav Petkov
@ 2025-01-13 15:50   ` Mauro Carvalho Chehab
  2025-01-30 19:18   ` Daniel Ferguson
  2 siblings, 0 replies; 87+ messages in thread
From: Mauro Carvalho Chehab @ 2025-01-13 15:50 UTC (permalink / raw)
  To: shiju.jose
  Cc: linux-edac, linux-cxl, linux-acpi, linux-mm, linux-kernel, bp,
	tony.luck, rafael, lenb, mchehab, dan.j.williams, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	ira.weiny, david, Vilas.Sridharan, leo.duran, Yazen.Ghannam,
	rientjes, jiaqiyan, Jon.Grimm, dave.hansen, naoya.horiguchi,
	james.morse, jthoughton, somasundaram.a, erdemaktas, pgonda,
	duenwen, gthelen, wschwartz, dferguson, wbs, nifan.cxl,
	tanxiaofei, prime.zeng, roberto.sassu, kangkang.shen,
	wanghuiqiang, linuxarm

Em Mon, 6 Jan 2025 12:09:58 +0000
<shiju.jose@huawei.com> escreveu:

> From: Shiju Jose <shiju.jose@huawei.com>
> 
> Add a generic EDAC scrub control to manage memory scrubbers in the system.
> Devices with a scrub feature register with the EDAC device driver, which
> retrieves the scrub descriptor from the EDAC scrub driver and exposes the
> sysfs scrub control attributes for a scrub instance to userspace at
> /sys/bus/edac/devices/<dev-name>/scrubX/.
> 
> The common sysfs scrub control interface abstracts the control of
> arbitrary scrubbing functionality into a common set of functions. The
> sysfs scrub attribute nodes are only present if the client driver has
> implemented the corresponding attribute callback function and passed the
> operations(ops) to the EDAC device driver during registration.
> 
> Co-developed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
> ---
>  Documentation/ABI/testing/sysfs-edac-scrub |  74 +++++++
>  Documentation/edac/features.rst            |   5 +
>  Documentation/edac/index.rst               |   1 +
>  Documentation/edac/scrub.rst               | 244 +++++++++++++++++++++
>  drivers/edac/Makefile                      |   1 +
>  drivers/edac/edac_device.c                 |  41 +++-
>  drivers/edac/scrub.c                       | 209 ++++++++++++++++++
>  include/linux/edac.h                       |  34 +++
>  8 files changed, 605 insertions(+), 4 deletions(-)
>  create mode 100644 Documentation/ABI/testing/sysfs-edac-scrub
>  create mode 100644 Documentation/edac/scrub.rst
>  create mode 100755 drivers/edac/scrub.c
> 
> diff --git a/Documentation/ABI/testing/sysfs-edac-scrub b/Documentation/ABI/testing/sysfs-edac-scrub
> new file mode 100644
> index 000000000000..af14a68ee5a9
> --- /dev/null
> +++ b/Documentation/ABI/testing/sysfs-edac-scrub
> @@ -0,0 +1,74 @@
> +What:		/sys/bus/edac/devices/<dev-name>/scrubX
> +Date:		Jan 2025
> +KernelVersion:	6.14
> +Contact:	linux-edac@vger.kernel.org
> +Description:
> +		The sysfs EDAC bus devices /<dev-name>/scrubX subdirectory
> +		belongs to an instance of memory scrub control feature,
> +		where <dev-name> directory corresponds to a device/memory
> +		region registered with the EDAC device driver for the
> +		scrub control feature.
> +		The sysfs scrub attr nodes are only present if the parent
> +		driver has implemented the corresponding attr callback
> +		function and provided the necessary operations to the EDAC
> +		device driver during registration.
> +
> +What:		/sys/bus/edac/devices/<dev-name>/scrubX/addr
> +Date:		Jan 2025
> +KernelVersion:	6.14
> +Contact:	linux-edac@vger.kernel.org
> +Description:
> +		(RW) The base address of the memory region to be scrubbed
> +		for on-demand scrubbing. Setting address would start
> +		scrubbing. The size must be set before that.
> +		The readback addr value would be non-zero if the requested
> +		on-demand scrubbing is in progress, zero otherwise.
> +
> +What:		/sys/bus/edac/devices/<dev-name>/scrubX/size
> +Date:		Jan 2025
> +KernelVersion:	6.14
> +Contact:	linux-edac@vger.kernel.org
> +Description:
> +		(RW) The size of the memory region to be scrubbed
> +		(on-demand scrubbing).
> +
> +What:		/sys/bus/edac/devices/<dev-name>/scrubX/enable_background
> +Date:		Jan 2025
> +KernelVersion:	6.14
> +Contact:	linux-edac@vger.kernel.org
> +Description:
> +		(RW) Start/Stop background(patrol) scrubbing if supported.
> +
> +What:		/sys/bus/edac/devices/<dev-name>/scrubX/enable_on_demand
> +Date:		Jan 2025
> +KernelVersion:	6.14
> +Contact:	linux-edac@vger.kernel.org
> +Description:
> +		(RW) Start/Stop on-demand scrubbing the memory region
> +		if supported.
> +
> +What:		/sys/bus/edac/devices/<dev-name>/scrubX/min_cycle_duration
> +Date:		Jan 2025
> +KernelVersion:	6.14
> +Contact:	linux-edac@vger.kernel.org
> +Description:
> +		(RO) Supported minimum scrub cycle duration in seconds
> +		by the memory scrubber.
> +
> +What:		/sys/bus/edac/devices/<dev-name>/scrubX/max_cycle_duration
> +Date:		Jan 2025
> +KernelVersion:	6.14
> +Contact:	linux-edac@vger.kernel.org
> +Description:
> +		(RO) Supported maximum scrub cycle duration in seconds
> +		by the memory scrubber.
> +
> +What:		/sys/bus/edac/devices/<dev-name>/scrubX/current_cycle_duration
> +Date:		Jan 2025
> +KernelVersion:	6.14
> +Contact:	linux-edac@vger.kernel.org
> +Description:
> +		(RW) The current scrub cycle duration in seconds and must be
> +		within the supported range by the memory scrubber.
> +		Scrub has an overhead when running and that may want to be
> +		reduced by taking longer to do it.
> diff --git a/Documentation/edac/features.rst b/Documentation/edac/features.rst
> index f32f259ce04d..ba3ab993ee4f 100644
> --- a/Documentation/edac/features.rst
> +++ b/Documentation/edac/features.rst
> @@ -92,3 +92,8 @@ the sysfs EDAC bus. For example, /sys/bus/edac/devices/<dev-name>/<feature>X/
>  3. RAS dynamic feature controller - Userspace sample modules in rasdaemon for
>  dynamic scrub/repair control to issue scrubbing/repair when excess number
>  of corrected memory errors are reported in a short span of time.
> +
> +RAS features
> +------------
> +1. Memory Scrub
> +Memory scrub features are documented in `Documentation/edac/scrub.rst`.
> diff --git a/Documentation/edac/index.rst b/Documentation/edac/index.rst
> index b6c265a4cffb..dfb0c9fb9ab1 100644
> --- a/Documentation/edac/index.rst
> +++ b/Documentation/edac/index.rst
> @@ -8,3 +8,4 @@ EDAC Subsystem
>     :maxdepth: 1
>  
>     features
> +   scrub
> diff --git a/Documentation/edac/scrub.rst b/Documentation/edac/scrub.rst
> new file mode 100644
> index 000000000000..5a5108b744a4
> --- /dev/null
> +++ b/Documentation/edac/scrub.rst
> @@ -0,0 +1,244 @@
> +.. SPDX-License-Identifier: GPL-2.0

Same note of patch 1, e. g.:

	.. SPDX-License-Identifier: GPL-2.0 OR GFDL-1.2-no-invariants-or-later


> +
> +===================
> +EDAC Scrub Control
> +===================
> +
> +Copyright (c) 2024 HiSilicon Limited.
> +
> +:Author:   Shiju Jose <shiju.jose@huawei.com>
> +:License:  The GNU Free Documentation License, Version 1.2
> +          (dual licensed under the GPL v2)

See my notes on patch 1.

I won't repeat those over the other patches in this series touching
documentation

> +:Original Reviewers:
> +
> +- Written for: 6.14
> +
> +Introduction
> +------------
> +Increasing DRAM size and cost have made memory subsystem reliability an
> +important concern. These modules are used where potentially corrupted data
> +could cause expensive or fatal issues. Memory errors are among the top
> +hardware failures that cause server and workload crashes.
> +
> +Memory scrubbing is a feature where an ECC (Error-Correcting Code) engine
> +reads data from each memory media location, corrects with an ECC if
> +necessary and writes the corrected data back to the same memory media
> +location.
> +
> +The memory DIMMs can be scrubbed at a configurable rate to detect
> +uncorrected memory errors and attempt recovery from detected errors,
> +providing the following benefits.
> +
> +* Proactively scrubbing memory DIMMs reduces the chance of a correctable error becoming uncorrectable.
> +
> +* When detected, uncorrected errors caught in unallocated memory pages are isolated and prevented from being allocated to an application or the OS.
> +
> +* This reduces the likelihood of software or hardware products encountering memory errors.
> +
> +There are 2 types of memory scrubbing:
> +
> +1. Background (patrol) scrubbing of the RAM while the RAM is otherwise
> +idle.
> +
> +2. On-demand scrubbing for a specific address range or region of memory.
> +
> +Several types of interfaces to hardware memory scrubbers have been
> +identified, such as CXL memory device patrol scrub, CXL DDR5 ECS, ACPI
> +RAS2 memory scrubbing, and ACPI NVDIMM ARS (Address Range Scrub).
> +
> +The scrub control varies between different memory scrubbers. To allow
> +for standard userspace tooling there is a need to present these controls
> +with a standard ABI.
> +
> +The control mechanisms vary across different memory scrubbers. To enable
> +standardized userspace tooling, there is a need to present these controls
> +through a standardized ABI.
> +
> +Introduce a generic memory EDAC scrub control that allows users to manage
> +underlying scrubbers in the system through a standardized sysfs scrub
> +control interface. This common sysfs scrub control interface abstracts the
> +management of various scrubbing functionalities into a unified set of
> +functions.
> +
> +Use cases of common scrub control feature
> +-----------------------------------------
> +1. Several types of interfaces for hardware (HW) memory scrubbers have
> +been identified, including the CXL memory device patrol scrub, CXL DDR5
> +ECS, ACPI RAS2 memory scrubbing features, ACPI NVDIMM ARS (Address Range
> +Scrub), and software-based memory scrubbers. Some of these scrubbers
> +support control over patrol (background) scrubbing (e.g., ACPI RAS2, CXL)
> +and/or on-demand scrubbing (e.g., ACPI RAS2, ACPI ARS). However, the scrub
> +control interfaces vary between memory scrubbers, highlighting the need for
> +a standardized, generic sysfs scrub control interface that is accessible to
> +userspace for administration and use by scripts/tools.
> +
> +2. User-space scrub controls allow users to disable scrubbing if necessary,
> +for example, to disable background patrol scrubbing or adjust the scrub
> +rate for performance-aware operations where background activities need to
> +be minimized or disabled.
> +
> +3. User-space tools enable on-demand scrubbing for specific address ranges,
> +provided that the scrubber supports this functionality.
> +
> +4. User-space tools can also control memory DIMM scrubbing at a configurable
> +scrub rate via sysfs scrub controls. This approach offers several benefits:
> +
> +* Detects uncorrectable memory errors early, before user access to affected memory, helping facilitate recovery.
> +
> +* Reduces the likelihood of correctable errors developing into uncorrectable errors.
> +
> +5. Policy control for hotplugged memory is necessary because there may not
> +be a system-wide BIOS or similar control to manage scrub settings for a CXL
> +device added after boot. Determining these settings is a policy decision,
> +balancing reliability against performance, so userspace should control it.
> +Therefore, a unified interface is recommended for handling this function in
> +a way that aligns with other similar interfaces, rather than creating a
> +separate one.
> +
> +Scrubbing features
> +------------------
> +Comparison of various scrubbing features::
> +
> + ................................................................
> + .              .   ACPI    . CXL patrol.  CXL ECS  .  ARS      .
> + .  Name        .   RAS2    . scrub     .           .           .
> + ................................................................
> + .              .           .           .           .           .
> + . On-demand    . Supported . No        . No        . Supported .
> + . Scrubbing    .           .           .           .           .
> + .              .           .           .           .           .
> + ................................................................
> + .              .           .           .           .           .
> + . Background   . Supported . Supported . Supported . No        .
> + . scrubbing    .           .           .           .           .
> + .              .           .           .           .           .
> + ................................................................
> + .              .           .           .           .           .
> + . Mode of      . Scrub ctrl. per device. per memory.  Unknown  .
> + . scrubbing    . per NUMA  .           . media     .           .
> + .              . domain.   .           .           .           .
> + ................................................................
> + .              .           .           .           .           .
> + . Query scrub  . Supported . Supported . Supported . Supported .
> + . capabilities .           .           .           .           .
> + .              .           .           .           .           .
> + ................................................................
> + .              .           .           .           .           .
> + . Setting      . Supported . No        . No        . Supported .
> + . address range.           .           .           .           .
> + .              .           .           .           .           .
> + ................................................................
> + .              .           .           .           .           .
> + . Setting      . Supported . Supported . No        . No        .
> + . scrub rate   .           .           .           .           .
> + .              .           .           .           .           .
> + ................................................................
> + .              .           .           .           .           .
> + . Unit for     . Not       . in hours  . No        . No        .
> + . scrub rate   . Defined   .           .           .           .
> + .              .           .           .           .           .
> + ................................................................
> + .              . Supported .           .           .           .
> + . Scrub        . on-demand . No        . No        . Supported .
> + . status/      . scrubbing .           .           .           .
> + . Completion   . only      .           .           .           .
> + ................................................................
> + . UC error     .           .CXL general.CXL general. ACPI UCE  .
> + . reporting    . Exception .media/DRAM .media/DRAM . notify and.
> + .              .           .event/media.event/media. query     .
> + .              .           .scan?      .scan?      . ARS status.
> + ................................................................
> + .              .           .           .           .           .
> + . Support for  . Supported . Supported . Supported . No        .
> + . EDAC control .           .           .           .           .
> + .              .           .           .           .           .
> + ................................................................

Please format this as a table, in ReST, e. g:

Scrubbing features
------------------
Comparison of various scrubbing features:

+--------------+-----------+-----------+-----------+-----------+
|              |           |           |           |           |
| Background   | Supported | Supported | Supported | No        |
| scrubbing    |           |           |           |           |
|              |           |           |           |           |
+==============+===========+===========+===========+===========+
|              |           |           |           |           |
| Mode of      | Scrub ctrl| per device| per memory|  Unknown  |
| scrubbing    | per NUMA  |           | media     |           |
|              | domain    |           |           |           |
+--------------+-----------+-----------+-----------+-----------+
|              |           |           |           |           |
| Query scrub  | Supported | Supported | Supported | Supported |
| capabilities |           |           |           |           |
|              |           |           |           |           |
+--------------+-----------+-----------+-----------+-----------+
|              |           |           |           |           |
| Setting      | Supported | No        | No        | Supported |
| address range|           |           |           |           |
|              |           |           |           |           |
+--------------+-----------+-----------+-----------+-----------+
|              |           |           |           |           |
| Setting      | Supported | Supported | No        | No        |
| scrub rate   |           |           |           |           |
|              |           |           |           |           |
+--------------+-----------+-----------+-----------+-----------+
|              |           |           |           |           |
| Unit for     | Not       | in hours  | No        | No        |
| scrub rate   | Defined   |           |           |           |
|              |           |           |           |           |
+--------------+-----------+-----------+-----------+-----------+
|              | Supported |           |           |           |
| Scrub        | on-demand | No        | No        | Supported |
| status/      | scrubbing |           |           |           |
| Completion   | only      |           |           |           |
+--------------+-----------+-----------+-----------+-----------+
| UC error     |           |CXL general|CXL general| ACPI UCE  |
| reporting    | Exception |media/DRAM |media/DRAM | notify and|
|              |           |event/media|event/media| query     |
|              |           |scan?      |scan?      | ARS status|
+--------------+-----------+-----------+-----------+-----------+
|              |           |           |           |           |
| Support for  | Supported | Supported | Supported | No        |
| EDAC control |           |           |           |           |
|              |           |           |           |           |
+--------------+-----------+-----------+-----------+-----------+


> +CXL Memory Scrubbing features
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +CXL spec r3.1 section 8.2.9.9.11.1 describes the memory device patrol scrub
> +control feature. The device patrol scrub proactively locates and makes
> +corrections to errors in regular cycle. The patrol scrub control allows the
> +request to configure patrol scrubber's input configurations.
> +
> +The patrol scrub control allows the requester to specify the number of
> +hours in which the patrol scrub cycles must be completed, provided that
> +the requested number is not less than the minimum number of hours for the
> +patrol scrub cycle that the device is capable of. In addition, the patrol
> +scrub controls allow the host to disable and enable the feature in case
> +disabling of the feature is needed for other purposes such as
> +performance-aware operations which require the background operations to be
> +turned off.
> +
> +Error Check Scrub (ECS)
> +~~~~~~~~~~~~~~~~~~~~~~~
> +CXL spec r3.1 section 8.2.9.9.11.2 describes the Error Check Scrub (ECS)
> +is a feature defined in JEDEC DDR5 SDRAM Specification (JESD79-5) and
> +allows the DRAM to internally read, correct single-bit errors, and write
> +back corrected data bits to the DRAM array while providing transparency
> +to error counts.

Please add a reference for CXL spec, like:

	CXL spec r3.1 [1]_ section 8.2.9.9.11.2 describes...

.. [1] https://computeexpresslink.org/wp-content/uploads/2024/02/CXL-3.1-Specification.pdf

Same for other specs mentioned at the docs. For more details, see:
	https://www.sphinx-doc.org/en/master/usage/restructuredtext/basics.html#footnotes

> +
> +The DDR5 device contains number of memory media FRUs per device. The
> +DDR5 ECS feature and thus the ECS control driver supports configuring
> +the ECS parameters per FRU.
> +
> +ACPI RAS2 Hardware-based Memory Scrubbing
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +ACPI spec 6.5 section 5.2.21 ACPI RAS2 describes ACPI RAS2 table
> +provides interfaces for platform RAS features and supports independent
> +RAS controls and capabilities for a given RAS feature for multiple
> +instances of the same component in a given system.
> +Memory RAS features apply to RAS capabilities, controls and operations
> +that are specific to memory. RAS2 PCC sub-spaces for memory-specific RAS
> +features have a Feature Type of 0x00 (Memory).
> +
> +The platform can use the hardware-based memory scrubbing feature to expose
> +controls and capabilities associated with hardware-based memory scrub
> +engines. The RAS2 memory scrubbing feature supports following as per spec,
> +
> +* Independent memory scrubbing controls for each NUMA domain, identified using its proximity domain.
> +
> +* Provision for background (patrol) scrubbing of the entire memory system, as well as on-demand scrubbing for a specific region of memory.
> +
> +ACPI Address Range Scrubbing(ARS)
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +ACPI spec 6.5 section 9.19.7.2 describes Address Range Scrubbing(ARS).
> +ARS allows the platform to communicate memory errors to system software.
> +This capability allows system software to prevent accesses to addresses
> +with uncorrectable errors in memory. ARS functions manage all NVDIMMs
> +present in the system. Only one scrub can be in progress system wide
> +at any given time.
> +Following functions are supported as per the specification.
> +
> +1. Query ARS Capabilities for a given address range, indicates platform
> +supports the ACPI NVDIMM Root Device Unconsumed Error Notification.
> +
> +2. Start ARS triggers an Address Range Scrub for the given memory range.
> +Address scrubbing can be done for volatile memory, persistent memory, or both.
> +
> +3. Query ARS Status command allows software to get the status of ARS,
> +including the progress of ARS and ARS error record.
> +
> +4. Clear Uncorrectable Error.
> +
> +5. Translate SPA
> +
> +6. ARS Error Inject etc.
> +
> +The kernel supports an existing control for ARS and ARS is currently not
> +supported in EDAC.
> +
> +The File System
> +---------------
> +
> +The control attributes of a registered scrubber instance could be
> +accessed in the
> +
> +/sys/bus/edac/devices/<dev-name>/scrubX/
> +
> +sysfs
> +-----
> +
> +Sysfs files are documented in
> +
> +`Documentation/ABI/testing/sysfs-edac-scrub`.

Please remove the blank line, e. g.:

	Sysfs files are documented in
	`Documentation/ABI/testing/sysfs-edac-scrub`.

> diff --git a/drivers/edac/Makefile b/drivers/edac/Makefile
> index f9cf19d8d13d..a162726cc6b9 100644
> --- a/drivers/edac/Makefile
> +++ b/drivers/edac/Makefile
> @@ -10,6 +10,7 @@ obj-$(CONFIG_EDAC)			:= edac_core.o
>  
>  edac_core-y	:= edac_mc.o edac_device.o edac_mc_sysfs.o
>  edac_core-y	+= edac_module.o edac_device_sysfs.o wq.o
> +edac_core-y	+= scrub.o
>  
>  edac_core-$(CONFIG_EDAC_DEBUG)		+= debugfs.o
>  
> diff --git a/drivers/edac/edac_device.c b/drivers/edac/edac_device.c
> index 9fce46dd7405..60b20eae01e8 100644
> --- a/drivers/edac/edac_device.c
> +++ b/drivers/edac/edac_device.c
> @@ -575,6 +575,7 @@ static void edac_dev_release(struct device *dev)
>  {
>  	struct edac_dev_feat_ctx *ctx = container_of(dev, struct edac_dev_feat_ctx, dev);
>  
> +	kfree(ctx->scrub);
>  	kfree(ctx->dev.groups);
>  	kfree(ctx);
>  }
> @@ -608,8 +609,10 @@ int edac_dev_register(struct device *parent, char *name,
>  		      const struct edac_dev_feature *ras_features)
>  {
>  	const struct attribute_group **ras_attr_groups;
> +	struct edac_dev_data *dev_data;
>  	struct edac_dev_feat_ctx *ctx;
>  	int attr_gcnt = 0;
> +	int scrub_cnt = 0;
>  	int ret, feat;
>  
>  	if (!parent || !name || !num_features || !ras_features)
> @@ -618,7 +621,10 @@ int edac_dev_register(struct device *parent, char *name,
>  	/* Double parse to make space for attributes */
>  	for (feat = 0; feat < num_features; feat++) {
>  		switch (ras_features[feat].ft_type) {
> -		/* Add feature specific code */
> +		case RAS_FEAT_SCRUB:
> +			attr_gcnt++;
> +			scrub_cnt++;
> +			break;
>  		default:
>  			return -EINVAL;
>  		}
> @@ -634,13 +640,38 @@ int edac_dev_register(struct device *parent, char *name,
>  		goto ctx_free;
>  	}
>  
> +	if (scrub_cnt) {
> +		ctx->scrub = kcalloc(scrub_cnt, sizeof(*ctx->scrub), GFP_KERNEL);
> +		if (!ctx->scrub) {
> +			ret = -ENOMEM;
> +			goto groups_free;
> +		}
> +	}
> +
>  	attr_gcnt = 0;
> +	scrub_cnt = 0;
>  	for (feat = 0; feat < num_features; feat++, ras_features++) {
>  		switch (ras_features->ft_type) {
> -		/* Add feature specific code */
> +		case RAS_FEAT_SCRUB:
> +			if (!ras_features->scrub_ops ||
> +			    scrub_cnt != ras_features->instance)
> +				goto data_mem_free;
> +
> +			dev_data = &ctx->scrub[scrub_cnt];
> +			dev_data->instance = scrub_cnt;
> +			dev_data->scrub_ops = ras_features->scrub_ops;
> +			dev_data->private = ras_features->ctx;
> +			ret = edac_scrub_get_desc(parent, &ras_attr_groups[attr_gcnt],
> +						  ras_features->instance);
> +			if (ret)
> +				goto data_mem_free;
> +
> +			scrub_cnt++;
> +			attr_gcnt++;
> +			break;
>  		default:
>  			ret = -EINVAL;
> -			goto groups_free;
> +			goto data_mem_free;
>  		}
>  	}
>  
> @@ -653,7 +684,7 @@ int edac_dev_register(struct device *parent, char *name,
>  
>  	ret = dev_set_name(&ctx->dev, name);
>  	if (ret)
> -		goto groups_free;
> +		goto data_mem_free;
>  
>  	ret = device_register(&ctx->dev);
>  	if (ret) {
> @@ -663,6 +694,8 @@ int edac_dev_register(struct device *parent, char *name,
>  
>  	return devm_add_action_or_reset(parent, edac_dev_unreg, &ctx->dev);
>  
> +data_mem_free:
> +	kfree(ctx->scrub);
>  groups_free:
>  	kfree(ras_attr_groups);
>  ctx_free:
> diff --git a/drivers/edac/scrub.c b/drivers/edac/scrub.c
> new file mode 100755
> index 000000000000..3978201c4bfc
> --- /dev/null
> +++ b/drivers/edac/scrub.c
> @@ -0,0 +1,209 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * The generic EDAC scrub driver controls the memory scrubbers in the
> + * system. The common sysfs scrub interface abstracts the control of
> + * various arbitrary scrubbing functionalities into a unified set of
> + * functions.
> + *
> + * Copyright (c) 2024 HiSilicon Limited.
> + */
> +
> +#include <linux/edac.h>
> +
> +enum edac_scrub_attributes {
> +	SCRUB_ADDRESS,
> +	SCRUB_SIZE,
> +	SCRUB_ENABLE_BACKGROUND,
> +	SCRUB_MIN_CYCLE_DURATION,
> +	SCRUB_MAX_CYCLE_DURATION,
> +	SCRUB_CUR_CYCLE_DURATION,
> +	SCRUB_MAX_ATTRS
> +};
> +
> +struct edac_scrub_dev_attr {
> +	struct device_attribute dev_attr;
> +	u8 instance;
> +};
> +
> +struct edac_scrub_context {
> +	char name[EDAC_FEAT_NAME_LEN];

Ok, here you're using EDAC_FEAT_NAME_LEN. Please move its definition from
patch 1 to this patch.

> +	struct edac_scrub_dev_attr scrub_dev_attr[SCRUB_MAX_ATTRS];
> +	struct attribute *scrub_attrs[SCRUB_MAX_ATTRS + 1];
> +	struct attribute_group group;
> +};
> +
> +#define TO_SCRUB_DEV_ATTR(_dev_attr)      \
> +		container_of(_dev_attr, struct edac_scrub_dev_attr, dev_attr)
> +
> +#define EDAC_SCRUB_ATTR_SHOW(attrib, cb, type, format)				\
> +static ssize_t attrib##_show(struct device *ras_feat_dev,			\
> +			     struct device_attribute *attr, char *buf)		\
> +{										\
> +	u8 inst = TO_SCRUB_DEV_ATTR(attr)->instance;				\
> +	struct edac_dev_feat_ctx *ctx = dev_get_drvdata(ras_feat_dev);		\
> +	const struct edac_scrub_ops *ops = ctx->scrub[inst].scrub_ops;		\
> +	type data;								\
> +	int ret;								\
> +										\
> +	ret = ops->cb(ras_feat_dev->parent, ctx->scrub[inst].private, &data);	\
> +	if (ret)								\
> +		return ret;							\
> +										\
> +	return sysfs_emit(buf, format, data);					\
> +}
> +
> +EDAC_SCRUB_ATTR_SHOW(addr, read_addr, u64, "0x%llx\n")
> +EDAC_SCRUB_ATTR_SHOW(size, read_size, u64, "0x%llx\n")
> +EDAC_SCRUB_ATTR_SHOW(enable_background, get_enabled_bg, bool, "%u\n")
> +EDAC_SCRUB_ATTR_SHOW(min_cycle_duration, get_min_cycle, u32, "%u\n")
> +EDAC_SCRUB_ATTR_SHOW(max_cycle_duration, get_max_cycle, u32, "%u\n")
> +EDAC_SCRUB_ATTR_SHOW(current_cycle_duration, get_cycle_duration, u32, "%u\n")
> +
> +#define EDAC_SCRUB_ATTR_STORE(attrib, cb, type, conv_func)			\
> +static ssize_t attrib##_store(struct device *ras_feat_dev,			\
> +			      struct device_attribute *attr,			\
> +			      const char *buf, size_t len)			\
> +{										\
> +	u8 inst = TO_SCRUB_DEV_ATTR(attr)->instance;				\
> +	struct edac_dev_feat_ctx *ctx = dev_get_drvdata(ras_feat_dev);		\
> +	const struct edac_scrub_ops *ops = ctx->scrub[inst].scrub_ops;		\
> +	type data;								\
> +	int ret;								\
> +										\
> +	ret = conv_func(buf, 0, &data);						\
> +	if (ret < 0)								\
> +		return ret;							\
> +										\
> +	ret = ops->cb(ras_feat_dev->parent, ctx->scrub[inst].private, data);	\
> +	if (ret)								\
> +		return ret;							\
> +										\
> +	return len;								\
> +}
> +
> +EDAC_SCRUB_ATTR_STORE(addr, write_addr, u64, kstrtou64)
> +EDAC_SCRUB_ATTR_STORE(size, write_size, u64, kstrtou64)
> +EDAC_SCRUB_ATTR_STORE(enable_background, set_enabled_bg, unsigned long, kstrtoul)
> +EDAC_SCRUB_ATTR_STORE(current_cycle_duration, set_cycle_duration, unsigned long, kstrtoul)
> +
> +static umode_t scrub_attr_visible(struct kobject *kobj, struct attribute *a, int attr_id)
> +{
> +	struct device *ras_feat_dev = kobj_to_dev(kobj);
> +	struct device_attribute *dev_attr = container_of(a, struct device_attribute, attr);
> +	u8 inst = TO_SCRUB_DEV_ATTR(dev_attr)->instance;
> +	struct edac_dev_feat_ctx *ctx = dev_get_drvdata(ras_feat_dev);
> +	const struct edac_scrub_ops *ops = ctx->scrub[inst].scrub_ops;
> +
> +	switch (attr_id) {
> +	case SCRUB_ADDRESS:
> +		if (ops->read_addr) {
> +			if (ops->write_addr)
> +				return a->mode;
> +			else
> +				return 0444;
> +		}
> +		break;
> +	case SCRUB_SIZE:
> +		if (ops->read_size) {
> +			if (ops->write_size)
> +				return a->mode;
> +			else
> +				return 0444;
> +		}
> +		break;
> +	case SCRUB_ENABLE_BACKGROUND:
> +		if (ops->get_enabled_bg) {
> +			if (ops->set_enabled_bg)
> +				return a->mode;
> +			else
> +				return 0444;
> +		}
> +		break;
> +	case SCRUB_MIN_CYCLE_DURATION:
> +		if (ops->get_min_cycle)
> +			return a->mode;
> +		break;
> +	case SCRUB_MAX_CYCLE_DURATION:
> +		if (ops->get_max_cycle)
> +			return a->mode;
> +		break;
> +	case SCRUB_CUR_CYCLE_DURATION:
> +		if (ops->get_cycle_duration) {
> +			if (ops->set_cycle_duration)
> +				return a->mode;
> +			else
> +				return 0444;
> +		}
> +		break;
> +	default:
> +		break;
> +	}
> +
> +	return 0;
> +}
> +
> +#define EDAC_SCRUB_ATTR_RO(_name, _instance)       \
> +	((struct edac_scrub_dev_attr) { .dev_attr = __ATTR_RO(_name), \
> +					.instance = _instance })
> +
> +#define EDAC_SCRUB_ATTR_WO(_name, _instance)       \
> +	((struct edac_scrub_dev_attr) { .dev_attr = __ATTR_WO(_name), \
> +					.instance = _instance })
> +
> +#define EDAC_SCRUB_ATTR_RW(_name, _instance)       \
> +	((struct edac_scrub_dev_attr) { .dev_attr = __ATTR_RW(_name), \
> +					.instance = _instance })
> +
> +static int scrub_create_desc(struct device *scrub_dev,
> +			     const struct attribute_group **attr_groups, u8 instance)
> +{
> +	struct edac_scrub_context *scrub_ctx;
> +	struct attribute_group *group;
> +	int i;
> +	struct edac_scrub_dev_attr dev_attr[] = {
> +		[SCRUB_ADDRESS] = EDAC_SCRUB_ATTR_RW(addr, instance),
> +		[SCRUB_SIZE] = EDAC_SCRUB_ATTR_RW(size, instance),
> +		[SCRUB_ENABLE_BACKGROUND] = EDAC_SCRUB_ATTR_RW(enable_background, instance),
> +		[SCRUB_MIN_CYCLE_DURATION] = EDAC_SCRUB_ATTR_RO(min_cycle_duration, instance),
> +		[SCRUB_MAX_CYCLE_DURATION] = EDAC_SCRUB_ATTR_RO(max_cycle_duration, instance),
> +		[SCRUB_CUR_CYCLE_DURATION] = EDAC_SCRUB_ATTR_RW(current_cycle_duration, instance)
> +	};
> +
> +	scrub_ctx = devm_kzalloc(scrub_dev, sizeof(*scrub_ctx), GFP_KERNEL);
> +	if (!scrub_ctx)
> +		return -ENOMEM;
> +
> +	group = &scrub_ctx->group;
> +	for (i = 0; i < SCRUB_MAX_ATTRS; i++) {
> +		memcpy(&scrub_ctx->scrub_dev_attr[i], &dev_attr[i], sizeof(dev_attr[i]));
> +		scrub_ctx->scrub_attrs[i] = &scrub_ctx->scrub_dev_attr[i].dev_attr.attr;
> +	}
> +	sprintf(scrub_ctx->name, "%s%d", "scrub", instance);
> +	group->name = scrub_ctx->name;
> +	group->attrs = scrub_ctx->scrub_attrs;
> +	group->is_visible  = scrub_attr_visible;
> +
> +	attr_groups[0] = group;
> +
> +	return 0;
> +}
> +
> +/**
> + * edac_scrub_get_desc - get EDAC scrub descriptors
> + * @scrub_dev: client device, with scrub support
> + * @attr_groups: pointer to attribute group container
> + * @instance: device's scrub instance number.
> + *
> + * Return:
> + *  * %0	- Success.
> + *  * %-EINVAL	- Invalid parameters passed.
> + *  * %-ENOMEM	- Dynamic memory allocation failed.
> + */
> +int edac_scrub_get_desc(struct device *scrub_dev,
> +			const struct attribute_group **attr_groups, u8 instance)
> +{
> +	if (!scrub_dev || !attr_groups)
> +		return -EINVAL;
> +
> +	return scrub_create_desc(scrub_dev, attr_groups, instance);
> +}
> diff --git a/include/linux/edac.h b/include/linux/edac.h
> index 521b17113d4d..ace8b10bb028 100644
> --- a/include/linux/edac.h
> +++ b/include/linux/edac.h
> @@ -666,11 +666,43 @@ static inline struct dimm_info *edac_get_dimm(struct mem_ctl_info *mci,
>  
>  /* RAS feature type */
>  enum edac_dev_feat {
> +	RAS_FEAT_SCRUB,
>  	RAS_FEAT_MAX
>  };
>  
> +/**
> + * struct edac_scrub_ops - scrub device operations (all elements optional)
> + * @read_addr: read base address of scrubbing range.
> + * @read_size: read offset of scrubbing range.
> + * @write_addr: set base address of the scrubbing range.
> + * @write_size: set offset of the scrubbing range.
> + * @get_enabled_bg: check if currently performing background scrub.
> + * @set_enabled_bg: start or stop a bg-scrub.
> + * @get_min_cycle: get minimum supported scrub cycle duration in seconds.
> + * @get_max_cycle: get maximum supported scrub cycle duration in seconds.
> + * @get_cycle_duration: get current scrub cycle duration in seconds.
> + * @set_cycle_duration: set current scrub cycle duration in seconds.
> + */
> +struct edac_scrub_ops {
> +	int (*read_addr)(struct device *dev, void *drv_data, u64 *base);
> +	int (*read_size)(struct device *dev, void *drv_data, u64 *size);
> +	int (*write_addr)(struct device *dev, void *drv_data, u64 base);
> +	int (*write_size)(struct device *dev, void *drv_data, u64 size);
> +	int (*get_enabled_bg)(struct device *dev, void *drv_data, bool *enable);
> +	int (*set_enabled_bg)(struct device *dev, void *drv_data, bool enable);
> +	int (*get_min_cycle)(struct device *dev, void *drv_data,  u32 *min);
> +	int (*get_max_cycle)(struct device *dev, void *drv_data,  u32 *max);
> +	int (*get_cycle_duration)(struct device *dev, void *drv_data, u32 *cycle);
> +	int (*set_cycle_duration)(struct device *dev, void *drv_data, u32 cycle);
> +};
> +
> +int edac_scrub_get_desc(struct device *scrub_dev,
> +			const struct attribute_group **attr_groups,
> +			u8 instance);
> +
>  /* EDAC device feature information structure */
>  struct edac_dev_data {
> +	const struct edac_scrub_ops *scrub_ops;
>  	u8 instance;
>  	void *private;
>  };
> @@ -678,11 +710,13 @@ struct edac_dev_data {
>  struct edac_dev_feat_ctx {
>  	struct device dev;
>  	void *private;
> +	struct edac_dev_data *scrub;
>  };
>  
>  struct edac_dev_feature {
>  	enum edac_dev_feat ft_type;
>  	u8 instance;
> +	const struct edac_scrub_ops *scrub_ops;
>  	void *ctx;
>  };
>  



Thanks,
Mauro

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v18 03/19] EDAC: Add ECS control feature
  2025-01-06 12:09 ` [PATCH v18 03/19] EDAC: Add ECS " shiju.jose
@ 2025-01-13 16:09   ` Mauro Carvalho Chehab
  0 siblings, 0 replies; 87+ messages in thread
From: Mauro Carvalho Chehab @ 2025-01-13 16:09 UTC (permalink / raw)
  To: shiju.jose
  Cc: linux-edac, linux-cxl, linux-acpi, linux-mm, linux-kernel, bp,
	tony.luck, rafael, lenb, mchehab, dan.j.williams, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	ira.weiny, david, Vilas.Sridharan, leo.duran, Yazen.Ghannam,
	rientjes, jiaqiyan, Jon.Grimm, dave.hansen, naoya.horiguchi,
	james.morse, jthoughton, somasundaram.a, erdemaktas, pgonda,
	duenwen, gthelen, wschwartz, dferguson, wbs, nifan.cxl,
	tanxiaofei, prime.zeng, roberto.sassu, kangkang.shen,
	wanghuiqiang, linuxarm

Em Mon, 6 Jan 2025 12:09:59 +0000
<shiju.jose@huawei.com> escreveu:

> From: Shiju Jose <shiju.jose@huawei.com>
> 
> Add EDAC ECS (Error Check Scrub) control to manage a memory device's
> ECS feature.
> 
> The Error Check Scrub (ECS) is a feature defined in JEDEC DDR5 SDRAM
> Specification (JESD79-5) and allows the DRAM to internally read, correct
> single-bit errors, and write back corrected data bits to the DRAM array
> while providing transparency to error counts.
> 
> The DDR5 device contains number of memory media FRUs per device. The
> DDR5 ECS feature and thus the ECS control driver supports configuring
> the ECS parameters per FRU.
> 
> Memory devices support the ECS feature register with the EDAC device
> driver, which retrieves the ECS descriptor from the EDAC ECS driver.
> This driver exposes sysfs ECS control attributes to userspace via
> /sys/bus/edac/devices/<dev-name>/ecs_fruX/.
> 
> The common sysfs ECS control interface abstracts the control of an
> arbitrary ECS functionality to a common set of functions.
> 
> Support for the ECS feature is added separately because the control
> attributes of the DDR5 ECS feature differ from those of the scrub
> feature.
> 
> The sysfs ECS attribute nodes are only present if the client driver
> has implemented the corresponding attribute callback function and
> passed the necessary operations to the EDAC RAS feature driver during
> registration.
> 
> Co-developed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> Signed-off-by: Shiju Jose <shiju.jose@huawei.com>

Patch LGTM, although I intend to do a more careful look on this series
checking it against the specs before sending my R-B.

> ---
>  Documentation/ABI/testing/sysfs-edac-ecs |  63 +++++++
>  Documentation/edac/scrub.rst             |   2 +
>  drivers/edac/Makefile                    |   2 +-
>  drivers/edac/ecs.c                       | 207 +++++++++++++++++++++++
>  drivers/edac/edac_device.c               |  17 ++
>  include/linux/edac.h                     |  41 ++++-
>  6 files changed, 329 insertions(+), 3 deletions(-)
>  create mode 100644 Documentation/ABI/testing/sysfs-edac-ecs
>  create mode 100755 drivers/edac/ecs.c
> 
> diff --git a/Documentation/ABI/testing/sysfs-edac-ecs b/Documentation/ABI/testing/sysfs-edac-ecs
> new file mode 100644
> index 000000000000..1160bec0603f
> --- /dev/null
> +++ b/Documentation/ABI/testing/sysfs-edac-ecs
> @@ -0,0 +1,63 @@
> +What:		/sys/bus/edac/devices/<dev-name>/ecs_fruX
> +Date:		Jan 2025
> +KernelVersion:	6.14
> +Contact:	linux-edac@vger.kernel.org
> +Description:
> +		The sysfs EDAC bus devices /<dev-name>/ecs_fruX subdirectory
> +		pertains to the memory media ECS (Error Check Scrub) control
> +		feature, where <dev-name> directory corresponds to a device
> +		registered with the EDAC device driver for the ECS feature.
> +		/ecs_fruX belongs to the media FRUs (Field Replaceable Unit)
> +		under the memory device.
> +		The sysfs ECS attr nodes are only present if the parent
> +		driver has implemented the corresponding attr callback
> +		function and provided the necessary operations to the EDAC
> +		device driver during registration.
> +
> +What:		/sys/bus/edac/devices/<dev-name>/ecs_fruX/log_entry_type
> +Date:		Jan 2025
> +KernelVersion:	6.14
> +Contact:	linux-edac@vger.kernel.org
> +Description:
> +		(RW) The log entry type of how the DDR5 ECS log is reported.
> +		0 - per DRAM.
> +		1 - per memory media FRU.
> +		All other values are reserved.
> +
> +What:		/sys/bus/edac/devices/<dev-name>/ecs_fruX/mode
> +Date:		Jan 2025
> +KernelVersion:	6.14
> +Contact:	linux-edac@vger.kernel.org
> +Description:
> +		(RW) The mode of how the DDR5 ECS counts the errors.
> +		Error count is tracked based on two different modes
> +		selected by DDR5 ECS Control Feature - Codeword mode and
> +		Row Count mode. If the ECS is under Codeword mode, then
> +		the error count increments each time a codeword with check
> +		bit errors is detected. If the ECS is under Row Count mode,
> +		then the error counter increments each time a row with
> +		check bit errors is detected.
> +		0 - ECS counts rows in the memory media that have ECC errors.
> +		1 - ECS counts codewords with errors, specifically, it counts
> +		the number of ECC-detected errors in the memory media.
> +		All other values are reserved.
> +
> +What:		/sys/bus/edac/devices/<dev-name>/ecs_fruX/reset
> +Date:		Jan 2025
> +KernelVersion:	6.14
> +Contact:	linux-edac@vger.kernel.org
> +Description:
> +		(WO) ECS reset ECC counter.
> +		1 - reset ECC counter to the default value.
> +		All other values are reserved.
> +
> +What:		/sys/bus/edac/devices/<dev-name>/ecs_fruX/threshold
> +Date:		Jan 2025
> +KernelVersion:	6.14
> +Contact:	linux-edac@vger.kernel.org
> +Description:
> +		(RW) DDR5 ECS threshold count per gigabits of memory cells.
> +		The ECS error count is subject to the ECS Threshold count
> +		per Gbit, which masks error counts less than the Threshold.
> +		Supported values are 256, 1024 and 4096.
> +		All other values are reserved.
> diff --git a/Documentation/edac/scrub.rst b/Documentation/edac/scrub.rst
> index 5a5108b744a4..5640f9aeee38 100644
> --- a/Documentation/edac/scrub.rst
> +++ b/Documentation/edac/scrub.rst
> @@ -242,3 +242,5 @@ sysfs
>  Sysfs files are documented in
>  
>  `Documentation/ABI/testing/sysfs-edac-scrub`.
> +
> +`Documentation/ABI/testing/sysfs-edac-ecs`.
> diff --git a/drivers/edac/Makefile b/drivers/edac/Makefile
> index a162726cc6b9..3a49304860f0 100644
> --- a/drivers/edac/Makefile
> +++ b/drivers/edac/Makefile
> @@ -10,7 +10,7 @@ obj-$(CONFIG_EDAC)			:= edac_core.o
>  
>  edac_core-y	:= edac_mc.o edac_device.o edac_mc_sysfs.o
>  edac_core-y	+= edac_module.o edac_device_sysfs.o wq.o
> -edac_core-y	+= scrub.o
> +edac_core-y	+= scrub.o ecs.o
>  
>  edac_core-$(CONFIG_EDAC_DEBUG)		+= debugfs.o
>  
> diff --git a/drivers/edac/ecs.c b/drivers/edac/ecs.c
> new file mode 100755
> index 000000000000..dae8e5ae881b
> --- /dev/null
> +++ b/drivers/edac/ecs.c
> @@ -0,0 +1,207 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * The generic ECS driver is designed to support control of on-die error
> + * check scrub (e.g., DDR5 ECS). The common sysfs ECS interface abstracts
> + * the control of various ECS functionalities into a unified set of functions.
> + *
> + * Copyright (c) 2024 HiSilicon Limited.
> + */
> +
> +#include <linux/edac.h>
> +
> +#define EDAC_ECS_FRU_NAME "ecs_fru"
> +
> +enum edac_ecs_attributes {
> +	ECS_LOG_ENTRY_TYPE,
> +	ECS_MODE,
> +	ECS_RESET,
> +	ECS_THRESHOLD,
> +	ECS_MAX_ATTRS
> +};
> +
> +struct edac_ecs_dev_attr {
> +	struct device_attribute dev_attr;
> +	int fru_id;
> +};
> +
> +struct edac_ecs_fru_context {
> +	char name[EDAC_FEAT_NAME_LEN];
> +	struct edac_ecs_dev_attr dev_attr[ECS_MAX_ATTRS];
> +	struct attribute *ecs_attrs[ECS_MAX_ATTRS + 1];
> +	struct attribute_group group;
> +};
> +
> +struct edac_ecs_context {
> +	u16 num_media_frus;
> +	struct edac_ecs_fru_context *fru_ctxs;
> +};
> +
> +#define TO_ECS_DEV_ATTR(_dev_attr)	\
> +	container_of(_dev_attr, struct edac_ecs_dev_attr, dev_attr)
> +
> +#define EDAC_ECS_ATTR_SHOW(attrib, cb, type, format)				\
> +static ssize_t attrib##_show(struct device *ras_feat_dev,			\
> +			     struct device_attribute *attr, char *buf)		\
> +{										\
> +	struct edac_ecs_dev_attr *dev_attr = TO_ECS_DEV_ATTR(attr);		\
> +	struct edac_dev_feat_ctx *ctx = dev_get_drvdata(ras_feat_dev);		\
> +	const struct edac_ecs_ops *ops = ctx->ecs.ecs_ops;			\
> +	type data;								\
> +	int ret;								\
> +										\
> +	ret = ops->cb(ras_feat_dev->parent, ctx->ecs.private,			\
> +		      dev_attr->fru_id, &data);					\
> +	if (ret)								\
> +		return ret;							\
> +										\
> +	return sysfs_emit(buf, format, data);					\
> +}
> +
> +EDAC_ECS_ATTR_SHOW(log_entry_type, get_log_entry_type, u32, "%u\n")
> +EDAC_ECS_ATTR_SHOW(mode, get_mode, u32, "%u\n")
> +EDAC_ECS_ATTR_SHOW(threshold, get_threshold, u32, "%u\n")
> +
> +#define EDAC_ECS_ATTR_STORE(attrib, cb, type, conv_func)			\
> +static ssize_t attrib##_store(struct device *ras_feat_dev,			\
> +			      struct device_attribute *attr,			\
> +			      const char *buf, size_t len)			\
> +{										\
> +	struct edac_ecs_dev_attr *dev_attr = TO_ECS_DEV_ATTR(attr);		\
> +	struct edac_dev_feat_ctx *ctx = dev_get_drvdata(ras_feat_dev);		\
> +	const struct edac_ecs_ops *ops = ctx->ecs.ecs_ops;			\
> +	type data;								\
> +	int ret;								\
> +										\
> +	ret = conv_func(buf, 0, &data);						\
> +	if (ret < 0)								\
> +		return ret;							\
> +										\
> +	ret = ops->cb(ras_feat_dev->parent, ctx->ecs.private,			\
> +		      dev_attr->fru_id, data);					\
> +	if (ret)								\
> +		return ret;							\
> +										\
> +	return len;								\
> +}
> +
> +EDAC_ECS_ATTR_STORE(log_entry_type, set_log_entry_type, unsigned long, kstrtoul)
> +EDAC_ECS_ATTR_STORE(mode, set_mode, unsigned long, kstrtoul)
> +EDAC_ECS_ATTR_STORE(reset, reset, unsigned long, kstrtoul)
> +EDAC_ECS_ATTR_STORE(threshold, set_threshold, unsigned long, kstrtoul)
> +
> +static umode_t ecs_attr_visible(struct kobject *kobj, struct attribute *a, int attr_id)
> +{
> +	struct device *ras_feat_dev = kobj_to_dev(kobj);
> +	struct edac_dev_feat_ctx *ctx = dev_get_drvdata(ras_feat_dev);
> +	const struct edac_ecs_ops *ops = ctx->ecs.ecs_ops;
> +
> +	switch (attr_id) {
> +	case ECS_LOG_ENTRY_TYPE:
> +		if (ops->get_log_entry_type)  {
> +			if (ops->set_log_entry_type)
> +				return a->mode;
> +			else
> +				return 0444;
> +		}
> +		break;
> +	case ECS_MODE:
> +		if (ops->get_mode) {
> +			if (ops->set_mode)
> +				return a->mode;
> +			else
> +				return 0444;
> +		}
> +		break;
> +	case ECS_RESET:
> +		if (ops->reset)
> +			return a->mode;
> +		break;
> +	case ECS_THRESHOLD:
> +		if (ops->get_threshold) {
> +			if (ops->set_threshold)
> +				return a->mode;
> +			else
> +				return 0444;
> +		}
> +		break;
> +	default:
> +		break;
> +	}
> +
> +	return 0;
> +}
> +
> +#define EDAC_ECS_ATTR_RO(_name, _fru_id)       \
> +	((struct edac_ecs_dev_attr) { .dev_attr = __ATTR_RO(_name), \
> +				     .fru_id = _fru_id })
> +
> +#define EDAC_ECS_ATTR_WO(_name, _fru_id)       \
> +	((struct edac_ecs_dev_attr) { .dev_attr = __ATTR_WO(_name), \
> +				     .fru_id = _fru_id })
> +
> +#define EDAC_ECS_ATTR_RW(_name, _fru_id)       \
> +	((struct edac_ecs_dev_attr) { .dev_attr = __ATTR_RW(_name), \
> +				     .fru_id = _fru_id })
> +
> +static int ecs_create_desc(struct device *ecs_dev,
> +			   const struct attribute_group **attr_groups, u16 num_media_frus)
> +{
> +	struct edac_ecs_context *ecs_ctx;
> +	u32 fru;
> +
> +	ecs_ctx = devm_kzalloc(ecs_dev, sizeof(*ecs_ctx), GFP_KERNEL);
> +	if (!ecs_ctx)
> +		return -ENOMEM;
> +
> +	ecs_ctx->num_media_frus = num_media_frus;
> +	ecs_ctx->fru_ctxs = devm_kcalloc(ecs_dev, num_media_frus,
> +					 sizeof(*ecs_ctx->fru_ctxs),
> +					 GFP_KERNEL);
> +	if (!ecs_ctx->fru_ctxs)
> +		return -ENOMEM;
> +
> +	for (fru = 0; fru < num_media_frus; fru++) {
> +		struct edac_ecs_fru_context *fru_ctx = &ecs_ctx->fru_ctxs[fru];
> +		struct attribute_group *group = &fru_ctx->group;
> +		int i;
> +
> +		fru_ctx->dev_attr[ECS_LOG_ENTRY_TYPE] =
> +					EDAC_ECS_ATTR_RW(log_entry_type, fru);
> +		fru_ctx->dev_attr[ECS_MODE] = EDAC_ECS_ATTR_RW(mode, fru);
> +		fru_ctx->dev_attr[ECS_RESET] = EDAC_ECS_ATTR_WO(reset, fru);
> +		fru_ctx->dev_attr[ECS_THRESHOLD] =
> +					EDAC_ECS_ATTR_RW(threshold, fru);
> +
> +		for (i = 0; i < ECS_MAX_ATTRS; i++)
> +			fru_ctx->ecs_attrs[i] = &fru_ctx->dev_attr[i].dev_attr.attr;
> +
> +		sprintf(fru_ctx->name, "%s%d", EDAC_ECS_FRU_NAME, fru);
> +		group->name = fru_ctx->name;
> +		group->attrs = fru_ctx->ecs_attrs;
> +		group->is_visible  = ecs_attr_visible;
> +
> +		attr_groups[fru] = group;
> +	}
> +
> +	return 0;
> +}
> +
> +/**
> + * edac_ecs_get_desc - get EDAC ECS descriptors
> + * @ecs_dev: client device, supports ECS feature
> + * @attr_groups: pointer to attribute group container
> + * @num_media_frus: number of media FRUs in the device
> + *
> + * Return:
> + *  * %0	- Success.
> + *  * %-EINVAL	- Invalid parameters passed.
> + *  * %-ENOMEM	- Dynamic memory allocation failed.
> + */
> +int edac_ecs_get_desc(struct device *ecs_dev,
> +		      const struct attribute_group **attr_groups, u16 num_media_frus)
> +{
> +	if (!ecs_dev || !attr_groups || !num_media_frus)
> +		return -EINVAL;
> +
> +	return ecs_create_desc(ecs_dev, attr_groups, num_media_frus);
> +}
> diff --git a/drivers/edac/edac_device.c b/drivers/edac/edac_device.c
> index 60b20eae01e8..1c1142a2e4e4 100644
> --- a/drivers/edac/edac_device.c
> +++ b/drivers/edac/edac_device.c
> @@ -625,6 +625,9 @@ int edac_dev_register(struct device *parent, char *name,
>  			attr_gcnt++;
>  			scrub_cnt++;
>  			break;
> +		case RAS_FEAT_ECS:
> +			attr_gcnt += ras_features[feat].ecs_info.num_media_frus;
> +			break;
>  		default:
>  			return -EINVAL;
>  		}
> @@ -669,6 +672,20 @@ int edac_dev_register(struct device *parent, char *name,
>  			scrub_cnt++;
>  			attr_gcnt++;
>  			break;
> +		case RAS_FEAT_ECS:
> +			if (!ras_features->ecs_ops)
> +				goto data_mem_free;
> +
> +			dev_data = &ctx->ecs;
> +			dev_data->ecs_ops = ras_features->ecs_ops;
> +			dev_data->private = ras_features->ctx;
> +			ret = edac_ecs_get_desc(parent, &ras_attr_groups[attr_gcnt],
> +						ras_features->ecs_info.num_media_frus);
> +			if (ret)
> +				goto data_mem_free;
> +
> +			attr_gcnt += ras_features->ecs_info.num_media_frus;
> +			break;
>  		default:
>  			ret = -EINVAL;
>  			goto data_mem_free;
> diff --git a/include/linux/edac.h b/include/linux/edac.h
> index ace8b10bb028..979e91426701 100644
> --- a/include/linux/edac.h
> +++ b/include/linux/edac.h
> @@ -667,6 +667,7 @@ static inline struct dimm_info *edac_get_dimm(struct mem_ctl_info *mci,
>  /* RAS feature type */
>  enum edac_dev_feat {
>  	RAS_FEAT_SCRUB,
> +	RAS_FEAT_ECS,
>  	RAS_FEAT_MAX
>  };
>  
> @@ -700,9 +701,40 @@ int edac_scrub_get_desc(struct device *scrub_dev,
>  			const struct attribute_group **attr_groups,
>  			u8 instance);
>  
> +/**
> + * struct edac_ecs_ops - ECS device operations (all elements optional)
> + * @get_log_entry_type: read the log entry type value.
> + * @set_log_entry_type: set the log entry type value.
> + * @get_mode: read the mode value.
> + * @set_mode: set the mode value.
> + * @reset: reset the ECS counter.
> + * @get_threshold: read the threshold count per gigabits of memory cells.
> + * @set_threshold: set the threshold count per gigabits of memory cells.
> + */
> +struct edac_ecs_ops {
> +	int (*get_log_entry_type)(struct device *dev, void *drv_data, int fru_id, u32 *val);
> +	int (*set_log_entry_type)(struct device *dev, void *drv_data, int fru_id, u32 val);
> +	int (*get_mode)(struct device *dev, void *drv_data, int fru_id, u32 *val);
> +	int (*set_mode)(struct device *dev, void *drv_data, int fru_id, u32 val);
> +	int (*reset)(struct device *dev, void *drv_data, int fru_id, u32 val);
> +	int (*get_threshold)(struct device *dev, void *drv_data, int fru_id, u32 *threshold);
> +	int (*set_threshold)(struct device *dev, void *drv_data, int fru_id, u32 threshold);
> +};
> +
> +struct edac_ecs_ex_info {
> +	u16 num_media_frus;
> +};
> +
> +int edac_ecs_get_desc(struct device *ecs_dev,
> +		      const struct attribute_group **attr_groups,
> +		      u16 num_media_frus);
> +
>  /* EDAC device feature information structure */
>  struct edac_dev_data {
> -	const struct edac_scrub_ops *scrub_ops;
> +	union {
> +		const struct edac_scrub_ops *scrub_ops;
> +		const struct edac_ecs_ops *ecs_ops;
> +	};
>  	u8 instance;
>  	void *private;
>  };
> @@ -711,13 +743,18 @@ struct edac_dev_feat_ctx {
>  	struct device dev;
>  	void *private;
>  	struct edac_dev_data *scrub;
> +	struct edac_dev_data ecs;
>  };
>  
>  struct edac_dev_feature {
>  	enum edac_dev_feat ft_type;
>  	u8 instance;
> -	const struct edac_scrub_ops *scrub_ops;
> +	union {
> +		const struct edac_scrub_ops *scrub_ops;
> +		const struct edac_ecs_ops *ecs_ops;
> +	};
>  	void *ctx;
> +	struct edac_ecs_ex_info ecs_info;
>  };
>  
>  int edac_dev_register(struct device *parent, char *dev_name,



Thanks,
Mauro

^ permalink raw reply	[flat|nested] 87+ messages in thread

* RE: [PATCH v18 00/19] EDAC: Scrub: introduce generic EDAC RAS control feature driver + CXL/ACPI-RAS2 drivers
  2025-01-13 14:46 ` [PATCH v18 00/19] EDAC: Scrub: introduce generic EDAC RAS control feature driver + CXL/ACPI-RAS2 drivers Mauro Carvalho Chehab
  2025-01-13 15:36   ` Jonathan Cameron
@ 2025-01-13 18:15   ` Shiju Jose
  1 sibling, 0 replies; 87+ messages in thread
From: Shiju Jose @ 2025-01-13 18:15 UTC (permalink / raw)
  To: Mauro Carvalho Chehab
  Cc: linux-edac@vger.kernel.org, linux-cxl@vger.kernel.org,
	linux-acpi@vger.kernel.org, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org, bp@alien8.de, tony.luck@intel.com,
	rafael@kernel.org, lenb@kernel.org, mchehab@kernel.org,
	dan.j.williams@intel.com, dave@stgolabs.net, Jonathan Cameron,
	dave.jiang@intel.com, alison.schofield@intel.com,
	vishal.l.verma@intel.com, ira.weiny@intel.com, david@redhat.com,
	Vilas.Sridharan@amd.com, leo.duran@amd.com, Yazen.Ghannam@amd.com,
	rientjes@google.com, jiaqiyan@google.com, Jon.Grimm@amd.com,
	dave.hansen@linux.intel.com, naoya.horiguchi@nec.com,
	james.morse@arm.com, jthoughton@google.com,
	somasundaram.a@hpe.com, erdemaktas@google.com, pgonda@google.com,
	duenwen@google.com, gthelen@google.com,
	wschwartz@amperecomputing.com, dferguson@amperecomputing.com,
	wbs@os.amperecomputing.com, nifan.cxl@gmail.com, tanxiaofei,
	Zengtao (B), Roberto Sassu, kangkang.shen@futurewei.com,
	wanghuiqiang, Linuxarm

>-----Original Message-----
>From: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
>Sent: 13 January 2025 14:47
>To: Shiju Jose <shiju.jose@huawei.com>
>Cc: linux-edac@vger.kernel.org; linux-cxl@vger.kernel.org; linux-
>acpi@vger.kernel.org; linux-mm@kvack.org; linux-kernel@vger.kernel.org;
>bp@alien8.de; tony.luck@intel.com; rafael@kernel.org; lenb@kernel.org;
>mchehab@kernel.org; dan.j.williams@intel.com; dave@stgolabs.net; Jonathan
>Cameron <jonathan.cameron@huawei.com>; dave.jiang@intel.com;
>alison.schofield@intel.com; vishal.l.verma@intel.com; ira.weiny@intel.com;
>david@redhat.com; Vilas.Sridharan@amd.com; leo.duran@amd.com;
>Yazen.Ghannam@amd.com; rientjes@google.com; jiaqiyan@google.com;
>Jon.Grimm@amd.com; dave.hansen@linux.intel.com;
>naoya.horiguchi@nec.com; james.morse@arm.com; jthoughton@google.com;
>somasundaram.a@hpe.com; erdemaktas@google.com; pgonda@google.com;
>duenwen@google.com; gthelen@google.com;
>wschwartz@amperecomputing.com; dferguson@amperecomputing.com;
>wbs@os.amperecomputing.com; nifan.cxl@gmail.com; tanxiaofei
><tanxiaofei@huawei.com>; Zengtao (B) <prime.zeng@hisilicon.com>; Roberto
>Sassu <roberto.sassu@huawei.com>; kangkang.shen@futurewei.com;
>wanghuiqiang <wanghuiqiang@huawei.com>; Linuxarm
><linuxarm@huawei.com>
>Subject: Re: [PATCH v18 00/19] EDAC: Scrub: introduce generic EDAC RAS
>control feature driver + CXL/ACPI-RAS2 drivers
>
>Em Mon, 6 Jan 2025 12:09:56 +0000
><shiju.jose@huawei.com> escreveu:
>
>> From: Shiju Jose <shiju.jose@huawei.com>
>>
>> Previously known as "ras: scrub: introduce subsystem + CXL/ACPI-RAS2
>drivers".
>>
>> Augmenting EDAC for controlling RAS features
>> ============================================
>> The proposed expansion of EDAC for controlling RAS features and
>> exposing features control attributes to userspace in sysfs.
>> Some Examples:
>>  - Scrub control
>>  - Error Check Scrub (ECS) control
>>  - ACPI RAS2 features
>>  - Post Package Repair (PPR) control
>>  - Memory Sparing Repair control etc.
>>
>> High level design is illustrated in the following diagram.
>>
>>          _______________________________________________
>>         |   Userspace - Rasdaemon                       |
>>         |  ____________                                 |
>>         | | RAS CXL    |       _______________          |
>>         | | Err Handler|----->|               |         |
>>         | |____________|      | RAS Dynamic   |         |
>>         |  ____________       | Scrub, Memory |         |
>>         | | RAS Memory |----->| Repair Control|         |
>>         | | Err Handler|      |_______________|         |
>>         | |____________|           |                    |
>>         |__________________________|____________________|
>>                                    |
>>                                    |
>>     _______________________________|______________________________
>>    |   Kernel EDAC based SubSystem | for RAS Features Control     |
>>    | ______________________________|____________________________  |
>>    || EDAC Core          Sysfs EDAC| Bus                        | |
>>    ||    __________________________|________ _    _____________ | |
>>    ||   |/sys/bus/edac/devices/<dev>/scrubX/ |   | EDAC Device || |
>>    ||   |/sys/bus/edac/devices/<dev>/ecsX/   |<->| EDAC MC     || |
>>    ||   |/sys/bus/edac/devices/<dev>/repairX |   | EDAC Sysfs  || |
>>    ||   |____________________________________|   |_____________|| |
>>    ||                               | EDAC Bus                  | |
>>    ||               Get             |       Get                 | |
>>    ||    __________ Features       |   Features __________    | |
>
>NIT: there is a misalignment here.
Hi Mauro,

Thanks for the comments.
Will fix.

>
>>    ||   |          |Descs  _________|______ Descs|          |   | |
>>    ||   |EDAC Scrub|<-----| EDAC Device    |     | EDAC Mem |   | |
>>    ||   |__________|      | Driver- RAS    |---->| Repair   |   | |
>>    ||    __________       | Feature Control|     |__________|   | |
>>    ||   |          |<-----|________________|                    | |
>>    ||   |EDAC ECS  |   Register RAS | Features                  | |
>>    ||   |__________|                |                           | |
>>    ||         ______________________|_________                  | |
>>    ||_________|_____________|________________|__________________| |
>>    |   _______|____    _____|_________   ____|_________           |
>>    |  |            |  | CXL Mem Driver| | Client Driver|          |
>>    |  | ACPI RAS2  |  | Sparing, PPR, | | Mem Repair   |          |
>>    |  | Driver     |  | Scrub, ECS    | | Features     |          |
>>    |  |____________|  |_______________| |______________|          |
>>    |        |              |              |                       |
>>
>|________|______________|______________|_______________________|
>>             |              |              |
>>      _______|______________|______________|_______________________
>>     |     __|______________|_ ____________|____________ ____      |
>>     |    |                                                  |     |
>>     |    |            Platform HW and Firmware              |     |
>>     |    |__________________________________________________|     |
>>
>|_____________________________________________________________|
>>
>> 1. EDAC RAS Features components - Create feature specific descriptors.
>>    for example, EDAC scrub, EDAC ECS, EDAC memory repair in the above
>>    diagram.
>> 2. EDAC device driver for controlling RAS Features - Get feature's attr
>>    descriptors from EDAC RAS feature component and registers device's
>>    RAS features with EDAC bus and expose the feature's sysfs attributes
>>    under the sysfs EDAC bus.
>> 3. RAS dynamic scrub controller - Userspace sample module added for scrub
>>    control in rasdaemon to issue scrubbing when excess number of memory
>>    errors are reported in a short span of time.
>>
>> The added EDAC feature specific components (e.g. EDAC scrub, EDAC ECS,
>> EDAC memory repair etc) do callbacks to  the parent driver (e.g. CXL
>> driver, ACPI RAS driver etc) for the controls rather than just letting
>> the caller deal with it because of the following reasons.
>> 1. Enforces a common API across multiple implementations can do that
>>    via review, but that's not generally gone well in the long run for
>>    subsystems that have done it (several have later moved to callback
>>    and feature list based approaches).
>> 2. Gives a path for 'intercepting' in the EDAC feature driver.
>>    An example for this is that we could intercept PPR repair calls
>>    and sanity check that the memory in question is offline before
>>    passing back to the underlying code.  Sure we could rely on doing
>>    that via some additional calls from the parent driver, but the
>>    ABI will get messier.
>> 3. (Speculative) we may get in kernel users of some features in the
>>    long run.
>>
>> More details of the common RAS features are described in the following
>> sections.
>>
>> Memory Scrubbing
>> ================
>> Increasing DRAM size and cost has made memory subsystem reliability an
>> important concern. These modules are used where potentially corrupted
>> data could cause expensive or fatal issues. Memory errors are one of
>> the top hardware failures that cause server and workload crashes.
>>
>> Memory scrub is a feature where an ECC engine reads data from each
>> memory media location, corrects with an ECC if necessary and writes
>> the corrected data back to the same memory media location.
>>
>> The memory DIMMs could be scrubbed at a configurable rate to detect
>> uncorrected memory errors and attempts to recover from detected memory
>> errors providing the following benefits.
>> - Proactively scrubbing memory DIMMs reduces the chance of a correctable
>>   error becoming uncorrectable.
>> - Once detected, uncorrected errors caught in unallocated memory pages are
>>   isolated and prevented from being allocated to an application or the OS.
>> - The probability of software/hardware products encountering memory
>>   errors is reduced.
>> Some details of background can be found in Reference [5].
>>
>> There are 2 types of memory scrubbing, 1. Background (patrol)
>> scrubbing of the RAM whilst the RAM is otherwise
>>    idle.
>> 2. On-demand scrubbing for a specific address range/region of memory.
>>
>> There are several types of interfaces to HW memory scrubbers
>> identified such as ACPI NVDIMM ARS(Address Range Scrub), CXL memory
>> device patrol scrub, CXL DDR5 ECS, ACPI RAS2 memory scrubbing.
>>
>> The scrub control varies between different memory scrubbers. To allow
>> for standard userspace tooling there is a need to present these
>> controls with a standard ABI.
>>
>> Introduce generic memory EDAC scrub control which allows user to
>> control underlying scrubbers in the system via generic sysfs scrub
>> control interface. The common sysfs scrub control interface abstracts
>> the control of an arbitrary scrubbing functionality to a common set of
>functions.
>>
>> Use case of common scrub control feature
>> ========================================
>> 1. There are several types of interfaces to HW memory scrubbers identified
>>    such as ACPI NVDIMM ARS(Address Range Scrub), CXL memory device patrol
>>    scrub, CXL DDR5 ECS, ACPI RAS2 memory scrubbing features and software
>>    based memory scrubber(discussed in the community Reference [5]).
>>    Also some scrubbers support controlling (background) patrol scrubbing
>>    (ACPI RAS2, CXL) and/or on-demand scrubbing(ACPI RAS2, ACPI ARS).
>>    However the scrub controls varies between memory scrubbers. Thus there
>>    is a requirement for a standard generic sysfs scrub controls exposed
>>    to userspace for the seamless control of the HW/SW scrubbers in
>>    the system by admin/scripts/tools etc.
>> 2. Scrub controls in user space allow the user to disable the scrubbing
>>    in case disabling of the background patrol scrubbing or changing the
>>    scrub rate are needed for other purposes such as performance-aware
>>    operations which requires the background operations to be turned off
>>    or reduced.
>> 3. Allows to perform on-demand scrubbing for specific address range if
>>    supported by the scrubber.
>> 4. User space tools controls scrub the memory DIMMs regularly at a
>>    configurable scrub rate using the sysfs scrub controls discussed help,
>>    - to detect uncorrectable memory errors early before user accessing
>memory,
>>      which helps to recover the detected memory errors.
>>    - reduces the chance of a correctable error becoming uncorrectable.
>> 5. Policy control for hotplugged memory. There is not necessarily a system
>>    wide bios or similar in the loop to control the scrub settings on a CXL
>>    device that wasn't there at boot. What that setting should be is a policy
>>    decision as we are trading of reliability vs performance - hence it should
>>    be in control of userspace. As such, 'an' interface is needed. Seems more
>>    sensible to try and unify it with other similar interfaces than spin
>>    yet another one.
>>
>> The draft version of userspace code added in rasdaemon for dynamic
>> scrub control, based on frequency of memory errors reported to
>> userspace, tested for CXL device based patrol scrubbing feature and
>> ACPI RAS2 based scrubbing feature.
>>
>> https://github.com/shijujose4/rasdaemon/tree/ras_feature_control
>>
>> ToDO: For memory repair features, such as PPR, memory sparing,
>> rasdaemon collates records and decides to replace a row if there are
>> lots of corrected errors, or a single uncorrected error or error
>> record received with maintenance request flag set as in some CXL event
>records.
>>
>> Comparison of scrubbing features
>> ================================
>>  ................................................................
>>  .              .   ACPI    . CXL patrol.  CXL ECS  .  ARS      .
>>  .  Name        .   RAS2    . scrub     .           .           .
>>  ................................................................
>>  .              .           .           .           .           .
>>  . On-demand    . Supported . No        . No        . Supported .
>>  . Scrubbing    .           .           .           .           .
>>  .              .           .           .           .           .
>>  ................................................................
>>  .              .           .           .           .           .
>>  . Background   . Supported . Supported . Supported . No        .
>>  . scrubbing    .           .           .           .           .
>>  .              .           .           .           .           .
>>  ................................................................
>>  .              .           .           .           .           .
>>  . Mode of      . Scrub ctrl. per device. per memory.  Unknown  .
>>  . scrubbing    . per NUMA  .           . media     .           .
>>  .              . domain.   .           .           .           .
>>  ................................................................
>>  .              .           .           .           .           .
>>  . Query scrub  . Supported . Supported . Supported . Supported .
>>  . capabilities .           .           .           .           .
>>  .              .           .           .           .           .
>>  ................................................................
>>  .              .           .           .           .           .
>>  . Setting      . Supported . No        . No        . Supported .
>>  . address range.           .           .           .           .
>>  .              .           .           .           .           .
>>  ................................................................
>>  .              .           .           .           .           .
>>  . Setting      . Supported . Supported . No        . No        .
>>  . scrub rate   .           .           .           .           .
>>  .              .           .           .           .           .
>>  ................................................................
>>  .              .           .           .           .           .
>>  . Unit for     . Not       . in hours  . No        . No        .
>>  . scrub rate   . Defined   .           .           .           .
>>  .              .           .           .           .           .
>>  ................................................................
>>  .              . Supported .           .           .           .
>>  . Scrub        . on-demand . No        . No        . Supported .
>>  . status/      . scrubbing .           .           .           .
>>  . Completion   . only      .           .           .           .
>>  ................................................................
>>  . UC error     .           .CXL general.CXL general. ACPI UCE  .
>>  . reporting    . Exception .media/DRAM .media/DRAM . notify and.
>>  .              .           .event/media.event/media. query     .
>>  .              .           .scan?      .scan?      . ARS status.
>>  ................................................................
>>  .              .           .           .           .           .
>>  . Clear UC     .  No       . No        .  No       . Supported .
>>  . error        .           .           .           .           .
>>  .              .           .           .           .           .
>>  ................................................................
>>  .              .           .           .           .           .
>>  . Translate    . No        . No        . No        . Supported .
>>  . *(1)SPA to   .           .           .           .           .
>>  . *(2)DPA      .           .           .           .           .
>>  ................................................................
>>
>> *(1) - SPA - System Physical Address. See section 9.19.7.8
>>        Function Index 5 - Translate SPA of ACPI spec r6.5.
>> *(2) - DPA - Device Physical Address. See section 9.19.7.8
>>        Function Index 5 - Translate SPA of ACPI spec r6.5.
>
>NIT: this table contains terms that are defined only at the text below. The text
>describing, for instance, ARS, needs to come before the table. 
Sure. Can be done.

>IMO, it needs to
>contain ReST links to the texts defining what each line/row contains (see below
>about ReST).
Not sure about ReST links.
>
>>
>> CXL Memory Scrubbing features
>> =============================
>> CXL spec r3.1 section 8.2.9.9.11.1 describes the memory device patrol
>> scrub control feature. The device patrol scrub proactively locates and
>> makes corrections to errors in regular cycle. The patrol scrub control
>> allows the request to configure patrol scrubber's input configurations.
>>
>> The patrol scrub control allows the requester to specify the number of
>> hours in which the patrol scrub cycles must be completed, provided
>> that the requested number is not less than the minimum number of hours
>> for the patrol scrub cycle that the device is capable of. In addition,
>> the patrol scrub controls allow the host to disable and enable the
>> feature in case disabling of the feature is needed for other purposes
>> such as performance-aware operations which require the background
>> operations to be turned off.
>>
>> The Error Check Scrub (ECS) is a feature defined in JEDEC DDR5 SDRAM
>> Specification (JESD79-5) and allows the DRAM to internally read,
>> correct single-bit errors, and write back corrected data bits to the
>> DRAM array while providing transparency to error counts.
>>
>> The DDR5 device contains number of memory media FRUs per device. The
>> DDR5 ECS feature and thus the ECS control driver supports configuring
>> the ECS parameters per FRU.
>>
>> ACPI RAS2 Hardware-based Memory Scrubbing
>> =========================================
>> ACPI spec 6.5 section 5.2.21 ACPI RAS2 describes ACPI RAS2 table
>> provides interfaces for platform RAS features and supports independent
>> RAS controls and capabilities for a given RAS feature for multiple
>> instances of the same component in a given system.
>> Memory RAS features apply to RAS capabilities, controls and operations
>> that are specific to memory. RAS2 PCC sub-spaces for memory-specific
>> RAS features have a Feature Type of 0x00 (Memory).
>>
>> The platform can use the hardware-based memory scrubbing feature to
>> expose controls and capabilities associated with hardware-based memory
>> scrub engines. The RAS2 memory scrubbing feature supports following as
>> per spec,
>>  - Independent memory scrubbing controls for each NUMA domain, identified
>>    using its proximity domain.
>>    Note: However AmpereComputing has single entry repeated as they have
>>          centralized controls.
>>  - Provision for background (patrol) scrubbing of the entire memory system,
>>    as well as on-demand scrubbing for a specific region of memory.
>>
>> ACPI Address Range Scrubbing(ARS)
>> ================================
>> ARS allows the platform to communicate memory errors to system software.
>> This capability allows system software to prevent accesses to
>> addresses with uncorrectable errors in memory. ARS functions manage
>> all NVDIMMs present in the system. Only one scrub can be in progress
>> system wide at any given time.
>> Following functions are supported as per the specification.
>> 1. Query ARS Capabilities for a given address range, indicates platform
>>    supports the ACPI NVDIMM Root Device Unconsumed Error Notification.
>> 2. Start ARS triggers an Address Range Scrub for the given memory range.
>>    Address scrubbing can be done for volatile memory, persistent memory,
>>    or both.
>> 3. Query ARS Status command allows software to get the status of ARS,
>>    including the progress of ARS and ARS error record.
>> 4. Clear Uncorrectable Error.
>> 5. Translate SPA
>> 6. ARS Error Inject etc.
>> Note: Support for ARS is not added in this series because to reduce
>> the line of code for review and could be added after initial code is merged.
>> We'd like feedback on whether this is of interest to ARS community?
>>
>> Post Package Repair(PPR)
>> ========================
>> PPR (Post Package Repair) maintenance operation requests the memory
>> device to perform a repair operation on its media if supported. A
>> memory device may support two types of PPR: Hard PPR (hPPR), for a
>> permanent row repair, and Soft PPR (sPPR), for a temporary row repair.
>> sPPR is much faster than hPPR, but the repair is lost with a power
>> cycle. During the execution of a PPR maintenance operation, a memory
>> device, may or may not retain data and may or may not be able to
>> process memory requests correctly. sPPR maintenance operation may be
>> executed at runtime, if data is retained and memory requests are
>> correctly processed. hPPR maintenance operation may be executed only at
>boot because data would not be retained.
>>
>> Use cases of common PPR control feature
>> =======================================
>> 1. The Soft PPR (sPPR) and Hard PPR (hPPR) share similar control
>> interfaces, thus there is a requirement for a standard generic sysfs
>> PPR controls exposed to userspace for the seamless control of the PPR
>> features in the system by the admin/scripts/tools etc.
>> 2. When a CXL device identifies a failure on a memory component, the
>> device may inform the host about the need for a PPR maintenance
>> operation by using an event record, where the maintenance needed flag
>> is set. The event record specifies the DPA that should be repaired.
>> Kernel reports the corresponding cxl general media or DRAM trace event
>> to userspace. The userspace tool, for reg. rasdaemon initiate a PPR
>> maintenance operation in response to a device request using the sysfs PPR
>control.
>> 3. User space tools, for eg. rasdaemon, do request PPR on a memory
>> region when uncorrected memory error or excess corrected memory errors
>> reported on that memory.
>> 4. Likely multiple instances of PPR present per memory device.
>>
>> Memory Sparing
>> ==============
>> Memory sparing is defined as a repair function that replaces a portion
>> of memory with a portion of functional memory at that same DPA. User
>> space tool, e.g. rasdaemon, may request the sparing operation for a
>> given address for which the uncorrectable error is reported. In CXL,
>> (CXL spec 3.1 section 8.2.9.7.1.4) subclasses for sparing operation
>> vary in terms of the scope of the sparing being performed. The
>> cacheline sparing subclass refers to a sparing action that can replace a full
>cacheline.
>> Row sparing is provided as an alternative to PPR sparing functions and
>> its scope is that of a single DDR row. Bank sparing allows an entire
>> bank to be replaced. Rank sparing is defined as an operation in which
>> an entire DDR rank is replaced.
>>
>> Series adds,
>> 1. EDAC device driver extended for controlling RAS features, EDAC scrub
>>    driver, EDAC ECS driver, EDAC memory repair driver supports memory
>>    scrub control, ECS control, memory repair(PPR, sparing) control
>>    respectively.
>> 2. Several common patches from Dave's cxl/fwctl series.
>> 3. Support for CXL feature mailbox commands, which is used by CXL device
>>    scrubbing and memory repair features.
>> 4. CXL features driver supporting patrol scrub control (device and
>>    region based).
>>
>> 5. CXL features driver supporting ECS control feature.
>> 6. ACPI RAS2 driver adds OS interface for RAS2 communication through
>>    PCC mailbox and extracts ACPI RAS2 feature table (RAS2) and
>>    create platform device for the RAS memory features, which binds
>>    to the memory ACPI RAS2 driver.
>> 7. Memory ACPI RAS2 driver gets the PCC subspace for communicating
>>    with the ACPI compliant platform supports ACPI RAS2. Add callback
>>    functions and registers with EDAC device to support user to
>>    control the HW patrol scrubbers exposed to the kernel via the
>>    ACPI RAS2 table.
>> 8. Support for CXL maintenance mailbox command, which is used by
>>    CXL device memory repair feature.
>> 9. CXL features driver supporting PPR control feature.
>> 10. CXL features driver supporting memory sparing control feature.
>>     Note: There are other PPR, memory sparing drivers to come.
>
>The text above should be inside Documentation, and not on patch 0.
The description for  EDAC device features control and for each features
were added in the Documentation under corresponding patches.
 
>
>A big description like that makes hard to review this series. It is also easier to
>review the text after having it parsed by kernel doc build specially for summary
>tables like the "Comparison of scrubbing features", which deserves ReST links
>processed by Sphinx to the corresponding definitions of the terms that are be
>compared there.
Same as above.
>
>> Open Questions based on feedbacks from the community:
>> 1. Leo: Standardize unit for scrub rate, for example ACPI RAS2 does not define
>>    unit for the scrub rate. RAS2 clarification needed.
>
>I noticed the same when reviewing a patch series for rasdaemon. Ideally, ACPI
>requires an errata defining what units are expected for scrub rate.
>
>While ACPI doesn't define it, better to not add support for it - or be conservative
>using a low granularity for it (like using minutes instead of hours).
>
Jonathan already replied.
>> 2. Jonathan:
 [...]
>
>Thanks,
>Mauro

Thanks,
Shiju

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v18 01/19] EDAC: Add support for EDAC device features control
  2025-01-13 15:06   ` Mauro Carvalho Chehab
@ 2025-01-14  9:55     ` Jonathan Cameron
  2025-01-14 10:08     ` Shiju Jose
  1 sibling, 0 replies; 87+ messages in thread
From: Jonathan Cameron @ 2025-01-14  9:55 UTC (permalink / raw)
  To: Mauro Carvalho Chehab
  Cc: shiju.jose, linux-edac, linux-cxl, linux-acpi, linux-mm,
	linux-kernel, bp, tony.luck, rafael, lenb, mchehab,
	dan.j.williams, dave, dave.jiang, alison.schofield,
	vishal.l.verma, ira.weiny, david, Vilas.Sridharan, leo.duran,
	Yazen.Ghannam, rientjes, jiaqiyan, Jon.Grimm, dave.hansen,
	naoya.horiguchi, james.morse, jthoughton, somasundaram.a,
	erdemaktas, pgonda, duenwen, gthelen, wschwartz, dferguson, wbs,
	nifan.cxl, tanxiaofei, prime.zeng, roberto.sassu, kangkang.shen,
	wanghuiqiang, linuxarm


> > +int edac_dev_register(struct device *parent, char *name,
> > +		      void *private, int num_features,
> > +		      const struct edac_dev_feature *ras_features)
> > +{
> > +	const struct attribute_group **ras_attr_groups;
> > +	struct edac_dev_feat_ctx *ctx;
> > +	int attr_gcnt = 0;
> > +	int ret, feat;
> > +
> > +	if (!parent || !name || !num_features || !ras_features)
> > +		return -EINVAL;
> > +
> > +	/* Double parse to make space for attributes */
> > +	for (feat = 0; feat < num_features; feat++) {
> > +		switch (ras_features[feat].ft_type) {
> > +		/* Add feature specific code */
> > +		default:
> > +			return -EINVAL;
> > +		}
> > +	}
> > +
> > +	ctx = kzalloc(sizeof(*ctx), GFP_KERNEL);
> > +	if (!ctx)
> > +		return -ENOMEM;
> > +
> > +	ras_attr_groups = kcalloc(attr_gcnt + 1, sizeof(*ras_attr_groups), GFP_KERNEL);
> > +	if (!ras_attr_groups) {
> > +		ret = -ENOMEM;
> > +		goto ctx_free;
> > +	}
> > +
> > +	attr_gcnt = 0;
> > +	for (feat = 0; feat < num_features; feat++, ras_features++) {
> > +		switch (ras_features->ft_type) {
> > +		/* Add feature specific code */
> > +		default:
> > +			ret = -EINVAL;
> > +			goto groups_free;
> > +		}
> > +	}
> > +
> > +	ctx->dev.parent = parent;
> > +	ctx->dev.bus = edac_get_sysfs_subsys();
> > +	ctx->dev.type = &edac_dev_type;
> > +	ctx->dev.groups = ras_attr_groups;
> > +	ctx->private = private;
> > +	dev_set_drvdata(&ctx->dev, ctx);
> > +
> > +	ret = dev_set_name(&ctx->dev, name);
> > +	if (ret)
> > +		goto groups_free;
> > +
> > +	ret = device_register(&ctx->dev);
> > +	if (ret) {
> > +		put_device(&ctx->dev);  
> 
> > +		return ret;  
> 
> As register failed, you need to change it to a goto groups_free,
> as edac_dev_release() won't be called.

Boris called this one out as well, so seems it is not that well understood.
I've also tripped over this in the past and it's one of the most common
things I catch in reviews of code calling this stuff.

As discussed offline, it will be called. The device_register() docs
make it clear that whether or not that call succeeds reference counting
is enabled and put_device() is the correct way to free resources.

The actual depends on the fact that device_register() is just a helper
defined as

device_initialize();
return device_add();

So for reasons lost to history (I guess there are cases where other cleanup
needs to happen before the release) it does not handle side effects
of device_initialize() on an error in device_add().  

device_initialize() has called
-> kobject_init(&dev->kobj, &device_type);
 -> kref_init_internal(kobj) + sets ktype (which has the release callback)

kref_init_internal() sets the reference counter to 1

Hence when we do a device_put() in the error path, the reference counter drops
to 0 and the release from the ktype is called.  Here that is edac_dev_release();

If you want to verify replace device_register() with device_initialize() then
call put_device().

If we were going back in history, I'd suggest device_register() should be side
effect free and call put_device() on error and any driver that needs to handle
other stuff before the release should just not use it. I guess that ship
long sailed and maybe there are other reasons I've not thought of.

I took a quick look and seems to go back into at least the 2.5 era.

Jonathan




^ permalink raw reply	[flat|nested] 87+ messages in thread

* RE: [PATCH v18 01/19] EDAC: Add support for EDAC device features control
  2025-01-13 15:06   ` Mauro Carvalho Chehab
  2025-01-14  9:55     ` Jonathan Cameron
@ 2025-01-14 10:08     ` Shiju Jose
  2025-01-14 11:33       ` Mauro Carvalho Chehab
  1 sibling, 1 reply; 87+ messages in thread
From: Shiju Jose @ 2025-01-14 10:08 UTC (permalink / raw)
  To: Mauro Carvalho Chehab
  Cc: linux-edac@vger.kernel.org, linux-cxl@vger.kernel.org,
	linux-acpi@vger.kernel.org, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org, bp@alien8.de, tony.luck@intel.com,
	rafael@kernel.org, lenb@kernel.org, mchehab@kernel.org,
	dan.j.williams@intel.com, dave@stgolabs.net, Jonathan Cameron,
	dave.jiang@intel.com, alison.schofield@intel.com,
	vishal.l.verma@intel.com, ira.weiny@intel.com, david@redhat.com,
	Vilas.Sridharan@amd.com, leo.duran@amd.com, Yazen.Ghannam@amd.com,
	rientjes@google.com, jiaqiyan@google.com, Jon.Grimm@amd.com,
	dave.hansen@linux.intel.com, naoya.horiguchi@nec.com,
	james.morse@arm.com, jthoughton@google.com,
	somasundaram.a@hpe.com, erdemaktas@google.com, pgonda@google.com,
	duenwen@google.com, gthelen@google.com,
	wschwartz@amperecomputing.com, dferguson@amperecomputing.com,
	wbs@os.amperecomputing.com, nifan.cxl@gmail.com, tanxiaofei,
	Zengtao (B), Roberto Sassu, kangkang.shen@futurewei.com,
	wanghuiqiang, Linuxarm

>-----Original Message-----
>From: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
>Sent: 13 January 2025 15:06
>To: Shiju Jose <shiju.jose@huawei.com>
>Cc: linux-edac@vger.kernel.org; linux-cxl@vger.kernel.org; linux-
>acpi@vger.kernel.org; linux-mm@kvack.org; linux-kernel@vger.kernel.org;
>bp@alien8.de; tony.luck@intel.com; rafael@kernel.org; lenb@kernel.org;
>mchehab@kernel.org; dan.j.williams@intel.com; dave@stgolabs.net; Jonathan
>Cameron <jonathan.cameron@huawei.com>; dave.jiang@intel.com;
>alison.schofield@intel.com; vishal.l.verma@intel.com; ira.weiny@intel.com;
>david@redhat.com; Vilas.Sridharan@amd.com; leo.duran@amd.com;
>Yazen.Ghannam@amd.com; rientjes@google.com; jiaqiyan@google.com;
>Jon.Grimm@amd.com; dave.hansen@linux.intel.com;
>naoya.horiguchi@nec.com; james.morse@arm.com; jthoughton@google.com;
>somasundaram.a@hpe.com; erdemaktas@google.com; pgonda@google.com;
>duenwen@google.com; gthelen@google.com;
>wschwartz@amperecomputing.com; dferguson@amperecomputing.com;
>wbs@os.amperecomputing.com; nifan.cxl@gmail.com; tanxiaofei
><tanxiaofei@huawei.com>; Zengtao (B) <prime.zeng@hisilicon.com>; Roberto
>Sassu <roberto.sassu@huawei.com>; kangkang.shen@futurewei.com;
>wanghuiqiang <wanghuiqiang@huawei.com>; Linuxarm
><linuxarm@huawei.com>
>Subject: Re: [PATCH v18 01/19] EDAC: Add support for EDAC device features
>control
>
>Em Mon, 6 Jan 2025 12:09:57 +0000
><shiju.jose@huawei.com> escreveu:
>
>> From: Shiju Jose <shiju.jose@huawei.com>
>>
>> Add generic EDAC device feature controls supporting the registration
>> of RAS features available in the system. The driver exposes control
>> attributes for these features to userspace in
>> /sys/bus/edac/devices/<dev-name>/<ras-feature>/
>>
>> Co-developed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
>> Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
>> Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
>> ---
>>  Documentation/edac/features.rst |  94 ++++++++++++++++++++++++++++++
>>  Documentation/edac/index.rst    |  10 ++++
>>  drivers/edac/edac_device.c      | 100 ++++++++++++++++++++++++++++++++
>>  include/linux/edac.h            |  28 +++++++++
>>  4 files changed, 232 insertions(+)
>>  create mode 100644 Documentation/edac/features.rst  create mode
>> 100644 Documentation/edac/index.rst
>>
>> diff --git a/Documentation/edac/features.rst
>> b/Documentation/edac/features.rst new file mode 100644 index
>> 000000000000..f32f259ce04d
>> --- /dev/null
>> +++ b/Documentation/edac/features.rst
>> @@ -0,0 +1,94 @@
>> +.. SPDX-License-Identifier: GPL-2.0
>
>SPDX should match what's written there, e. g.
>
>	.. SPDX-License-Identifier: GPL-2.0 OR GFDL-1.2-no-invariants-or-later
>
>Please notice that GNU FDL family contains both open source and non-open
>source licenses. The open-source one is this:
>
>	https://spdx.org/licenses/GFDL-1.2-no-invariants-or-later.html
>
>E.g. it is a the license permits changing the entire document in the future, as
>there's no invariant parts on it.
This seems not widely used, have seen this is used in few documents only. 

>
>> +
>> +============================================
>> +Augmenting EDAC for controlling RAS features
>> +============================================
>> +
>> +Copyright (c) 2024 HiSilicon Limited.
>
>2024-2025?
Will do.  
>
>> +
>> +:Author:   Shiju Jose <shiju.jose@huawei.com>
>> +:License:  The GNU Free Documentation License, Version 1.2
>> +          (dual licensed under the GPL v2)
>
>You need to define if invariant parts are allowed or not, e. g.:
>
>	:License: The GNU Free Documentation License, Version 1.2 without
>Invariant Sections, Front-Cover Texts nor Back-Cover Texts.
>		  (dual licensed under the GPL v2)
Same as above.
>
>
>> +:Original Reviewers:
>> +
>> +- Written for: 6.14
>> +
>> +Introduction
>> +------------
>> +The expansion of EDAC for controlling RAS features and exposing
>> +features control attributes to userspace via sysfs. Some Examples:
>> +
>> +* Scrub control
>> +
>> +* Error Check Scrub (ECS) control
>> +
>> +* ACPI RAS2 features
>> +
>> +* Post Package Repair (PPR) control
>> +
>> +* Memory Sparing Repair control etc.
>> +
>> +High level design is illustrated in the following diagram::
>> +
>> +         _______________________________________________
>> +        |   Userspace - Rasdaemon                       |
>> +        |  _____________                                |
>> +        | | RAS CXL mem |      _______________          |
>> +        | |error handler|---->|               |         |
>> +        | |_____________|     | RAS dynamic   |         |
>> +        |  _____________      | scrub, memory |         |
>> +        | | RAS memory  |---->| repair control|         |
>> +        | |error handler|     |_______________|         |
>> +        | |_____________|          |                    |
>> +        |__________________________|____________________|
>> +                                   |
>> +                                   |
>> +
>_______________________________|______________________________
>> +   |     Kernel EDAC extension for | controlling RAS Features     |
>> +   | ______________________________|____________________________
>|
>> +   || EDAC Core          Sysfs EDAC| Bus                        | |
>> +   ||    __________________________|_________     _____________ | |
>> +   ||   |/sys/bus/edac/devices/<dev>/scrubX/ |   | EDAC device || |
>> +   ||   |/sys/bus/edac/devices/<dev>/ecsX/   |<->| EDAC MC     || |
>> +   ||   |/sys/bus/edac/devices/<dev>/repairX |   | EDAC sysfs  || |
>> +   ||   |____________________________________|   |_____________|| |
>> +   ||                           EDAC|Bus                        | |
>> +   ||                               |                           | |
>> +   ||    __________ Get feature     |      Get feature          | |
>> +   ||   |          |desc   _________|______ desc  __________    | |
>> +   ||   |EDAC scrub|<-----| EDAC device    |     |          |   | |
>> +   ||   |__________|      | driver- RAS    |---->| EDAC mem |   | |
>> +   ||    __________       | feature control|     | repair   |   | |
>> +   ||   |          |<-----|________________|     |__________|   | |
>> +   ||   |EDAC ECS  |    Register RAS|features                   | |
>> +   ||   |__________|                |                           | |
>> +   ||         ______________________|_____________              | |
>> +   ||_________|_______________|__________________|______________|
>|
>> +   |   _______|____    _______|_______       ____|__________      |
>> +   |  |            |  | CXL mem driver|     | Client driver |     |
>> +   |  | ACPI RAS2  |  | scrub, ECS,   |     | memory repair |     |
>> +   |  | driver     |  | sparing, PPR  |     | features      |     |
>> +   |  |____________|  |_______________|     |_______________|     |
>> +   |        |                 |                    |              |
>> +
>|________|_________________|____________________|______________|
>> +            |                 |                    |
>> +
>________|_________________|____________________|______________
>> +   |     ___|_________________|____________________|_______       |
>> +   |    |                                                  |      |
>> +   |    |            Platform HW and Firmware              |      |
>> +   |    |__________________________________________________|      |
>> +
>|______________________________________________________________|
>> +
>> +
>> +1. EDAC Features components - Create feature specific descriptors.
>> +For example, EDAC scrub, EDAC ECS, EDAC memory repair in the above
>> +diagram.
>> +
>> +2. EDAC device driver for controlling RAS Features - Get feature's
>> +attribute descriptors from EDAC RAS feature component and registers
>> +device's RAS features with EDAC bus and exposes the features control
>> +attributes via the sysfs EDAC bus. For example,
>> +/sys/bus/edac/devices/<dev-name>/<feature>X/
>> +
>> +3. RAS dynamic feature controller - Userspace sample modules in
>> +rasdaemon for dynamic scrub/repair control to issue scrubbing/repair
>> +when excess number of corrected memory errors are reported in a short
>span of time.
>> diff --git a/Documentation/edac/index.rst
>> b/Documentation/edac/index.rst new file mode 100644 index
>> 000000000000..b6c265a4cffb
>> --- /dev/null
>> +++ b/Documentation/edac/index.rst
>> @@ -0,0 +1,10 @@
>> +.. SPDX-License-Identifier: GPL-2.0
>> +
>> +==============
>> +EDAC Subsystem
>> +==============
>> +
>> +.. toctree::
>> +   :maxdepth: 1
>> +
>> +   features
>> diff --git a/drivers/edac/edac_device.c b/drivers/edac/edac_device.c
>> index 621dc2a5d034..9fce46dd7405 100644
>> --- a/drivers/edac/edac_device.c
>> +++ b/drivers/edac/edac_device.c
>> @@ -570,3 +570,103 @@ void edac_device_handle_ue_count(struct
>edac_device_ctl_info *edac_dev,
>>  		      block ? block->name : "N/A", count, msg);  }
>> EXPORT_SYMBOL_GPL(edac_device_handle_ue_count);
>> +
>> +static void edac_dev_release(struct device *dev) {
>> +	struct edac_dev_feat_ctx *ctx = container_of(dev, struct
>> +edac_dev_feat_ctx, dev);
>> +
>> +	kfree(ctx->dev.groups);
>> +	kfree(ctx);
>> +}
>> +
>> +const struct device_type edac_dev_type = {
>> +	.name = "edac_dev",
>> +	.release = edac_dev_release,
>> +};
>> +
>> +static void edac_dev_unreg(void *data) {
>> +	device_unregister(data);
>> +}
>> +
>> +/**
>> + * edac_dev_register - register device for RAS features with EDAC
>> + * @parent: parent device.
>> + * @name: parent device's name.
>> + * @private: parent driver's data to store in the context if any.
>> + * @num_features: number of RAS features to register.
>> + * @ras_features: list of RAS features to register.
>> + *
>> + * Return:
>> + *  * %0       - Success.
>> + *  * %-EINVAL - Invalid parameters passed.
>> + *  * %-ENOMEM - Dynamic memory allocation failed.
>> + *
>> + */
>> +int edac_dev_register(struct device *parent, char *name,
>> +		      void *private, int num_features,
>> +		      const struct edac_dev_feature *ras_features) {
>> +	const struct attribute_group **ras_attr_groups;
>> +	struct edac_dev_feat_ctx *ctx;
>> +	int attr_gcnt = 0;
>> +	int ret, feat;
>> +
>> +	if (!parent || !name || !num_features || !ras_features)
>> +		return -EINVAL;
>> +
>> +	/* Double parse to make space for attributes */
>> +	for (feat = 0; feat < num_features; feat++) {
>> +		switch (ras_features[feat].ft_type) {
>> +		/* Add feature specific code */
>> +		default:
>> +			return -EINVAL;
>> +		}
>> +	}
>> +
>> +	ctx = kzalloc(sizeof(*ctx), GFP_KERNEL);
>> +	if (!ctx)
>> +		return -ENOMEM;
>> +
>> +	ras_attr_groups = kcalloc(attr_gcnt + 1, sizeof(*ras_attr_groups),
>GFP_KERNEL);
>> +	if (!ras_attr_groups) {
>> +		ret = -ENOMEM;
>> +		goto ctx_free;
>> +	}
>> +
>> +	attr_gcnt = 0;
>> +	for (feat = 0; feat < num_features; feat++, ras_features++) {
>> +		switch (ras_features->ft_type) {
>> +		/* Add feature specific code */
>> +		default:
>> +			ret = -EINVAL;
>> +			goto groups_free;
>> +		}
>> +	}
>> +
>> +	ctx->dev.parent = parent;
>> +	ctx->dev.bus = edac_get_sysfs_subsys();
>> +	ctx->dev.type = &edac_dev_type;
>> +	ctx->dev.groups = ras_attr_groups;
>> +	ctx->private = private;
>> +	dev_set_drvdata(&ctx->dev, ctx);
>> +
>> +	ret = dev_set_name(&ctx->dev, name);
>> +	if (ret)
>> +		goto groups_free;
>> +
>> +	ret = device_register(&ctx->dev);
>> +	if (ret) {
>> +		put_device(&ctx->dev);
>
>> +		return ret;
>
>As register failed, you need to change it to a goto groups_free, as
>edac_dev_release() won't be called.
As per experimentation edac_dev_release() will be called.

>
>> +	}
>> +
>> +	return devm_add_action_or_reset(parent, edac_dev_unreg, &ctx->dev);
>> +
>> +groups_free:
>> +	kfree(ras_attr_groups);
>> +ctx_free:
>> +	kfree(ctx);
>> +	return ret;
>> +}
>> +EXPORT_SYMBOL_GPL(edac_dev_register);
>> diff --git a/include/linux/edac.h b/include/linux/edac.h index
>> b4ee8961e623..521b17113d4d 100644
>> --- a/include/linux/edac.h
>> +++ b/include/linux/edac.h
>> @@ -661,4 +661,32 @@ static inline struct dimm_info
>> *edac_get_dimm(struct mem_ctl_info *mci,
>>
>>  	return mci->dimms[index];
>>  }
>> +
>> +#define EDAC_FEAT_NAME_LEN	128
>
>This macro was not used on this patch.
Sure.
>
>> +
>> +/* RAS feature type */
>> +enum edac_dev_feat {
>> +	RAS_FEAT_MAX
>> +};
>> +
>> +/* EDAC device feature information structure */ struct edac_dev_data
>> +{
>> +	u8 instance;
>> +	void *private;
>> +};
>> +
>> +struct edac_dev_feat_ctx {
>> +	struct device dev;
>> +	void *private;
>> +};
>> +
>> +struct edac_dev_feature {
>> +	enum edac_dev_feat ft_type;
>> +	u8 instance;
>> +	void *ctx;
>> +};
>> +
>> +int edac_dev_register(struct device *parent, char *dev_name,
>> +		      void *parent_pvt_data, int num_features,
>> +		      const struct edac_dev_feature *ras_features);
>>  #endif /* _LINUX_EDAC_H_ */
>
>Thanks,
>Mauro

Thanks,
Shiju

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v18 01/19] EDAC: Add support for EDAC device features control
  2025-01-14 10:08     ` Shiju Jose
@ 2025-01-14 11:33       ` Mauro Carvalho Chehab
  0 siblings, 0 replies; 87+ messages in thread
From: Mauro Carvalho Chehab @ 2025-01-14 11:33 UTC (permalink / raw)
  To: Shiju Jose
  Cc: linux-edac@vger.kernel.org, linux-cxl@vger.kernel.org,
	linux-acpi@vger.kernel.org, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org, bp@alien8.de, tony.luck@intel.com,
	rafael@kernel.org, lenb@kernel.org, mchehab@kernel.org,
	dan.j.williams@intel.com, dave@stgolabs.net, Jonathan Cameron,
	dave.jiang@intel.com, alison.schofield@intel.com,
	vishal.l.verma@intel.com, ira.weiny@intel.com, david@redhat.com,
	Vilas.Sridharan@amd.com, leo.duran@amd.com, Yazen.Ghannam@amd.com,
	rientjes@google.com, jiaqiyan@google.com, Jon.Grimm@amd.com,
	dave.hansen@linux.intel.com, naoya.horiguchi@nec.com,
	james.morse@arm.com, jthoughton@google.com,
	somasundaram.a@hpe.com, erdemaktas@google.com, pgonda@google.com,
	duenwen@google.com, gthelen@google.com,
	wschwartz@amperecomputing.com, dferguson@amperecomputing.com,
	wbs@os.amperecomputing.com, nifan.cxl@gmail.com, tanxiaofei,
	Zengtao (B), Roberto Sassu, kangkang.shen@futurewei.com,
	wanghuiqiang, Linuxarm

Em Tue, 14 Jan 2025 10:08:42 +0000
Shiju Jose <shiju.jose@huawei.com> escreveu:

> >> diff --git a/Documentation/edac/features.rst
> >> b/Documentation/edac/features.rst new file mode 100644 index
> >> 000000000000..f32f259ce04d
> >> --- /dev/null
> >> +++ b/Documentation/edac/features.rst
> >> @@ -0,0 +1,94 @@
> >> +.. SPDX-License-Identifier: GPL-2.0  
> >
> >SPDX should match what's written there, e. g.
> >
> >	.. SPDX-License-Identifier: GPL-2.0 OR GFDL-1.2-no-invariants-or-later
> >
> >Please notice that GNU FDL family contains both open source and non-open
> >source licenses. The open-source one is this:
> >
> >	https://spdx.org/licenses/GFDL-1.2-no-invariants-or-later.html
> >
> >E.g. it is a the license permits changing the entire document in the future, as
> >there's no invariant parts on it.  
> This seems not widely used, have seen this is used in few documents only. 

This was added after some discussions I had with LF people in charge
of SPDX: GFDL explicitly allows to have some parts that can't be touched
by future patches. Those are "invariant" parts of the document.

They were designed in a way that the original author's notes
can't be touched on any further patch from someone's else. 

You can see more about that at:

	https://www.gnu.org/licenses/fdl-howto-opt.en.html

See:

	"The classical example of an invariant nontechnical section in a free manual
	 is the GNU Manifesto, which is included in the GNU Emacs Manual. The GNU
	 Manifesto says nothing about how to edit with Emacs, but it explains the
	 reason why I wrote GNU Emacs"

And:
	https://www.gnu.org/gnu/manifesto.html

Due to its nature of being invariant, most people consider it as a
non-open-source license. See, for instance:

	https://www.debian.org/vote/2006/vote_001

Due to such concerns, after several discussions I had with interested parties,
this was added to SPDX spec and to the Linux Kernel:

	- GFDL-1.2-no-invariants-only - for GFDL v 1.2 only
	- GFDL-1.2-no-invariants-or-later - for GFDL v 1.2 or later

(plus variants for other GFDL versions)

You may use either one of them, but you should *not* use GFDL-1.2
as this is deprecated:

	https://spdx.org/licenses/GFDL-1.2.html

And need to be replaced by either:

	https://spdx.org/licenses/GFDL-1.2-no-invariants-or-later.html
or:
	https://spdx.org/licenses/GFDL-1.2-invariants-only.html

Thanks,
Mauro

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v18 04/19] EDAC: Add memory repair control feature
  2025-01-06 12:10 ` [PATCH v18 04/19] EDAC: Add memory repair " shiju.jose
  2025-01-09  9:19   ` Borislav Petkov
@ 2025-01-14 11:47   ` Mauro Carvalho Chehab
  2025-01-14 12:31     ` Shiju Jose
  2025-01-14 13:47   ` Mauro Carvalho Chehab
  2 siblings, 1 reply; 87+ messages in thread
From: Mauro Carvalho Chehab @ 2025-01-14 11:47 UTC (permalink / raw)
  To: shiju.jose
  Cc: linux-edac, linux-cxl, linux-acpi, linux-mm, linux-kernel, bp,
	tony.luck, rafael, lenb, mchehab, dan.j.williams, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	ira.weiny, david, Vilas.Sridharan, leo.duran, Yazen.Ghannam,
	rientjes, jiaqiyan, Jon.Grimm, dave.hansen, naoya.horiguchi,
	james.morse, jthoughton, somasundaram.a, erdemaktas, pgonda,
	duenwen, gthelen, wschwartz, dferguson, wbs, nifan.cxl,
	tanxiaofei, prime.zeng, roberto.sassu, kangkang.shen,
	wanghuiqiang, linuxarm

Em Mon, 6 Jan 2025 12:10:00 +0000
<shiju.jose@huawei.com> escreveu:

> From: Shiju Jose <shiju.jose@huawei.com>
> 
> Add a generic EDAC memory repair control driver to manage memory repairs
> in the system, such as CXL Post Package Repair (PPR) and CXL memory sparing
> features.
> 
> For example, a CXL device with DRAM components that support PPR features
> may implement PPR maintenance operations. DRAM components may support two
> types of PPR, hard PPR, for a permanent row repair, and soft PPR,  for a
> temporary row repair. Soft PPR is much faster than hard PPR, but the repair
> is lost with a power cycle.
> Similarly a CXL memory device may support soft and hard memory sparing at
> cacheline, row, bank and rank granularities. Memory sparing is defined as
> a repair function that replaces a portion of memory with a portion of
> functional memory at that same granularity.
> When a CXL device detects an error in a memory, it may report the host of
> the need for a repair maintenance operation by using an event record where
> the "maintenance needed" flag is set. The event records contains the device
> physical address(DPA) and other attributes of the memory to repair (such as
> channel, sub-channel, bank group, bank, rank, row, column etc). The kernel
> will report the corresponding CXL general media or DRAM trace event to
> userspace, and userspace tools (e.g. rasdaemon) will initiate a repair
> operation in response to the device request via the sysfs repair control.
> 
> Device with memory repair features registers with EDAC device driver,
> which retrieves memory repair descriptor from EDAC memory repair driver
> and exposes the sysfs repair control attributes to userspace in
> /sys/bus/edac/devices/<dev-name>/mem_repairX/.
> 
> The common memory repair control interface abstracts the control of
> arbitrary memory repair functionality into a standardized set of functions.
> The sysfs memory repair attribute nodes are only available if the client
> driver has implemented the corresponding attribute callback function and
> provided operations to the EDAC device driver during registration.
> 
> Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
> ---
>  .../ABI/testing/sysfs-edac-memory-repair      | 244 +++++++++
>  Documentation/edac/features.rst               |   3 +
>  Documentation/edac/index.rst                  |   1 +
>  Documentation/edac/memory_repair.rst          | 101 ++++
>  drivers/edac/Makefile                         |   2 +-
>  drivers/edac/edac_device.c                    |  33 ++
>  drivers/edac/mem_repair.c                     | 492 ++++++++++++++++++
>  include/linux/edac.h                          | 139 +++++
>  8 files changed, 1014 insertions(+), 1 deletion(-)
>  create mode 100644 Documentation/ABI/testing/sysfs-edac-memory-repair
>  create mode 100644 Documentation/edac/memory_repair.rst
>  create mode 100755 drivers/edac/mem_repair.c
> 
> diff --git a/Documentation/ABI/testing/sysfs-edac-memory-repair b/Documentation/ABI/testing/sysfs-edac-memory-repair
> new file mode 100644
> index 000000000000..e9268f3780ed
> --- /dev/null
> +++ b/Documentation/ABI/testing/sysfs-edac-memory-repair
> @@ -0,0 +1,244 @@
> +What:		/sys/bus/edac/devices/<dev-name>/mem_repairX
> +Date:		Jan 2025
> +KernelVersion:	6.14
> +Contact:	linux-edac@vger.kernel.org
> +Description:
> +		The sysfs EDAC bus devices /<dev-name>/mem_repairX subdirectory
> +		pertains to the memory media repair features control, such as
> +		PPR (Post Package Repair), memory sparing etc, where<dev-name>
> +		directory corresponds to a device registered with the EDAC
> +		device driver for the memory repair features.
> +
> +		Post Package Repair is a maintenance operation requests the memory
> +		device to perform a repair operation on its media, in detail is a
> +		memory self-healing feature that fixes a failing memory location by
> +		replacing it with a spare row in a DRAM device. For example, a
> +		CXL memory device with DRAM components that support PPR features may
> +		implement PPR maintenance operations. DRAM components may support
> +		two types of PPR functions: hard PPR, for a permanent row repair, and
> +		soft PPR, for a temporary row repair. soft PPR is much faster than
> +		hard PPR, but the repair is lost with a power cycle.
> +
> +		Memory sparing is a repair function that replaces a portion
> +		of memory with a portion of functional memory at that same
> +		sparing granularity. Memory sparing has cacheline/row/bank/rank
> +		sparing granularities. For example, in memory-sparing mode,
> +		one memory rank serves as a spare for other ranks on the same
> +		channel in case they fail. The spare rank is held in reserve and
> +		not used as active memory until a failure is indicated, with
> +		reserved capacity subtracted from the total available memory
> +		in the system.The DIMM installation order for memory sparing
> +		varies based on the number of processors and memory modules
> +		installed in the server. After an error threshold is surpassed
> +		in a system protected by memory sparing, the content of a failing
> +		rank of DIMMs is copied to the spare rank. The failing rank is
> +		then taken offline and the spare rank placed online for use as
> +		active memory in place of the failed rank.
> +
> +		The sysfs attributes nodes for a repair feature are only
> +		present if the parent driver has implemented the corresponding
> +		attr callback function and provided the necessary operations
> +		to the EDAC device driver during registration.
> +
> +		In some states of system configuration (e.g. before address
> +		decoders have been configured), memory devices (e.g. CXL)
> +		may not have an active mapping in the main host address
> +		physical address map. As such, the memory to repair must be
> +		identified by a device specific physical addressing scheme
> +		using a device physical address(DPA). The DPA and other control
> +		attributes to use will be presented in related error records.
> +
> +What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/repair_function
> +Date:		Jan 2025
> +KernelVersion:	6.14
> +Contact:	linux-edac@vger.kernel.org
> +Description:
> +		(RO) Memory repair function type. For eg. post package repair,
> +		memory sparing etc.
> +		EDAC_SOFT_PPR - Soft post package repair
> +		EDAC_HARD_PPR - Hard post package repair
> +		EDAC_CACHELINE_MEM_SPARING - Cacheline memory sparing
> +		EDAC_ROW_MEM_SPARING - Row memory sparing
> +		EDAC_BANK_MEM_SPARING - Bank memory sparing
> +		EDAC_RANK_MEM_SPARING - Rank memory sparing
> +		All other values are reserved.

Too big strings. Why are them in upper cases? IMO:

	soft-ppr, hard-ppr, ... would be enough.

Also, Is it mandatory that all types are supported? If not, you need a
way to report to userspace what of them are supported. One option
would be that reading /sys/bus/edac/devices/<dev-name>/mem_repairX/repair_function
would return something like:

	soft-ppr [hard-ppr] row-mem-sparing

Also, as this will be parsed in ReST format, you need to change the
description to use bullets, otherwise the html/pdf version of the
document will place everything on a single line. E.g. something like:

Description:
		(RO) Memory repair function type. For eg. post package repair,
		memory sparing etc. Can be:

		- EDAC_SOFT_PPR - Soft post package repair
		- EDAC_HARD_PPR - Hard post package repair
		- EDAC_CACHELINE_MEM_SPARING - Cacheline memory sparing
		- EDAC_ROW_MEM_SPARING - Row memory sparing
		- EDAC_BANK_MEM_SPARING - Bank memory sparing
		- EDAC_RANK_MEM_SPARING - Rank memory sparing
		- All other values are reserved.

Same applies to other sysfs nodes. See for instance:

	Documentation/ABI/stable/sysfs-class-backlight

And see how it is formatted after Sphinx processing at the Kernel
Admin guide:

	https://www.kernel.org/doc/html/latest/admin-guide/abi-stable.html#symbols-under-sys-class

Please fix it on all places you have a list of values.
	
> +
> +What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/persist_mode
> +Date:		Jan 2025
> +KernelVersion:	6.14
> +Contact:	linux-edac@vger.kernel.org
> +Description:
> +		(RW) Read/Write the current persist repair mode set for a
> +		repair function. Persist repair modes supported in the
> +		device, based on the memory repair function is temporary
> +		or permanent and is lost with a power cycle.
> +		EDAC_MEM_REPAIR_SOFT - Soft repair function (temporary repair).
> +		EDAC_MEM_REPAIR_HARD - Hard memory repair function (permanent repair).
> +		All other values are reserved.

Same here: edac/ is already in the path. No need to place EDAC_ at the name.

> +
> +What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/dpa_support
> +Date:		Jan 2025
> +KernelVersion:	6.14
> +Contact:	linux-edac@vger.kernel.org
> +Description:
> +		(RO) True if memory device required device physical
> +		address (DPA) of memory to repair.
> +		False if memory device required host specific physical
> +                address (HPA) of memory to repair.

Please remove the extra spaces before "address", as otherwise conversion to
ReST may do the wrong thing or may produce doc warnings.

> +		In some states of system configuration (e.g. before address
> +		decoders have been configured), memory devices (e.g. CXL)
> +		may not have an active mapping in the main host address
> +		physical address map. As such, the memory to repair must be
> +		identified by a device specific physical addressing scheme
> +		using a DPA. The device physical address(DPA) to use will be
> +		presented in related error records.
> +
> +What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/repair_safe_when_in_use
> +Date:		Jan 2025
> +KernelVersion:	6.14
> +Contact:	linux-edac@vger.kernel.org
> +Description:
> +		(RO) True if memory media is accessible and data is retained
> +		during the memory repair operation.
> +		The data may not be retained and memory requests may not be
> +		correctly processed during a repair operation. In such case
> +		the repair operation should not executed at runtime.

Please add an extra line before "The data" to ensure that the output at
the admin-guide won't merge the two paragraphs. Same on other places along
this patch series: paragraphs need a blank line at the description.

> +
> +What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/hpa
> +Date:		Jan 2025
> +KernelVersion:	6.14
> +Contact:	linux-edac@vger.kernel.org
> +Description:
> +		(RW) Host Physical Address (HPA) of the memory to repair.
> +		See attribute 'dpa_support' for more details.
> +		The HPA to use will be provided in related error records.
> +
> +What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/dpa
> +Date:		Jan 2025
> +KernelVersion:	6.14
> +Contact:	linux-edac@vger.kernel.org
> +Description:
> +		(RW) Device Physical Address (DPA) of the memory to repair.
> +		See attribute 'dpa_support' for more details.
> +		The specific DPA to use will be provided in related error
> +		records.
> +
> +What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/nibble_mask
> +Date:		Jan 2025
> +KernelVersion:	6.14
> +Contact:	linux-edac@vger.kernel.org
> +Description:
> +		(RW) Read/Write Nibble mask of the memory to repair.
> +		Nibble mask identifies one or more nibbles in error on the
> +		memory bus that produced the error event. Nibble Mask bit 0
> +		shall be set if nibble 0 on the memory bus produced the
> +		event, etc. For example, CXL PPR and sparing, a nibble mask
> +		bit set to 1 indicates the request to perform repair
> +		operation in the specific device. All nibble mask bits set
> +		to 1 indicates the request to perform the operation in all
> +		devices. For CXL memory to repiar, the specific value of
> +		nibble mask to use will be provided in related error records.
> +		For more details, See nibble mask field in CXL spec ver 3.1,
> +		section 8.2.9.7.1.2 Table 8-103 soft PPR and section
> +		8.2.9.7.1.3 Table 8-104 hard PPR, section 8.2.9.7.1.4
> +		Table 8-105 memory sparing.
> +
> +What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/bank_group
> +What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/bank
> +What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/rank
> +What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/row
> +What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/column
> +What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/channel
> +What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/sub_channel
> +Date:		Jan 2025
> +KernelVersion:	6.14
> +Contact:	linux-edac@vger.kernel.org
> +Description:
> +		(RW) The control attributes associated with memory address
> +		that is to be repaired. The specific value of attributes to
> +		use depends on the portion of memory to repair and may be
> +		reported to host in related error records and may be
> +		available to userspace in trace events, such as in CXL
> +		memory devices.
> +
> +		channel - The channel of the memory to repair. Channel is
> +		defined as an interface that can be independently accessed
> +		for a transaction.
> +		rank - The rank of the memory to repair. Rank is defined as a
> +		set of memory devices on a channel that together execute a
> +		transaction.
> +		bank_group - The bank group of the memory to repair.
> +		bank - The bank number of the memory to repair.
> +		row - The row number of the memory to repair.
> +		column - The column number of the memory to repair.
> +		sub_channel - The sub-channel of the memory to repair.

Same problem here with regards to bad ReST input. I would do:

	channel
		The channel of the memory to repair. Channel is
		defined as an interface that can be independently accessed
		for a transaction.

	rank
		The rank of the memory to repair. Rank is defined as a
		set of memory devices on a channel that together execute a
		transaction.

as this would provide a better output at admin-guide while still being
nicer to read as text.

> +
> +		The requirement to set these attributes varies based on the
> +		repair function. The attributes in sysfs are not present
> +		unless required for a repair function.
> +		For example, CXL spec ver 3.1, Section 8.2.9.7.1.2 Table 8-103
> +		soft PPR and Section 8.2.9.7.1.3 Table 8-104 hard PPR operations,
> +		these attributes are not required to set.
> +		For example, CXL spec ver 3.1, Section 8.2.9.7.1.4 Table 8-105
> +		memory sparing, these attributes are required to set based on
> +		memory sparing granularity as follows.
> +		Channel: Channel associated with the DPA that is to be spared
> +		and applies to all subclasses of sparing (cacheline, bank,
> +		row and rank sparing).
> +		Rank: Rank associated with the DPA that is to be spared and
> +		applies to all subclasses of sparing.
> +		Bank & Bank Group: Bank & bank group are associated with
> +		the DPA that is to be spared and applies to cacheline sparing,
> +		row sparing and bank sparing subclasses.
> +		Row: Row associated with the DPA that is to be spared and
> +		applies to cacheline sparing and row sparing subclasses.
> +		Column: column associated with the DPA that is to be spared
> +		and applies to cacheline sparing only.
> +		Sub-channel: sub-channel associated with the DPA that is to
> +		be spared and applies to cacheline sparing only.

Same here: this will all be on a single paragraph which would be really
weird.

> +
> +What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/min_hpa
> +What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/min_dpa
> +What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/min_nibble_mask
> +What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/min_bank_group
> +What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/min_bank
> +What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/min_rank
> +What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/min_row
> +What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/min_column
> +What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/min_channel
> +What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/min_sub_channel
> +What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/max_hpa
> +What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/max_dpa
> +What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/max_nibble_mask
> +What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/max_bank_group
> +What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/max_bank
> +What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/max_rank
> +What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/max_row
> +What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/max_column
> +What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/max_channel
> +What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/max_sub_channel
> +Date:		Jan 2025
> +KernelVersion:	6.14
> +Contact:	linux-edac@vger.kernel.org
> +Description:
> +		(RW) The supported range of control attributes (optional)
> +		associated with a memory address that is to be repaired.
> +		The memory device may give the supported range of
> +		attributes to use and it will depend on the memory device
> +		and the portion of memory to repair.
> +		The userspace may receive the specific value of attributes
> +		to use for a repair operation from the memory device via
> +		related error records and trace events, such as in CXL
> +		memory devices.
> +
> +What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/repair
> +Date:		Jan 2025
> +KernelVersion:	6.14
> +Contact:	linux-edac@vger.kernel.org
> +Description:
> +		(WO) Issue the memory repair operation for the specified
> +		memory repair attributes. The operation may fail if resources
> +		are insufficient based on the requirements of the memory
> +		device and repair function.
> +		EDAC_DO_MEM_REPAIR - issue repair operation.
> +		All other values are reserved.
> diff --git a/Documentation/edac/features.rst b/Documentation/edac/features.rst
> index ba3ab993ee4f..bfd5533b81b7 100644
> --- a/Documentation/edac/features.rst
> +++ b/Documentation/edac/features.rst
> @@ -97,3 +97,6 @@ RAS features
>  ------------
>  1. Memory Scrub
>  Memory scrub features are documented in `Documentation/edac/scrub.rst`.
> +
> +2. Memory Repair
> +Memory repair features are documented in `Documentation/edac/memory_repair.rst`.
> diff --git a/Documentation/edac/index.rst b/Documentation/edac/index.rst
> index dfb0c9fb9ab1..d6778f4562dd 100644
> --- a/Documentation/edac/index.rst
> +++ b/Documentation/edac/index.rst
> @@ -8,4 +8,5 @@ EDAC Subsystem
>     :maxdepth: 1
>  
>     features
> +   memory_repair
>     scrub
> diff --git a/Documentation/edac/memory_repair.rst b/Documentation/edac/memory_repair.rst
> new file mode 100644
> index 000000000000..2787a8a2d6ba
> --- /dev/null
> +++ b/Documentation/edac/memory_repair.rst
> @@ -0,0 +1,101 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +==========================
> +EDAC Memory Repair Control
> +==========================
> +
> +Copyright (c) 2024 HiSilicon Limited.
> +
> +:Author:   Shiju Jose <shiju.jose@huawei.com>
> +:License:  The GNU Free Documentation License, Version 1.2
> +          (dual licensed under the GPL v2)
> +:Original Reviewers:
> +
> +- Written for: 6.14

See my comments with regards to license on the previous patches.

> +
> +Introduction
> +------------
> +Memory devices may support repair operations to address issues in their
> +memory media. Post Package Repair (PPR) and memory sparing are examples
> +of such features.
> +
> +Post Package Repair(PPR)
> +~~~~~~~~~~~~~~~~~~~~~~~~
> +Post Package Repair is a maintenance operation requests the memory device
> +to perform repair operation on its media, in detail is a memory self-healing
> +feature that fixes a failing memory location by replacing it with a spare
> +row in a DRAM device. For example, a CXL memory device with DRAM components
> +that support PPR features may implement PPR maintenance operations. DRAM
> +components may support types of PPR functions, hard PPR, for a permanent row
> +repair, and soft PPR, for a temporary row repair. Soft PPR is much faster
> +than hard PPR, but the repair is lost with a power cycle.  The data may not
> +be retained and memory requests may not be correctly processed during a
> +repair operation. In such case, the repair operation should not executed
> +at runtime.
> +For example, CXL memory devices, soft PPR and hard PPR repair operations
> +may be supported. See CXL spec rev 3.1 sections 8.2.9.7.1.1 PPR Maintenance
> +Operations, 8.2.9.7.1.2 sPPR Maintenance Operation and 8.2.9.7.1.3 hPPR
> +Maintenance Operation for more details.

Paragraphs require blank lines in ReST. Also, please place a link to the
specs.

I strongly suggest looking at the output of all docs with make htmldocs
and make pdfdocs to be sure that the paragraphs and the final document
will be properly handled. You may use:

	SPHINXDIRS="<book name(s)>"

to speed-up documentation builds.

Please see Sphinx documentation for more details about what it is expected
there:

	https://www.sphinx-doc.org/en/master/usage/restructuredtext/basics.html

> +
> +Memory Sparing
> +~~~~~~~~~~~~~~
> +Memory sparing is a repair function that replaces a portion of memory with
> +a portion of functional memory at that same sparing granularity. Memory
> +sparing has cacheline/row/bank/rank sparing granularities. For example, in
> +memory-sparing mode, one memory rank serves as a spare for other ranks on
> +the same channel in case they fail. The spare rank is held in reserve and
> +not used as active memory until a failure is indicated, with reserved
> +capacity subtracted from the total available memory in the system. The DIMM
> +installation order for memory sparing varies based on the number of processors
> +and memory modules installed in the server. After an error threshold is
> +surpassed in a system protected by memory sparing, the content of a failing
> +rank of DIMMs is copied to the spare rank. The failing rank is then taken
> +offline and the spare rank placed online for use as active memory in place
> +of the failed rank.
> +
> +For example, CXL memory devices may support various subclasses for sparing
> +operation vary in terms of the scope of the sparing being performed.
> +Cacheline sparing subclass refers to a sparing action that can replace a
> +full cacheline. Row sparing is provided as an alternative to PPR sparing
> +functions and its scope is that of a single DDR row. Bank sparing allows
> +an entire bank to be replaced. Rank sparing is defined as an operation
> +in which an entire DDR rank is replaced. See CXL spec 3.1 section
> +8.2.9.7.1.4 Memory Sparing Maintenance Operations for more details.
> +
> +Use cases of generic memory repair features control
> +---------------------------------------------------
> +
> +1. The soft PPR , hard PPR and memory-sparing features share similar
> +control attributes. Therefore, there is a need for a standardized, generic
> +sysfs repair control that is exposed to userspace and used by
> +administrators, scripts and tools.
> +
> +2. When a CXL device detects an error in a memory component, it may inform
> +the host of the need for a repair maintenance operation by using an event
> +record where the "maintenance needed" flag is set. The event record
> +specifies the device physical address(DPA) and attributes of the memory that
> +requires repair. The kernel reports the corresponding CXL general media or
> +DRAM trace event to userspace, and userspace tools (e.g. rasdaemon) initiate
> +a repair maintenance operation in response to the device request using the
> +sysfs repair control.
> +
> +3. Userspace tools, such as rasdaemon, may request a PPR/sparing on a memory
> +region when an uncorrected memory error or an excess of corrected memory
> +errors is reported on that memory.
> +
> +4. Multiple PPR/sparing instances may be present per memory device.
> +
> +The File System
> +---------------
> +
> +The control attributes of a registered memory repair instance could be
> +accessed in the
> +
> +/sys/bus/edac/devices/<dev-name>/mem_repairX/
> +
> +sysfs
> +-----
> +
> +Sysfs files are documented in
> +
> +`Documentation/ABI/testing/sysfs-edac-memory-repair`.
> diff --git a/drivers/edac/Makefile b/drivers/edac/Makefile
> index 3a49304860f0..1de9fe66ac6b 100644
> --- a/drivers/edac/Makefile
> +++ b/drivers/edac/Makefile
> @@ -10,7 +10,7 @@ obj-$(CONFIG_EDAC)			:= edac_core.o
>  
>  edac_core-y	:= edac_mc.o edac_device.o edac_mc_sysfs.o
>  edac_core-y	+= edac_module.o edac_device_sysfs.o wq.o
> -edac_core-y	+= scrub.o ecs.o
> +edac_core-y	+= scrub.o ecs.o mem_repair.o
>  
>  edac_core-$(CONFIG_EDAC_DEBUG)		+= debugfs.o
>  
> diff --git a/drivers/edac/edac_device.c b/drivers/edac/edac_device.c
> index 1c1142a2e4e4..a401d81dad8a 100644
> --- a/drivers/edac/edac_device.c
> +++ b/drivers/edac/edac_device.c
> @@ -575,6 +575,7 @@ static void edac_dev_release(struct device *dev)
>  {
>  	struct edac_dev_feat_ctx *ctx = container_of(dev, struct edac_dev_feat_ctx, dev);
>  
> +	kfree(ctx->mem_repair);
>  	kfree(ctx->scrub);
>  	kfree(ctx->dev.groups);
>  	kfree(ctx);
> @@ -611,6 +612,7 @@ int edac_dev_register(struct device *parent, char *name,
>  	const struct attribute_group **ras_attr_groups;
>  	struct edac_dev_data *dev_data;
>  	struct edac_dev_feat_ctx *ctx;
> +	int mem_repair_cnt = 0;
>  	int attr_gcnt = 0;
>  	int scrub_cnt = 0;
>  	int ret, feat;
> @@ -628,6 +630,10 @@ int edac_dev_register(struct device *parent, char *name,
>  		case RAS_FEAT_ECS:
>  			attr_gcnt += ras_features[feat].ecs_info.num_media_frus;
>  			break;
> +		case RAS_FEAT_MEM_REPAIR:
> +			attr_gcnt++;
> +			mem_repair_cnt++;
> +			break;
>  		default:
>  			return -EINVAL;
>  		}
> @@ -651,8 +657,17 @@ int edac_dev_register(struct device *parent, char *name,
>  		}
>  	}
>  
> +	if (mem_repair_cnt) {
> +		ctx->mem_repair = kcalloc(mem_repair_cnt, sizeof(*ctx->mem_repair), GFP_KERNEL);
> +		if (!ctx->mem_repair) {
> +			ret = -ENOMEM;
> +			goto data_mem_free;
> +		}
> +	}
> +
>  	attr_gcnt = 0;
>  	scrub_cnt = 0;
> +	mem_repair_cnt = 0;
>  	for (feat = 0; feat < num_features; feat++, ras_features++) {
>  		switch (ras_features->ft_type) {
>  		case RAS_FEAT_SCRUB:
> @@ -686,6 +701,23 @@ int edac_dev_register(struct device *parent, char *name,
>  
>  			attr_gcnt += ras_features->ecs_info.num_media_frus;
>  			break;
> +		case RAS_FEAT_MEM_REPAIR:
> +			if (!ras_features->mem_repair_ops ||
> +			    mem_repair_cnt != ras_features->instance)
> +				goto data_mem_free;
> +
> +			dev_data = &ctx->mem_repair[mem_repair_cnt];
> +			dev_data->instance = mem_repair_cnt;
> +			dev_data->mem_repair_ops = ras_features->mem_repair_ops;
> +			dev_data->private = ras_features->ctx;
> +			ret = edac_mem_repair_get_desc(parent, &ras_attr_groups[attr_gcnt],
> +						       ras_features->instance);
> +			if (ret)
> +				goto data_mem_free;
> +
> +			mem_repair_cnt++;
> +			attr_gcnt++;
> +			break;
>  		default:
>  			ret = -EINVAL;
>  			goto data_mem_free;
> @@ -712,6 +744,7 @@ int edac_dev_register(struct device *parent, char *name,
>  	return devm_add_action_or_reset(parent, edac_dev_unreg, &ctx->dev);
>  
>  data_mem_free:
> +	kfree(ctx->mem_repair);
>  	kfree(ctx->scrub);
>  groups_free:
>  	kfree(ras_attr_groups);
> diff --git a/drivers/edac/mem_repair.c b/drivers/edac/mem_repair.c
> new file mode 100755
> index 000000000000..e7439fd26c41
> --- /dev/null
> +++ b/drivers/edac/mem_repair.c
> @@ -0,0 +1,492 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * The generic EDAC memory repair driver is designed to control the memory
> + * devices with memory repair features, such as Post Package Repair (PPR),
> + * memory sparing etc. The common sysfs memory repair interface abstracts
> + * the control of various arbitrary memory repair functionalities into a
> + * unified set of functions.
> + *
> + * Copyright (c) 2024 HiSilicon Limited.
> + */
> +
> +#include <linux/edac.h>
> +
> +enum edac_mem_repair_attributes {
> +	MEM_REPAIR_FUNCTION,
> +	MEM_REPAIR_PERSIST_MODE,
> +	MEM_REPAIR_DPA_SUPPORT,
> +	MEM_REPAIR_SAFE_IN_USE,
> +	MEM_REPAIR_HPA,
> +	MEM_REPAIR_MIN_HPA,
> +	MEM_REPAIR_MAX_HPA,
> +	MEM_REPAIR_DPA,
> +	MEM_REPAIR_MIN_DPA,
> +	MEM_REPAIR_MAX_DPA,
> +	MEM_REPAIR_NIBBLE_MASK,
> +	MEM_REPAIR_MIN_NIBBLE_MASK,
> +	MEM_REPAIR_MAX_NIBBLE_MASK,
> +	MEM_REPAIR_BANK_GROUP,
> +	MEM_REPAIR_MIN_BANK_GROUP,
> +	MEM_REPAIR_MAX_BANK_GROUP,
> +	MEM_REPAIR_BANK,
> +	MEM_REPAIR_MIN_BANK,
> +	MEM_REPAIR_MAX_BANK,
> +	MEM_REPAIR_RANK,
> +	MEM_REPAIR_MIN_RANK,
> +	MEM_REPAIR_MAX_RANK,
> +	MEM_REPAIR_ROW,
> +	MEM_REPAIR_MIN_ROW,
> +	MEM_REPAIR_MAX_ROW,
> +	MEM_REPAIR_COLUMN,
> +	MEM_REPAIR_MIN_COLUMN,
> +	MEM_REPAIR_MAX_COLUMN,
> +	MEM_REPAIR_CHANNEL,
> +	MEM_REPAIR_MIN_CHANNEL,
> +	MEM_REPAIR_MAX_CHANNEL,
> +	MEM_REPAIR_SUB_CHANNEL,
> +	MEM_REPAIR_MIN_SUB_CHANNEL,
> +	MEM_REPAIR_MAX_SUB_CHANNEL,
> +	MEM_DO_REPAIR,
> +	MEM_REPAIR_MAX_ATTRS
> +};
> +
> +struct edac_mem_repair_dev_attr {
> +	struct device_attribute dev_attr;
> +	u8 instance;
> +};
> +
> +struct edac_mem_repair_context {
> +	char name[EDAC_FEAT_NAME_LEN];
> +	struct edac_mem_repair_dev_attr mem_repair_dev_attr[MEM_REPAIR_MAX_ATTRS];
> +	struct attribute *mem_repair_attrs[MEM_REPAIR_MAX_ATTRS + 1];
> +	struct attribute_group group;
> +};
> +
> +#define TO_MEM_REPAIR_DEV_ATTR(_dev_attr)      \
> +		container_of(_dev_attr, struct edac_mem_repair_dev_attr, dev_attr)
> +
> +#define EDAC_MEM_REPAIR_ATTR_SHOW(attrib, cb, type, format)			\
> +static ssize_t attrib##_show(struct device *ras_feat_dev,			\
> +			     struct device_attribute *attr, char *buf)		\
> +{										\
> +	u8 inst = TO_MEM_REPAIR_DEV_ATTR(attr)->instance;			\
> +	struct edac_dev_feat_ctx *ctx = dev_get_drvdata(ras_feat_dev);		\
> +	const struct edac_mem_repair_ops *ops =					\
> +				ctx->mem_repair[inst].mem_repair_ops;		\
> +	type data;								\
> +	int ret;								\
> +										\
> +	ret = ops->cb(ras_feat_dev->parent, ctx->mem_repair[inst].private,	\
> +		      &data);							\
> +	if (ret)								\
> +		return ret;							\
> +										\
> +	return sysfs_emit(buf, format, data);					\
> +}
> +
> +EDAC_MEM_REPAIR_ATTR_SHOW(repair_function, get_repair_function, u32, "%u\n")
> +EDAC_MEM_REPAIR_ATTR_SHOW(persist_mode, get_persist_mode, u32, "%u\n")
> +EDAC_MEM_REPAIR_ATTR_SHOW(dpa_support, get_dpa_support, u32, "%u\n")
> +EDAC_MEM_REPAIR_ATTR_SHOW(repair_safe_when_in_use, get_repair_safe_when_in_use, u32, "%u\n")
> +EDAC_MEM_REPAIR_ATTR_SHOW(hpa, get_hpa, u64, "0x%llx\n")
> +EDAC_MEM_REPAIR_ATTR_SHOW(min_hpa, get_min_hpa, u64, "0x%llx\n")
> +EDAC_MEM_REPAIR_ATTR_SHOW(max_hpa, get_max_hpa, u64, "0x%llx\n")
> +EDAC_MEM_REPAIR_ATTR_SHOW(dpa, get_dpa, u64, "0x%llx\n")
> +EDAC_MEM_REPAIR_ATTR_SHOW(min_dpa, get_min_dpa, u64, "0x%llx\n")
> +EDAC_MEM_REPAIR_ATTR_SHOW(max_dpa, get_max_dpa, u64, "0x%llx\n")
> +EDAC_MEM_REPAIR_ATTR_SHOW(nibble_mask, get_nibble_mask, u64, "0x%llx\n")
> +EDAC_MEM_REPAIR_ATTR_SHOW(min_nibble_mask, get_min_nibble_mask, u64, "0x%llx\n")
> +EDAC_MEM_REPAIR_ATTR_SHOW(max_nibble_mask, get_max_nibble_mask, u64, "0x%llx\n")
> +EDAC_MEM_REPAIR_ATTR_SHOW(bank_group, get_bank_group, u32, "%u\n")
> +EDAC_MEM_REPAIR_ATTR_SHOW(min_bank_group, get_min_bank_group, u32, "%u\n")
> +EDAC_MEM_REPAIR_ATTR_SHOW(max_bank_group, get_max_bank_group, u32, "%u\n")
> +EDAC_MEM_REPAIR_ATTR_SHOW(bank, get_bank, u32, "%u\n")
> +EDAC_MEM_REPAIR_ATTR_SHOW(min_bank, get_min_bank, u32, "%u\n")
> +EDAC_MEM_REPAIR_ATTR_SHOW(max_bank, get_max_bank, u32, "%u\n")
> +EDAC_MEM_REPAIR_ATTR_SHOW(rank, get_rank, u32, "%u\n")
> +EDAC_MEM_REPAIR_ATTR_SHOW(min_rank, get_min_rank, u32, "%u\n")
> +EDAC_MEM_REPAIR_ATTR_SHOW(max_rank, get_max_rank, u32, "%u\n")
> +EDAC_MEM_REPAIR_ATTR_SHOW(row, get_row, u64, "0x%llx\n")
> +EDAC_MEM_REPAIR_ATTR_SHOW(min_row, get_min_row, u64, "0x%llx\n")
> +EDAC_MEM_REPAIR_ATTR_SHOW(max_row, get_max_row, u64, "0x%llx\n")
> +EDAC_MEM_REPAIR_ATTR_SHOW(column, get_column, u32, "%u\n")
> +EDAC_MEM_REPAIR_ATTR_SHOW(min_column, get_min_column, u32, "%u\n")
> +EDAC_MEM_REPAIR_ATTR_SHOW(max_column, get_max_column, u32, "%u\n")
> +EDAC_MEM_REPAIR_ATTR_SHOW(channel, get_channel, u32, "%u\n")
> +EDAC_MEM_REPAIR_ATTR_SHOW(min_channel, get_min_channel, u32, "%u\n")
> +EDAC_MEM_REPAIR_ATTR_SHOW(max_channel, get_max_channel, u32, "%u\n")
> +EDAC_MEM_REPAIR_ATTR_SHOW(sub_channel, get_sub_channel, u32, "%u\n")
> +EDAC_MEM_REPAIR_ATTR_SHOW(min_sub_channel, get_min_sub_channel, u32, "%u\n")
> +EDAC_MEM_REPAIR_ATTR_SHOW(max_sub_channel, get_max_sub_channel, u32, "%u\n")
> +
> +#define EDAC_MEM_REPAIR_ATTR_STORE(attrib, cb, type, conv_func)			\
> +static ssize_t attrib##_store(struct device *ras_feat_dev,			\
> +			      struct device_attribute *attr,			\
> +			      const char *buf, size_t len)			\
> +{										\
> +	u8 inst = TO_MEM_REPAIR_DEV_ATTR(attr)->instance;			\
> +	struct edac_dev_feat_ctx *ctx = dev_get_drvdata(ras_feat_dev);		\
> +	const struct edac_mem_repair_ops *ops =					\
> +				ctx->mem_repair[inst].mem_repair_ops;		\
> +	type data;								\
> +	int ret;								\
> +										\
> +	ret = conv_func(buf, 0, &data);						\
> +	if (ret < 0)								\
> +		return ret;							\
> +										\
> +	ret = ops->cb(ras_feat_dev->parent, ctx->mem_repair[inst].private,	\
> +		      data);							\
> +	if (ret)								\
> +		return ret;							\
> +										\
> +	return len;								\
> +}
> +
> +EDAC_MEM_REPAIR_ATTR_STORE(persist_mode, set_persist_mode, unsigned long, kstrtoul)
> +EDAC_MEM_REPAIR_ATTR_STORE(hpa, set_hpa, u64, kstrtou64)
> +EDAC_MEM_REPAIR_ATTR_STORE(dpa, set_dpa, u64, kstrtou64)
> +EDAC_MEM_REPAIR_ATTR_STORE(nibble_mask, set_nibble_mask, u64, kstrtou64)
> +EDAC_MEM_REPAIR_ATTR_STORE(bank_group, set_bank_group, unsigned long, kstrtoul)
> +EDAC_MEM_REPAIR_ATTR_STORE(bank, set_bank, unsigned long, kstrtoul)
> +EDAC_MEM_REPAIR_ATTR_STORE(rank, set_rank, unsigned long, kstrtoul)
> +EDAC_MEM_REPAIR_ATTR_STORE(row, set_row, u64, kstrtou64)
> +EDAC_MEM_REPAIR_ATTR_STORE(column, set_column, unsigned long, kstrtoul)
> +EDAC_MEM_REPAIR_ATTR_STORE(channel, set_channel, unsigned long, kstrtoul)
> +EDAC_MEM_REPAIR_ATTR_STORE(sub_channel, set_sub_channel, unsigned long, kstrtoul)
> +
> +#define EDAC_MEM_REPAIR_DO_OP(attrib, cb)						\
> +static ssize_t attrib##_store(struct device *ras_feat_dev,				\
> +			      struct device_attribute *attr,				\
> +			      const char *buf, size_t len)				\
> +{											\
> +	u8 inst = TO_MEM_REPAIR_DEV_ATTR(attr)->instance;				\
> +	struct edac_dev_feat_ctx *ctx = dev_get_drvdata(ras_feat_dev);			\
> +	const struct edac_mem_repair_ops *ops = ctx->mem_repair[inst].mem_repair_ops;	\
> +	unsigned long data;								\
> +	int ret;									\
> +											\
> +	ret = kstrtoul(buf, 0, &data);							\
> +	if (ret < 0)									\
> +		return ret;								\
> +											\
> +	ret = ops->cb(ras_feat_dev->parent, ctx->mem_repair[inst].private, data);	\
> +	if (ret)									\
> +		return ret;								\
> +											\
> +	return len;									\
> +}
> +
> +EDAC_MEM_REPAIR_DO_OP(repair, do_repair)
> +
> +static umode_t mem_repair_attr_visible(struct kobject *kobj, struct attribute *a, int attr_id)
> +{
> +	struct device *ras_feat_dev = kobj_to_dev(kobj);
> +	struct device_attribute *dev_attr = container_of(a, struct device_attribute, attr);
> +	struct edac_dev_feat_ctx *ctx = dev_get_drvdata(ras_feat_dev);
> +	u8 inst = TO_MEM_REPAIR_DEV_ATTR(dev_attr)->instance;
> +	const struct edac_mem_repair_ops *ops = ctx->mem_repair[inst].mem_repair_ops;
> +
> +	switch (attr_id) {
> +	case MEM_REPAIR_FUNCTION:
> +		if (ops->get_repair_function)
> +			return a->mode;
> +		break;
> +	case MEM_REPAIR_PERSIST_MODE:
> +		if (ops->get_persist_mode) {
> +			if (ops->set_persist_mode)
> +				return a->mode;
> +			else
> +				return 0444;
> +		}
> +		break;
> +	case MEM_REPAIR_DPA_SUPPORT:
> +		if (ops->get_dpa_support)
> +			return a->mode;
> +		break;
> +	case MEM_REPAIR_SAFE_IN_USE:
> +		if (ops->get_repair_safe_when_in_use)
> +			return a->mode;
> +		break;
> +	case MEM_REPAIR_HPA:
> +		if (ops->get_hpa) {
> +			if (ops->set_hpa)
> +				return a->mode;
> +			else
> +				return 0444;
> +		}
> +		break;
> +	case MEM_REPAIR_MIN_HPA:
> +		if (ops->get_min_hpa)
> +			return a->mode;
> +		break;
> +	case MEM_REPAIR_MAX_HPA:
> +		if (ops->get_max_hpa)
> +			return a->mode;
> +		break;
> +	case MEM_REPAIR_DPA:
> +		if (ops->get_dpa) {
> +			if (ops->set_dpa)
> +				return a->mode;
> +			else
> +				return 0444;
> +		}
> +		break;
> +	case MEM_REPAIR_MIN_DPA:
> +		if (ops->get_min_dpa)
> +			return a->mode;
> +		break;
> +	case MEM_REPAIR_MAX_DPA:
> +		if (ops->get_max_dpa)
> +			return a->mode;
> +		break;
> +	case MEM_REPAIR_NIBBLE_MASK:
> +		if (ops->get_nibble_mask) {
> +			if (ops->set_nibble_mask)
> +				return a->mode;
> +			else
> +				return 0444;
> +		}
> +		break;
> +	case MEM_REPAIR_MIN_NIBBLE_MASK:
> +		if (ops->get_min_nibble_mask)
> +			return a->mode;
> +		break;
> +	case MEM_REPAIR_MAX_NIBBLE_MASK:
> +		if (ops->get_max_nibble_mask)
> +			return a->mode;
> +		break;
> +	case MEM_REPAIR_BANK_GROUP:
> +		if (ops->get_bank_group) {
> +			if (ops->set_bank_group)
> +				return a->mode;
> +			else
> +				return 0444;
> +		}
> +		break;
> +	case MEM_REPAIR_MIN_BANK_GROUP:
> +		if (ops->get_min_bank_group)
> +			return a->mode;
> +		break;
> +	case MEM_REPAIR_MAX_BANK_GROUP:
> +		if (ops->get_max_bank_group)
> +			return a->mode;
> +		break;
> +	case MEM_REPAIR_BANK:
> +		if (ops->get_bank) {
> +			if (ops->set_bank)
> +				return a->mode;
> +			else
> +				return 0444;
> +		}
> +		break;
> +	case MEM_REPAIR_MIN_BANK:
> +		if (ops->get_min_bank)
> +			return a->mode;
> +		break;
> +	case MEM_REPAIR_MAX_BANK:
> +		if (ops->get_max_bank)
> +			return a->mode;
> +		break;
> +	case MEM_REPAIR_RANK:
> +		if (ops->get_rank) {
> +			if (ops->set_rank)
> +				return a->mode;
> +			else
> +				return 0444;
> +		}
> +		break;
> +	case MEM_REPAIR_MIN_RANK:
> +		if (ops->get_min_rank)
> +			return a->mode;
> +		break;
> +	case MEM_REPAIR_MAX_RANK:
> +		if (ops->get_max_rank)
> +			return a->mode;
> +		break;
> +	case MEM_REPAIR_ROW:
> +		if (ops->get_row) {
> +			if (ops->set_row)
> +				return a->mode;
> +			else
> +				return 0444;
> +		}
> +		break;
> +	case MEM_REPAIR_MIN_ROW:
> +		if (ops->get_min_row)
> +			return a->mode;
> +		break;
> +	case MEM_REPAIR_MAX_ROW:
> +		if (ops->get_max_row)
> +			return a->mode;
> +		break;
> +	case MEM_REPAIR_COLUMN:
> +		if (ops->get_column) {
> +			if (ops->set_column)
> +				return a->mode;
> +			else
> +				return 0444;
> +		}
> +		break;
> +	case MEM_REPAIR_MIN_COLUMN:
> +		if (ops->get_min_column)
> +			return a->mode;
> +		break;
> +	case MEM_REPAIR_MAX_COLUMN:
> +		if (ops->get_max_column)
> +			return a->mode;
> +		break;
> +	case MEM_REPAIR_CHANNEL:
> +		if (ops->get_channel) {
> +			if (ops->set_channel)
> +				return a->mode;
> +			else
> +				return 0444;
> +		}
> +		break;
> +	case MEM_REPAIR_MIN_CHANNEL:
> +		if (ops->get_min_channel)
> +			return a->mode;
> +		break;
> +	case MEM_REPAIR_MAX_CHANNEL:
> +		if (ops->get_max_channel)
> +			return a->mode;
> +		break;
> +	case MEM_REPAIR_SUB_CHANNEL:
> +		if (ops->get_sub_channel) {
> +			if (ops->set_sub_channel)
> +				return a->mode;
> +			else
> +				return 0444;
> +		}
> +		break;
> +	case MEM_REPAIR_MIN_SUB_CHANNEL:
> +		if (ops->get_min_sub_channel)
> +			return a->mode;
> +		break;
> +	case MEM_REPAIR_MAX_SUB_CHANNEL:
> +		if (ops->get_max_sub_channel)
> +			return a->mode;
> +		break;
> +	case MEM_DO_REPAIR:
> +		if (ops->do_repair)
> +			return a->mode;
> +		break;
> +	default:
> +		break;
> +	}
> +
> +	return 0;
> +}
> +
> +#define EDAC_MEM_REPAIR_ATTR_RO(_name, _instance)       \
> +	((struct edac_mem_repair_dev_attr) { .dev_attr = __ATTR_RO(_name), \
> +					     .instance = _instance })
> +
> +#define EDAC_MEM_REPAIR_ATTR_WO(_name, _instance)       \
> +	((struct edac_mem_repair_dev_attr) { .dev_attr = __ATTR_WO(_name), \
> +					     .instance = _instance })
> +
> +#define EDAC_MEM_REPAIR_ATTR_RW(_name, _instance)       \
> +	((struct edac_mem_repair_dev_attr) { .dev_attr = __ATTR_RW(_name), \
> +					     .instance = _instance })
> +
> +static int mem_repair_create_desc(struct device *dev,
> +				  const struct attribute_group **attr_groups,
> +				  u8 instance)
> +{
> +	struct edac_mem_repair_context *ctx;
> +	struct attribute_group *group;
> +	int i;
> +	struct edac_mem_repair_dev_attr dev_attr[] = {
> +		[MEM_REPAIR_FUNCTION] = EDAC_MEM_REPAIR_ATTR_RO(repair_function,
> +							    instance),
> +		[MEM_REPAIR_PERSIST_MODE] =
> +				EDAC_MEM_REPAIR_ATTR_RW(persist_mode, instance),
> +		[MEM_REPAIR_DPA_SUPPORT] =
> +				EDAC_MEM_REPAIR_ATTR_RO(dpa_support, instance),
> +		[MEM_REPAIR_SAFE_IN_USE] =
> +				EDAC_MEM_REPAIR_ATTR_RO(repair_safe_when_in_use,
> +							instance),
> +		[MEM_REPAIR_HPA] = EDAC_MEM_REPAIR_ATTR_RW(hpa, instance),
> +		[MEM_REPAIR_MIN_HPA] = EDAC_MEM_REPAIR_ATTR_RO(min_hpa, instance),
> +		[MEM_REPAIR_MAX_HPA] = EDAC_MEM_REPAIR_ATTR_RO(max_hpa, instance),
> +		[MEM_REPAIR_DPA] = EDAC_MEM_REPAIR_ATTR_RW(dpa, instance),
> +		[MEM_REPAIR_MIN_DPA] = EDAC_MEM_REPAIR_ATTR_RO(min_dpa, instance),
> +		[MEM_REPAIR_MAX_DPA] = EDAC_MEM_REPAIR_ATTR_RO(max_dpa, instance),
> +		[MEM_REPAIR_NIBBLE_MASK] =
> +				EDAC_MEM_REPAIR_ATTR_RW(nibble_mask, instance),
> +		[MEM_REPAIR_MIN_NIBBLE_MASK] =
> +				EDAC_MEM_REPAIR_ATTR_RO(min_nibble_mask, instance),
> +		[MEM_REPAIR_MAX_NIBBLE_MASK] =
> +				EDAC_MEM_REPAIR_ATTR_RO(max_nibble_mask, instance),
> +		[MEM_REPAIR_BANK_GROUP] =
> +				EDAC_MEM_REPAIR_ATTR_RW(bank_group, instance),
> +		[MEM_REPAIR_MIN_BANK_GROUP] =
> +				EDAC_MEM_REPAIR_ATTR_RO(min_bank_group, instance),
> +		[MEM_REPAIR_MAX_BANK_GROUP] =
> +				EDAC_MEM_REPAIR_ATTR_RO(max_bank_group, instance),
> +		[MEM_REPAIR_BANK] = EDAC_MEM_REPAIR_ATTR_RW(bank, instance),
> +		[MEM_REPAIR_MIN_BANK] = EDAC_MEM_REPAIR_ATTR_RO(min_bank, instance),
> +		[MEM_REPAIR_MAX_BANK] = EDAC_MEM_REPAIR_ATTR_RO(max_bank, instance),
> +		[MEM_REPAIR_RANK] = EDAC_MEM_REPAIR_ATTR_RW(rank, instance),
> +		[MEM_REPAIR_MIN_RANK] = EDAC_MEM_REPAIR_ATTR_RO(min_rank, instance),
> +		[MEM_REPAIR_MAX_RANK] = EDAC_MEM_REPAIR_ATTR_RO(max_rank, instance),
> +		[MEM_REPAIR_ROW] = EDAC_MEM_REPAIR_ATTR_RW(row, instance),
> +		[MEM_REPAIR_MIN_ROW] = EDAC_MEM_REPAIR_ATTR_RO(min_row, instance),
> +		[MEM_REPAIR_MAX_ROW] = EDAC_MEM_REPAIR_ATTR_RO(max_row, instance),
> +		[MEM_REPAIR_COLUMN] = EDAC_MEM_REPAIR_ATTR_RW(column, instance),
> +		[MEM_REPAIR_MIN_COLUMN] = EDAC_MEM_REPAIR_ATTR_RO(min_column, instance),
> +		[MEM_REPAIR_MAX_COLUMN] = EDAC_MEM_REPAIR_ATTR_RO(max_column, instance),
> +		[MEM_REPAIR_CHANNEL] = EDAC_MEM_REPAIR_ATTR_RW(channel, instance),
> +		[MEM_REPAIR_MIN_CHANNEL] = EDAC_MEM_REPAIR_ATTR_RO(min_channel, instance),
> +		[MEM_REPAIR_MAX_CHANNEL] = EDAC_MEM_REPAIR_ATTR_RO(max_channel, instance),
> +		[MEM_REPAIR_SUB_CHANNEL] =
> +				EDAC_MEM_REPAIR_ATTR_RW(sub_channel, instance),
> +		[MEM_REPAIR_MIN_SUB_CHANNEL] =
> +				EDAC_MEM_REPAIR_ATTR_RO(min_sub_channel, instance),
> +		[MEM_REPAIR_MAX_SUB_CHANNEL] =
> +				EDAC_MEM_REPAIR_ATTR_RO(max_sub_channel, instance),
> +		[MEM_DO_REPAIR] = EDAC_MEM_REPAIR_ATTR_WO(repair, instance)
> +	};
> +
> +	ctx = devm_kzalloc(dev, sizeof(*ctx), GFP_KERNEL);
> +	if (!ctx)
> +		return -ENOMEM;
> +
> +	for (i = 0; i < MEM_REPAIR_MAX_ATTRS; i++) {
> +		memcpy(&ctx->mem_repair_dev_attr[i].dev_attr,
> +		       &dev_attr[i], sizeof(dev_attr[i]));
> +		ctx->mem_repair_attrs[i] =
> +				&ctx->mem_repair_dev_attr[i].dev_attr.attr;
> +	}
> +
> +	sprintf(ctx->name, "%s%d", "mem_repair", instance);
> +	group = &ctx->group;
> +	group->name = ctx->name;
> +	group->attrs = ctx->mem_repair_attrs;
> +	group->is_visible  = mem_repair_attr_visible;
> +	attr_groups[0] = group;
> +
> +	return 0;
> +}
> +
> +/**
> + * edac_mem_repair_get_desc - get EDAC memory repair descriptors
> + * @dev: client device with memory repair feature
> + * @attr_groups: pointer to attribute group container
> + * @instance: device's memory repair instance number.
> + *
> + * Return:
> + *  * %0	- Success.
> + *  * %-EINVAL	- Invalid parameters passed.
> + *  * %-ENOMEM	- Dynamic memory allocation failed.
> + */
> +int edac_mem_repair_get_desc(struct device *dev,
> +			     const struct attribute_group **attr_groups, u8 instance)
> +{
> +	if (!dev || !attr_groups)
> +		return -EINVAL;
> +
> +	return mem_repair_create_desc(dev, attr_groups, instance);
> +}
> diff --git a/include/linux/edac.h b/include/linux/edac.h
> index 979e91426701..5d07192bf1a7 100644
> --- a/include/linux/edac.h
> +++ b/include/linux/edac.h
> @@ -668,6 +668,7 @@ static inline struct dimm_info *edac_get_dimm(struct mem_ctl_info *mci,
>  enum edac_dev_feat {
>  	RAS_FEAT_SCRUB,
>  	RAS_FEAT_ECS,
> +	RAS_FEAT_MEM_REPAIR,
>  	RAS_FEAT_MAX
>  };
>  
> @@ -729,11 +730,147 @@ int edac_ecs_get_desc(struct device *ecs_dev,
>  		      const struct attribute_group **attr_groups,
>  		      u16 num_media_frus);
>  
> +enum edac_mem_repair_function {
> +	EDAC_SOFT_PPR,
> +	EDAC_HARD_PPR,
> +	EDAC_CACHELINE_MEM_SPARING,
> +	EDAC_ROW_MEM_SPARING,
> +	EDAC_BANK_MEM_SPARING,
> +	EDAC_RANK_MEM_SPARING,
> +};
> +
> +enum edac_mem_repair_persist_mode {
> +	EDAC_MEM_REPAIR_SOFT, /* soft memory repair */
> +	EDAC_MEM_REPAIR_HARD, /* hard memory repair */
> +};
> +
> +enum edac_mem_repair_cmd {
> +	EDAC_DO_MEM_REPAIR = 1,
> +};
> +
> +/**
> + * struct edac_mem_repair_ops - memory repair operations
> + * (all elements are optional except do_repair, set_hpa/set_dpa)
> + * @get_repair_function: get the memory repair function, listed in
> + *			 enum edac_mem_repair_function.
> + * @get_persist_mode: get the current persist mode. Persist repair modes supported
> + *		      in the device is based on the memory repair function which is
> + *		      temporary or permanent and is lost with a power cycle.
> + *		      EDAC_MEM_REPAIR_SOFT - Soft repair function (temporary repair).
> + *		      EDAC_MEM_REPAIR_HARD - Hard memory repair function (permanent repair).
> + * All other values are reserved.
> + * @set_persist_mode: set the persist mode of the memory repair instance.
> + * @get_dpa_support: get dpa support flag. In some states of system configuration
> + *		     (e.g. before address decoders have been configured), memory devices
> + *		     (e.g. CXL) may not have an active mapping in the main host address
> + *		     physical address map. As such, the memory to repair must be identified
> + *		     by a device specific physical addressing scheme using a device physical
> + *		     address(DPA). The DPA and other control attributes to use for the
> + *		     dry_run and repair operations will be presented in related error records.
> + * @get_repair_safe_when_in_use: get whether memory media is accessible and
> + *				 data is retained during repair operation.
> + * @get_hpa: get current host physical address (HPA).
> + * @set_hpa: set host physical address (HPA) of memory to repair.
> + * @get_min_hpa: get the minimum supported host physical address (HPA).
> + * @get_max_hpa: get the maximum supported host physical address (HPA).
> + * @get_dpa: get current device physical address (DPA).
> + * @set_dpa: set device physical address (DPA) of memory to repair.
> + * @get_min_dpa: get the minimum supported device physical address (DPA).
> + * @get_max_dpa: get the maximum supported device physical address (DPA).
> + * @get_nibble_mask: get current nibble mask.
> + * @set_nibble_mask: set nibble mask of memory to repair.
> + * @get_min_nibble_mask: get the minimum supported nibble mask.
> + * @get_max_nibble_mask: get the maximum supported nibble mask.
> + * @get_bank_group: get current bank group.
> + * @set_bank_group: set bank group of memory to repair.
> + * @get_min_bank_group: get the minimum supported bank group.
> + * @get_max_bank_group: get the maximum supported bank group.
> + * @get_bank: get current bank.
> + * @set_bank: set bank of memory to repair.
> + * @get_min_bank: get the minimum supported bank.
> + * @get_max_bank: get the maximum supported bank.
> + * @get_rank: get current rank.
> + * @set_rank: set rank of memory to repair.
> + * @get_min_rank: get the minimum supported rank.
> + * @get_max_rank: get the maximum supported rank.
> + * @get_row: get current row.
> + * @set_row: set row of memory to repair.
> + * @get_min_row: get the minimum supported row.
> + * @get_max_row: get the maximum supported row.
> + * @get_column: get current column.
> + * @set_column: set column of memory to repair.
> + * @get_min_column: get the minimum supported column.
> + * @get_max_column: get the maximum supported column.
> + * @get_channel: get current channel.
> + * @set_channel: set channel of memory to repair.
> + * @get_min_channel: get the minimum supported channel.
> + * @get_max_channel: get the maximum supported channel.
> + * @get_sub_channel: get current sub channel.
> + * @set_sub_channel: set sub channel of memory to repair.
> + * @get_min_sub_channel: get the minimum supported sub channel.
> + * @get_max_sub_channel: get the maximum supported sub channel.
> + * @do_repair: Issue memory repair operation for the HPA/DPA and
> + *	       other control attributes set for the memory to repair.
> + */
> +struct edac_mem_repair_ops {
> +	int (*get_repair_function)(struct device *dev, void *drv_data, u32 *val);
> +	int (*get_persist_mode)(struct device *dev, void *drv_data, u32 *mode);
> +	int (*set_persist_mode)(struct device *dev, void *drv_data, u32 mode);
> +	int (*get_dpa_support)(struct device *dev, void *drv_data, u32 *val);
> +	int (*get_repair_safe_when_in_use)(struct device *dev, void *drv_data, u32 *val);
> +	int (*get_hpa)(struct device *dev, void *drv_data, u64 *hpa);
> +	int (*set_hpa)(struct device *dev, void *drv_data, u64 hpa);
> +	int (*get_min_hpa)(struct device *dev, void *drv_data, u64 *hpa);
> +	int (*get_max_hpa)(struct device *dev, void *drv_data, u64 *hpa);
> +	int (*get_dpa)(struct device *dev, void *drv_data, u64 *dpa);
> +	int (*set_dpa)(struct device *dev, void *drv_data, u64 dpa);
> +	int (*get_min_dpa)(struct device *dev, void *drv_data, u64 *dpa);
> +	int (*get_max_dpa)(struct device *dev, void *drv_data, u64 *dpa);
> +	int (*get_nibble_mask)(struct device *dev, void *drv_data, u64 *val);
> +	int (*set_nibble_mask)(struct device *dev, void *drv_data, u64 val);
> +	int (*get_min_nibble_mask)(struct device *dev, void *drv_data, u64 *val);
> +	int (*get_max_nibble_mask)(struct device *dev, void *drv_data, u64 *val);
> +	int (*get_bank_group)(struct device *dev, void *drv_data, u32 *val);
> +	int (*set_bank_group)(struct device *dev, void *drv_data, u32 val);
> +	int (*get_min_bank_group)(struct device *dev, void *drv_data, u32 *val);
> +	int (*get_max_bank_group)(struct device *dev, void *drv_data, u32 *val);
> +	int (*get_bank)(struct device *dev, void *drv_data, u32 *val);
> +	int (*set_bank)(struct device *dev, void *drv_data, u32 val);
> +	int (*get_min_bank)(struct device *dev, void *drv_data, u32 *val);
> +	int (*get_max_bank)(struct device *dev, void *drv_data, u32 *val);
> +	int (*get_rank)(struct device *dev, void *drv_data, u32 *val);
> +	int (*set_rank)(struct device *dev, void *drv_data, u32 val);
> +	int (*get_min_rank)(struct device *dev, void *drv_data, u32 *val);
> +	int (*get_max_rank)(struct device *dev, void *drv_data, u32 *val);
> +	int (*get_row)(struct device *dev, void *drv_data, u64 *val);
> +	int (*set_row)(struct device *dev, void *drv_data, u64 val);
> +	int (*get_min_row)(struct device *dev, void *drv_data, u64 *val);
> +	int (*get_max_row)(struct device *dev, void *drv_data, u64 *val);
> +	int (*get_column)(struct device *dev, void *drv_data, u32 *val);
> +	int (*set_column)(struct device *dev, void *drv_data, u32 val);
> +	int (*get_min_column)(struct device *dev, void *drv_data, u32 *val);
> +	int (*get_max_column)(struct device *dev, void *drv_data, u32 *val);
> +	int (*get_channel)(struct device *dev, void *drv_data, u32 *val);
> +	int (*set_channel)(struct device *dev, void *drv_data, u32 val);
> +	int (*get_min_channel)(struct device *dev, void *drv_data, u32 *val);
> +	int (*get_max_channel)(struct device *dev, void *drv_data, u32 *val);
> +	int (*get_sub_channel)(struct device *dev, void *drv_data, u32 *val);
> +	int (*set_sub_channel)(struct device *dev, void *drv_data, u32 val);
> +	int (*get_min_sub_channel)(struct device *dev, void *drv_data, u32 *val);
> +	int (*get_max_sub_channel)(struct device *dev, void *drv_data, u32 *val);
> +	int (*do_repair)(struct device *dev, void *drv_data, u32 val);
> +};
> +
> +int edac_mem_repair_get_desc(struct device *dev,
> +			     const struct attribute_group **attr_groups,
> +			     u8 instance);
> +
>  /* EDAC device feature information structure */
>  struct edac_dev_data {
>  	union {
>  		const struct edac_scrub_ops *scrub_ops;
>  		const struct edac_ecs_ops *ecs_ops;
> +		const struct edac_mem_repair_ops *mem_repair_ops;
>  	};
>  	u8 instance;
>  	void *private;
> @@ -744,6 +881,7 @@ struct edac_dev_feat_ctx {
>  	void *private;
>  	struct edac_dev_data *scrub;
>  	struct edac_dev_data ecs;
> +	struct edac_dev_data *mem_repair;
>  };
>  
>  struct edac_dev_feature {
> @@ -752,6 +890,7 @@ struct edac_dev_feature {
>  	union {
>  		const struct edac_scrub_ops *scrub_ops;
>  		const struct edac_ecs_ops *ecs_ops;
> +		const struct edac_mem_repair_ops *mem_repair_ops;
>  	};
>  	void *ctx;
>  	struct edac_ecs_ex_info ecs_info;

Thanks,
Mauro

^ permalink raw reply	[flat|nested] 87+ messages in thread

* RE: [PATCH v18 04/19] EDAC: Add memory repair control feature
  2025-01-14 11:47   ` Mauro Carvalho Chehab
@ 2025-01-14 12:31     ` Shiju Jose
  2025-01-14 14:26       ` Mauro Carvalho Chehab
  0 siblings, 1 reply; 87+ messages in thread
From: Shiju Jose @ 2025-01-14 12:31 UTC (permalink / raw)
  To: Mauro Carvalho Chehab
  Cc: linux-edac@vger.kernel.org, linux-cxl@vger.kernel.org,
	linux-acpi@vger.kernel.org, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org, bp@alien8.de, tony.luck@intel.com,
	rafael@kernel.org, lenb@kernel.org, mchehab@kernel.org,
	dan.j.williams@intel.com, dave@stgolabs.net, Jonathan Cameron,
	dave.jiang@intel.com, alison.schofield@intel.com,
	vishal.l.verma@intel.com, ira.weiny@intel.com, david@redhat.com,
	Vilas.Sridharan@amd.com, leo.duran@amd.com, Yazen.Ghannam@amd.com,
	rientjes@google.com, jiaqiyan@google.com, Jon.Grimm@amd.com,
	dave.hansen@linux.intel.com, naoya.horiguchi@nec.com,
	james.morse@arm.com, jthoughton@google.com,
	somasundaram.a@hpe.com, erdemaktas@google.com, pgonda@google.com,
	duenwen@google.com, gthelen@google.com,
	wschwartz@amperecomputing.com, dferguson@amperecomputing.com,
	wbs@os.amperecomputing.com, nifan.cxl@gmail.com, tanxiaofei,
	Zengtao (B), Roberto Sassu, kangkang.shen@futurewei.com,
	wanghuiqiang, Linuxarm

Hi Mauro,

Thanks for the comments.

>-----Original Message-----
>From: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
>Sent: 14 January 2025 11:48
>To: Shiju Jose <shiju.jose@huawei.com>
>Cc: linux-edac@vger.kernel.org; linux-cxl@vger.kernel.org; linux-
>acpi@vger.kernel.org; linux-mm@kvack.org; linux-kernel@vger.kernel.org;
>bp@alien8.de; tony.luck@intel.com; rafael@kernel.org; lenb@kernel.org;
>mchehab@kernel.org; dan.j.williams@intel.com; dave@stgolabs.net; Jonathan
>Cameron <jonathan.cameron@huawei.com>; dave.jiang@intel.com;
>alison.schofield@intel.com; vishal.l.verma@intel.com; ira.weiny@intel.com;
>david@redhat.com; Vilas.Sridharan@amd.com; leo.duran@amd.com;
>Yazen.Ghannam@amd.com; rientjes@google.com; jiaqiyan@google.com;
>Jon.Grimm@amd.com; dave.hansen@linux.intel.com;
>naoya.horiguchi@nec.com; james.morse@arm.com; jthoughton@google.com;
>somasundaram.a@hpe.com; erdemaktas@google.com; pgonda@google.com;
>duenwen@google.com; gthelen@google.com;
>wschwartz@amperecomputing.com; dferguson@amperecomputing.com;
>wbs@os.amperecomputing.com; nifan.cxl@gmail.com; tanxiaofei
><tanxiaofei@huawei.com>; Zengtao (B) <prime.zeng@hisilicon.com>; Roberto
>Sassu <roberto.sassu@huawei.com>; kangkang.shen@futurewei.com;
>wanghuiqiang <wanghuiqiang@huawei.com>; Linuxarm
><linuxarm@huawei.com>
>Subject: Re: [PATCH v18 04/19] EDAC: Add memory repair control feature
>
>Em Mon, 6 Jan 2025 12:10:00 +0000
><shiju.jose@huawei.com> escreveu:
>
>> From: Shiju Jose <shiju.jose@huawei.com>
>>
>> Add a generic EDAC memory repair control driver to manage memory repairs
>> in the system, such as CXL Post Package Repair (PPR) and CXL memory sparing
>> features.
>>
>> For example, a CXL device with DRAM components that support PPR features
>> may implement PPR maintenance operations. DRAM components may support
>two
>> types of PPR, hard PPR, for a permanent row repair, and soft PPR,  for a
>> temporary row repair. Soft PPR is much faster than hard PPR, but the repair
>> is lost with a power cycle.
>> Similarly a CXL memory device may support soft and hard memory sparing at
>> cacheline, row, bank and rank granularities. Memory sparing is defined as
>> a repair function that replaces a portion of memory with a portion of
>> functional memory at that same granularity.
>> When a CXL device detects an error in a memory, it may report the host of
>> the need for a repair maintenance operation by using an event record where
>> the "maintenance needed" flag is set. The event records contains the device
>> physical address(DPA) and other attributes of the memory to repair (such as
>> channel, sub-channel, bank group, bank, rank, row, column etc). The kernel
>> will report the corresponding CXL general media or DRAM trace event to
>> userspace, and userspace tools (e.g. rasdaemon) will initiate a repair
>> operation in response to the device request via the sysfs repair control.
>>
>> Device with memory repair features registers with EDAC device driver,
>> which retrieves memory repair descriptor from EDAC memory repair driver
>> and exposes the sysfs repair control attributes to userspace in
>> /sys/bus/edac/devices/<dev-name>/mem_repairX/.
>>
>> The common memory repair control interface abstracts the control of
>> arbitrary memory repair functionality into a standardized set of functions.
>> The sysfs memory repair attribute nodes are only available if the client
>> driver has implemented the corresponding attribute callback function and
>> provided operations to the EDAC device driver during registration.
>>
>> Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
>> ---
>>  .../ABI/testing/sysfs-edac-memory-repair      | 244 +++++++++
>>  Documentation/edac/features.rst               |   3 +
>>  Documentation/edac/index.rst                  |   1 +
>>  Documentation/edac/memory_repair.rst          | 101 ++++
>>  drivers/edac/Makefile                         |   2 +-
>>  drivers/edac/edac_device.c                    |  33 ++
>>  drivers/edac/mem_repair.c                     | 492 ++++++++++++++++++
>>  include/linux/edac.h                          | 139 +++++
>>  8 files changed, 1014 insertions(+), 1 deletion(-)
>>  create mode 100644 Documentation/ABI/testing/sysfs-edac-memory-repair
>>  create mode 100644 Documentation/edac/memory_repair.rst
>>  create mode 100755 drivers/edac/mem_repair.c
>>
>> diff --git a/Documentation/ABI/testing/sysfs-edac-memory-repair
>b/Documentation/ABI/testing/sysfs-edac-memory-repair
>> new file mode 100644
>> index 000000000000..e9268f3780ed
>> --- /dev/null
>> +++ b/Documentation/ABI/testing/sysfs-edac-memory-repair
>> @@ -0,0 +1,244 @@
>> +What:		/sys/bus/edac/devices/<dev-name>/mem_repairX
>> +Date:		Jan 2025
>> +KernelVersion:	6.14
>> +Contact:	linux-edac@vger.kernel.org
>> +Description:
>> +		The sysfs EDAC bus devices /<dev-name>/mem_repairX
>subdirectory
>> +		pertains to the memory media repair features control, such as
>> +		PPR (Post Package Repair), memory sparing etc, where<dev-
>name>
>> +		directory corresponds to a device registered with the EDAC
>> +		device driver for the memory repair features.
>> +
>> +		Post Package Repair is a maintenance operation requests the
>memory
>> +		device to perform a repair operation on its media, in detail is a
>> +		memory self-healing feature that fixes a failing memory
>location by
>> +		replacing it with a spare row in a DRAM device. For example, a
>> +		CXL memory device with DRAM components that support PPR
>features may
>> +		implement PPR maintenance operations. DRAM components
>may support
>> +		two types of PPR functions: hard PPR, for a permanent row
>repair, and
>> +		soft PPR, for a temporary row repair. soft PPR is much faster
>than
>> +		hard PPR, but the repair is lost with a power cycle.
>> +
>> +		Memory sparing is a repair function that replaces a portion
>> +		of memory with a portion of functional memory at that same
>> +		sparing granularity. Memory sparing has
>cacheline/row/bank/rank
>> +		sparing granularities. For example, in memory-sparing mode,
>> +		one memory rank serves as a spare for other ranks on the same
>> +		channel in case they fail. The spare rank is held in reserve and
>> +		not used as active memory until a failure is indicated, with
>> +		reserved capacity subtracted from the total available memory
>> +		in the system.The DIMM installation order for memory sparing
>> +		varies based on the number of processors and memory modules
>> +		installed in the server. After an error threshold is surpassed
>> +		in a system protected by memory sparing, the content of a
>failing
>> +		rank of DIMMs is copied to the spare rank. The failing rank is
>> +		then taken offline and the spare rank placed online for use as
>> +		active memory in place of the failed rank.
>> +
>> +		The sysfs attributes nodes for a repair feature are only
>> +		present if the parent driver has implemented the corresponding
>> +		attr callback function and provided the necessary operations
>> +		to the EDAC device driver during registration.
>> +
>> +		In some states of system configuration (e.g. before address
>> +		decoders have been configured), memory devices (e.g. CXL)
>> +		may not have an active mapping in the main host address
>> +		physical address map. As such, the memory to repair must be
>> +		identified by a device specific physical addressing scheme
>> +		using a device physical address(DPA). The DPA and other control
>> +		attributes to use will be presented in related error records.
>> +
>> +What:		/sys/bus/edac/devices/<dev-
>name>/mem_repairX/repair_function
>> +Date:		Jan 2025
>> +KernelVersion:	6.14
>> +Contact:	linux-edac@vger.kernel.org
>> +Description:
>> +		(RO) Memory repair function type. For eg. post package repair,
>> +		memory sparing etc.
>> +		EDAC_SOFT_PPR - Soft post package repair
>> +		EDAC_HARD_PPR - Hard post package repair
>> +		EDAC_CACHELINE_MEM_SPARING - Cacheline memory sparing
>> +		EDAC_ROW_MEM_SPARING - Row memory sparing
>> +		EDAC_BANK_MEM_SPARING - Bank memory sparing
>> +		EDAC_RANK_MEM_SPARING - Rank memory sparing
>> +		All other values are reserved.
>
>Too big strings. Why are them in upper cases? IMO:
>
>	soft-ppr, hard-ppr, ... would be enough.
>
Here return repair type (single value, such as 0, 1, or 2 etc not as decoded string  for eg."EDAC_SOFT_PPR")
of the memory repair instance, which is  defined as enums (EDAC_SOFT_PPR, EDAC_HARD_PPR, ... etc) 
for the memory repair interface in the include/linux/edac.h.

enum edac_mem_repair_function {
	EDAC_SOFT_PPR,
	EDAC_HARD_PPR,
	EDAC_CACHELINE_MEM_SPARING,
	EDAC_ROW_MEM_SPARING,
	EDAC_BANK_MEM_SPARING,
	EDAC_RANK_MEM_SPARING,
};
  
I documented return value in terms of the above enums.

>Also, Is it mandatory that all types are supported? If not, you need a
>way to report to userspace what of them are supported. One option
>would be that reading /sys/bus/edac/devices/<dev-
>name>/mem_repairX/repair_function
>would return something like:
>
>	soft-ppr [hard-ppr] row-mem-sparing
>
Same as above. It is not returned in the decoded string format.
 
>Also, as this will be parsed in ReST format, you need to change the
>description to use bullets, otherwise the html/pdf version of the
>document will place everything on a single line. E.g. something like:
Sure.

>
>Description:
>		(RO) Memory repair function type. For eg. post package repair,
>		memory sparing etc. Can be:
>
>		- EDAC_SOFT_PPR - Soft post package repair
>		- EDAC_HARD_PPR - Hard post package repair
>		- EDAC_CACHELINE_MEM_SPARING - Cacheline memory
>sparing
>		- EDAC_ROW_MEM_SPARING - Row memory sparing
>		- EDAC_BANK_MEM_SPARING - Bank memory sparing
>		- EDAC_RANK_MEM_SPARING - Rank memory sparing
>		- All other values are reserved.
>
>Same applies to other sysfs nodes. See for instance:
>
>	Documentation/ABI/stable/sysfs-class-backlight
>
>And see how it is formatted after Sphinx processing at the Kernel
>Admin guide:
>
>	https://www.kernel.org/doc/html/latest/admin-guide/abi-
>stable.html#symbols-under-sys-class
>
>Please fix it on all places you have a list of values.
Sure.
>
>> +
>> +What:		/sys/bus/edac/devices/<dev-
>name>/mem_repairX/persist_mode
>> +Date:		Jan 2025
>> +KernelVersion:	6.14
>> +Contact:	linux-edac@vger.kernel.org
>> +Description:
>> +		(RW) Read/Write the current persist repair mode set for a
>> +		repair function. Persist repair modes supported in the
>> +		device, based on the memory repair function is temporary
>> +		or permanent and is lost with a power cycle.
>> +		EDAC_MEM_REPAIR_SOFT - Soft repair function (temporary
>repair).
>> +		EDAC_MEM_REPAIR_HARD - Hard memory repair function
>(permanent repair).
>> +		All other values are reserved.
>
>Same here: edac/ is already in the path. No need to place EDAC_ at the name.
>
Sam as above. Return a single value, not as decoded string. But documented in terms
of the enums defined for interface in the include/linux/edac.h    
>> +
>> +What:		/sys/bus/edac/devices/<dev-
>name>/mem_repairX/dpa_support
>> +Date:		Jan 2025
>> +KernelVersion:	6.14
>> +Contact:	linux-edac@vger.kernel.org
>> +Description:
>> +		(RO) True if memory device required device physical
>> +		address (DPA) of memory to repair.
>> +		False if memory device required host specific physical
>> +                address (HPA) of memory to repair.
>
>Please remove the extra spaces before "address", as otherwise conversion to
>ReST may do the wrong thing or may produce doc warnings.
Will fix.
>
>> +		In some states of system configuration (e.g. before address
>> +		decoders have been configured), memory devices (e.g. CXL)
>> +		may not have an active mapping in the main host address
>> +		physical address map. As such, the memory to repair must be
>> +		identified by a device specific physical addressing scheme
>> +		using a DPA. The device physical address(DPA) to use will be
>> +		presented in related error records.
>> +
>> +What:		/sys/bus/edac/devices/<dev-
>name>/mem_repairX/repair_safe_when_in_use
>> +Date:		Jan 2025
>> +KernelVersion:	6.14
>> +Contact:	linux-edac@vger.kernel.org
>> +Description:
>> +		(RO) True if memory media is accessible and data is retained
>> +		during the memory repair operation.
>> +		The data may not be retained and memory requests may not be
>> +		correctly processed during a repair operation. In such case
>> +		the repair operation should not executed at runtime.
>
>Please add an extra line before "The data" to ensure that the output at
>the admin-guide won't merge the two paragraphs. Same on other places along
>this patch series: paragraphs need a blank line at the description.
>
Sure.
>> +
>> +What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/hpa
>> +Date:		Jan 2025
>> +KernelVersion:	6.14
>> +Contact:	linux-edac@vger.kernel.org
>> +Description:
>> +		(RW) Host Physical Address (HPA) of the memory to repair.
>> +		See attribute 'dpa_support' for more details.
>> +		The HPA to use will be provided in related error records.
>> +
>> +What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/dpa
>> +Date:		Jan 2025
>> +KernelVersion:	6.14
>> +Contact:	linux-edac@vger.kernel.org
>> +Description:
>> +		(RW) Device Physical Address (DPA) of the memory to repair.
>> +		See attribute 'dpa_support' for more details.
>> +		The specific DPA to use will be provided in related error
>> +		records.
>> +
>> +What:		/sys/bus/edac/devices/<dev-
>name>/mem_repairX/nibble_mask
>> +Date:		Jan 2025
>> +KernelVersion:	6.14
>> +Contact:	linux-edac@vger.kernel.org
>> +Description:
>> +		(RW) Read/Write Nibble mask of the memory to repair.
>> +		Nibble mask identifies one or more nibbles in error on the
>> +		memory bus that produced the error event. Nibble Mask bit 0
>> +		shall be set if nibble 0 on the memory bus produced the
>> +		event, etc. For example, CXL PPR and sparing, a nibble mask
>> +		bit set to 1 indicates the request to perform repair
>> +		operation in the specific device. All nibble mask bits set
>> +		to 1 indicates the request to perform the operation in all
>> +		devices. For CXL memory to repiar, the specific value of
>> +		nibble mask to use will be provided in related error records.
>> +		For more details, See nibble mask field in CXL spec ver 3.1,
>> +		section 8.2.9.7.1.2 Table 8-103 soft PPR and section
>> +		8.2.9.7.1.3 Table 8-104 hard PPR, section 8.2.9.7.1.4
>> +		Table 8-105 memory sparing.
>> +
>> +What:		/sys/bus/edac/devices/<dev-
>name>/mem_repairX/bank_group
>> +What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/bank
>> +What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/rank
>> +What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/row
>> +What:		/sys/bus/edac/devices/<dev-
>name>/mem_repairX/column
>> +What:		/sys/bus/edac/devices/<dev-
>name>/mem_repairX/channel
>> +What:		/sys/bus/edac/devices/<dev-
>name>/mem_repairX/sub_channel
>> +Date:		Jan 2025
>> +KernelVersion:	6.14
>> +Contact:	linux-edac@vger.kernel.org
>> +Description:
>> +		(RW) The control attributes associated with memory address
>> +		that is to be repaired. The specific value of attributes to
>> +		use depends on the portion of memory to repair and may be
>> +		reported to host in related error records and may be
>> +		available to userspace in trace events, such as in CXL
>> +		memory devices.
>> +
>> +		channel - The channel of the memory to repair. Channel is
>> +		defined as an interface that can be independently accessed
>> +		for a transaction.
>> +		rank - The rank of the memory to repair. Rank is defined as a
>> +		set of memory devices on a channel that together execute a
>> +		transaction.
>> +		bank_group - The bank group of the memory to repair.
>> +		bank - The bank number of the memory to repair.
>> +		row - The row number of the memory to repair.
>> +		column - The column number of the memory to repair.
>> +		sub_channel - The sub-channel of the memory to repair.
>
>Same problem here with regards to bad ReST input. I would do:
>
>	channel
>		The channel of the memory to repair. Channel is
>		defined as an interface that can be independently accessed
>		for a transaction.
>
>	rank
>		The rank of the memory to repair. Rank is defined as a
>		set of memory devices on a channel that together execute a
>		transaction.
>
Sure. Will fix.
>as this would provide a better output at admin-guide while still being
>nicer to read as text.
>
>> +
>> +		The requirement to set these attributes varies based on the
>> +		repair function. The attributes in sysfs are not present
>> +		unless required for a repair function.
>> +		For example, CXL spec ver 3.1, Section 8.2.9.7.1.2 Table 8-103
>> +		soft PPR and Section 8.2.9.7.1.3 Table 8-104 hard PPR
>operations,
>> +		these attributes are not required to set.
>> +		For example, CXL spec ver 3.1, Section 8.2.9.7.1.4 Table 8-105
>> +		memory sparing, these attributes are required to set based on
>> +		memory sparing granularity as follows.
>> +		Channel: Channel associated with the DPA that is to be spared
>> +		and applies to all subclasses of sparing (cacheline, bank,
>> +		row and rank sparing).
>> +		Rank: Rank associated with the DPA that is to be spared and
>> +		applies to all subclasses of sparing.
>> +		Bank & Bank Group: Bank & bank group are associated with
>> +		the DPA that is to be spared and applies to cacheline sparing,
>> +		row sparing and bank sparing subclasses.
>> +		Row: Row associated with the DPA that is to be spared and
>> +		applies to cacheline sparing and row sparing subclasses.
>> +		Column: column associated with the DPA that is to be spared
>> +		and applies to cacheline sparing only.
>> +		Sub-channel: sub-channel associated with the DPA that is to
>> +		be spared and applies to cacheline sparing only.
>
>Same here: this will all be on a single paragraph which would be really
>weird.
Will fix.
>
>> +
>> +What:		/sys/bus/edac/devices/<dev-
>name>/mem_repairX/min_hpa
>> +What:		/sys/bus/edac/devices/<dev-
>name>/mem_repairX/min_dpa
>> +What:		/sys/bus/edac/devices/<dev-
>name>/mem_repairX/min_nibble_mask
>> +What:		/sys/bus/edac/devices/<dev-
>name>/mem_repairX/min_bank_group
>> +What:		/sys/bus/edac/devices/<dev-
>name>/mem_repairX/min_bank
>> +What:		/sys/bus/edac/devices/<dev-
>name>/mem_repairX/min_rank
>> +What:		/sys/bus/edac/devices/<dev-
>name>/mem_repairX/min_row
>> +What:		/sys/bus/edac/devices/<dev-
>name>/mem_repairX/min_column
>> +What:		/sys/bus/edac/devices/<dev-
>name>/mem_repairX/min_channel
>> +What:		/sys/bus/edac/devices/<dev-
>name>/mem_repairX/min_sub_channel
>> +What:		/sys/bus/edac/devices/<dev-
>name>/mem_repairX/max_hpa
>> +What:		/sys/bus/edac/devices/<dev-
>name>/mem_repairX/max_dpa
>> +What:		/sys/bus/edac/devices/<dev-
>name>/mem_repairX/max_nibble_mask
>> +What:		/sys/bus/edac/devices/<dev-
>name>/mem_repairX/max_bank_group
>> +What:		/sys/bus/edac/devices/<dev-
>name>/mem_repairX/max_bank
>> +What:		/sys/bus/edac/devices/<dev-
>name>/mem_repairX/max_rank
>> +What:		/sys/bus/edac/devices/<dev-
>name>/mem_repairX/max_row
>> +What:		/sys/bus/edac/devices/<dev-
>name>/mem_repairX/max_column
>> +What:		/sys/bus/edac/devices/<dev-
>name>/mem_repairX/max_channel
>> +What:		/sys/bus/edac/devices/<dev-
>name>/mem_repairX/max_sub_channel
>> +Date:		Jan 2025
>> +KernelVersion:	6.14
>> +Contact:	linux-edac@vger.kernel.org
>> +Description:
>> +		(RW) The supported range of control attributes (optional)
>> +		associated with a memory address that is to be repaired.
>> +		The memory device may give the supported range of
>> +		attributes to use and it will depend on the memory device
>> +		and the portion of memory to repair.
>> +		The userspace may receive the specific value of attributes
>> +		to use for a repair operation from the memory device via
>> +		related error records and trace events, such as in CXL
>> +		memory devices.
>> +
>> +What:		/sys/bus/edac/devices/<dev-
>name>/mem_repairX/repair
>> +Date:		Jan 2025
>> +KernelVersion:	6.14
>> +Contact:	linux-edac@vger.kernel.org
>> +Description:
>> +		(WO) Issue the memory repair operation for the specified
>> +		memory repair attributes. The operation may fail if resources
>> +		are insufficient based on the requirements of the memory
>> +		device and repair function.
>> +		EDAC_DO_MEM_REPAIR - issue repair operation.
>> +		All other values are reserved.
>> diff --git a/Documentation/edac/features.rst
>b/Documentation/edac/features.rst
>> index ba3ab993ee4f..bfd5533b81b7 100644
>> --- a/Documentation/edac/features.rst
>> +++ b/Documentation/edac/features.rst
>> @@ -97,3 +97,6 @@ RAS features
>>  ------------
>>  1. Memory Scrub
>>  Memory scrub features are documented in `Documentation/edac/scrub.rst`.
>> +
>> +2. Memory Repair
>> +Memory repair features are documented in
>`Documentation/edac/memory_repair.rst`.
>> diff --git a/Documentation/edac/index.rst b/Documentation/edac/index.rst
>> index dfb0c9fb9ab1..d6778f4562dd 100644
>> --- a/Documentation/edac/index.rst
>> +++ b/Documentation/edac/index.rst
>> @@ -8,4 +8,5 @@ EDAC Subsystem
>>     :maxdepth: 1
>>
>>     features
>> +   memory_repair
>>     scrub
>> diff --git a/Documentation/edac/memory_repair.rst
>b/Documentation/edac/memory_repair.rst
>> new file mode 100644
>> index 000000000000..2787a8a2d6ba
>> --- /dev/null
>> +++ b/Documentation/edac/memory_repair.rst
>> @@ -0,0 +1,101 @@
>> +.. SPDX-License-Identifier: GPL-2.0
>> +
>> +==========================
>> +EDAC Memory Repair Control
>> +==========================
>> +
>> +Copyright (c) 2024 HiSilicon Limited.
>> +
>> +:Author:   Shiju Jose <shiju.jose@huawei.com>
>> +:License:  The GNU Free Documentation License, Version 1.2
>> +          (dual licensed under the GPL v2)
>> +:Original Reviewers:
>> +
>> +- Written for: 6.14
>
>See my comments with regards to license on the previous patches.
Ok.
>
>> +
>> +Introduction
>> +------------
>> +Memory devices may support repair operations to address issues in their
>> +memory media. Post Package Repair (PPR) and memory sparing are
>examples
>> +of such features.
>> +
>> +Post Package Repair(PPR)
>> +~~~~~~~~~~~~~~~~~~~~~~~~
>> +Post Package Repair is a maintenance operation requests the memory device
>> +to perform repair operation on its media, in detail is a memory self-healing
>> +feature that fixes a failing memory location by replacing it with a spare
>> +row in a DRAM device. For example, a CXL memory device with DRAM
>components
>> +that support PPR features may implement PPR maintenance operations.
>DRAM
>> +components may support types of PPR functions, hard PPR, for a permanent
>row
>> +repair, and soft PPR, for a temporary row repair. Soft PPR is much faster
>> +than hard PPR, but the repair is lost with a power cycle.  The data may not
>> +be retained and memory requests may not be correctly processed during a
>> +repair operation. In such case, the repair operation should not executed
>> +at runtime.
>> +For example, CXL memory devices, soft PPR and hard PPR repair operations
>> +may be supported. See CXL spec rev 3.1 sections 8.2.9.7.1.1 PPR Maintenance
>> +Operations, 8.2.9.7.1.2 sPPR Maintenance Operation and 8.2.9.7.1.3 hPPR
>> +Maintenance Operation for more details.
>
>Paragraphs require blank lines in ReST. Also, please place a link to the
>specs.
>
>I strongly suggest looking at the output of all docs with make htmldocs
>and make pdfdocs to be sure that the paragraphs and the final document
>will be properly handled. You may use:
>
>	SPHINXDIRS="<book name(s)>"
>
>to speed-up documentation builds.
>
>Please see Sphinx documentation for more details about what it is expected
>there:
>
>	https://www.sphinx-
>doc.org/en/master/usage/restructuredtext/basics.html
Thanks for information.  I will check and fix. 
I had fixed blank line requirements in most of the main documentations, 
but was  not aware of location of output for the ABI docs and missed.
>
>> +
>> +Memory Sparing
>> +~~~~~~~~~~~~~~
>> +Memory sparing is a repair function that replaces a portion of memory with
>> +a portion of functional memory at that same sparing granularity. Memory
>> +sparing has cacheline/row/bank/rank sparing granularities. For example, in
>> +memory-sparing mode, one memory rank serves as a spare for other ranks
>on
>> +the same channel in case they fail. The spare rank is held in reserve and
>> +not used as active memory until a failure is indicated, with reserved
>> +capacity subtracted from the total available memory in the system. The
>DIMM
>> +installation order for memory sparing varies based on the number of
>processors
>> +and memory modules installed in the server. After an error threshold is
>> +surpassed in a system protected by memory sparing, the content of a failing
>> +rank of DIMMs is copied to the spare rank. The failing rank is then taken
>> +offline and the spare rank placed online for use as active memory in place
>> +of the failed rank.
>> +
>> +For example, CXL memory devices may support various subclasses for sparing
>> +operation vary in terms of the scope of the sparing being performed.
>> +Cacheline sparing subclass refers to a sparing action that can replace a
>> +full cacheline. Row sparing is provided as an alternative to PPR sparing
>> +functions and its scope is that of a single DDR row. Bank sparing allows
>> +an entire bank to be replaced. Rank sparing is defined as an operation
>> +in which an entire DDR rank is replaced. See CXL spec 3.1 section
>> +8.2.9.7.1.4 Memory Sparing Maintenance Operations for more details.
>> +
>> +Use cases of generic memory repair features control
>> +---------------------------------------------------
>> +
>> +1. The soft PPR , hard PPR and memory-sparing features share similar
>> +control attributes. Therefore, there is a need for a standardized, generic
>> +sysfs repair control that is exposed to userspace and used by
>> +administrators, scripts and tools.
>> +
>> +2. When a CXL device detects an error in a memory component, it may
>inform
>> +the host of the need for a repair maintenance operation by using an event
>> +record where the "maintenance needed" flag is set. The event record
>> +specifies the device physical address(DPA) and attributes of the memory that
>> +requires repair. The kernel reports the corresponding CXL general media or
>> +DRAM trace event to userspace, and userspace tools (e.g. rasdaemon)
>initiate
>> +a repair maintenance operation in response to the device request using the
>> +sysfs repair control.
>> +
>> +3. Userspace tools, such as rasdaemon, may request a PPR/sparing on a
>memory
>> +region when an uncorrected memory error or an excess of corrected
>memory
>> +errors is reported on that memory.
>> +
>> +4. Multiple PPR/sparing instances may be present per memory device.
>> +
>> +The File System
>> +---------------
>> +
>> +The control attributes of a registered memory repair instance could be
>> +accessed in the
>> +
>> +/sys/bus/edac/devices/<dev-name>/mem_repairX/
>> +
>> +sysfs
>> +-----
>> +
>> +Sysfs files are documented in
>> +
>> +`Documentation/ABI/testing/sysfs-edac-memory-repair`.
>> diff --git a/drivers/edac/Makefile b/drivers/edac/Makefile
>> index 3a49304860f0..1de9fe66ac6b 100644
>> --- a/drivers/edac/Makefile
>> +++ b/drivers/edac/Makefile
>> @@ -10,7 +10,7 @@ obj-$(CONFIG_EDAC)			:= edac_core.o
>>
>>  edac_core-y	:= edac_mc.o edac_device.o edac_mc_sysfs.o
>>  edac_core-y	+= edac_module.o edac_device_sysfs.o wq.o
>> -edac_core-y	+= scrub.o ecs.o
>> +edac_core-y	+= scrub.o ecs.o mem_repair.o
>>
>>  edac_core-$(CONFIG_EDAC_DEBUG)		+= debugfs.o
>>
>> diff --git a/drivers/edac/edac_device.c b/drivers/edac/edac_device.c
>> index 1c1142a2e4e4..a401d81dad8a 100644
>> --- a/drivers/edac/edac_device.c
>> +++ b/drivers/edac/edac_device.c
>> @@ -575,6 +575,7 @@ static void edac_dev_release(struct device *dev)
>>  {
>>  	struct edac_dev_feat_ctx *ctx = container_of(dev, struct
>edac_dev_feat_ctx, dev);
>>
>> +	kfree(ctx->mem_repair);
>>  	kfree(ctx->scrub);
>>  	kfree(ctx->dev.groups);
>>  	kfree(ctx);
>> @@ -611,6 +612,7 @@ int edac_dev_register(struct device *parent, char
>*name,
>>  	const struct attribute_group **ras_attr_groups;
>>  	struct edac_dev_data *dev_data;
>>  	struct edac_dev_feat_ctx *ctx;
>> +	int mem_repair_cnt = 0;
>>  	int attr_gcnt = 0;
>>  	int scrub_cnt = 0;
>>  	int ret, feat;
>> @@ -628,6 +630,10 @@ int edac_dev_register(struct device *parent, char
>*name,
>>  		case RAS_FEAT_ECS:
>>  			attr_gcnt +=
>ras_features[feat].ecs_info.num_media_frus;
>>  			break;
>> +		case RAS_FEAT_MEM_REPAIR:
>> +			attr_gcnt++;
>> +			mem_repair_cnt++;
>> +			break;
>>  		default:
>>  			return -EINVAL;
>>  		}
>> @@ -651,8 +657,17 @@ int edac_dev_register(struct device *parent, char
>*name,
>>  		}
>>  	}
>>
>> +	if (mem_repair_cnt) {
>> +		ctx->mem_repair = kcalloc(mem_repair_cnt, sizeof(*ctx-
>>mem_repair), GFP_KERNEL);
>> +		if (!ctx->mem_repair) {
>> +			ret = -ENOMEM;
>> +			goto data_mem_free;
>> +		}
>> +	}
>> +
>>  	attr_gcnt = 0;
>>  	scrub_cnt = 0;
>> +	mem_repair_cnt = 0;
>>  	for (feat = 0; feat < num_features; feat++, ras_features++) {
>>  		switch (ras_features->ft_type) {
>>  		case RAS_FEAT_SCRUB:
>> @@ -686,6 +701,23 @@ int edac_dev_register(struct device *parent, char
>*name,
>>
>>  			attr_gcnt += ras_features->ecs_info.num_media_frus;
>>  			break;
>> +		case RAS_FEAT_MEM_REPAIR:
>> +			if (!ras_features->mem_repair_ops ||
>> +			    mem_repair_cnt != ras_features->instance)
>> +				goto data_mem_free;
>> +
>> +			dev_data = &ctx->mem_repair[mem_repair_cnt];
>> +			dev_data->instance = mem_repair_cnt;
>> +			dev_data->mem_repair_ops = ras_features-
>>mem_repair_ops;
>> +			dev_data->private = ras_features->ctx;
>> +			ret = edac_mem_repair_get_desc(parent,
>&ras_attr_groups[attr_gcnt],
>> +						       ras_features->instance);
>> +			if (ret)
>> +				goto data_mem_free;
>> +
>> +			mem_repair_cnt++;
>> +			attr_gcnt++;
>> +			break;
>>  		default:
>>  			ret = -EINVAL;
>>  			goto data_mem_free;
>> @@ -712,6 +744,7 @@ int edac_dev_register(struct device *parent, char
>*name,
>>  	return devm_add_action_or_reset(parent, edac_dev_unreg, &ctx->dev);
>>
>>  data_mem_free:
>> +	kfree(ctx->mem_repair);
>>  	kfree(ctx->scrub);
>>  groups_free:
>>  	kfree(ras_attr_groups);
>> diff --git a/drivers/edac/mem_repair.c b/drivers/edac/mem_repair.c
>> new file mode 100755
>> index 000000000000..e7439fd26c41
>> --- /dev/null
>> +++ b/drivers/edac/mem_repair.c
>> @@ -0,0 +1,492 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +/*
>> + * The generic EDAC memory repair driver is designed to control the memory
>> + * devices with memory repair features, such as Post Package Repair (PPR),
>> + * memory sparing etc. The common sysfs memory repair interface abstracts
>> + * the control of various arbitrary memory repair functionalities into a
>> + * unified set of functions.
>> + *
>> + * Copyright (c) 2024 HiSilicon Limited.
>> + */
>> +
>> +#include <linux/edac.h>
>> +
>> +enum edac_mem_repair_attributes {
>> +	MEM_REPAIR_FUNCTION,
>> +	MEM_REPAIR_PERSIST_MODE,
>> +	MEM_REPAIR_DPA_SUPPORT,
>> +	MEM_REPAIR_SAFE_IN_USE,
>> +	MEM_REPAIR_HPA,
>> +	MEM_REPAIR_MIN_HPA,
>> +	MEM_REPAIR_MAX_HPA,
>> +	MEM_REPAIR_DPA,
>> +	MEM_REPAIR_MIN_DPA,
>> +	MEM_REPAIR_MAX_DPA,
>> +	MEM_REPAIR_NIBBLE_MASK,
>> +	MEM_REPAIR_MIN_NIBBLE_MASK,
>> +	MEM_REPAIR_MAX_NIBBLE_MASK,
>> +	MEM_REPAIR_BANK_GROUP,
>> +	MEM_REPAIR_MIN_BANK_GROUP,
>> +	MEM_REPAIR_MAX_BANK_GROUP,
>> +	MEM_REPAIR_BANK,
>> +	MEM_REPAIR_MIN_BANK,
>> +	MEM_REPAIR_MAX_BANK,
>> +	MEM_REPAIR_RANK,
>> +	MEM_REPAIR_MIN_RANK,
>> +	MEM_REPAIR_MAX_RANK,
>> +	MEM_REPAIR_ROW,
>> +	MEM_REPAIR_MIN_ROW,
>> +	MEM_REPAIR_MAX_ROW,
>> +	MEM_REPAIR_COLUMN,
>> +	MEM_REPAIR_MIN_COLUMN,
>> +	MEM_REPAIR_MAX_COLUMN,
>> +	MEM_REPAIR_CHANNEL,
>> +	MEM_REPAIR_MIN_CHANNEL,
>> +	MEM_REPAIR_MAX_CHANNEL,
>> +	MEM_REPAIR_SUB_CHANNEL,
>> +	MEM_REPAIR_MIN_SUB_CHANNEL,
>> +	MEM_REPAIR_MAX_SUB_CHANNEL,
>> +	MEM_DO_REPAIR,
>> +	MEM_REPAIR_MAX_ATTRS
>> +};
>> +
>> +struct edac_mem_repair_dev_attr {
>> +	struct device_attribute dev_attr;
>> +	u8 instance;
>> +};
>> +
>> +struct edac_mem_repair_context {
>> +	char name[EDAC_FEAT_NAME_LEN];
>> +	struct edac_mem_repair_dev_attr
>mem_repair_dev_attr[MEM_REPAIR_MAX_ATTRS];
>> +	struct attribute *mem_repair_attrs[MEM_REPAIR_MAX_ATTRS + 1];
>> +	struct attribute_group group;
>> +};
>> +
>> +#define TO_MEM_REPAIR_DEV_ATTR(_dev_attr)      \
>> +		container_of(_dev_attr, struct edac_mem_repair_dev_attr,
>dev_attr)
>> +
>> +#define EDAC_MEM_REPAIR_ATTR_SHOW(attrib, cb, type, format)
>		\
>> +static ssize_t attrib##_show(struct device *ras_feat_dev,
>	\
>> +			     struct device_attribute *attr, char *buf)
>	\
>> +{
>	\
>> +	u8 inst = TO_MEM_REPAIR_DEV_ATTR(attr)->instance;
>	\
>> +	struct edac_dev_feat_ctx *ctx = dev_get_drvdata(ras_feat_dev);
>	\
>> +	const struct edac_mem_repair_ops *ops =
>		\
>> +				ctx->mem_repair[inst].mem_repair_ops;
>		\
>> +	type data;
>	\
>> +	int ret;								\
>> +
>	\
>> +	ret = ops->cb(ras_feat_dev->parent, ctx->mem_repair[inst].private,
>	\
>> +		      &data);
>	\
>> +	if (ret)								\
>> +		return ret;
>	\
>> +
>	\
>> +	return sysfs_emit(buf, format, data);
>	\
>> +}
>> +
>> +EDAC_MEM_REPAIR_ATTR_SHOW(repair_function, get_repair_function,
>u32, "%u\n")
>> +EDAC_MEM_REPAIR_ATTR_SHOW(persist_mode, get_persist_mode, u32,
>"%u\n")
>> +EDAC_MEM_REPAIR_ATTR_SHOW(dpa_support, get_dpa_support, u32,
>"%u\n")
>> +EDAC_MEM_REPAIR_ATTR_SHOW(repair_safe_when_in_use,
>get_repair_safe_when_in_use, u32, "%u\n")
>> +EDAC_MEM_REPAIR_ATTR_SHOW(hpa, get_hpa, u64, "0x%llx\n")
>> +EDAC_MEM_REPAIR_ATTR_SHOW(min_hpa, get_min_hpa, u64, "0x%llx\n")
>> +EDAC_MEM_REPAIR_ATTR_SHOW(max_hpa, get_max_hpa, u64, "0x%llx\n")
>> +EDAC_MEM_REPAIR_ATTR_SHOW(dpa, get_dpa, u64, "0x%llx\n")
>> +EDAC_MEM_REPAIR_ATTR_SHOW(min_dpa, get_min_dpa, u64, "0x%llx\n")
>> +EDAC_MEM_REPAIR_ATTR_SHOW(max_dpa, get_max_dpa, u64, "0x%llx\n")
>> +EDAC_MEM_REPAIR_ATTR_SHOW(nibble_mask, get_nibble_mask, u64,
>"0x%llx\n")
>> +EDAC_MEM_REPAIR_ATTR_SHOW(min_nibble_mask, get_min_nibble_mask,
>u64, "0x%llx\n")
>> +EDAC_MEM_REPAIR_ATTR_SHOW(max_nibble_mask,
>get_max_nibble_mask, u64, "0x%llx\n")
>> +EDAC_MEM_REPAIR_ATTR_SHOW(bank_group, get_bank_group, u32,
>"%u\n")
>> +EDAC_MEM_REPAIR_ATTR_SHOW(min_bank_group, get_min_bank_group,
>u32, "%u\n")
>> +EDAC_MEM_REPAIR_ATTR_SHOW(max_bank_group, get_max_bank_group,
>u32, "%u\n")
>> +EDAC_MEM_REPAIR_ATTR_SHOW(bank, get_bank, u32, "%u\n")
>> +EDAC_MEM_REPAIR_ATTR_SHOW(min_bank, get_min_bank, u32, "%u\n")
>> +EDAC_MEM_REPAIR_ATTR_SHOW(max_bank, get_max_bank, u32, "%u\n")
>> +EDAC_MEM_REPAIR_ATTR_SHOW(rank, get_rank, u32, "%u\n")
>> +EDAC_MEM_REPAIR_ATTR_SHOW(min_rank, get_min_rank, u32, "%u\n")
>> +EDAC_MEM_REPAIR_ATTR_SHOW(max_rank, get_max_rank, u32, "%u\n")
>> +EDAC_MEM_REPAIR_ATTR_SHOW(row, get_row, u64, "0x%llx\n")
>> +EDAC_MEM_REPAIR_ATTR_SHOW(min_row, get_min_row, u64, "0x%llx\n")
>> +EDAC_MEM_REPAIR_ATTR_SHOW(max_row, get_max_row, u64, "0x%llx\n")
>> +EDAC_MEM_REPAIR_ATTR_SHOW(column, get_column, u32, "%u\n")
>> +EDAC_MEM_REPAIR_ATTR_SHOW(min_column, get_min_column, u32,
>"%u\n")
>> +EDAC_MEM_REPAIR_ATTR_SHOW(max_column, get_max_column, u32,
>"%u\n")
>> +EDAC_MEM_REPAIR_ATTR_SHOW(channel, get_channel, u32, "%u\n")
>> +EDAC_MEM_REPAIR_ATTR_SHOW(min_channel, get_min_channel, u32,
>"%u\n")
>> +EDAC_MEM_REPAIR_ATTR_SHOW(max_channel, get_max_channel, u32,
>"%u\n")
>> +EDAC_MEM_REPAIR_ATTR_SHOW(sub_channel, get_sub_channel, u32,
>"%u\n")
>> +EDAC_MEM_REPAIR_ATTR_SHOW(min_sub_channel, get_min_sub_channel,
>u32, "%u\n")
>> +EDAC_MEM_REPAIR_ATTR_SHOW(max_sub_channel,
>get_max_sub_channel, u32, "%u\n")
>> +
>> +#define EDAC_MEM_REPAIR_ATTR_STORE(attrib, cb, type, conv_func)
>			\
>> +static ssize_t attrib##_store(struct device *ras_feat_dev,
>	\
>> +			      struct device_attribute *attr,
>	\
>> +			      const char *buf, size_t len)			\
>> +{
>	\
>> +	u8 inst = TO_MEM_REPAIR_DEV_ATTR(attr)->instance;
>	\
>> +	struct edac_dev_feat_ctx *ctx = dev_get_drvdata(ras_feat_dev);
>	\
>> +	const struct edac_mem_repair_ops *ops =
>		\
>> +				ctx->mem_repair[inst].mem_repair_ops;
>		\
>> +	type data;
>	\
>> +	int ret;								\
>> +
>	\
>> +	ret = conv_func(buf, 0, &data);
>	\
>> +	if (ret < 0)
>	\
>> +		return ret;
>	\
>> +
>	\
>> +	ret = ops->cb(ras_feat_dev->parent, ctx->mem_repair[inst].private,
>	\
>> +		      data);
>	\
>> +	if (ret)								\
>> +		return ret;
>	\
>> +
>	\
>> +	return len;
>	\
>> +}
>> +
>> +EDAC_MEM_REPAIR_ATTR_STORE(persist_mode, set_persist_mode,
>unsigned long, kstrtoul)
>> +EDAC_MEM_REPAIR_ATTR_STORE(hpa, set_hpa, u64, kstrtou64)
>> +EDAC_MEM_REPAIR_ATTR_STORE(dpa, set_dpa, u64, kstrtou64)
>> +EDAC_MEM_REPAIR_ATTR_STORE(nibble_mask, set_nibble_mask, u64,
>kstrtou64)
>> +EDAC_MEM_REPAIR_ATTR_STORE(bank_group, set_bank_group, unsigned
>long, kstrtoul)
>> +EDAC_MEM_REPAIR_ATTR_STORE(bank, set_bank, unsigned long, kstrtoul)
>> +EDAC_MEM_REPAIR_ATTR_STORE(rank, set_rank, unsigned long, kstrtoul)
>> +EDAC_MEM_REPAIR_ATTR_STORE(row, set_row, u64, kstrtou64)
>> +EDAC_MEM_REPAIR_ATTR_STORE(column, set_column, unsigned long,
>kstrtoul)
>> +EDAC_MEM_REPAIR_ATTR_STORE(channel, set_channel, unsigned long,
>kstrtoul)
>> +EDAC_MEM_REPAIR_ATTR_STORE(sub_channel, set_sub_channel, unsigned
>long, kstrtoul)
>> +
>> +#define EDAC_MEM_REPAIR_DO_OP(attrib, cb)
>			\
>> +static ssize_t attrib##_store(struct device *ras_feat_dev,
>		\
>> +			      struct device_attribute *attr,
>		\
>> +			      const char *buf, size_t len)
>	\
>> +{
>		\
>> +	u8 inst = TO_MEM_REPAIR_DEV_ATTR(attr)->instance;
>		\
>> +	struct edac_dev_feat_ctx *ctx = dev_get_drvdata(ras_feat_dev);
>		\
>> +	const struct edac_mem_repair_ops *ops = ctx-
>>mem_repair[inst].mem_repair_ops;	\
>> +	unsigned long data;
>		\
>> +	int ret;
>	\
>> +
>		\
>> +	ret = kstrtoul(buf, 0, &data);
>		\
>> +	if (ret < 0)
>		\
>> +		return ret;
>		\
>> +
>		\
>> +	ret = ops->cb(ras_feat_dev->parent, ctx->mem_repair[inst].private,
>data);	\
>> +	if (ret)
>	\
>> +		return ret;
>		\
>> +
>		\
>> +	return len;
>		\
>> +}
>> +
>> +EDAC_MEM_REPAIR_DO_OP(repair, do_repair)
>> +
>> +static umode_t mem_repair_attr_visible(struct kobject *kobj, struct attribute
>*a, int attr_id)
>> +{
>> +	struct device *ras_feat_dev = kobj_to_dev(kobj);
>> +	struct device_attribute *dev_attr = container_of(a, struct
>device_attribute, attr);
>> +	struct edac_dev_feat_ctx *ctx = dev_get_drvdata(ras_feat_dev);
>> +	u8 inst = TO_MEM_REPAIR_DEV_ATTR(dev_attr)->instance;
>> +	const struct edac_mem_repair_ops *ops = ctx-
>>mem_repair[inst].mem_repair_ops;
>> +
>> +	switch (attr_id) {
>> +	case MEM_REPAIR_FUNCTION:
>> +		if (ops->get_repair_function)
>> +			return a->mode;
>> +		break;
>> +	case MEM_REPAIR_PERSIST_MODE:
>> +		if (ops->get_persist_mode) {
>> +			if (ops->set_persist_mode)
>> +				return a->mode;
>> +			else
>> +				return 0444;
>> +		}
>> +		break;
>> +	case MEM_REPAIR_DPA_SUPPORT:
>> +		if (ops->get_dpa_support)
>> +			return a->mode;
>> +		break;
>> +	case MEM_REPAIR_SAFE_IN_USE:
>> +		if (ops->get_repair_safe_when_in_use)
>> +			return a->mode;
>> +		break;
>> +	case MEM_REPAIR_HPA:
>> +		if (ops->get_hpa) {
>> +			if (ops->set_hpa)
>> +				return a->mode;
>> +			else
>> +				return 0444;
>> +		}
>> +		break;
>> +	case MEM_REPAIR_MIN_HPA:
>> +		if (ops->get_min_hpa)
>> +			return a->mode;
>> +		break;
>> +	case MEM_REPAIR_MAX_HPA:
>> +		if (ops->get_max_hpa)
>> +			return a->mode;
>> +		break;
>> +	case MEM_REPAIR_DPA:
>> +		if (ops->get_dpa) {
>> +			if (ops->set_dpa)
>> +				return a->mode;
>> +			else
>> +				return 0444;
>> +		}
>> +		break;
>> +	case MEM_REPAIR_MIN_DPA:
>> +		if (ops->get_min_dpa)
>> +			return a->mode;
>> +		break;
>> +	case MEM_REPAIR_MAX_DPA:
>> +		if (ops->get_max_dpa)
>> +			return a->mode;
>> +		break;
>> +	case MEM_REPAIR_NIBBLE_MASK:
>> +		if (ops->get_nibble_mask) {
>> +			if (ops->set_nibble_mask)
>> +				return a->mode;
>> +			else
>> +				return 0444;
>> +		}
>> +		break;
>> +	case MEM_REPAIR_MIN_NIBBLE_MASK:
>> +		if (ops->get_min_nibble_mask)
>> +			return a->mode;
>> +		break;
>> +	case MEM_REPAIR_MAX_NIBBLE_MASK:
>> +		if (ops->get_max_nibble_mask)
>> +			return a->mode;
>> +		break;
>> +	case MEM_REPAIR_BANK_GROUP:
>> +		if (ops->get_bank_group) {
>> +			if (ops->set_bank_group)
>> +				return a->mode;
>> +			else
>> +				return 0444;
>> +		}
>> +		break;
>> +	case MEM_REPAIR_MIN_BANK_GROUP:
>> +		if (ops->get_min_bank_group)
>> +			return a->mode;
>> +		break;
>> +	case MEM_REPAIR_MAX_BANK_GROUP:
>> +		if (ops->get_max_bank_group)
>> +			return a->mode;
>> +		break;
>> +	case MEM_REPAIR_BANK:
>> +		if (ops->get_bank) {
>> +			if (ops->set_bank)
>> +				return a->mode;
>> +			else
>> +				return 0444;
>> +		}
>> +		break;
>> +	case MEM_REPAIR_MIN_BANK:
>> +		if (ops->get_min_bank)
>> +			return a->mode;
>> +		break;
>> +	case MEM_REPAIR_MAX_BANK:
>> +		if (ops->get_max_bank)
>> +			return a->mode;
>> +		break;
>> +	case MEM_REPAIR_RANK:
>> +		if (ops->get_rank) {
>> +			if (ops->set_rank)
>> +				return a->mode;
>> +			else
>> +				return 0444;
>> +		}
>> +		break;
>> +	case MEM_REPAIR_MIN_RANK:
>> +		if (ops->get_min_rank)
>> +			return a->mode;
>> +		break;
>> +	case MEM_REPAIR_MAX_RANK:
>> +		if (ops->get_max_rank)
>> +			return a->mode;
>> +		break;
>> +	case MEM_REPAIR_ROW:
>> +		if (ops->get_row) {
>> +			if (ops->set_row)
>> +				return a->mode;
>> +			else
>> +				return 0444;
>> +		}
>> +		break;
>> +	case MEM_REPAIR_MIN_ROW:
>> +		if (ops->get_min_row)
>> +			return a->mode;
>> +		break;
>> +	case MEM_REPAIR_MAX_ROW:
>> +		if (ops->get_max_row)
>> +			return a->mode;
>> +		break;
>> +	case MEM_REPAIR_COLUMN:
>> +		if (ops->get_column) {
>> +			if (ops->set_column)
>> +				return a->mode;
>> +			else
>> +				return 0444;
>> +		}
>> +		break;
>> +	case MEM_REPAIR_MIN_COLUMN:
>> +		if (ops->get_min_column)
>> +			return a->mode;
>> +		break;
>> +	case MEM_REPAIR_MAX_COLUMN:
>> +		if (ops->get_max_column)
>> +			return a->mode;
>> +		break;
>> +	case MEM_REPAIR_CHANNEL:
>> +		if (ops->get_channel) {
>> +			if (ops->set_channel)
>> +				return a->mode;
>> +			else
>> +				return 0444;
>> +		}
>> +		break;
>> +	case MEM_REPAIR_MIN_CHANNEL:
>> +		if (ops->get_min_channel)
>> +			return a->mode;
>> +		break;
>> +	case MEM_REPAIR_MAX_CHANNEL:
>> +		if (ops->get_max_channel)
>> +			return a->mode;
>> +		break;
>> +	case MEM_REPAIR_SUB_CHANNEL:
>> +		if (ops->get_sub_channel) {
>> +			if (ops->set_sub_channel)
>> +				return a->mode;
>> +			else
>> +				return 0444;
>> +		}
>> +		break;
>> +	case MEM_REPAIR_MIN_SUB_CHANNEL:
>> +		if (ops->get_min_sub_channel)
>> +			return a->mode;
>> +		break;
>> +	case MEM_REPAIR_MAX_SUB_CHANNEL:
>> +		if (ops->get_max_sub_channel)
>> +			return a->mode;
>> +		break;
>> +	case MEM_DO_REPAIR:
>> +		if (ops->do_repair)
>> +			return a->mode;
>> +		break;
>> +	default:
>> +		break;
>> +	}
>> +
>> +	return 0;
>> +}
>> +
>> +#define EDAC_MEM_REPAIR_ATTR_RO(_name, _instance)       \
>> +	((struct edac_mem_repair_dev_attr) { .dev_attr = __ATTR_RO(_name),
>\
>> +					     .instance = _instance })
>> +
>> +#define EDAC_MEM_REPAIR_ATTR_WO(_name, _instance)       \
>> +	((struct edac_mem_repair_dev_attr) { .dev_attr = __ATTR_WO(_name),
>\
>> +					     .instance = _instance })
>> +
>> +#define EDAC_MEM_REPAIR_ATTR_RW(_name, _instance)       \
>> +	((struct edac_mem_repair_dev_attr) { .dev_attr = __ATTR_RW(_name),
>\
>> +					     .instance = _instance })
>> +
>> +static int mem_repair_create_desc(struct device *dev,
>> +				  const struct attribute_group **attr_groups,
>> +				  u8 instance)
>> +{
>> +	struct edac_mem_repair_context *ctx;
>> +	struct attribute_group *group;
>> +	int i;
>> +	struct edac_mem_repair_dev_attr dev_attr[] = {
>> +		[MEM_REPAIR_FUNCTION] =
>EDAC_MEM_REPAIR_ATTR_RO(repair_function,
>> +							    instance),
>> +		[MEM_REPAIR_PERSIST_MODE] =
>> +				EDAC_MEM_REPAIR_ATTR_RW(persist_mode,
>instance),
>> +		[MEM_REPAIR_DPA_SUPPORT] =
>> +				EDAC_MEM_REPAIR_ATTR_RO(dpa_support,
>instance),
>> +		[MEM_REPAIR_SAFE_IN_USE] =
>> +
>	EDAC_MEM_REPAIR_ATTR_RO(repair_safe_when_in_use,
>> +							instance),
>> +		[MEM_REPAIR_HPA] = EDAC_MEM_REPAIR_ATTR_RW(hpa,
>instance),
>> +		[MEM_REPAIR_MIN_HPA] =
>EDAC_MEM_REPAIR_ATTR_RO(min_hpa, instance),
>> +		[MEM_REPAIR_MAX_HPA] =
>EDAC_MEM_REPAIR_ATTR_RO(max_hpa, instance),
>> +		[MEM_REPAIR_DPA] = EDAC_MEM_REPAIR_ATTR_RW(dpa,
>instance),
>> +		[MEM_REPAIR_MIN_DPA] =
>EDAC_MEM_REPAIR_ATTR_RO(min_dpa, instance),
>> +		[MEM_REPAIR_MAX_DPA] =
>EDAC_MEM_REPAIR_ATTR_RO(max_dpa, instance),
>> +		[MEM_REPAIR_NIBBLE_MASK] =
>> +				EDAC_MEM_REPAIR_ATTR_RW(nibble_mask,
>instance),
>> +		[MEM_REPAIR_MIN_NIBBLE_MASK] =
>> +
>	EDAC_MEM_REPAIR_ATTR_RO(min_nibble_mask, instance),
>> +		[MEM_REPAIR_MAX_NIBBLE_MASK] =
>> +
>	EDAC_MEM_REPAIR_ATTR_RO(max_nibble_mask, instance),
>> +		[MEM_REPAIR_BANK_GROUP] =
>> +				EDAC_MEM_REPAIR_ATTR_RW(bank_group,
>instance),
>> +		[MEM_REPAIR_MIN_BANK_GROUP] =
>> +
>	EDAC_MEM_REPAIR_ATTR_RO(min_bank_group, instance),
>> +		[MEM_REPAIR_MAX_BANK_GROUP] =
>> +
>	EDAC_MEM_REPAIR_ATTR_RO(max_bank_group, instance),
>> +		[MEM_REPAIR_BANK] = EDAC_MEM_REPAIR_ATTR_RW(bank,
>instance),
>> +		[MEM_REPAIR_MIN_BANK] =
>EDAC_MEM_REPAIR_ATTR_RO(min_bank, instance),
>> +		[MEM_REPAIR_MAX_BANK] =
>EDAC_MEM_REPAIR_ATTR_RO(max_bank, instance),
>> +		[MEM_REPAIR_RANK] = EDAC_MEM_REPAIR_ATTR_RW(rank,
>instance),
>> +		[MEM_REPAIR_MIN_RANK] =
>EDAC_MEM_REPAIR_ATTR_RO(min_rank, instance),
>> +		[MEM_REPAIR_MAX_RANK] =
>EDAC_MEM_REPAIR_ATTR_RO(max_rank, instance),
>> +		[MEM_REPAIR_ROW] = EDAC_MEM_REPAIR_ATTR_RW(row,
>instance),
>> +		[MEM_REPAIR_MIN_ROW] =
>EDAC_MEM_REPAIR_ATTR_RO(min_row, instance),
>> +		[MEM_REPAIR_MAX_ROW] =
>EDAC_MEM_REPAIR_ATTR_RO(max_row, instance),
>> +		[MEM_REPAIR_COLUMN] =
>EDAC_MEM_REPAIR_ATTR_RW(column, instance),
>> +		[MEM_REPAIR_MIN_COLUMN] =
>EDAC_MEM_REPAIR_ATTR_RO(min_column, instance),
>> +		[MEM_REPAIR_MAX_COLUMN] =
>EDAC_MEM_REPAIR_ATTR_RO(max_column, instance),
>> +		[MEM_REPAIR_CHANNEL] =
>EDAC_MEM_REPAIR_ATTR_RW(channel, instance),
>> +		[MEM_REPAIR_MIN_CHANNEL] =
>EDAC_MEM_REPAIR_ATTR_RO(min_channel, instance),
>> +		[MEM_REPAIR_MAX_CHANNEL] =
>EDAC_MEM_REPAIR_ATTR_RO(max_channel, instance),
>> +		[MEM_REPAIR_SUB_CHANNEL] =
>> +				EDAC_MEM_REPAIR_ATTR_RW(sub_channel,
>instance),
>> +		[MEM_REPAIR_MIN_SUB_CHANNEL] =
>> +
>	EDAC_MEM_REPAIR_ATTR_RO(min_sub_channel, instance),
>> +		[MEM_REPAIR_MAX_SUB_CHANNEL] =
>> +
>	EDAC_MEM_REPAIR_ATTR_RO(max_sub_channel, instance),
>> +		[MEM_DO_REPAIR] = EDAC_MEM_REPAIR_ATTR_WO(repair,
>instance)
>> +	};
>> +
>> +	ctx = devm_kzalloc(dev, sizeof(*ctx), GFP_KERNEL);
>> +	if (!ctx)
>> +		return -ENOMEM;
>> +
>> +	for (i = 0; i < MEM_REPAIR_MAX_ATTRS; i++) {
>> +		memcpy(&ctx->mem_repair_dev_attr[i].dev_attr,
>> +		       &dev_attr[i], sizeof(dev_attr[i]));
>> +		ctx->mem_repair_attrs[i] =
>> +				&ctx->mem_repair_dev_attr[i].dev_attr.attr;
>> +	}
>> +
>> +	sprintf(ctx->name, "%s%d", "mem_repair", instance);
>> +	group = &ctx->group;
>> +	group->name = ctx->name;
>> +	group->attrs = ctx->mem_repair_attrs;
>> +	group->is_visible  = mem_repair_attr_visible;
>> +	attr_groups[0] = group;
>> +
>> +	return 0;
>> +}
>> +
>> +/**
>> + * edac_mem_repair_get_desc - get EDAC memory repair descriptors
>> + * @dev: client device with memory repair feature
>> + * @attr_groups: pointer to attribute group container
>> + * @instance: device's memory repair instance number.
>> + *
>> + * Return:
>> + *  * %0	- Success.
>> + *  * %-EINVAL	- Invalid parameters passed.
>> + *  * %-ENOMEM	- Dynamic memory allocation failed.
>> + */
>> +int edac_mem_repair_get_desc(struct device *dev,
>> +			     const struct attribute_group **attr_groups, u8
>instance)
>> +{
>> +	if (!dev || !attr_groups)
>> +		return -EINVAL;
>> +
>> +	return mem_repair_create_desc(dev, attr_groups, instance);
>> +}
>> diff --git a/include/linux/edac.h b/include/linux/edac.h
>> index 979e91426701..5d07192bf1a7 100644
>> --- a/include/linux/edac.h
>> +++ b/include/linux/edac.h
>> @@ -668,6 +668,7 @@ static inline struct dimm_info *edac_get_dimm(struct
>mem_ctl_info *mci,
>>  enum edac_dev_feat {
>>  	RAS_FEAT_SCRUB,
>>  	RAS_FEAT_ECS,
>> +	RAS_FEAT_MEM_REPAIR,
>>  	RAS_FEAT_MAX
>>  };
>>
>> @@ -729,11 +730,147 @@ int edac_ecs_get_desc(struct device *ecs_dev,
>>  		      const struct attribute_group **attr_groups,
>>  		      u16 num_media_frus);
>>
>> +enum edac_mem_repair_function {
>> +	EDAC_SOFT_PPR,
>> +	EDAC_HARD_PPR,
>> +	EDAC_CACHELINE_MEM_SPARING,
>> +	EDAC_ROW_MEM_SPARING,
>> +	EDAC_BANK_MEM_SPARING,
>> +	EDAC_RANK_MEM_SPARING,
>> +};
>> +
>> +enum edac_mem_repair_persist_mode {
>> +	EDAC_MEM_REPAIR_SOFT, /* soft memory repair */
>> +	EDAC_MEM_REPAIR_HARD, /* hard memory repair */
>> +};
>> +
>> +enum edac_mem_repair_cmd {
>> +	EDAC_DO_MEM_REPAIR = 1,
>> +};
>> +
>> +/**
>> + * struct edac_mem_repair_ops - memory repair operations
>> + * (all elements are optional except do_repair, set_hpa/set_dpa)
>> + * @get_repair_function: get the memory repair function, listed in
>> + *			 enum edac_mem_repair_function.
>> + * @get_persist_mode: get the current persist mode. Persist repair modes
>supported
>> + *		      in the device is based on the memory repair function which
>is
>> + *		      temporary or permanent and is lost with a power cycle.
>> + *		      EDAC_MEM_REPAIR_SOFT - Soft repair function (temporary
>repair).
>> + *		      EDAC_MEM_REPAIR_HARD - Hard memory repair function
>(permanent repair).
>> + * All other values are reserved.
>> + * @set_persist_mode: set the persist mode of the memory repair instance.
>> + * @get_dpa_support: get dpa support flag. In some states of system
>configuration
>> + *		     (e.g. before address decoders have been configured),
>memory devices
>> + *		     (e.g. CXL) may not have an active mapping in the main host
>address
>> + *		     physical address map. As such, the memory to repair must be
>identified
>> + *		     by a device specific physical addressing scheme using a
>device physical
>> + *		     address(DPA). The DPA and other control attributes to use for
>the
>> + *		     dry_run and repair operations will be presented in related
>error records.
>> + * @get_repair_safe_when_in_use: get whether memory media is accessible
>and
>> + *				 data is retained during repair operation.
>> + * @get_hpa: get current host physical address (HPA).
>> + * @set_hpa: set host physical address (HPA) of memory to repair.
>> + * @get_min_hpa: get the minimum supported host physical address (HPA).
>> + * @get_max_hpa: get the maximum supported host physical address (HPA).
>> + * @get_dpa: get current device physical address (DPA).
>> + * @set_dpa: set device physical address (DPA) of memory to repair.
>> + * @get_min_dpa: get the minimum supported device physical address
>(DPA).
>> + * @get_max_dpa: get the maximum supported device physical address
>(DPA).
>> + * @get_nibble_mask: get current nibble mask.
>> + * @set_nibble_mask: set nibble mask of memory to repair.
>> + * @get_min_nibble_mask: get the minimum supported nibble mask.
>> + * @get_max_nibble_mask: get the maximum supported nibble mask.
>> + * @get_bank_group: get current bank group.
>> + * @set_bank_group: set bank group of memory to repair.
>> + * @get_min_bank_group: get the minimum supported bank group.
>> + * @get_max_bank_group: get the maximum supported bank group.
>> + * @get_bank: get current bank.
>> + * @set_bank: set bank of memory to repair.
>> + * @get_min_bank: get the minimum supported bank.
>> + * @get_max_bank: get the maximum supported bank.
>> + * @get_rank: get current rank.
>> + * @set_rank: set rank of memory to repair.
>> + * @get_min_rank: get the minimum supported rank.
>> + * @get_max_rank: get the maximum supported rank.
>> + * @get_row: get current row.
>> + * @set_row: set row of memory to repair.
>> + * @get_min_row: get the minimum supported row.
>> + * @get_max_row: get the maximum supported row.
>> + * @get_column: get current column.
>> + * @set_column: set column of memory to repair.
>> + * @get_min_column: get the minimum supported column.
>> + * @get_max_column: get the maximum supported column.
>> + * @get_channel: get current channel.
>> + * @set_channel: set channel of memory to repair.
>> + * @get_min_channel: get the minimum supported channel.
>> + * @get_max_channel: get the maximum supported channel.
>> + * @get_sub_channel: get current sub channel.
>> + * @set_sub_channel: set sub channel of memory to repair.
>> + * @get_min_sub_channel: get the minimum supported sub channel.
>> + * @get_max_sub_channel: get the maximum supported sub channel.
>> + * @do_repair: Issue memory repair operation for the HPA/DPA and
>> + *	       other control attributes set for the memory to repair.
>> + */
>> +struct edac_mem_repair_ops {
>> +	int (*get_repair_function)(struct device *dev, void *drv_data, u32 *val);
>> +	int (*get_persist_mode)(struct device *dev, void *drv_data, u32
>*mode);
>> +	int (*set_persist_mode)(struct device *dev, void *drv_data, u32 mode);
>> +	int (*get_dpa_support)(struct device *dev, void *drv_data, u32 *val);
>> +	int (*get_repair_safe_when_in_use)(struct device *dev, void *drv_data,
>u32 *val);
>> +	int (*get_hpa)(struct device *dev, void *drv_data, u64 *hpa);
>> +	int (*set_hpa)(struct device *dev, void *drv_data, u64 hpa);
>> +	int (*get_min_hpa)(struct device *dev, void *drv_data, u64 *hpa);
>> +	int (*get_max_hpa)(struct device *dev, void *drv_data, u64 *hpa);
>> +	int (*get_dpa)(struct device *dev, void *drv_data, u64 *dpa);
>> +	int (*set_dpa)(struct device *dev, void *drv_data, u64 dpa);
>> +	int (*get_min_dpa)(struct device *dev, void *drv_data, u64 *dpa);
>> +	int (*get_max_dpa)(struct device *dev, void *drv_data, u64 *dpa);
>> +	int (*get_nibble_mask)(struct device *dev, void *drv_data, u64 *val);
>> +	int (*set_nibble_mask)(struct device *dev, void *drv_data, u64 val);
>> +	int (*get_min_nibble_mask)(struct device *dev, void *drv_data, u64
>*val);
>> +	int (*get_max_nibble_mask)(struct device *dev, void *drv_data, u64
>*val);
>> +	int (*get_bank_group)(struct device *dev, void *drv_data, u32 *val);
>> +	int (*set_bank_group)(struct device *dev, void *drv_data, u32 val);
>> +	int (*get_min_bank_group)(struct device *dev, void *drv_data, u32
>*val);
>> +	int (*get_max_bank_group)(struct device *dev, void *drv_data, u32
>*val);
>> +	int (*get_bank)(struct device *dev, void *drv_data, u32 *val);
>> +	int (*set_bank)(struct device *dev, void *drv_data, u32 val);
>> +	int (*get_min_bank)(struct device *dev, void *drv_data, u32 *val);
>> +	int (*get_max_bank)(struct device *dev, void *drv_data, u32 *val);
>> +	int (*get_rank)(struct device *dev, void *drv_data, u32 *val);
>> +	int (*set_rank)(struct device *dev, void *drv_data, u32 val);
>> +	int (*get_min_rank)(struct device *dev, void *drv_data, u32 *val);
>> +	int (*get_max_rank)(struct device *dev, void *drv_data, u32 *val);
>> +	int (*get_row)(struct device *dev, void *drv_data, u64 *val);
>> +	int (*set_row)(struct device *dev, void *drv_data, u64 val);
>> +	int (*get_min_row)(struct device *dev, void *drv_data, u64 *val);
>> +	int (*get_max_row)(struct device *dev, void *drv_data, u64 *val);
>> +	int (*get_column)(struct device *dev, void *drv_data, u32 *val);
>> +	int (*set_column)(struct device *dev, void *drv_data, u32 val);
>> +	int (*get_min_column)(struct device *dev, void *drv_data, u32 *val);
>> +	int (*get_max_column)(struct device *dev, void *drv_data, u32 *val);
>> +	int (*get_channel)(struct device *dev, void *drv_data, u32 *val);
>> +	int (*set_channel)(struct device *dev, void *drv_data, u32 val);
>> +	int (*get_min_channel)(struct device *dev, void *drv_data, u32 *val);
>> +	int (*get_max_channel)(struct device *dev, void *drv_data, u32 *val);
>> +	int (*get_sub_channel)(struct device *dev, void *drv_data, u32 *val);
>> +	int (*set_sub_channel)(struct device *dev, void *drv_data, u32 val);
>> +	int (*get_min_sub_channel)(struct device *dev, void *drv_data, u32
>*val);
>> +	int (*get_max_sub_channel)(struct device *dev, void *drv_data, u32
>*val);
>> +	int (*do_repair)(struct device *dev, void *drv_data, u32 val);
>> +};
>> +
>> +int edac_mem_repair_get_desc(struct device *dev,
>> +			     const struct attribute_group **attr_groups,
>> +			     u8 instance);
>> +
>>  /* EDAC device feature information structure */
>>  struct edac_dev_data {
>>  	union {
>>  		const struct edac_scrub_ops *scrub_ops;
>>  		const struct edac_ecs_ops *ecs_ops;
>> +		const struct edac_mem_repair_ops *mem_repair_ops;
>>  	};
>>  	u8 instance;
>>  	void *private;
>> @@ -744,6 +881,7 @@ struct edac_dev_feat_ctx {
>>  	void *private;
>>  	struct edac_dev_data *scrub;
>>  	struct edac_dev_data ecs;
>> +	struct edac_dev_data *mem_repair;
>>  };
>>
>>  struct edac_dev_feature {
>> @@ -752,6 +890,7 @@ struct edac_dev_feature {
>>  	union {
>>  		const struct edac_scrub_ops *scrub_ops;
>>  		const struct edac_ecs_ops *ecs_ops;
>> +		const struct edac_mem_repair_ops *mem_repair_ops;
>>  	};
>>  	void *ctx;
>>  	struct edac_ecs_ex_info ecs_info;
>
>Thanks,
>Mauro

Thanks,
Shiju

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v18 04/19] EDAC: Add memory repair control feature
  2025-01-09 14:24         ` Jonathan Cameron
  2025-01-09 15:18           ` Borislav Petkov
@ 2025-01-14 12:38           ` Mauro Carvalho Chehab
  2025-01-14 13:05             ` Jonathan Cameron
  1 sibling, 1 reply; 87+ messages in thread
From: Mauro Carvalho Chehab @ 2025-01-14 12:38 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: Borislav Petkov, Shiju Jose, linux-edac@vger.kernel.org,
	linux-cxl@vger.kernel.org, linux-acpi@vger.kernel.org,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	tony.luck@intel.com, rafael@kernel.org, lenb@kernel.org,
	mchehab@kernel.org, dan.j.williams@intel.com, dave@stgolabs.net,
	dave.jiang@intel.com, alison.schofield@intel.com,
	vishal.l.verma@intel.com, ira.weiny@intel.com, david@redhat.com,
	Vilas.Sridharan@amd.com, leo.duran@amd.com, Yazen.Ghannam@amd.com,
	rientjes@google.com, jiaqiyan@google.com, Jon.Grimm@amd.com,
	dave.hansen@linux.intel.com, naoya.horiguchi@nec.com,
	james.morse@arm.com, jthoughton@google.com,
	somasundaram.a@hpe.com, erdemaktas@google.com, pgonda@google.com,
	duenwen@google.com, gthelen@google.com,
	wschwartz@amperecomputing.com, dferguson@amperecomputing.com,
	wbs@os.amperecomputing.com, nifan.cxl@gmail.com, tanxiaofei,
	Zengtao (B), Roberto Sassu, kangkang.shen@futurewei.com,
	wanghuiqiang, Linuxarm

Em Thu, 9 Jan 2025 14:24:33 +0000
Jonathan Cameron <Jonathan.Cameron@huawei.com> escreveu:

> On Thu, 9 Jan 2025 13:32:22 +0100
> Borislav Petkov <bp@alien8.de> wrote:
> 
> Hi Boris,
> 
> > On Thu, Jan 09, 2025 at 11:00:43AM +0000, Shiju Jose wrote:  
> > > The min_ and max_ attributes of the control attributes are added  for your
> > > feedback on V15 to expose supported ranges of these control attributes to the user, 
> > > in the following links.      
> > 
> > Sure, but you can make that differently:
> > 
> > cat /sys/bus/edac/devices/<dev-name>/mem_repairX/bank
> > [x:y]
> > 
> > which is the allowed range.  
> 
> To my thinking that would fail the test of being an intuitive interface.
> To issue a repair command requires that multiple attributes be configured
> before triggering the actual repair.
> 
> Think of it as setting the coordinates of the repair in a high dimensional
> space.
> 
> In the extreme case of fine grained repair (Cacheline), to identify the
> relevant subunit of memory (obtained from the error record that we are
> basing the decision to repair on) we need to specify all of:
> 
> Channel, sub-channel, rank,  bank group, row, column and nibble mask.
> For coarser granularity repair only specify a subset of these applies and
> only the relevant controls are exposed to userspace.
> 
> They are broken out as specific attributes to enable each to be set before
> triggering the action with a write to the repair attribute.
> 
> There are several possible alternatives:
> 
> Option 1
> 
> "A:B:C:D:E:F:G:H:I:J" opaque single write to trigger the repair where
> each number is providing one of those coordinates and where a readback
> let's us known what each number is.
> 
> That single attribute interface is very hard to extend in an intuitive way.
> 
> History tell us more levels will be introduced in the middle, not just
> at the finest granularity, making such an interface hard to extend in
> a backwards compatible way.
> 
> Another alternative of a key value list would make for a nasty sysfs
> interface.
> 
> Option 2 
> There are sysfs interfaces that use a selection type presentation.
> 
> Write: "C", Read: "A, B, [C], D" but that only works well for discrete sets
> of options and is a pain to parse if read back is necessary.

Writing it as:

	a b [c] d

or even:
	a, b, [c], d

doesn't make it hard to be parse on userspace. Adding a comma makes
Kernel code a little bigger, as it needs an extra check at the loop
to check if the line is empty or not:

	if (*tmp != '\0')
		*tmp += snprintf(", ")

Btwm we have an implementation like that on kernelspace/userspace for
the RC API:

- Kernelspace:
  https://github.com/torvalds/linux/blob/master/drivers/media/rc/rc-main.c#L1125
  6 lines of code + a const table with names/values, if we use the same example
  for EDAC:

	const char *name[] = { "foo", "bar" };

	for (i = 0; i < ARRAY_SIZE(names); i++) {
		if (enabled & names[i].type)
			tmp += sprintf(tmp, "[%s] ", names[i].name);
		else if (allowed & proto_names[i].type)
			tmp += sprintf(tmp, "%s ", names[i].name);
	}


- Userspace:
  https://git.linuxtv.org/v4l-utils.git/tree/utils/keytable/keytable.c#n197
  5 lines of code + a const table, if we use the same example
  for ras-daemon:

		const char *name[] = { 
			[EDAC_FOO] = "[foo]",
			[EDAC_BAR] = "[bar]",
		};

		for (p = strtok(arg, " ,"); p; p = strtok(NULL, " ,"))
			for (i = 0; i < ARRAY_SIZE(name); i++)
				if (!strcasecmp(p, name[i])
					return i;
		return -1;

	(strtok handles both space and commas at the above example)

IMO, this is a lot better, as the alternative would be to have separate
sysfs nodes to describe what values are valid for a given edac devnode.

See, userspace needs to know what values are valid for a given
device and support for it may vary depending on the Kernel and
device version. So, we need to have the information about what values
are valid stored on some sysfs devnode, to allow backward compatibility.

> 
> So in conclusion, I think the proposed multiple sysfs attribute style
> with them reading back the most recent value written is the least bad
> solution to a complex control interface.
> 
> > 
> > echo ... 
> > 
> > then writes in the bank.
> >   
> > > ... so we would propose we do not add max_ and min_ for now and see how the
> > > use cases evolve.    
> > 
> > Yes, you should apply that same methodology to the rest of the new features
> > you're adding: only add functionality for the stuff that is actually being
> > used now. You can always extend it later.
> > 
> > Changing an already user-visible API is a whole different story and a lot lot
> > harder, even impossible.
> > 
> > So I'd suggest you prune the EDAC patches from all the hypothetical usage and
> > then send only what remains so that I can try to queue them.  
> 
> Sure. In this case the addition of min/max was perhaps a wrong response to
> your request for a way to those ranges rather than just rejecting a write
> of something out of range as earlier version did.
> 
> We can revisit in future if range discovery becomes necessary.  Personally
> I don't think it is given we are only taking these actions in response error
> records that give us precisely what to write and hence are always in range.

For RO devnodes, there's no need for ranges, but those are likely needed for
RW, as otherwise userspace may try to write invalid requests and/or have
backward-compatibility issues.


Thanks,
Mauro

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v18 04/19] EDAC: Add memory repair control feature
  2025-01-09 16:01             ` Jonathan Cameron
  2025-01-09 16:19               ` Borislav Petkov
@ 2025-01-14 12:57               ` Mauro Carvalho Chehab
  1 sibling, 0 replies; 87+ messages in thread
From: Mauro Carvalho Chehab @ 2025-01-14 12:57 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: Borislav Petkov, Shiju Jose, linux-edac@vger.kernel.org,
	linux-cxl@vger.kernel.org, linux-acpi@vger.kernel.org,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	tony.luck@intel.com, rafael@kernel.org, lenb@kernel.org,
	mchehab@kernel.org, dan.j.williams@intel.com, dave@stgolabs.net,
	dave.jiang@intel.com, alison.schofield@intel.com,
	vishal.l.verma@intel.com, ira.weiny@intel.com, david@redhat.com,
	Vilas.Sridharan@amd.com, leo.duran@amd.com, Yazen.Ghannam@amd.com,
	rientjes@google.com, jiaqiyan@google.com, Jon.Grimm@amd.com,
	dave.hansen@linux.intel.com, naoya.horiguchi@nec.com,
	james.morse@arm.com, jthoughton@google.com,
	somasundaram.a@hpe.com, erdemaktas@google.com, pgonda@google.com,
	duenwen@google.com, gthelen@google.com,
	wschwartz@amperecomputing.com, dferguson@amperecomputing.com,
	wbs@os.amperecomputing.com, nifan.cxl@gmail.com, tanxiaofei,
	Zengtao (B), Roberto Sassu, kangkang.shen@futurewei.com,
	wanghuiqiang, Linuxarm

Em Thu, 9 Jan 2025 16:01:59 +0000
Jonathan Cameron <Jonathan.Cameron@huawei.com> escreveu:


> > My goal here was to make this user-friendly. Because you need some way of
> > knowing what valid ranges are and in order to trigger the repair, if it needs
> > to happen for a range.  

IMO, user-friendly is important, as it allows people to manually use the
feature. This is interesting for debugging purposes and also to test if
some hardware is doing the right thing.

Ok, in practice, production will use an userspace tool like rasdaemon,
and/or some scripts [1].

[1] I'd say that rasdaemon should have an initialization phase to
discover capabilities that can be discovered.

As an example, rasdaemon could, for instance, reserve some sparing
memory at init time, if the hardware (partially) supports it. For
instance, maybe a CXL device could not be able to handle
rank-mem-sparing, but it can handle bank-mem-sparing.

> In at least the CXL case I'm fairly sure most of them are not discoverable.
> Until you see errors you have no idea what the memory topology is.

Sure, but some things can be discovered in advance, like what CXL
scrubbing features are supported by a given hardware.

If the hardware supports detecting ranges for row/bank/rank sparing,
it would be nice to have this reported in a way that userspace can
properly set it at OS init time, if desired by the sysadmins.

> > Or, you can teach the repair logic to ignore invalid ranges and "clamp" things
> > to whatever makes sense.  
> 
> For that you'd need to have a path to read back what happened.

If sysfs is RW, you have it there already after committing the value set.

> > Again, I'm looking at it from the usability perspective. I haven't actually
> > needed this scrub+repair functionality yet to know whether the UI makes sense.
> > So yeah, collecting some feedback from real-life use cases would probably give
> > you a lot better understanding of how that UI should be designed... perhaps
> > you won't ever need the ranges, whow knows.
> > 
> > So yes, preemptively designing stuff like that "in the dark" is kinda hard.
> > :-)  
> 
> The discoverability is unnecessary for any known usecase.
> 
> Ok. Then can we just drop the range discoverability entirely or we go with
> your suggestion and do not support read back of what has been
> requested but instead have the reads return a range if known or "" /
> return -EONOTSUPP if simply not known?

It sounds to be that ranges are needed at least to setup mem sparing.

> I can live with that though to me we are heading in the direction of
> a less intuitive interface to save a small number of additional files.
> 
> Jonathan
> 
> >   
> 



Thanks,
Mauro

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v18 04/19] EDAC: Add memory repair control feature
  2025-01-14 12:38           ` Mauro Carvalho Chehab
@ 2025-01-14 13:05             ` Jonathan Cameron
  2025-01-14 14:39               ` Mauro Carvalho Chehab
  0 siblings, 1 reply; 87+ messages in thread
From: Jonathan Cameron @ 2025-01-14 13:05 UTC (permalink / raw)
  To: Mauro Carvalho Chehab
  Cc: Borislav Petkov, Shiju Jose, linux-edac@vger.kernel.org,
	linux-cxl@vger.kernel.org, linux-acpi@vger.kernel.org,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	tony.luck@intel.com, rafael@kernel.org, lenb@kernel.org,
	mchehab@kernel.org, dan.j.williams@intel.com, dave@stgolabs.net,
	dave.jiang@intel.com, alison.schofield@intel.com,
	vishal.l.verma@intel.com, ira.weiny@intel.com, david@redhat.com,
	Vilas.Sridharan@amd.com, leo.duran@amd.com, Yazen.Ghannam@amd.com,
	rientjes@google.com, jiaqiyan@google.com, Jon.Grimm@amd.com,
	dave.hansen@linux.intel.com, naoya.horiguchi@nec.com,
	james.morse@arm.com, jthoughton@google.com,
	somasundaram.a@hpe.com, erdemaktas@google.com, pgonda@google.com,
	duenwen@google.com, gthelen@google.com,
	wschwartz@amperecomputing.com, dferguson@amperecomputing.com,
	wbs@os.amperecomputing.com, nifan.cxl@gmail.com, tanxiaofei,
	Zengtao (B), Roberto Sassu, kangkang.shen@futurewei.com,
	wanghuiqiang, Linuxarm

On Tue, 14 Jan 2025 13:38:31 +0100
Mauro Carvalho Chehab <mchehab+huawei@kernel.org> wrote:

> Em Thu, 9 Jan 2025 14:24:33 +0000
> Jonathan Cameron <Jonathan.Cameron@huawei.com> escreveu:
> 
> > On Thu, 9 Jan 2025 13:32:22 +0100
> > Borislav Petkov <bp@alien8.de> wrote:
> > 
> > Hi Boris,
> >   
> > > On Thu, Jan 09, 2025 at 11:00:43AM +0000, Shiju Jose wrote:    
> > > > The min_ and max_ attributes of the control attributes are added  for your
> > > > feedback on V15 to expose supported ranges of these control attributes to the user, 
> > > > in the following links.        
> > > 
> > > Sure, but you can make that differently:
> > > 
> > > cat /sys/bus/edac/devices/<dev-name>/mem_repairX/bank
> > > [x:y]
> > > 
> > > which is the allowed range.    
> > 
> > To my thinking that would fail the test of being an intuitive interface.
> > To issue a repair command requires that multiple attributes be configured
> > before triggering the actual repair.
> > 
> > Think of it as setting the coordinates of the repair in a high dimensional
> > space.
> > 
> > In the extreme case of fine grained repair (Cacheline), to identify the
> > relevant subunit of memory (obtained from the error record that we are
> > basing the decision to repair on) we need to specify all of:
> > 
> > Channel, sub-channel, rank,  bank group, row, column and nibble mask.
> > For coarser granularity repair only specify a subset of these applies and
> > only the relevant controls are exposed to userspace.
> > 
> > They are broken out as specific attributes to enable each to be set before
> > triggering the action with a write to the repair attribute.
> > 
> > There are several possible alternatives:
> > 
> > Option 1
> > 
> > "A:B:C:D:E:F:G:H:I:J" opaque single write to trigger the repair where
> > each number is providing one of those coordinates and where a readback
> > let's us known what each number is.
> > 
> > That single attribute interface is very hard to extend in an intuitive way.
> > 
> > History tell us more levels will be introduced in the middle, not just
> > at the finest granularity, making such an interface hard to extend in
> > a backwards compatible way.
> > 
> > Another alternative of a key value list would make for a nasty sysfs
> > interface.
> > 
> > Option 2 
> > There are sysfs interfaces that use a selection type presentation.
> > 
> > Write: "C", Read: "A, B, [C], D" but that only works well for discrete sets
> > of options and is a pain to parse if read back is necessary.  
> 
> Writing it as:
> 
> 	a b [c] d
> 
> or even:
> 	a, b, [c], d
> 
> doesn't make it hard to be parse on userspace. Adding a comma makes
> Kernel code a little bigger, as it needs an extra check at the loop
> to check if the line is empty or not:
> 
> 	if (*tmp != '\0')
> 		*tmp += snprintf(", ")
> 
> Btwm we have an implementation like that on kernelspace/userspace for
> the RC API:
> 
> - Kernelspace:
>   https://github.com/torvalds/linux/blob/master/drivers/media/rc/rc-main.c#L1125
>   6 lines of code + a const table with names/values, if we use the same example
>   for EDAC:
> 
> 	const char *name[] = { "foo", "bar" };
> 
> 	for (i = 0; i < ARRAY_SIZE(names); i++) {
> 		if (enabled & names[i].type)
> 			tmp += sprintf(tmp, "[%s] ", names[i].name);
> 		else if (allowed & proto_names[i].type)
> 			tmp += sprintf(tmp, "%s ", names[i].name);
> 	}
> 
> 
> - Userspace:
>   https://git.linuxtv.org/v4l-utils.git/tree/utils/keytable/keytable.c#n197
>   5 lines of code + a const table, if we use the same example
>   for ras-daemon:
> 
> 		const char *name[] = { 
> 			[EDAC_FOO] = "[foo]",
> 			[EDAC_BAR] = "[bar]",
> 		};
> 
> 		for (p = strtok(arg, " ,"); p; p = strtok(NULL, " ,"))
> 			for (i = 0; i < ARRAY_SIZE(name); i++)
> 				if (!strcasecmp(p, name[i])
> 					return i;
> 		return -1;
> 
> 	(strtok handles both space and commas at the above example)
> 
> IMO, this is a lot better, as the alternative would be to have separate
> sysfs nodes to describe what values are valid for a given edac devnode.
> 
> See, userspace needs to know what values are valid for a given
> device and support for it may vary depending on the Kernel and
> device version. So, we need to have the information about what values
> are valid stored on some sysfs devnode, to allow backward compatibility.

These aren't selectors from a discrete list so the question is more
whether a syntax of
<min> value <max> 
is intuitive or not.  I'm not aware of precedence for this one.

There was another branch of the thread where Boris mentioned this as an
option. It isn't bad to deal with and an easy change to the code,
but I have an open question on what choice we make for representing
unknown min / max.  For separate files the absence of the file
indicates we don't have any information.


> 
> > 
> > So in conclusion, I think the proposed multiple sysfs attribute style
> > with them reading back the most recent value written is the least bad
> > solution to a complex control interface.
> >   
> > > 
> > > echo ... 
> > > 
> > > then writes in the bank.
> > >     
> > > > ... so we would propose we do not add max_ and min_ for now and see how the
> > > > use cases evolve.      
> > > 
> > > Yes, you should apply that same methodology to the rest of the new features
> > > you're adding: only add functionality for the stuff that is actually being
> > > used now. You can always extend it later.
> > > 
> > > Changing an already user-visible API is a whole different story and a lot lot
> > > harder, even impossible.
> > > 
> > > So I'd suggest you prune the EDAC patches from all the hypothetical usage and
> > > then send only what remains so that I can try to queue them.    
> > 
> > Sure. In this case the addition of min/max was perhaps a wrong response to
> > your request for a way to those ranges rather than just rejecting a write
> > of something out of range as earlier version did.
> > 
> > We can revisit in future if range discovery becomes necessary.  Personally
> > I don't think it is given we are only taking these actions in response error
> > records that give us precisely what to write and hence are always in range.  
> 
> For RO devnodes, there's no need for ranges, but those are likely needed for
> RW, as otherwise userspace may try to write invalid requests and/or have
> backward-compatibility issues.

Given these parameters are only meaningfully written with values coming
ultimately from error records, userspace should never consider writing
something that is out of range except during testing.

I don't mind presenting the range where known (in CXL case it is not
discoverable for most of them) but I wouldn't expect tooling to ever
read it as known correct values to write come from the error records.
Checking those values against provided limits seems an unnecessary step
given an invalid parameter that slips through will be rejected by the
hardware anyway.

Jonathan

> 
> 
> Thanks,
> Mauro


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v18 04/19] EDAC: Add memory repair control feature
  2025-01-09 18:34                 ` Jonathan Cameron
  2025-01-09 23:51                   ` Dan Williams
  2025-01-11 17:12                   ` Borislav Petkov
@ 2025-01-14 13:10                   ` Mauro Carvalho Chehab
  2 siblings, 0 replies; 87+ messages in thread
From: Mauro Carvalho Chehab @ 2025-01-14 13:10 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: Borislav Petkov, Shiju Jose, linux-edac@vger.kernel.org,
	linux-cxl@vger.kernel.org, linux-acpi@vger.kernel.org,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	tony.luck@intel.com, rafael@kernel.org, lenb@kernel.org,
	mchehab@kernel.org, dan.j.williams@intel.com, dave@stgolabs.net,
	dave.jiang@intel.com, alison.schofield@intel.com,
	vishal.l.verma@intel.com, ira.weiny@intel.com, david@redhat.com,
	Vilas.Sridharan@amd.com, leo.duran@amd.com, Yazen.Ghannam@amd.com,
	rientjes@google.com, jiaqiyan@google.com, Jon.Grimm@amd.com,
	dave.hansen@linux.intel.com, naoya.horiguchi@nec.com,
	james.morse@arm.com, jthoughton@google.com,
	somasundaram.a@hpe.com, erdemaktas@google.com, pgonda@google.com,
	duenwen@google.com, gthelen@google.com,
	wschwartz@amperecomputing.com, dferguson@amperecomputing.com,
	wbs@os.amperecomputing.com, nifan.cxl@gmail.com, tanxiaofei,
	Zengtao (B), Roberto Sassu, kangkang.shen@futurewei.com,
	wanghuiqiang, Linuxarm

Em Thu, 9 Jan 2025 18:34:48 +0000
Jonathan Cameron <Jonathan.Cameron@huawei.com> escreveu:

> On Thu, 9 Jan 2025 17:19:02 +0100
> Borislav Petkov <bp@alien8.de> wrote:

> > 
> > But then why do you even need the interface at all?
> > 
> > Why can't the kernel automatically collect all those attributes and start the
> > scrubbing automatically - no need for any user interaction...?

Implementing policies at Kernelspace is a very bad idea.

See, to properly implement scrubbing and memory sparing policies, one
needs to have knowledge not only about the current Kernel lifetime (which
may be a recent boot due to a Kernel upgrade), but also for events that 
happened in the past months/years.

On other words, a database is needed to know if a memory error
was just a random issue due to high cosmic ray activities (so, a
soft PPR would be used - just in case), or if it is due to some memory
region that it is known to have past problems, probably an indication
of a a hardware issue - in which case a hard PPR would be used instead.

If this were meant to be done automatically, CXL wouldn't need to send
events about that to the OSPM.

Also, different usecases may require different policies. So, better
to let an userspace daemon to handle policies, and use sysfs for
such daemon to to setup the hardware.

Thanks,
Mauro

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v18 04/19] EDAC: Add memory repair control feature
  2025-01-06 12:10 ` [PATCH v18 04/19] EDAC: Add memory repair " shiju.jose
  2025-01-09  9:19   ` Borislav Petkov
  2025-01-14 11:47   ` Mauro Carvalho Chehab
@ 2025-01-14 13:47   ` Mauro Carvalho Chehab
  2025-01-14 14:30     ` Shiju Jose
  2 siblings, 1 reply; 87+ messages in thread
From: Mauro Carvalho Chehab @ 2025-01-14 13:47 UTC (permalink / raw)
  To: shiju.jose
  Cc: linux-edac, linux-cxl, linux-acpi, linux-mm, linux-kernel, bp,
	tony.luck, rafael, lenb, mchehab, dan.j.williams, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	ira.weiny, david, Vilas.Sridharan, leo.duran, Yazen.Ghannam,
	rientjes, jiaqiyan, Jon.Grimm, dave.hansen, naoya.horiguchi,
	james.morse, jthoughton, somasundaram.a, erdemaktas, pgonda,
	duenwen, gthelen, wschwartz, dferguson, wbs, nifan.cxl,
	tanxiaofei, prime.zeng, roberto.sassu, kangkang.shen,
	wanghuiqiang, linuxarm

Em Mon, 6 Jan 2025 12:10:00 +0000
<shiju.jose@huawei.com> escreveu:

> +What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/repair_function
> +Date:		Jan 2025
> +KernelVersion:	6.14
> +Contact:	linux-edac@vger.kernel.org
> +Description:
> +		(RO) Memory repair function type. For eg. post package repair,
> +		memory sparing etc.
> +		EDAC_SOFT_PPR - Soft post package repair
> +		EDAC_HARD_PPR - Hard post package repair
> +		EDAC_CACHELINE_MEM_SPARING - Cacheline memory sparing
> +		EDAC_ROW_MEM_SPARING - Row memory sparing
> +		EDAC_BANK_MEM_SPARING - Bank memory sparing
> +		EDAC_RANK_MEM_SPARING - Rank memory sparing
> +		All other values are reserved.
> +
> +What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/persist_mode
> +Date:		Jan 2025
> +KernelVersion:	6.14
> +Contact:	linux-edac@vger.kernel.org
> +Description:
> +		(RW) Read/Write the current persist repair mode set for a
> +		repair function. Persist repair modes supported in the
> +		device, based on the memory repair function is temporary
> +		or permanent and is lost with a power cycle.
> +		EDAC_MEM_REPAIR_SOFT - Soft repair function (temporary repair).
> +		EDAC_MEM_REPAIR_HARD - Hard memory repair function (permanent repair).
> +		All other values are reserved.
> +

After re-reading some things, I suspect that the above can be simplified
a little bit by folding soft/hard PPR into a single element at
/repair_function, and letting it clearer that persist_mode is valid only
for PPR (I think this is the case, right?), e.g. something like:

	What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/repair_function
	...
	Description:
			(RO) Memory repair function type. For e.g. post package repair,
			memory sparing etc. Valid values are:

			- ppr - post package repair.
			  Please define its mode via
			  /sys/bus/edac/devices/<dev-name>/mem_repairX/persist_mode
			- cacheline-sparing - Cacheline memory sparing
			- row-sparing - Row memory sparing
			- bank-sparing - Bank memory sparing
			- rank-sparing - Rank memory sparing
			- All other values are reserved.

and define persist_mode in a different way:

	What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/ppr_persist_mode
	...
	Description:
		(RW) Read/Write the current persist repair (PPR) mode set for a
		post package repair function. Persist repair modes supported 
		in the device, based on the memory repair function is temporary
		or permanent and is lost with a power cycle. Valid values are:

		- repair-soft - Soft PPR function (temporary repair).
		- repair-hard - Hard memory repair function (permanent repair).
		- All other values are reserved.

Thanks,
Mauro

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v18 00/19] EDAC: Scrub: introduce generic EDAC RAS control feature driver + CXL/ACPI-RAS2 drivers
  2025-01-13 15:36   ` Jonathan Cameron
@ 2025-01-14 14:06     ` Mauro Carvalho Chehab
  0 siblings, 0 replies; 87+ messages in thread
From: Mauro Carvalho Chehab @ 2025-01-14 14:06 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: linux-acpi, linux-mm, vishal.l.verma, shiju.jose, linux-edac,
	linux-cxl, linux-kernel, bp, tony.luck, rafael, lenb, mchehab,
	dan.j.williams, dave, dave.jiang, alison.schofield, ira.weiny,
	david, Vilas.Sridharan, leo.duran, Yazen.Ghannam, rientjes,
	jiaqiyan, Jon.Grimm, dave.hansen, naoya.horiguchi, james.morse,
	jthoughton, somasundaram.a, erdemaktas, pgonda, duenwen, gthelen,
	wschwartz, dferguson, wbs, nifan.cxl, tanxiaofei, prime.zeng,
	roberto.sassu, kangkang.shen, wanghuiqiang, linuxarm

Em Mon, 13 Jan 2025 15:36:39 +0000
Jonathan Cameron <Jonathan.Cameron@huawei.com> escreveu:

> > >    
> > > 5. CXL features driver supporting ECS control feature.
> > > 6. ACPI RAS2 driver adds OS interface for RAS2 communication through
> > >    PCC mailbox and extracts ACPI RAS2 feature table (RAS2) and
> > >    create platform device for the RAS memory features, which binds
> > >    to the memory ACPI RAS2 driver.
> > > 7. Memory ACPI RAS2 driver gets the PCC subspace for communicating
> > >    with the ACPI compliant platform supports ACPI RAS2. Add callback
> > >    functions and registers with EDAC device to support user to
> > >    control the HW patrol scrubbers exposed to the kernel via the
> > >    ACPI RAS2 table.
> > > 8. Support for CXL maintenance mailbox command, which is used by
> > >    CXL device memory repair feature.   
> > > 9. CXL features driver supporting PPR control feature.
> > > 10. CXL features driver supporting memory sparing control feature.
> > >     Note: There are other PPR, memory sparing drivers to come.    
> > 
> > The text above should be inside Documentation, and not on patch 0.
> > 
> > A big description like that makes hard to review this series. It is
> > also easier to review the text after having it parsed by kernel doc
> > build specially for summary tables like the "Comparison of scrubbing 
> > features", which deserves ReST links processed by Sphinx to the 
> > corresponding definitions of the terms that are be compared there.  
> 
> Whilst I fully agree that having a huge cover letter makes for a burden
> for any reviewer coming to the series, this is here at specific request
> of reviewers. 

Ok, then. Yet, even for them it would be very hard to track what
changes from v19 to the next versions if you change something at 
patch 00.

> We can look at keeping more of it in documentation though
> it's a bit white paper like in comparison with what I'd normally expect
> to see in kernel documentation.

Personally, I like comprehensive documentation at the Kernel.

> >   
> > > Open Questions based on feedbacks from the community:
> > > 1. Leo: Standardize unit for scrub rate, for example ACPI RAS2 does not define
> > >    unit for the scrub rate. RAS2 clarification needed.     
> > 
> > I noticed the same when reviewing a patch series for rasdaemon. Ideally,
> > ACPI requires an errata defining what units are expected for scrub rate.  
> 
> There is a code first ACPI ECN that indeed adds units.  That is accepted
> for next ACPI specification release.
> 
> Seems the tianocore bugzilla is unhelpfully down for a migration
> but it should be id 1013 at bugzilla.tianocore.com
> 
> That adds a detailed description of what the scrub rate settings mean but
> we may well still have older platforms where the scaling is arbitrary.
> The units defined are sufficient to map to whatever presentation we like.
>
> > While ACPI doesn't define it, better to not add support for it - or be
> > conservative using a low granularity for it (like using minutes instead 
> > of hours).  
> 
> I don't mind changing this, though for systems we are aware of default scrub
> is typically once or twice in 24 hours.

Yes, I noticed that we're using seconds after reading other patches.
It sounds OK to me to keep it as-is. 

It is really unlikely that we would ever have scrubbing finishing in less
than a second.

Thanks,
Mauro

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v18 04/19] EDAC: Add memory repair control feature
  2025-01-14 12:31     ` Shiju Jose
@ 2025-01-14 14:26       ` Mauro Carvalho Chehab
  0 siblings, 0 replies; 87+ messages in thread
From: Mauro Carvalho Chehab @ 2025-01-14 14:26 UTC (permalink / raw)
  To: Shiju Jose
  Cc: linux-edac@vger.kernel.org, linux-cxl@vger.kernel.org,
	linux-acpi@vger.kernel.org, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org, bp@alien8.de, tony.luck@intel.com,
	rafael@kernel.org, lenb@kernel.org, mchehab@kernel.org,
	dan.j.williams@intel.com, dave@stgolabs.net, Jonathan Cameron,
	dave.jiang@intel.com, alison.schofield@intel.com,
	vishal.l.verma@intel.com, ira.weiny@intel.com, david@redhat.com,
	Vilas.Sridharan@amd.com, leo.duran@amd.com, Yazen.Ghannam@amd.com,
	rientjes@google.com, jiaqiyan@google.com, Jon.Grimm@amd.com,
	dave.hansen@linux.intel.com, naoya.horiguchi@nec.com,
	james.morse@arm.com, jthoughton@google.com,
	somasundaram.a@hpe.com, erdemaktas@google.com, pgonda@google.com,
	duenwen@google.com, gthelen@google.com,
	wschwartz@amperecomputing.com, dferguson@amperecomputing.com,
	wbs@os.amperecomputing.com, nifan.cxl@gmail.com, tanxiaofei,
	Zengtao (B), Roberto Sassu, kangkang.shen@futurewei.com,
	wanghuiqiang, Linuxarm

Em Tue, 14 Jan 2025 12:31:44 +0000
Shiju Jose <shiju.jose@huawei.com> escreveu:

> Hi Mauro,
> 
> Thanks for the comments.
> 
> >-----Original Message-----
> >From: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
> >Sent: 14 January 2025 11:48
> >To: Shiju Jose <shiju.jose@huawei.com>
> >Cc: linux-edac@vger.kernel.org; linux-cxl@vger.kernel.org; linux-
> >acpi@vger.kernel.org; linux-mm@kvack.org; linux-kernel@vger.kernel.org;
> >bp@alien8.de; tony.luck@intel.com; rafael@kernel.org; lenb@kernel.org;
> >mchehab@kernel.org; dan.j.williams@intel.com; dave@stgolabs.net; Jonathan
> >Cameron <jonathan.cameron@huawei.com>; dave.jiang@intel.com;
> >alison.schofield@intel.com; vishal.l.verma@intel.com; ira.weiny@intel.com;
> >david@redhat.com; Vilas.Sridharan@amd.com; leo.duran@amd.com;
> >Yazen.Ghannam@amd.com; rientjes@google.com; jiaqiyan@google.com;
> >Jon.Grimm@amd.com; dave.hansen@linux.intel.com;
> >naoya.horiguchi@nec.com; james.morse@arm.com; jthoughton@google.com;
> >somasundaram.a@hpe.com; erdemaktas@google.com; pgonda@google.com;
> >duenwen@google.com; gthelen@google.com;
> >wschwartz@amperecomputing.com; dferguson@amperecomputing.com;
> >wbs@os.amperecomputing.com; nifan.cxl@gmail.com; tanxiaofei
> ><tanxiaofei@huawei.com>; Zengtao (B) <prime.zeng@hisilicon.com>; Roberto
> >Sassu <roberto.sassu@huawei.com>; kangkang.shen@futurewei.com;
> >wanghuiqiang <wanghuiqiang@huawei.com>; Linuxarm
> ><linuxarm@huawei.com>
> >Subject: Re: [PATCH v18 04/19] EDAC: Add memory repair control feature
> >
> >Em Mon, 6 Jan 2025 12:10:00 +0000
> ><shiju.jose@huawei.com> escreveu:
> >  
> >> From: Shiju Jose <shiju.jose@huawei.com>
> >>
> >> Add a generic EDAC memory repair control driver to manage memory repairs
> >> in the system, such as CXL Post Package Repair (PPR) and CXL memory sparing
> >> features.
> >>
> >> For example, a CXL device with DRAM components that support PPR features
> >> may implement PPR maintenance operations. DRAM components may support  
> >two  
> >> types of PPR, hard PPR, for a permanent row repair, and soft PPR,  for a
> >> temporary row repair. Soft PPR is much faster than hard PPR, but the repair
> >> is lost with a power cycle.
> >> Similarly a CXL memory device may support soft and hard memory sparing at
> >> cacheline, row, bank and rank granularities. Memory sparing is defined as
> >> a repair function that replaces a portion of memory with a portion of
> >> functional memory at that same granularity.
> >> When a CXL device detects an error in a memory, it may report the host of
> >> the need for a repair maintenance operation by using an event record where
> >> the "maintenance needed" flag is set. The event records contains the device
> >> physical address(DPA) and other attributes of the memory to repair (such as
> >> channel, sub-channel, bank group, bank, rank, row, column etc). The kernel
> >> will report the corresponding CXL general media or DRAM trace event to
> >> userspace, and userspace tools (e.g. rasdaemon) will initiate a repair
> >> operation in response to the device request via the sysfs repair control.
> >>
> >> Device with memory repair features registers with EDAC device driver,
> >> which retrieves memory repair descriptor from EDAC memory repair driver
> >> and exposes the sysfs repair control attributes to userspace in
> >> /sys/bus/edac/devices/<dev-name>/mem_repairX/.
> >>
> >> The common memory repair control interface abstracts the control of
> >> arbitrary memory repair functionality into a standardized set of functions.
> >> The sysfs memory repair attribute nodes are only available if the client
> >> driver has implemented the corresponding attribute callback function and
> >> provided operations to the EDAC device driver during registration.
> >>
> >> Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
> >> ---
> >>  .../ABI/testing/sysfs-edac-memory-repair      | 244 +++++++++
> >>  Documentation/edac/features.rst               |   3 +
> >>  Documentation/edac/index.rst                  |   1 +
> >>  Documentation/edac/memory_repair.rst          | 101 ++++
> >>  drivers/edac/Makefile                         |   2 +-
> >>  drivers/edac/edac_device.c                    |  33 ++
> >>  drivers/edac/mem_repair.c                     | 492 ++++++++++++++++++
> >>  include/linux/edac.h                          | 139 +++++
> >>  8 files changed, 1014 insertions(+), 1 deletion(-)
> >>  create mode 100644 Documentation/ABI/testing/sysfs-edac-memory-repair
> >>  create mode 100644 Documentation/edac/memory_repair.rst
> >>  create mode 100755 drivers/edac/mem_repair.c
> >>
> >> diff --git a/Documentation/ABI/testing/sysfs-edac-memory-repair  
> >b/Documentation/ABI/testing/sysfs-edac-memory-repair  
> >> new file mode 100644
> >> index 000000000000..e9268f3780ed
> >> --- /dev/null
> >> +++ b/Documentation/ABI/testing/sysfs-edac-memory-repair
> >> @@ -0,0 +1,244 @@
> >> +What:		/sys/bus/edac/devices/<dev-name>/mem_repairX
> >> +Date:		Jan 2025
> >> +KernelVersion:	6.14
> >> +Contact:	linux-edac@vger.kernel.org
> >> +Description:
> >> +		The sysfs EDAC bus devices /<dev-name>/mem_repairX  
> >subdirectory  
> >> +		pertains to the memory media repair features control, such as
> >> +		PPR (Post Package Repair), memory sparing etc, where<dev-
> >name>
> >> +		directory corresponds to a device registered with the EDAC
> >> +		device driver for the memory repair features.
> >> +
> >> +		Post Package Repair is a maintenance operation requests the  
> >memory  
> >> +		device to perform a repair operation on its media, in detail is a
> >> +		memory self-healing feature that fixes a failing memory  
> >location by  
> >> +		replacing it with a spare row in a DRAM device. For example, a
> >> +		CXL memory device with DRAM components that support PPR  
> >features may  
> >> +		implement PPR maintenance operations. DRAM components  
> >may support  
> >> +		two types of PPR functions: hard PPR, for a permanent row  
> >repair, and  
> >> +		soft PPR, for a temporary row repair. soft PPR is much faster  
> >than  
> >> +		hard PPR, but the repair is lost with a power cycle.
> >> +
> >> +		Memory sparing is a repair function that replaces a portion
> >> +		of memory with a portion of functional memory at that same
> >> +		sparing granularity. Memory sparing has  
> >cacheline/row/bank/rank  
> >> +		sparing granularities. For example, in memory-sparing mode,
> >> +		one memory rank serves as a spare for other ranks on the same
> >> +		channel in case they fail. The spare rank is held in reserve and
> >> +		not used as active memory until a failure is indicated, with
> >> +		reserved capacity subtracted from the total available memory
> >> +		in the system.The DIMM installation order for memory sparing
> >> +		varies based on the number of processors and memory modules
> >> +		installed in the server. After an error threshold is surpassed
> >> +		in a system protected by memory sparing, the content of a  
> >failing  
> >> +		rank of DIMMs is copied to the spare rank. The failing rank is
> >> +		then taken offline and the spare rank placed online for use as
> >> +		active memory in place of the failed rank.
> >> +
> >> +		The sysfs attributes nodes for a repair feature are only
> >> +		present if the parent driver has implemented the corresponding
> >> +		attr callback function and provided the necessary operations
> >> +		to the EDAC device driver during registration.
> >> +
> >> +		In some states of system configuration (e.g. before address
> >> +		decoders have been configured), memory devices (e.g. CXL)
> >> +		may not have an active mapping in the main host address
> >> +		physical address map. As such, the memory to repair must be
> >> +		identified by a device specific physical addressing scheme
> >> +		using a device physical address(DPA). The DPA and other control
> >> +		attributes to use will be presented in related error records.
> >> +
> >> +What:		/sys/bus/edac/devices/<dev-
> >name>/mem_repairX/repair_function
> >> +Date:		Jan 2025
> >> +KernelVersion:	6.14
> >> +Contact:	linux-edac@vger.kernel.org
> >> +Description:
> >> +		(RO) Memory repair function type. For eg. post package repair,
> >> +		memory sparing etc.
> >> +		EDAC_SOFT_PPR - Soft post package repair
> >> +		EDAC_HARD_PPR - Hard post package repair
> >> +		EDAC_CACHELINE_MEM_SPARING - Cacheline memory sparing
> >> +		EDAC_ROW_MEM_SPARING - Row memory sparing
> >> +		EDAC_BANK_MEM_SPARING - Bank memory sparing
> >> +		EDAC_RANK_MEM_SPARING - Rank memory sparing
> >> +		All other values are reserved.  
> >
> >Too big strings. Why are them in upper cases? IMO:
> >
> >	soft-ppr, hard-ppr, ... would be enough.
> >  
> Here return repair type (single value, such as 0, 1, or 2 etc not as decoded string  for eg."EDAC_SOFT_PPR")
> of the memory repair instance, which is  defined as enums (EDAC_SOFT_PPR, EDAC_HARD_PPR, ... etc) 
> for the memory repair interface in the include/linux/edac.h.
> 
> enum edac_mem_repair_function {
> 	EDAC_SOFT_PPR,
> 	EDAC_HARD_PPR,
> 	EDAC_CACHELINE_MEM_SPARING,
> 	EDAC_ROW_MEM_SPARING,
> 	EDAC_BANK_MEM_SPARING,
> 	EDAC_RANK_MEM_SPARING,
> };
>   
> I documented return value in terms of the above enums.

The ABI documentation describes exactly what numeric/strings values will be there.
So, if you place:

	EDAC_SOFT_PPR

It means a string with EDAC_SOFT_PPR, not a numeric zero value.

Also, as I explained at:
	 https://lore.kernel.org/linux-edac/1bf421f9d1924d68860d08c70829a705@huawei.com/T/#m1e60da13198b47701a4c2f740d4b78701f912d2d

it doesn't make sense to report soft/hard PPR, as the persist mode
is designed to be on a different sysfs devnode (/persist_mode on your
proposal).

So, here you need to fold EDAC_SOFT_PPR and EDAC_HARD_PPR into a single
value ("ppr").

-

Btw, very few sysfs nodes use numbers for things that can be mapped with 
enums:

	$ git grep -l "\- 0" Documentation/ABI|wc -l
	20
	(several of those are actually false-positives)

and this is done mostly when it reports what the hardware actually
outputs when reading some register.

Thanks,
Mauro

^ permalink raw reply	[flat|nested] 87+ messages in thread

* RE: [PATCH v18 04/19] EDAC: Add memory repair control feature
  2025-01-14 13:47   ` Mauro Carvalho Chehab
@ 2025-01-14 14:30     ` Shiju Jose
  2025-01-15 12:03       ` Mauro Carvalho Chehab
  0 siblings, 1 reply; 87+ messages in thread
From: Shiju Jose @ 2025-01-14 14:30 UTC (permalink / raw)
  To: Mauro Carvalho Chehab
  Cc: linux-edac@vger.kernel.org, linux-cxl@vger.kernel.org,
	linux-acpi@vger.kernel.org, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org, bp@alien8.de, tony.luck@intel.com,
	rafael@kernel.org, lenb@kernel.org, mchehab@kernel.org,
	dan.j.williams@intel.com, dave@stgolabs.net, Jonathan Cameron,
	dave.jiang@intel.com, alison.schofield@intel.com,
	vishal.l.verma@intel.com, ira.weiny@intel.com, david@redhat.com,
	Vilas.Sridharan@amd.com, leo.duran@amd.com, Yazen.Ghannam@amd.com,
	rientjes@google.com, jiaqiyan@google.com, Jon.Grimm@amd.com,
	dave.hansen@linux.intel.com, naoya.horiguchi@nec.com,
	james.morse@arm.com, jthoughton@google.com,
	somasundaram.a@hpe.com, erdemaktas@google.com, pgonda@google.com,
	duenwen@google.com, gthelen@google.com,
	wschwartz@amperecomputing.com, dferguson@amperecomputing.com,
	wbs@os.amperecomputing.com, nifan.cxl@gmail.com, tanxiaofei,
	Zengtao (B), Roberto Sassu, kangkang.shen@futurewei.com,
	wanghuiqiang, Linuxarm

>-----Original Message-----
>From: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
>Sent: 14 January 2025 13:47
>To: Shiju Jose <shiju.jose@huawei.com>
>Cc: linux-edac@vger.kernel.org; linux-cxl@vger.kernel.org; linux-
>acpi@vger.kernel.org; linux-mm@kvack.org; linux-kernel@vger.kernel.org;
>bp@alien8.de; tony.luck@intel.com; rafael@kernel.org; lenb@kernel.org;
>mchehab@kernel.org; dan.j.williams@intel.com; dave@stgolabs.net; Jonathan
>Cameron <jonathan.cameron@huawei.com>; dave.jiang@intel.com;
>alison.schofield@intel.com; vishal.l.verma@intel.com; ira.weiny@intel.com;
>david@redhat.com; Vilas.Sridharan@amd.com; leo.duran@amd.com;
>Yazen.Ghannam@amd.com; rientjes@google.com; jiaqiyan@google.com;
>Jon.Grimm@amd.com; dave.hansen@linux.intel.com;
>naoya.horiguchi@nec.com; james.morse@arm.com; jthoughton@google.com;
>somasundaram.a@hpe.com; erdemaktas@google.com; pgonda@google.com;
>duenwen@google.com; gthelen@google.com;
>wschwartz@amperecomputing.com; dferguson@amperecomputing.com;
>wbs@os.amperecomputing.com; nifan.cxl@gmail.com; tanxiaofei
><tanxiaofei@huawei.com>; Zengtao (B) <prime.zeng@hisilicon.com>; Roberto
>Sassu <roberto.sassu@huawei.com>; kangkang.shen@futurewei.com;
>wanghuiqiang <wanghuiqiang@huawei.com>; Linuxarm
><linuxarm@huawei.com>
>Subject: Re: [PATCH v18 04/19] EDAC: Add memory repair control feature
>
>Em Mon, 6 Jan 2025 12:10:00 +0000
><shiju.jose@huawei.com> escreveu:
>
>> +What:		/sys/bus/edac/devices/<dev-
>name>/mem_repairX/repair_function
>> +Date:		Jan 2025
>> +KernelVersion:	6.14
>> +Contact:	linux-edac@vger.kernel.org
>> +Description:
>> +		(RO) Memory repair function type. For eg. post package repair,
>> +		memory sparing etc.
>> +		EDAC_SOFT_PPR - Soft post package repair
>> +		EDAC_HARD_PPR - Hard post package repair
>> +		EDAC_CACHELINE_MEM_SPARING - Cacheline memory sparing
>> +		EDAC_ROW_MEM_SPARING - Row memory sparing
>> +		EDAC_BANK_MEM_SPARING - Bank memory sparing
>> +		EDAC_RANK_MEM_SPARING - Rank memory sparing
>> +		All other values are reserved.
>> +
>> +What:		/sys/bus/edac/devices/<dev-
>name>/mem_repairX/persist_mode
>> +Date:		Jan 2025
>> +KernelVersion:	6.14
>> +Contact:	linux-edac@vger.kernel.org
>> +Description:
>> +		(RW) Read/Write the current persist repair mode set for a
>> +		repair function. Persist repair modes supported in the
>> +		device, based on the memory repair function is temporary
>> +		or permanent and is lost with a power cycle.
>> +		EDAC_MEM_REPAIR_SOFT - Soft repair function (temporary
>repair).
>> +		EDAC_MEM_REPAIR_HARD - Hard memory repair function
>(permanent repair).
>> +		All other values are reserved.
>> +
>
>After re-reading some things, I suspect that the above can be simplified a little
>bit by folding soft/hard PPR into a single element at /repair_function, and letting
>it clearer that persist_mode is valid only for PPR (I think this is the case, right?),
>e.g. something like:
persist_mode is valid for memory sparing features(atleast in CXL) as well.
In the case of CXL memory sparing, host has option to request either soft or hard sparing
in a flag when issue a memory sparing operation.

>
>	What:		/sys/bus/edac/devices/<dev-
>name>/mem_repairX/repair_function
>	...
>	Description:
>			(RO) Memory repair function type. For e.g. post
>package repair,
>			memory sparing etc. Valid values are:
>
>			- ppr - post package repair.
>			  Please define its mode via
>			  /sys/bus/edac/devices/<dev-
>name>/mem_repairX/persist_mode
>			- cacheline-sparing - Cacheline memory sparing
>			- row-sparing - Row memory sparing
>			- bank-sparing - Bank memory sparing
>			- rank-sparing - Rank memory sparing
>			- All other values are reserved.
>
>and define persist_mode in a different way:
Note: For return as decoded strings instead of raw value,  I need to add some extra callback function/s
in the edac/memory_repair.c  for these attributes  and which will reduce the current level of optimization done to
minimize the code size.
>
>	What:		/sys/bus/edac/devices/<dev-
>name>/mem_repairX/ppr_persist_mode
Same as above.  persist_mode is needed for memory sparing feature too.
>	...
>	Description:
>		(RW) Read/Write the current persist repair (PPR) mode set for a
>		post package repair function. Persist repair modes supported
>		in the device, based on the memory repair function is
>temporary
>		or permanent and is lost with a power cycle. Valid values are:
>
>		- repair-soft - Soft PPR function (temporary repair).
>		- repair-hard - Hard memory repair function (permanent
>repair).
>		- All other values are reserved.
>
>Thanks,
>Mauro

Thanks,
Shiju


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v18 04/19] EDAC: Add memory repair control feature
  2025-01-14 13:05             ` Jonathan Cameron
@ 2025-01-14 14:39               ` Mauro Carvalho Chehab
  0 siblings, 0 replies; 87+ messages in thread
From: Mauro Carvalho Chehab @ 2025-01-14 14:39 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: Borislav Petkov, Shiju Jose, linux-edac@vger.kernel.org,
	linux-cxl@vger.kernel.org, linux-acpi@vger.kernel.org,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	tony.luck@intel.com, rafael@kernel.org, lenb@kernel.org,
	mchehab@kernel.org, dan.j.williams@intel.com, dave@stgolabs.net,
	dave.jiang@intel.com, alison.schofield@intel.com,
	vishal.l.verma@intel.com, ira.weiny@intel.com, david@redhat.com,
	Vilas.Sridharan@amd.com, leo.duran@amd.com, Yazen.Ghannam@amd.com,
	rientjes@google.com, jiaqiyan@google.com, Jon.Grimm@amd.com,
	dave.hansen@linux.intel.com, naoya.horiguchi@nec.com,
	james.morse@arm.com, jthoughton@google.com,
	somasundaram.a@hpe.com, erdemaktas@google.com, pgonda@google.com,
	duenwen@google.com, gthelen@google.com,
	wschwartz@amperecomputing.com, dferguson@amperecomputing.com,
	wbs@os.amperecomputing.com, nifan.cxl@gmail.com, tanxiaofei,
	Zengtao (B), Roberto Sassu, kangkang.shen@futurewei.com,
	wanghuiqiang, Linuxarm

Em Tue, 14 Jan 2025 13:05:37 +0000
Jonathan Cameron <Jonathan.Cameron@huawei.com> escreveu:

> On Tue, 14 Jan 2025 13:38:31 +0100
> Mauro Carvalho Chehab <mchehab+huawei@kernel.org> wrote:
> 
> > Em Thu, 9 Jan 2025 14:24:33 +0000
> > Jonathan Cameron <Jonathan.Cameron@huawei.com> escreveu:
> >   
> > > On Thu, 9 Jan 2025 13:32:22 +0100
> > > Borislav Petkov <bp@alien8.de> wrote:
> > > 
> > > Hi Boris,
> > >     
> > > > On Thu, Jan 09, 2025 at 11:00:43AM +0000, Shiju Jose wrote:      
> > > > > The min_ and max_ attributes of the control attributes are added  for your
> > > > > feedback on V15 to expose supported ranges of these control attributes to the user, 
> > > > > in the following links.          
> > > > 
> > > > Sure, but you can make that differently:
> > > > 
> > > > cat /sys/bus/edac/devices/<dev-name>/mem_repairX/bank
> > > > [x:y]
> > > > 
> > > > which is the allowed range.      
> > > 
> > > To my thinking that would fail the test of being an intuitive interface.
> > > To issue a repair command requires that multiple attributes be configured
> > > before triggering the actual repair.
> > > 
> > > Think of it as setting the coordinates of the repair in a high dimensional
> > > space.
> > > 
> > > In the extreme case of fine grained repair (Cacheline), to identify the
> > > relevant subunit of memory (obtained from the error record that we are
> > > basing the decision to repair on) we need to specify all of:
> > > 
> > > Channel, sub-channel, rank,  bank group, row, column and nibble mask.
> > > For coarser granularity repair only specify a subset of these applies and
> > > only the relevant controls are exposed to userspace.
> > > 
> > > They are broken out as specific attributes to enable each to be set before
> > > triggering the action with a write to the repair attribute.
> > > 
> > > There are several possible alternatives:
> > > 
> > > Option 1
> > > 
> > > "A:B:C:D:E:F:G:H:I:J" opaque single write to trigger the repair where
> > > each number is providing one of those coordinates and where a readback
> > > let's us known what each number is.
> > > 
> > > That single attribute interface is very hard to extend in an intuitive way.
> > > 
> > > History tell us more levels will be introduced in the middle, not just
> > > at the finest granularity, making such an interface hard to extend in
> > > a backwards compatible way.
> > > 
> > > Another alternative of a key value list would make for a nasty sysfs
> > > interface.
> > > 
> > > Option 2 
> > > There are sysfs interfaces that use a selection type presentation.
> > > 
> > > Write: "C", Read: "A, B, [C], D" but that only works well for discrete sets
> > > of options and is a pain to parse if read back is necessary.    
> > 
> > Writing it as:
> > 
> > 	a b [c] d
> > 
> > or even:
> > 	a, b, [c], d
> > 
> > doesn't make it hard to be parse on userspace. Adding a comma makes
> > Kernel code a little bigger, as it needs an extra check at the loop
> > to check if the line is empty or not:
> > 
> > 	if (*tmp != '\0')
> > 		*tmp += snprintf(", ")
> > 
> > Btwm we have an implementation like that on kernelspace/userspace for
> > the RC API:
> > 
> > - Kernelspace:
> >   https://github.com/torvalds/linux/blob/master/drivers/media/rc/rc-main.c#L1125
> >   6 lines of code + a const table with names/values, if we use the same example
> >   for EDAC:
> > 
> > 	const char *name[] = { "foo", "bar" };
> > 
> > 	for (i = 0; i < ARRAY_SIZE(names); i++) {
> > 		if (enabled & names[i].type)
> > 			tmp += sprintf(tmp, "[%s] ", names[i].name);
> > 		else if (allowed & proto_names[i].type)
> > 			tmp += sprintf(tmp, "%s ", names[i].name);
> > 	}
> > 
> > 
> > - Userspace:
> >   https://git.linuxtv.org/v4l-utils.git/tree/utils/keytable/keytable.c#n197
> >   5 lines of code + a const table, if we use the same example
> >   for ras-daemon:
> > 
> > 		const char *name[] = { 
> > 			[EDAC_FOO] = "[foo]",
> > 			[EDAC_BAR] = "[bar]",
> > 		};
> > 
> > 		for (p = strtok(arg, " ,"); p; p = strtok(NULL, " ,"))
> > 			for (i = 0; i < ARRAY_SIZE(name); i++)
> > 				if (!strcasecmp(p, name[i])
> > 					return i;
> > 		return -1;
> > 
> > 	(strtok handles both space and commas at the above example)
> > 
> > IMO, this is a lot better, as the alternative would be to have separate
> > sysfs nodes to describe what values are valid for a given edac devnode.
> > 
> > See, userspace needs to know what values are valid for a given
> > device and support for it may vary depending on the Kernel and
> > device version. So, we need to have the information about what values
> > are valid stored on some sysfs devnode, to allow backward compatibility.  
> 
> These aren't selectors from a discrete list so the question is more
> whether a syntax of
> <min> value <max> 
> is intuitive or not.  I'm not aware of precedence for this one.

From my side, I prefer having 3 separate sysfs nodes, as this is a
very common practice. Doing it on a different way sounds an API violation,
but if someone insists on dropping min/max, this can be argued at
https://lore.kernel.org/linux-api/.

On a very quick search:

	$ ./scripts/get_abi.pl search "\bmin.*max"

I can't see any place using min and max at the same devnode.

	$ ./scripts/get_abi.pl search "\b(min|max)"|grep /sys/ |wc -l
	234

So, it sounds to me that merging those into a single devnode is an
API violation.

> 
> There was another branch of the thread where Boris mentioned this as an
> option. It isn't bad to deal with and an easy change to the code,
> but I have an open question on what choice we make for representing
> unknown min / max.  For separate files the absence of the file
> indicates we don't have any information.
> 
> 
> >   
> > > 
> > > So in conclusion, I think the proposed multiple sysfs attribute style
> > > with them reading back the most recent value written is the least bad
> > > solution to a complex control interface.
> > >     
> > > > 
> > > > echo ... 
> > > > 
> > > > then writes in the bank.
> > > >       
> > > > > ... so we would propose we do not add max_ and min_ for now and see how the
> > > > > use cases evolve.        
> > > > 
> > > > Yes, you should apply that same methodology to the rest of the new features
> > > > you're adding: only add functionality for the stuff that is actually being
> > > > used now. You can always extend it later.
> > > > 
> > > > Changing an already user-visible API is a whole different story and a lot lot
> > > > harder, even impossible.
> > > > 
> > > > So I'd suggest you prune the EDAC patches from all the hypothetical usage and
> > > > then send only what remains so that I can try to queue them.      
> > > 
> > > Sure. In this case the addition of min/max was perhaps a wrong response to
> > > your request for a way to those ranges rather than just rejecting a write
> > > of something out of range as earlier version did.
> > > 
> > > We can revisit in future if range discovery becomes necessary.  Personally
> > > I don't think it is given we are only taking these actions in response error
> > > records that give us precisely what to write and hence are always in range.    
> > 
> > For RO devnodes, there's no need for ranges, but those are likely needed for
> > RW, as otherwise userspace may try to write invalid requests and/or have
> > backward-compatibility issues.  
> 
> Given these parameters are only meaningfully written with values coming
> ultimately from error records, userspace should never consider writing
> something that is out of range except during testing.
> 
> I don't mind presenting the range where known (in CXL case it is not
> discoverable for most of them) but I wouldn't expect tooling to ever
> read it as known correct values to write come from the error records.
> Checking those values against provided limits seems an unnecessary step
> given an invalid parameter that slips through will be rejected by the
> hardware anyway.

I'm fine starting without min/max if there's no current usecase, provided
that:

1. when needed, we add min/max as separate devnodes;
2. there won't be any backward issues when min/max gets added.

Regards,
Mauro

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v18 04/19] EDAC: Add memory repair control feature
  2025-01-13 11:40                         ` Jonathan Cameron
@ 2025-01-14 19:35                           ` Dan Williams
  2025-01-15 10:07                             ` Jonathan Cameron
  2025-01-15 11:35                             ` Mauro Carvalho Chehab
  0 siblings, 2 replies; 87+ messages in thread
From: Dan Williams @ 2025-01-14 19:35 UTC (permalink / raw)
  To: Jonathan Cameron, Dan Williams
  Cc: Borislav Petkov, Shiju Jose, linux-edac@vger.kernel.org,
	linux-cxl@vger.kernel.org, linux-acpi@vger.kernel.org,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	tony.luck@intel.com, rafael@kernel.org, lenb@kernel.org,
	mchehab@kernel.org, dave@stgolabs.net, dave.jiang@intel.com,
	alison.schofield@intel.com, vishal.l.verma@intel.com,
	ira.weiny@intel.com, david@redhat.com, Vilas.Sridharan@amd.com,
	leo.duran@amd.com, Yazen.Ghannam@amd.com, rientjes@google.com,
	jiaqiyan@google.com, Jon.Grimm@amd.com,
	dave.hansen@linux.intel.com, naoya.horiguchi@nec.com,
	james.morse@arm.com, jthoughton@google.com,
	somasundaram.a@hpe.com, erdemaktas@google.com, pgonda@google.com,
	duenwen@google.com, gthelen@google.com,
	wschwartz@amperecomputing.com, dferguson@amperecomputing.com,
	wbs@os.amperecomputing.com, nifan.cxl@gmail.com, tanxiaofei,
	Zengtao (B), Roberto Sassu, kangkang.shen@futurewei.com,
	wanghuiqiang, Linuxarm

Jonathan Cameron wrote:
> On Fri, 10 Jan 2025 14:49:03 -0800
> Dan Williams <dan.j.williams@intel.com> wrote:
[..]
> > This is where you lose me. The error record is a point in time snapshot
> > of the SPA:HPA:DPA:<proprietary internal "DIMM" mapping>. The security
> > model for memory operations is based on coordinating with the kernel's
> > understanding of how that SPA is *currently* being used.
> 
> Whilst it is being used I agree.  Key is to only do disruptive / data
> changing actions when it is not being used.

Sure.

> > The kernel can not just take userspace's word for it that potentially
> > data changing, or temporary loss-of-use operations are safe to execute
> > just because once upon a time userspace saw an error record that
> > implicated a given SPA in the past, especially over reboot. 
> 
> There are two cases (discoverable from hardware)
> 
> 1) Non disruptive.  No security concern as the device guarantees to
>    not interrupt traffic and the memory contents is copied to the new
>    location. Basically software never knows it happened.
> 2) Disruptive.  We only allow this if the memory is offline. In the CXL case
>    the CXL specific code must check no memory on the device is online so
>    we aren't disrupting anything.  The other implementation we have code
>    for (will post after this lands) has finer granularity constraints and only
>    the page needs to be offline.
>    As it is offline the content is not preserved anyway. We may need to add extra
>    constraints along with future support for temporal persistence / sharing but
>    we can do that as part of adding that support in general.
>    (Personally I think in those cases memory repair is a job for the out of
>     band management anyway).
> 
> In neither case am I seeing a security concern.  Am I missing something?

s/security/system-integrity/

1/ Hardware engineers may have a different definition of "non-disuptive"
than software. See the history around hibernate_quiet_exec() to work
around the disruption of latency spikes. If this is poorly specified
across vendors we are going to wish that we did not build a "take
userspace's word for it" interface.

2/ Yes, if the device is not actively decoding any in use memory feel
free to run destructive operations on the device. However, is sysfs the
right interface for "submit multi-parameter atomic operation with
transient result"? I lean heavily into sysfs, but ioctl and netlink have
a role to play in scenarios like this. Otherwise userspace can inject
error records back into the kernel with the expectation that the kernel
can only accept the DIMM address and not any of the translation data
which might be stale.

[..]
> > Again, the repair control assumes that the kernel can just trust
> > userspace to get it right. When the kernel knows the SPA implications it
> > can add safety like "you are going to issue sparing on deviceA that will
> > temporarily take deviceA offline. CXL subsystem tells me deviceA is
> > interleaved with deviceB in SPA so the whole SPA range needs to be
> > offline before this operation proceeds". That is not someting that
> > userspace can reliably coordinate.
> 
> Absolutely he kernel has to enforce this. Same way we protect against
> poison injection in some cases.  Right now the enforcement is slightly
> wrong (Shiju is looking at it again) as we were enforcing at wrong
> granularity (specific dpa, not device). Identifying that hole is a good
> outcome of this discussion making us take another look.
> 
> Enforcing this is one of the key jobs of the CXL specific driver.
> We considered doing it in the core, but the granularity differences
> between our initial few examples meant we decided on specific driver
> implementations of the checks for now.

Which specific driver? Is this not just a callback provided via the EDAC
registration interface to say "sparing allowed"?

Yes, this needs to avoid the midlayer mistake, but I expect more CXL
memory exporting devices can live with the CXL core's determination that
HDM decode is live or not.

> > > > 3/ What if the device does not use DDR terminology / topology terms for
> > > > repair?  
> > > 
> > > Then we provide the additional interfaces assuming the correspond to well
> > > known terms.  If someone is using a magic key then we can get grumpy
> > > with them, but that can also be supported.
> > > 
> > > Mostly I'd expect a new technology to overlap a lot of the existing
> > > interface and maybe add one or two more; which layer in the stack for
> > > HBM for instance.  
> > 
> > The concern is the assertion that sysfs needs to care about all these
> > parameters vs an ABI that says "repair errorX". If persistence and
> > validity of error records is the concern lets build an ABI for that and
> > not depend upon trust in userspace to properly coordinate memory
> > integrity concerns.
> 
> It doesn't have to.  It just has to ensure that the memory device is in the correct
> state.  So check, not coordinate. At a larger scale, coordination is already doable
> (subject to races that we must avoid by holding locks), tear down the regions
> so there are no mappings on the device you want to repair.  Don't bring them
> up again until after you are done.
> 
> The main use case is probably do it before you bring the mappings up, but
> same result.

Agree.

> 
> > 
> > > 
> > > The main alternative is where the device takes an HPA / SPA / DPA. We have one
> > > driver that does that queued up behind this series that uses HPA. PPR uses
> > > DPA.  In that case userspace first tries to see if it can repair by HPA then
> > > DPA and if not moves on to see if it it can use the fuller description.
> > > We will see devices supporting HPA / DPA (which to use depends on when you
> > > are doing the operation and what has been configured) but mostly I'd expect
> > > either HPA/DPA or fine grained on a given repair instance.
> > > 
> > > HPA only works if the address decoders are always configured (so not on CXL)
> > > What is actually happening in that case is typically that a firmware is
> > > involved that can look up address decoders etc, and map the control HPA
> > > to Bank / row etc to issue the actual low level commands.  This keeps
> > > the memory mapping completely secret rather than exposing it in error
> > > records.
> > >   
> > > > 
> > > > I expect the flow rasdaemon would want is that the current PFA (leaky
> > > > bucket Pre-Failure Analysis) decides that the number of soft-offlines it
> > > > has performed exceeds some threshold and it wants to attempt to repair
> > > > memory.  
> > > 
> > > Sparing may happen prior to point where we'd have done a soft offline
> > > if non disruptive (whether it is can be read from another bit of the
> > > ABI).  Memory repair might be much less disruptive than soft-offline!
> > > I rather hope memory manufacturers build that, but I'm aware of at least
> > > one case where they didn't and the memory must be offline.  
> > 
> > That's a good point, spare before offline makes sense.
> 
> If transparent an resources not constrained.
> Very much not if we have to tear down the memory first.
> 
> > 
> > [..]
> > > However, there are other usecases where this isn't needed which is why
> > > that isn't a precursor for this series.
> > > 
> > > Initial enablement targets two situations:
> > > 1) Repair can be done in non disruptive way - no need to soft offline at all.  
> > 
> > Modulo needing to quiesce access over the sparing event?
> 
> Absolutely.  This is only doable in devices that don't need to quiesce.
> 
> > 
> > > 2) Repair can be done at boot before memory is onlined or on admin
> > >    action to take the whole region offline, then repair specific chunks of
> > >    memory before bringing it back online.  
> > 
> > Which is userspace racing the kernel to online memory?
> 
> If you are doing this scheme you don't automatically online memory. So
> both are in userspace control and can be easily sequenced.
> If you aren't auto onlining then buy devices with hard PPR and do it by offlining
> manually, repairing and rebooting. Or buy devices that don't need to quiecse
> and cross your fingers the dodgy ram doesn't throw an error before you get
> that far.  Little choice if you decide to online right at the start as normal
> memory.
> 
> > 
> > > > So, yes, +1 to simpler for now where software effectively just needs to
> > > > deal with a handful of "region repair" buttons and the semantics of
> > > > those are coarse and sub-optimal. Wait for a future where a tool author
> > > > says, "we have had good success getting bulk offlined pages back into
> > > > service, but now we need this specific finer grained kernel interface to
> > > > avoid wasting spare banks prematurely".  
> > > 
> > > Depends on where you think that interface is.  I can absolutely see that
> > > as a control to RAS Daemon.  Option 2 above, region is offline, repair
> > > all dodgy looking fine grained buckets.
> > > 
> > > Note though that a suboptimal repair may mean permanent use of very rare
> > > resources.  So there needs to be a control a the finest granularity as well.
> > > Which order those get added to userspace tools doesn't matter to me.
> > > 
> > > If you mean that interface in kernel it brings some non trivial requirements.
> > > The kernel would need all of:
> > > 1) Tracking interface for all error records so the reverse map from region
> > >    to specific bank / row etc is available for a subset of entries.  The
> > >    kernel would need to know which of those are important (soft offline
> > >    might help in that use case, otherwise that means decision algorithms
> > >    are in kernel or we have fine grained queue for region repair in parallel
> > >    with soft-offline).
> > > 2) A way to inject the reverse map information from a userspace store
> > >   (to deal with reboot etc).  
> > 
> > Not a way to inject the reverse map information, a way to inject the
> > error records and assert that memory topology changes have not
> > invalidated those records.
> 
> There is no way to tell that the topology hasn't changed.
> For the reasons above, I don't think we care. Instead of trying to stop
> userspace reparing the wrong memory, make sure it is safe for it to do that.
> (The kernel is rarely in the business of preventing the slightly stupid)

If the policy is "error records with SPA from the current boot can be
trusted" and "userspace requests outside of current boot error records
must only be submitted to known offline" then I think we are aligned.

> > > That sounds a lot harder to deal with than relying on the usespace program
> > > that already does the tracking across boots.  
> > 
> > I am stuck behind the barrier of userspace must not assume it knows
> > better than the kernel about the SPA impact of a DIMM sparing
> > event. The kernel needs evidence either live records from within the
> > same kernel boot or validated records from a previous boot.
> 
> I think this is the wrong approach.  The operation must be 'safe'.
> With that in place we absolutely can let userspace assume it knows better than
> the kernel. 

Violent agreement? Operation must be safe, yes, next what are the criteria
for kernel management of safety. Offline-only repair is great place to
be.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v18 04/19] EDAC: Add memory repair control feature
  2025-01-14 19:35                           ` Dan Williams
@ 2025-01-15 10:07                             ` Jonathan Cameron
  2025-01-15 11:35                             ` Mauro Carvalho Chehab
  1 sibling, 0 replies; 87+ messages in thread
From: Jonathan Cameron @ 2025-01-15 10:07 UTC (permalink / raw)
  To: Dan Williams
  Cc: Borislav Petkov, Shiju Jose, linux-edac@vger.kernel.org,
	linux-cxl@vger.kernel.org, linux-acpi@vger.kernel.org,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	tony.luck@intel.com, rafael@kernel.org, lenb@kernel.org,
	mchehab@kernel.org, dave@stgolabs.net, dave.jiang@intel.com,
	alison.schofield@intel.com, vishal.l.verma@intel.com,
	ira.weiny@intel.com, david@redhat.com, Vilas.Sridharan@amd.com,
	leo.duran@amd.com, Yazen.Ghannam@amd.com, rientjes@google.com,
	jiaqiyan@google.com, Jon.Grimm@amd.com,
	dave.hansen@linux.intel.com, naoya.horiguchi@nec.com,
	james.morse@arm.com, jthoughton@google.com,
	somasundaram.a@hpe.com, erdemaktas@google.com, pgonda@google.com,
	duenwen@google.com, gthelen@google.com,
	wschwartz@amperecomputing.com, dferguson@amperecomputing.com,
	wbs@os.amperecomputing.com, nifan.cxl@gmail.com, tanxiaofei,
	Zengtao (B), Roberto Sassu, kangkang.shen@futurewei.com,
	wanghuiqiang, Linuxarm

On Tue, 14 Jan 2025 11:35:21 -0800
Dan Williams <dan.j.williams@intel.com> wrote:

> Jonathan Cameron wrote:
> > On Fri, 10 Jan 2025 14:49:03 -0800
> > Dan Williams <dan.j.williams@intel.com> wrote:  
> [..]
> > > This is where you lose me. The error record is a point in time snapshot
> > > of the SPA:HPA:DPA:<proprietary internal "DIMM" mapping>. The security
> > > model for memory operations is based on coordinating with the kernel's
> > > understanding of how that SPA is *currently* being used.  
> > 
> > Whilst it is being used I agree.  Key is to only do disruptive / data
> > changing actions when it is not being used.  
> 
> Sure.
> 
> > > The kernel can not just take userspace's word for it that potentially
> > > data changing, or temporary loss-of-use operations are safe to execute
> > > just because once upon a time userspace saw an error record that
> > > implicated a given SPA in the past, especially over reboot.   
> > 
> > There are two cases (discoverable from hardware)
> > 
> > 1) Non disruptive.  No security concern as the device guarantees to
> >    not interrupt traffic and the memory contents is copied to the new
> >    location. Basically software never knows it happened.
> > 2) Disruptive.  We only allow this if the memory is offline. In the CXL case
> >    the CXL specific code must check no memory on the device is online so
> >    we aren't disrupting anything.  The other implementation we have code
> >    for (will post after this lands) has finer granularity constraints and only
> >    the page needs to be offline.
> >    As it is offline the content is not preserved anyway. We may need to add extra
> >    constraints along with future support for temporal persistence / sharing but
> >    we can do that as part of adding that support in general.
> >    (Personally I think in those cases memory repair is a job for the out of
> >     band management anyway).
> > 
> > In neither case am I seeing a security concern.  Am I missing something?  
> 
> s/security/system-integrity/
> 
> 1/ Hardware engineers may have a different definition of "non-disuptive"
> than software. See the history around hibernate_quiet_exec() to work
> around the disruption of latency spikes. If this is poorly specified
> across vendors we are going to wish that we did not build a "take
> userspace's word for it" interface.

Sure, but given the spec is fairly specific "CXL.mem requests are correctly
preserved and data is retained" they will have to stay with latency
requirements of the CXL system.  If a device breaks these admittedly soft
rules then we can add allow / deny lists or let userspace tooling handle
that (which I'd prefer).  Note that this stuff happens on some devices
under the hood anyway, so I'd not expect the command driven case to be
much worse (but I accept it might be).

I'd also expect some very grumpy large purchasers of that device to get
the firmware changed to not advertise that it is fine to do it live if
the performance drop is large.

Note this happens today under the hood on at least some (I think most?)
servers.  Have you ever seen a latency bug report from memory sparing?
How often?  The memory and controllers on CXL devices are going to be very
similar in technology to host memory controllers so I can't see why they'd
be much worse (subject of course to cheap and nasty maybe turning up - I doubt
that will be common given the cost of the RAM behind these things).

This applies just as much however we implement this interface. So a concern
but not one specific to the 'how' part.

> 
> 2/ Yes, if the device is not actively decoding any in use memory feel
> free to run destructive operations on the device. However, is sysfs the
> right interface for "submit multi-parameter atomic operation with
> transient result"? I lean heavily into sysfs, but ioctl and netlink have
> a role to play in scenarios like this. Otherwise userspace can inject
> error records back into the kernel with the expectation that the kernel
> can only accept the DIMM address and not any of the translation data
> which might be stale.

I'd argue yes to the sysfs question.  This is common enough for other
subsystems (e.g. CXL - I still do all tests with bash scripts!) and this
really isn't a particularly complex interface. I'd expect very
light weight tooling to be able to use it (though solutions to work out
whether to do it are way more complex). It is a simple more or less self
describing interface that directly exposes what controls are available.

If we end up reinjecting error records (which I'm still pushing back
strongly on because I am yet to see any reason to do so) then maybe
sysfs is not the right interface.  Pushing it to netlink or ioctl
doesn't change the complexity of the interface, so I'm not sure there
is any strong reason to do so.  Files are cheap and easy to use from
all manner of simple scripting.  I think from later discussion below
that we don't need to reinject which is nice!

> 
> [..]
> > > Again, the repair control assumes that the kernel can just trust
> > > userspace to get it right. When the kernel knows the SPA implications it
> > > can add safety like "you are going to issue sparing on deviceA that will
> > > temporarily take deviceA offline. CXL subsystem tells me deviceA is
> > > interleaved with deviceB in SPA so the whole SPA range needs to be
> > > offline before this operation proceeds". That is not someting that
> > > userspace can reliably coordinate.  
> > 
> > Absolutely he kernel has to enforce this. Same way we protect against
> > poison injection in some cases.  Right now the enforcement is slightly
> > wrong (Shiju is looking at it again) as we were enforcing at wrong
> > granularity (specific dpa, not device). Identifying that hole is a good
> > outcome of this discussion making us take another look.
> > 
> > Enforcing this is one of the key jobs of the CXL specific driver.
> > We considered doing it in the core, but the granularity differences
> > between our initial few examples meant we decided on specific driver
> > implementations of the checks for now.  
> 
> Which specific driver? Is this not just a callback provided via the EDAC
> registration interface to say "sparing allowed"?

That would be a reasonable interface to add. There was push back in earlier
reviews on an explicit query so we backed away from this sort of thing.
The question IIRC was what was the point in querying given we could just
try it and rapidly see an error if it can't. There is a callback to say
it is disruptive but it is an exercise for userspace to figure out what
it needs to do to allow the request to succeed.

The granularity varies across the two sparing implementations I have access
to, so this interface would just be a 'can I do it now?'  Whether that
is helpful given userspace will have to be involved in getting that answer
to be 'yes' is not entirely clear to me.

So interesting thought, but I'm not yet convinced this isn't a userspace
problem + the suck it and see call to find out the answer.

> 
> Yes, this needs to avoid the midlayer mistake, but I expect more CXL
> memory exporting devices can live with the CXL core's determination that
> HDM decode is live or not.

I think this is valid for the ones that advertise that they need the
traffic to stop. From later in discussion I think we are aligned on that.

> 
> > > > > 3/ What if the device does not use DDR terminology / topology terms for
> > > > > repair?    
> > > > 
> > > > Then we provide the additional interfaces assuming the correspond to well
> > > > known terms.  If someone is using a magic key then we can get grumpy
> > > > with them, but that can also be supported.
> > > > 
> > > > Mostly I'd expect a new technology to overlap a lot of the existing
> > > > interface and maybe add one or two more; which layer in the stack for
> > > > HBM for instance.    
> > > 
> > > The concern is the assertion that sysfs needs to care about all these
> > > parameters vs an ABI that says "repair errorX". If persistence and
> > > validity of error records is the concern lets build an ABI for that and
> > > not depend upon trust in userspace to properly coordinate memory
> > > integrity concerns.  
> > 
> > It doesn't have to.  It just has to ensure that the memory device is in the correct
> > state.  So check, not coordinate. At a larger scale, coordination is already doable
> > (subject to races that we must avoid by holding locks), tear down the regions
> > so there are no mappings on the device you want to repair.  Don't bring them
> > up again until after you are done.
> > 
> > The main use case is probably do it before you bring the mappings up, but
> > same result.  
> 
> Agree.
> 
> >   
> > >   
> > > > 
> > > > The main alternative is where the device takes an HPA / SPA / DPA. We have one
> > > > driver that does that queued up behind this series that uses HPA. PPR uses
> > > > DPA.  In that case userspace first tries to see if it can repair by HPA then
> > > > DPA and if not moves on to see if it it can use the fuller description.
> > > > We will see devices supporting HPA / DPA (which to use depends on when you
> > > > are doing the operation and what has been configured) but mostly I'd expect
> > > > either HPA/DPA or fine grained on a given repair instance.
> > > > 
> > > > HPA only works if the address decoders are always configured (so not on CXL)
> > > > What is actually happening in that case is typically that a firmware is
> > > > involved that can look up address decoders etc, and map the control HPA
> > > > to Bank / row etc to issue the actual low level commands.  This keeps
> > > > the memory mapping completely secret rather than exposing it in error
> > > > records.
> > > >     
> > > > > 
> > > > > I expect the flow rasdaemon would want is that the current PFA (leaky
> > > > > bucket Pre-Failure Analysis) decides that the number of soft-offlines it
> > > > > has performed exceeds some threshold and it wants to attempt to repair
> > > > > memory.    
> > > > 
> > > > Sparing may happen prior to point where we'd have done a soft offline
> > > > if non disruptive (whether it is can be read from another bit of the
> > > > ABI).  Memory repair might be much less disruptive than soft-offline!
> > > > I rather hope memory manufacturers build that, but I'm aware of at least
> > > > one case where they didn't and the memory must be offline.    
> > > 
> > > That's a good point, spare before offline makes sense.  
> > 
> > If transparent an resources not constrained.
> > Very much not if we have to tear down the memory first.
> >   
> > > 
> > > [..]  
> > > > However, there are other usecases where this isn't needed which is why
> > > > that isn't a precursor for this series.
> > > > 
> > > > Initial enablement targets two situations:
> > > > 1) Repair can be done in non disruptive way - no need to soft offline at all.    
> > > 
> > > Modulo needing to quiesce access over the sparing event?  
> > 
> > Absolutely.  This is only doable in devices that don't need to quiesce.
> >   
> > >   
> > > > 2) Repair can be done at boot before memory is onlined or on admin
> > > >    action to take the whole region offline, then repair specific chunks of
> > > >    memory before bringing it back online.    
> > > 
> > > Which is userspace racing the kernel to online memory?  
> > 
> > If you are doing this scheme you don't automatically online memory. So
> > both are in userspace control and can be easily sequenced.
> > If you aren't auto onlining then buy devices with hard PPR and do it by offlining
> > manually, repairing and rebooting. Or buy devices that don't need to quiecse
> > and cross your fingers the dodgy ram doesn't throw an error before you get
> > that far.  Little choice if you decide to online right at the start as normal
> > memory.
> >   
> > >   
> > > > > So, yes, +1 to simpler for now where software effectively just needs to
> > > > > deal with a handful of "region repair" buttons and the semantics of
> > > > > those are coarse and sub-optimal. Wait for a future where a tool author
> > > > > says, "we have had good success getting bulk offlined pages back into
> > > > > service, but now we need this specific finer grained kernel interface to
> > > > > avoid wasting spare banks prematurely".    
> > > > 
> > > > Depends on where you think that interface is.  I can absolutely see that
> > > > as a control to RAS Daemon.  Option 2 above, region is offline, repair
> > > > all dodgy looking fine grained buckets.
> > > > 
> > > > Note though that a suboptimal repair may mean permanent use of very rare
> > > > resources.  So there needs to be a control a the finest granularity as well.
> > > > Which order those get added to userspace tools doesn't matter to me.
> > > > 
> > > > If you mean that interface in kernel it brings some non trivial requirements.
> > > > The kernel would need all of:
> > > > 1) Tracking interface for all error records so the reverse map from region
> > > >    to specific bank / row etc is available for a subset of entries.  The
> > > >    kernel would need to know which of those are important (soft offline
> > > >    might help in that use case, otherwise that means decision algorithms
> > > >    are in kernel or we have fine grained queue for region repair in parallel
> > > >    with soft-offline).
> > > > 2) A way to inject the reverse map information from a userspace store
> > > >   (to deal with reboot etc).    
> > > 
> > > Not a way to inject the reverse map information, a way to inject the
> > > error records and assert that memory topology changes have not
> > > invalidated those records.  
> > 
> > There is no way to tell that the topology hasn't changed.
> > For the reasons above, I don't think we care. Instead of trying to stop
> > userspace reparing the wrong memory, make sure it is safe for it to do that.
> > (The kernel is rarely in the business of preventing the slightly stupid)  
> 
> If the policy is "error records with SPA from the current boot can be
> trusted" and "userspace requests outside of current boot error records
> must only be submitted to known offline" then I think we are aligned.

Ok, I'm not sure the requirement is clear, but given that is probably not
too bad to do and we can rip it out later if it turns out to be overkill.
I'll discuss this with Shiju and others later today.

For now I think this belongs in the specific drivers, not the core
as they will be tracking different things and getting the data
from different paths.

How we make this work with other error records is going to be more painful but
should be easy enough for CXL non firmware first reporting. Other sources
of records can come later.

Eventually we might end up with a top level PA check in the EDAC core code,
and additional optional checks in the drivers, or factor out the shared bit
as library code. Not sure yet.

> 
> > > > That sounds a lot harder to deal with than relying on the usespace program
> > > > that already does the tracking across boots.    
> > > 
> > > I am stuck behind the barrier of userspace must not assume it knows
> > > better than the kernel about the SPA impact of a DIMM sparing
> > > event. The kernel needs evidence either live records from within the
> > > same kernel boot or validated records from a previous boot.  
> > 
> > I think this is the wrong approach.  The operation must be 'safe'.
> > With that in place we absolutely can let userspace assume it knows better than
> > the kernel.   
> 
> Violent agreement? Operation must be safe, yes, next what are the criteria
> for kernel management of safety. Offline-only repair is great place to
> be.

Offline makes life easy but removes a significant usecase - the above extra
checks bring that back so may give a way forwards.  The combination
of 'only previous records' and online isn't an important one as you can
act on them before it is online.

So 'maybe' a path forwards that adds extra guarantees.  I'm not convinced
we need them, but it seems to not restrict real usecases so not too bad.

Jonathan

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v18 04/19] EDAC: Add memory repair control feature
  2025-01-14 19:35                           ` Dan Williams
  2025-01-15 10:07                             ` Jonathan Cameron
@ 2025-01-15 11:35                             ` Mauro Carvalho Chehab
  1 sibling, 0 replies; 87+ messages in thread
From: Mauro Carvalho Chehab @ 2025-01-15 11:35 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jonathan Cameron, Borislav Petkov, Shiju Jose,
	linux-edac@vger.kernel.org, linux-cxl@vger.kernel.org,
	linux-acpi@vger.kernel.org, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org, tony.luck@intel.com,
	rafael@kernel.org, lenb@kernel.org, mchehab@kernel.org,
	dave@stgolabs.net, dave.jiang@intel.com,
	alison.schofield@intel.com, vishal.l.verma@intel.com,
	ira.weiny@intel.com, david@redhat.com, Vilas.Sridharan@amd.com,
	leo.duran@amd.com, Yazen.Ghannam@amd.com, rientjes@google.com,
	jiaqiyan@google.com, Jon.Grimm@amd.com,
	dave.hansen@linux.intel.com, naoya.horiguchi@nec.com,
	james.morse@arm.com, jthoughton@google.com,
	somasundaram.a@hpe.com, erdemaktas@google.com, pgonda@google.com,
	duenwen@google.com, gthelen@google.com,
	wschwartz@amperecomputing.com, dferguson@amperecomputing.com,
	wbs@os.amperecomputing.com, nifan.cxl@gmail.com, tanxiaofei,
	Zengtao (B), Roberto Sassu, kangkang.shen@futurewei.com,
	wanghuiqiang, Linuxarm

Em Tue, 14 Jan 2025 11:35:21 -0800
Dan Williams <dan.j.williams@intel.com> escreveu:

> > There is no way to tell that the topology hasn't changed.
> > For the reasons above, I don't think we care. Instead of trying to stop
> > userspace reparing the wrong memory, make sure it is safe for it to do that.
> > (The kernel is rarely in the business of preventing the slightly stupid)  
> 
> If the policy is "error records with SPA from the current boot can be
> trusted" and "userspace requests outside of current boot error records
> must only be submitted to known offline" then I think we are aligned.

Surely userspace cannot infere if past errors on SPA are for the same DPA
block, but it may still decide between soft/hard PPR based on different
criteria adopted by the machine admins - or use instead memory sparing.

So, yeah sanity checks at Kernel level to identify "trust" level based
on having DPA data or not makes sense, but the final decision about
the action should be taken on userspace.

Thanks,
Mauro

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v18 04/19] EDAC: Add memory repair control feature
  2025-01-14 14:30     ` Shiju Jose
@ 2025-01-15 12:03       ` Mauro Carvalho Chehab
  0 siblings, 0 replies; 87+ messages in thread
From: Mauro Carvalho Chehab @ 2025-01-15 12:03 UTC (permalink / raw)
  To: Shiju Jose
  Cc: linux-edac@vger.kernel.org, linux-cxl@vger.kernel.org,
	linux-acpi@vger.kernel.org, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org, bp@alien8.de, tony.luck@intel.com,
	rafael@kernel.org, lenb@kernel.org, mchehab@kernel.org,
	dan.j.williams@intel.com, dave@stgolabs.net, Jonathan Cameron,
	dave.jiang@intel.com, alison.schofield@intel.com,
	vishal.l.verma@intel.com, ira.weiny@intel.com, david@redhat.com,
	Vilas.Sridharan@amd.com, leo.duran@amd.com, Yazen.Ghannam@amd.com,
	rientjes@google.com, jiaqiyan@google.com, Jon.Grimm@amd.com,
	dave.hansen@linux.intel.com, naoya.horiguchi@nec.com,
	james.morse@arm.com, jthoughton@google.com,
	somasundaram.a@hpe.com, erdemaktas@google.com, pgonda@google.com,
	duenwen@google.com, gthelen@google.com,
	wschwartz@amperecomputing.com, dferguson@amperecomputing.com,
	wbs@os.amperecomputing.com, nifan.cxl@gmail.com, tanxiaofei,
	Zengtao (B), Roberto Sassu, kangkang.shen@futurewei.com,
	wanghuiqiang, Linuxarm

Em Tue, 14 Jan 2025 14:30:53 +0000
Shiju Jose <shiju.jose@huawei.com> escreveu:

> >-----Original Message-----
> >From: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
> >Sent: 14 January 2025 13:47
> >To: Shiju Jose <shiju.jose@huawei.com>
> >Cc: linux-edac@vger.kernel.org; linux-cxl@vger.kernel.org; linux-
> >acpi@vger.kernel.org; linux-mm@kvack.org; linux-kernel@vger.kernel.org;
> >bp@alien8.de; tony.luck@intel.com; rafael@kernel.org; lenb@kernel.org;
> >mchehab@kernel.org; dan.j.williams@intel.com; dave@stgolabs.net; Jonathan
> >Cameron <jonathan.cameron@huawei.com>; dave.jiang@intel.com;
> >alison.schofield@intel.com; vishal.l.verma@intel.com; ira.weiny@intel.com;
> >david@redhat.com; Vilas.Sridharan@amd.com; leo.duran@amd.com;
> >Yazen.Ghannam@amd.com; rientjes@google.com; jiaqiyan@google.com;
> >Jon.Grimm@amd.com; dave.hansen@linux.intel.com;
> >naoya.horiguchi@nec.com; james.morse@arm.com; jthoughton@google.com;
> >somasundaram.a@hpe.com; erdemaktas@google.com; pgonda@google.com;
> >duenwen@google.com; gthelen@google.com;
> >wschwartz@amperecomputing.com; dferguson@amperecomputing.com;
> >wbs@os.amperecomputing.com; nifan.cxl@gmail.com; tanxiaofei
> ><tanxiaofei@huawei.com>; Zengtao (B) <prime.zeng@hisilicon.com>; Roberto
> >Sassu <roberto.sassu@huawei.com>; kangkang.shen@futurewei.com;
> >wanghuiqiang <wanghuiqiang@huawei.com>; Linuxarm
> ><linuxarm@huawei.com>
> >Subject: Re: [PATCH v18 04/19] EDAC: Add memory repair control feature
> >
> >Em Mon, 6 Jan 2025 12:10:00 +0000
> ><shiju.jose@huawei.com> escreveu:
> >  
> >> +What:		/sys/bus/edac/devices/<dev-
> >name>/mem_repairX/repair_function
> >> +Date:		Jan 2025
> >> +KernelVersion:	6.14
> >> +Contact:	linux-edac@vger.kernel.org
> >> +Description:
> >> +		(RO) Memory repair function type. For eg. post package repair,
> >> +		memory sparing etc.
> >> +		EDAC_SOFT_PPR - Soft post package repair
> >> +		EDAC_HARD_PPR - Hard post package repair
> >> +		EDAC_CACHELINE_MEM_SPARING - Cacheline memory sparing
> >> +		EDAC_ROW_MEM_SPARING - Row memory sparing
> >> +		EDAC_BANK_MEM_SPARING - Bank memory sparing
> >> +		EDAC_RANK_MEM_SPARING - Rank memory sparing
> >> +		All other values are reserved.
> >> +
> >> +What:		/sys/bus/edac/devices/<dev-
> >name>/mem_repairX/persist_mode
> >> +Date:		Jan 2025
> >> +KernelVersion:	6.14
> >> +Contact:	linux-edac@vger.kernel.org
> >> +Description:
> >> +		(RW) Read/Write the current persist repair mode set for a
> >> +		repair function. Persist repair modes supported in the
> >> +		device, based on the memory repair function is temporary
> >> +		or permanent and is lost with a power cycle.
> >> +		EDAC_MEM_REPAIR_SOFT - Soft repair function (temporary  
> >repair).  
> >> +		EDAC_MEM_REPAIR_HARD - Hard memory repair function  
> >(permanent repair).  
> >> +		All other values are reserved.
> >> +  
> >
> >After re-reading some things, I suspect that the above can be simplified a little
> >bit by folding soft/hard PPR into a single element at /repair_function, and letting
> >it clearer that persist_mode is valid only for PPR (I think this is the case, right?),
> >e.g. something like:  
> persist_mode is valid for memory sparing features(atleast in CXL) as well.
> In the case of CXL memory sparing, host has option to request either soft or hard sparing
> in a flag when issue a memory sparing operation.

Ok.

> 
> >
> >	What:		/sys/bus/edac/devices/<dev-  
> >name>/mem_repairX/repair_function  
> >	...
> >	Description:
> >			(RO) Memory repair function type. For e.g. post
> >package repair,
> >			memory sparing etc. Valid values are:
> >
> >			- ppr - post package repair.
> >			  Please define its mode via
> >			  /sys/bus/edac/devices/<dev-  
> >name>/mem_repairX/persist_mode  
> >			- cacheline-sparing - Cacheline memory sparing
> >			- row-sparing - Row memory sparing
> >			- bank-sparing - Bank memory sparing
> >			- rank-sparing - Rank memory sparing
> >			- All other values are reserved.
> >
> >and define persist_mode in a different way:  
> Note: For return as decoded strings instead of raw value,  I need to add some extra callback function/s
> in the edac/memory_repair.c  for these attributes  and which will reduce the current level of optimization done to
> minimize the code size.

You're already using a callback at EDAC_MEM_REPAIR_ATTR_SHOW macro.
So, no need for any change at the current code, except for the type
used at the EDAC_MEM_REPAIR_ATTR_SHOW() call.

Something similar to this (not tested) would work:

    int get_repair_function(struct device *dev, void *drv_data, const char **val)
    {
	unsigned int type;

	// Some logic to get repair type from *drv_data, storing into "unsigned int type"

	const char *repair_type[] = {
		[EDAC_SOFT_PPR] = "ppr",
		[EDAC_HARD_PPR] = "ppr",
		[EDAC_CACHELINE_MEM_SPARING] = "cacheline-sparing",
		...
	}

	if (type < ARRAY_SIZE(repair_type)) {
		*val = repair_type(type);
		return 0;
	}

	return -EINVAL;
    }

    EDAC_MEM_REPAIR_ATTR_SHOW(repair_function, get_repair_function, const char *, "%s\n");

Thanks,
Mauro

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v18 04/19] EDAC: Add memory repair control feature
  2025-01-13 11:07                     ` Jonathan Cameron
@ 2025-01-21 16:16                       ` Borislav Petkov
  2025-01-21 18:16                         ` Jonathan Cameron
  0 siblings, 1 reply; 87+ messages in thread
From: Borislav Petkov @ 2025-01-21 16:16 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: Shiju Jose, linux-edac@vger.kernel.org, linux-cxl@vger.kernel.org,
	linux-acpi@vger.kernel.org, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org, tony.luck@intel.com,
	rafael@kernel.org, lenb@kernel.org, mchehab@kernel.org,
	dan.j.williams@intel.com, dave@stgolabs.net, dave.jiang@intel.com,
	alison.schofield@intel.com, vishal.l.verma@intel.com,
	ira.weiny@intel.com, david@redhat.com, Vilas.Sridharan@amd.com,
	leo.duran@amd.com, Yazen.Ghannam@amd.com, rientjes@google.com,
	jiaqiyan@google.com, Jon.Grimm@amd.com,
	dave.hansen@linux.intel.com, naoya.horiguchi@nec.com,
	james.morse@arm.com, jthoughton@google.com,
	somasundaram.a@hpe.com, erdemaktas@google.com, pgonda@google.com,
	duenwen@google.com, gthelen@google.com,
	wschwartz@amperecomputing.com, dferguson@amperecomputing.com,
	wbs@os.amperecomputing.com, nifan.cxl@gmail.com, tanxiaofei,
	Zengtao (B), Roberto Sassu, kangkang.shen@futurewei.com,
	wanghuiqiang, Linuxarm

On Mon, Jan 13, 2025 at 11:07:40AM +0000, Jonathan Cameron wrote:
> We can do that if you prefer.  I'm not that fussed how this is handled
> because, for tooling at least, I don't see why we'd ever read it.
> It's for human parsing only and the above is fine.

Is there even a concrete use case for humans currently? Because if not, we
might as well not do it at all and keep it simple.

All I see is an avalanche of sysfs nodes and I'm questioning the usefulness of
the interface and what's the 30K ft big picture for all this.

If this all is just wishful thinking on the part of how this is going to be
used, then I agree with Dan: less is more. But I need to read the rest of that
thread when there's time.

...
> Repair cam be a feature of the DIMMs themselves or it can be a feature
> of the memory controller. It is basically replacing them with spare
> memory from somewhere else (usually elsewhere on same DIMMs that have
> a bit of spare capacity for this).  Bit like a hot spare in a RAID setup.

Ooh, this is what you call repair. That's using a spare rank or so, under
which I know it as one example.

What I thought you mean with repair is what you mean with "correct". Ok,
I see.

> In some other systems the OS gets the errors and is responsible for making
> the decision.

This decision has been kept away from the OS in my world so far. So yes, the
FW doing all the RAS recovery work is more like it. And the FW is the better
agent in some sense because it has a lot more intimate knowledge of the
platform. However...

> Sticking to the corrected error case (uncorrected handling
> is going to require a lot more work given we've lost data, Dan asked about that
> in the other branch of the thread), the OS as a whole (kernel + userspace)
> gets the error records and makes the policy decision to repair based on
> assessment of risk vs resource availability to make a repair.
> 
> Two reasons for this
> 1) Hardware isn't necessarily capable of repairing autonomously as
>    other actions may be needed (memory traffic to some granularity of
>    memory may need to be stopped to avoid timeouts). Note there are many
>    graduations of this from A) can do it live with no visible effect, through
>    B) offline a page, to C) offlining the whole device.
> 2) Policy can be a lot more sophisticated than a BMC can do.

... yes, that's why you can't rely only on the FW to do recovery but involve
the OS too. Basically what I've been saying all those years. Oh well...

> In some cases perhaps, but another very strong driver is that policy is involved.
> 
> We can either try put a complex design in firmware and poke it with N opaque
> parameters from a userspace tool or via some out of band method or we can put
> the algorithm in userspace where it can be designed to incorporate lessons learnt
> over time.  We will start simple and see what is appropriate as this starts
> to get used in large fleets.  This stuff is a reasonable target for AI type
> algorithms etc that we aren't going to put in the kernel.
> 
> Doing this at all is a reliability optimization, normally it isn't required for
> correct operation.

I'm not saying you should put an AI engine into the kernel - all I'm saying
is, the stuff which the kernel can decide itself without user input doesn't
need user input. Only toggle: the kernel should do this correction and/or
repair automatically or not.

What is clear here is that you can't design an interface properly right now
for algorithms which you don't have yet. And there's experience missing from
running this in large fleets.

But the interface you're adding now will remain forever cast in stone. Just
for us to realize one day that we're not really using it but it is sitting out
there dead in the water and we can't retract it. Or we're not using it as
originally designed but differently and we need this and that hack to make it
work for the current sensible use case.

So the way it looks to me right now is, you want this to be in debugfs. You
want to go nuts there, collect experience, algorithms, lessons learned etc and
*then*, the parts which are really useful and sensible should be moved to
sysfs and cast in stone. But not preemptively like that.

> Offline has no permanent cost and no limit on number of times you can
> do it. Repair is definitely a limited resource and may permanently use
> up that resource (discoverable as a policy wants to know that too!)
> In some cases once you run out of repair resources you have to send an
> engineer to replace the memory before you can do it again.

Yes, and until you can do that and because cloud doesn't want to *ever*
reboot, you must do diminished but still present machine capabilities by
offlining pages and cordoning off faulty hw, etc, etc.

> Ok. I guess it is an option (I wasn't aware of that work).
> 
> I was thinking that was far more complex to deal with than just doing it in
> userspace tooling. From a quick look that solution seems to rely on ACPI ERSR
> infrastructure to provide that persistence that we won't generally have but
> I suppose we can read it from the filesystem or other persistent stores.
> We'd need to be a lot more general about that as can't make system assumptions
> that can be made in AMD specific code.
> 
> So could be done, I don't think it is a good idea in this case, but that
> example does suggest it is possible.

You can look at this as specialized solutions. Could they be more general?
Ofc. But we don't have a general RAS architecture which is vendor-agnostic.

> In approach we are targetting, there is no round trip situation.  We let the kernel
> deal with any synchronous error just fine and run it's existing logic
> to offline problematic memory.  That needs to be timely and to carry on operating
> exactly as it always has.
> 
> In parallel with that we gather the error reports that we will already be
> gathering and run analysis on those.  From that we decide if a memory is likely to fail
> again and perform a sparing operation if appropriate.
> Effectively this is 'free'. All the information is already there in userspace
> and already understood by tools like rasdaemon, we are not expanding that
> reporting interface at all.

That is fair. I think you can do that even now if the errors logged have
enough hw information to classify them and use them for predictive analysis.

> Ok.  It seems you correlate number of files with complexity.

No, wrong. I'm looking at the interface and am wondering how is this going to
be used and whether it is worth it to have it cast in stone forever.

> I correlated difficulty of understanding those files with complexity.
> Everyone one of the files is clearly defined and aligned with long history
> of how to describe DRAM (see how long CPER records have used these
> fields for example - they go back to the beginning).

Ok, then pls point me to the actual use cases how those files are going to be
used or they are used already.

> I'm all in favor of building an interface up by providing minimum first
> and then adding to it, but here what is proposed is the minimum for basic
> functionality and the alternative of doing the whole thing in kernel both
> puts complexity in the wrong place and restricts us in what is possible.

There's another point to consider: if this is the correct and proper solution
for *your* fleet, that doesn't necessarily mean it is the correct and
generic solution for *everybody* using the kernel. So you can imagine that I'd
like to have a generic solution which can maximally include everyone instead
of *some* special case only.

> To some degree but I think there is a major mismatch in what we think
> this is for.
> 
> What I've asked Shiju to look at is splitting the repair infrastructure
> into two cases so that maybe we can make partial progress:
> 
> 1) Systems that support repair by Physical Address
>  - Covers Post Package Repair for CXL
> 
> 2) Systems that support repair by description of the underlying hardware
> - Covers Memory Sparing interfaces for CXL. 
> 
> We need both longer term anyway, but maybe 1 is less controversial simply
> on basis it has fewer control parameters
> 
> This still fundamentally puts the policy in userspace where I
> believe it belongs.

Ok, this is more concrete. Let's start with those. Can I have some more
details on how this works pls and who does what? Is it generic enough?

If not, can it live in debugfs for now? See above what I mean about this.

Big picture: what is the kernel's role here? To be a parrot to carry data
back'n'forth or can it simply do clear-cut decisions itself without the need
for userspace involvement?

So far I get the idea that this is something for your RAS needs. This should
have general usability for the rest of the kernel users - otherwise it should
remain a vendor-specific solution until it is needed by others and can be
generalized.

Also, can already existing solutions in the kernel be generalized so that you
can use them too and others can benefit from your improvements?

I hope this makes more sense.

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v18 04/19] EDAC: Add memory repair control feature
  2025-01-21 16:16                       ` Borislav Petkov
@ 2025-01-21 18:16                         ` Jonathan Cameron
  2025-01-22 19:09                           ` Borislav Petkov
  0 siblings, 1 reply; 87+ messages in thread
From: Jonathan Cameron @ 2025-01-21 18:16 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Shiju Jose, linux-edac@vger.kernel.org, linux-cxl@vger.kernel.org,
	linux-acpi@vger.kernel.org, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org, tony.luck@intel.com,
	rafael@kernel.org, lenb@kernel.org, mchehab@kernel.org,
	dan.j.williams@intel.com, dave@stgolabs.net, dave.jiang@intel.com,
	alison.schofield@intel.com, vishal.l.verma@intel.com,
	ira.weiny@intel.com, david@redhat.com, Vilas.Sridharan@amd.com,
	leo.duran@amd.com, Yazen.Ghannam@amd.com, rientjes@google.com,
	jiaqiyan@google.com, Jon.Grimm@amd.com,
	dave.hansen@linux.intel.com, naoya.horiguchi@nec.com,
	james.morse@arm.com, jthoughton@google.com,
	somasundaram.a@hpe.com, erdemaktas@google.com, pgonda@google.com,
	duenwen@google.com, gthelen@google.com,
	wschwartz@amperecomputing.com, dferguson@amperecomputing.com,
	wbs@os.amperecomputing.com, nifan.cxl@gmail.com, tanxiaofei,
	Zengtao (B), Roberto Sassu, kangkang.shen@futurewei.com,
	wanghuiqiang, Linuxarm, Vandana Salve

On Tue, 21 Jan 2025 17:16:53 +0100
Borislav Petkov <bp@alien8.de> wrote:

> On Mon, Jan 13, 2025 at 11:07:40AM +0000, Jonathan Cameron wrote:
> > We can do that if you prefer.  I'm not that fussed how this is handled
> > because, for tooling at least, I don't see why we'd ever read it.
> > It's for human parsing only and the above is fine.  
> 
Note we are dropping the min / max stuff in most cases as it was only
added as it seems to be our misguided attempt to resolve an earlier
review comment. That reduces some of the complexity and wasn't useful anyway.

We are also splitting the patch set up differently so maybe we can
move the discussion on to the 'extended' case for repair without also
blocking the simple memory address based one.

> Is there even a concrete use case for humans currently? Because if not, we
> might as well not do it at all and keep it simple.

Clearly we need to provide more evidence of use cases: 'Show us your code'
seems to apply here.  We'll do that over the next few weeks.

The concrete use case is repair of CXL memory devices using sparing,
based on simple algorithms applied to the data RAS Daemon already has.
The interface for the reasons discussed in the long thread with Dan
is the minimum required to provide the information needed to allow
for two use cases.  We enumerated them explicitly in the discussion with
Dan because they possibly affected 'safety'.

1) Power up, pre memory online, (typically non persistent) repair of
   known bad memory.  There are two interface options for this, inject
   the prior mapping from device physical address space (host address
   is not necessarily relevant here as no address decoders have been
   programmed yet in CXL - that happens as part of the flow to bring
   the memory up), or use the information that userspace already has
   (bank, rank etc) to select what memory is to be replaced with
   spare capacity.

   Given the injection interface and the repair interface have to
   convey the same data, the interface complexity is identical and
   we might as well have a single step 'repair' rather than
     1. Inject prior records then
     2. Pass a physical address that is matched to one of those records.

   There are no security related concerns here as we always treat this
   as new memory and zero it etc as part of onlining.

2) Online case.  Here the restriction Dan proposed was that we 'check'
   that we have seen an error record on this boot that matches the full
   description.  That is matching both the physical address and the
   topology (as that mapping can change from boot to boot, but not whilst
   the memory is in use). This doesn't prevent any use case we have
   come up with yet because, if we are making a post initial onlining
   decision to repair we can assume there is a new error record that
   provided new information on which we are acting.  Hence the kernel
   had the information to check.

   Whilst I wasn't convinced that we had a definite security
   problem without this protection, it requires minimal changes and doesn't
   block the flows we care about so we are fine with adding this check.

> 
> All I see is an avalanche of sysfs nodes and I'm questioning the usefulness of
> the interface and what's the 30K ft big picture for all this.

Ok. We'll put together an example script / RASdaemon code to show how
it is used. I think you may be surprised at how simple this is and hopefully
that will show that the interface is appropriate.

> 
> If this all is just wishful thinking on the part of how this is going to be
> used, then I agree with Dan: less is more. But I need to read the rest of that
> thread when there's time.
> 

I'll let Dan speak for himself, but my understanding is that what
Dan is focused on is safety and other than tidying up a little isn't
proposing an significant interface change.

> ...
> > Repair cam be a feature of the DIMMs themselves or it can be a feature
> > of the memory controller. It is basically replacing them with spare
> > memory from somewhere else (usually elsewhere on same DIMMs that have
> > a bit of spare capacity for this).  Bit like a hot spare in a RAID setup.  
> 
> Ooh, this is what you call repair. That's using a spare rank or so, under
> which I know it as one example.
> 
> What I thought you mean with repair is what you mean with "correct". Ok,
> I see.
> 
> > In some other systems the OS gets the errors and is responsible for making
> > the decision.  
> 
> This decision has been kept away from the OS in my world so far. So yes, the
> FW doing all the RAS recovery work is more like it. And the FW is the better
> agent in some sense because it has a lot more intimate knowledge of the
> platform. However...
> 
> > Sticking to the corrected error case (uncorrected handling
> > is going to require a lot more work given we've lost data, Dan asked about that
> > in the other branch of the thread), the OS as a whole (kernel + userspace)
> > gets the error records and makes the policy decision to repair based on
> > assessment of risk vs resource availability to make a repair.
> > 
> > Two reasons for this
> > 1) Hardware isn't necessarily capable of repairing autonomously as
> >    other actions may be needed (memory traffic to some granularity of
> >    memory may need to be stopped to avoid timeouts). Note there are many
> >    graduations of this from A) can do it live with no visible effect, through
> >    B) offline a page, to C) offlining the whole device.
> > 2) Policy can be a lot more sophisticated than a BMC can do.  
> 
> ... yes, that's why you can't rely only on the FW to do recovery but involve
> the OS too. Basically what I've been saying all those years. Oh well...

This we agree on.

> 
> > In some cases perhaps, but another very strong driver is that policy is involved.
> > 
> > We can either try put a complex design in firmware and poke it with N opaque
> > parameters from a userspace tool or via some out of band method or we can put
> > the algorithm in userspace where it can be designed to incorporate lessons learnt
> > over time.  We will start simple and see what is appropriate as this starts
> > to get used in large fleets.  This stuff is a reasonable target for AI type
> > algorithms etc that we aren't going to put in the kernel.
> > 
> > Doing this at all is a reliability optimization, normally it isn't required for
> > correct operation.  
> 
> I'm not saying you should put an AI engine into the kernel - all I'm saying
> is, the stuff which the kernel can decide itself without user input doesn't
> need user input. Only toggle: the kernel should do this correction and/or
> repair automatically or not.

This we disagree on. For this persistent case in particular these are limited
resources. Once you have used them all you can't do it again.  Using them
carefully is key. An exception is mentioned below as a possible extension but
it relies on a specific subset of allowed device functionality and only
covers some use cases (so it's an extra, not a replacement for what this
set does).

> 
> What is clear here is that you can't design an interface properly right now
> for algorithms which you don't have yet. And there's experience missing from
> running this in large fleets.

With the decision algorithms in userspace, we can design the userspace to kernel
interface because we don't care about the algorithm choice - only what it needs
to control which is well defined. Algorithms will start simple and then
we'll iterate but it won't need changes in this interface because none of it
is connected to how we use the data.

> 
> But the interface you're adding now will remain forever cast in stone. Just
> for us to realize one day that we're not really using it but it is sitting out
> there dead in the water and we can't retract it. Or we're not using it as
> originally designed but differently and we need this and that hack to make it
> work for the current sensible use case.
> 
> So the way it looks to me right now is, you want this to be in debugfs. You
> want to go nuts there, collect experience, algorithms, lessons learned etc and
> *then*, the parts which are really useful and sensible should be moved to
> sysfs and cast in stone. But not preemptively like that.

In general an ABI that is used is cast in stone. To my understanding there
is nothing special about debugfs.  If we introduce a regression in tooling
that uses that interface are we actually any better off than sysfs?
https://lwn.net/Articles/309298/ was a good article on this a while back.

Maybe there has been a change of opinion on this that I missed.

> 
> > Offline has no permanent cost and no limit on number of times you can
> > do it. Repair is definitely a limited resource and may permanently use
> > up that resource (discoverable as a policy wants to know that too!)
> > In some cases once you run out of repair resources you have to send an
> > engineer to replace the memory before you can do it again.  
> 
> Yes, and until you can do that and because cloud doesn't want to *ever*
> reboot, you must do diminished but still present machine capabilities by
> offlining pages and cordoning off faulty hw, etc, etc.

Absolutely though the performance impact of punching holes in memory over
time is getting some cloud folk pushing back because they can't get their
1GIB pages to put under a VM.  Mind you that's not particularly relevant
to this thread.

> 
> > Ok. I guess it is an option (I wasn't aware of that work).
> > 
> > I was thinking that was far more complex to deal with than just doing it in
> > userspace tooling. From a quick look that solution seems to rely on ACPI ERSR
> > infrastructure to provide that persistence that we won't generally have but
> > I suppose we can read it from the filesystem or other persistent stores.
> > We'd need to be a lot more general about that as can't make system assumptions
> > that can be made in AMD specific code.
> > 
> > So could be done, I don't think it is a good idea in this case, but that
> > example does suggest it is possible.  
> 
> You can look at this as specialized solutions. Could they be more general?
> Ofc. But we don't have a general RAS architecture which is vendor-agnostic.

It could perhaps be made more general, but so far I'm not seeing why we would
do this for this particular feature.  It does seem like an interesting avenue
to investigate more generally.

The use cases discussed in the thread with Dan do not require injection of
prior records.  Mauro called out his view that complex policy should not be
in the kernel as well and as you have gathered I fully agree with him on
that!

> 
> > In approach we are targetting, there is no round trip situation.  We let the kernel
> > deal with any synchronous error just fine and run it's existing logic
> > to offline problematic memory.  That needs to be timely and to carry on operating
> > exactly as it always has.
> > 
> > In parallel with that we gather the error reports that we will already be
> > gathering and run analysis on those.  From that we decide if a memory is likely to fail
> > again and perform a sparing operation if appropriate.
> > Effectively this is 'free'. All the information is already there in userspace
> > and already understood by tools like rasdaemon, we are not expanding that
> > reporting interface at all.  
> 
> That is fair. I think you can do that even now if the errors logged have
> enough hw information to classify them and use them for predictive analysis.

Definitely.  In general (outside of CXL in particular) we think there are
some gaps that we'll look to address in future, but that's simple stuff
for another patch series.

> 
> > Ok.  It seems you correlate number of files with complexity.  
> 
> No, wrong. I'm looking at the interface and am wondering how is this going to
> be used and whether it is worth it to have it cast in stone forever.

Ok.  I'm not concerned by this because of the alignment with old specifications
going back a long way.  I'm not sure of the history of the CXL definition but
it is near identical to the UEFI CPER records that have been in use a long time.
For me that convinces me that this form of device description is pretty general
and stable.  I'm sure we'll get small new things over time (sub-channel came
with DDR5 for example) but those are minor additions.

> 
> > I correlated difficulty of understanding those files with complexity.
> > Everyone one of the files is clearly defined and aligned with long history
> > of how to describe DRAM (see how long CPER records have used these
> > fields for example - they go back to the beginning).  
> 
> Ok, then pls point me to the actual use cases how those files are going to be
> used or they are used already.

For this we'll do as we did for scrub control and send a patch set adding tooling
to RASdaemon and/or if more appropriate a script along side it.  My fault,
I falsely thought this one was more obvious and we could leave that until
this landed. Seems not!

> 
> > I'm all in favor of building an interface up by providing minimum first
> > and then adding to it, but here what is proposed is the minimum for basic
> > functionality and the alternative of doing the whole thing in kernel both
> > puts complexity in the wrong place and restricts us in what is possible.  
> 
> There's another point to consider: if this is the correct and proper solution
> for *your* fleet, that doesn't necessarily mean it is the correct and
> generic solution for *everybody* using the kernel. So you can imagine that I'd
> like to have a generic solution which can maximally include everyone instead
> of *some* special case only.

This I agree on. However, if CXL takes off (and there seems to be agreement
it will to some degree at least) then this interface is fully general for any spec
compliant device. It would be nice to have visibility of more OS managed
repair interfaces but for now I can only see one other and that is more
similar to PPR in CXL which is device/host physical address based.
So we have 3 examples on which to build generalizations, but only one fits
the model we are discussing here (which is the second part of repair
support in v19 patch set).

> 
> > To some degree but I think there is a major mismatch in what we think
> > this is for.
> > 
> > What I've asked Shiju to look at is splitting the repair infrastructure
> > into two cases so that maybe we can make partial progress:
> > 
> > 1) Systems that support repair by Physical Address
> >  - Covers Post Package Repair for CXL
> > 
> > 2) Systems that support repair by description of the underlying hardware
> > - Covers Memory Sparing interfaces for CXL. 
> > 
> > We need both longer term anyway, but maybe 1 is less controversial simply
> > on basis it has fewer control parameters
> > 
> > This still fundamentally puts the policy in userspace where I
> > believe it belongs.  
> 
> Ok, this is more concrete. Let's start with those. Can I have some more
> details on how this works pls and who does what? Is it generic enough?

Sure. We can definitely do that.  We have this split in v19 (just undergoing
some final docs tidy up etc, should be posted soon).

> 
> If not, can it live in debugfs for now? See above what I mean about this.
> 
> Big picture: what is the kernel's role here? To be a parrot to carry data
> back'n'forth or can it simply do clear-cut decisions itself without the need
> for userspace involvement?

With the additional 'safety' checks Dan has asked for the kernel requirement
(beyond a standardized interface / place to look for the controls etc) is
now responsible for ensuring that a request to repair memory that is online
matches up with an error record that we have received. First instance of this
will be CXL native error handling based.  For now we've made this device
specific because exactly what needs checking depends on the type of repair
implementation.

My intent was that the kernel never makes decisions for this feature.

Potentially in future we could relax that to allow it to do so for a few
usecases - the non persistent ones, where it could be considered
a way to avoid offlining a page.  I see that as a much more complex case
though than the userspace managed handling so one for future work.
It only applies on some subset of devices and there are not enough in
the market yet for us to know if that is going to be commonly supported.
They will support repair, but whether they allow online repair rather
than only offline is yet to be seen. That question corresponds to a
single attribute in sysfs and a couple of checks in the driver code
but changes whether the second usecase above is possible.

Early devices and the ones in a few years time may make different
decisions on this. All options are covered by this driver (autonomous
repair is covered for free as nothing to do!)

Online sparing to avoid offline is a cute idea only at the moment.

> 
> So far I get the idea that this is something for your RAS needs. This should
> have general usability for the rest of the kernel users - otherwise it should
> remain a vendor-specific solution until it is needed by others and can be
> generalized.

CXL is not vendor specific. Our other driver that I keep referring
to as 'coming soon' is though.  I'll see if I can get a few memory
device manufacturers to specifically stick their hands up that they
care about this. As an example we presented on this topic with
Micron at the LPC CXL uconf (+CC Vandana).  I don't have access
to Micron parts so this isn't just Huawei using Micron, we simply had two
proposals on the same topic so combined the sessions.  We have a CXL
open source sync call in an hour so I'll ask there.

> 
> Also, can already existing solutions in the kernel be generalized so that you
> can use them too and others can benefit from your improvements?

Maybe for the follow on topic of non persistent repair as a path to
avoid offlining memory detected as bad. Maybe that counts
as generalization (rather than extension).  But that's not covering
our usecase of restablishing the offline at boot, or the persistent
usecases.  So it's a value add feature for a follow up effort,
not a baseline one which is the intent of this patch set.

> 
> I hope this makes more sense.

Thanks for taking time to continue the discussion and I think we
are converging somewhat even if there is further to go.

Jonathan
> 
> Thx.
> 

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v18 05/19] ACPI:RAS2: Add ACPI RAS2 driver
  2025-01-06 12:10 ` [PATCH v18 05/19] ACPI:RAS2: Add ACPI RAS2 driver shiju.jose
@ 2025-01-21 23:01   ` Daniel Ferguson
  2025-01-22 15:38     ` Shiju Jose
  2025-01-30 19:19   ` Daniel Ferguson
  1 sibling, 1 reply; 87+ messages in thread
From: Daniel Ferguson @ 2025-01-21 23:01 UTC (permalink / raw)
  To: shiju.jose, linux-edac, linux-cxl, linux-acpi, linux-mm,
	linux-kernel
  Cc: bp, tony.luck, rafael, lenb, mchehab, dan.j.williams, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	ira.weiny, david, Vilas.Sridharan, leo.duran, Yazen.Ghannam,
	rientjes, jiaqiyan, Jon.Grimm, dave.hansen, naoya.horiguchi,
	james.morse, jthoughton, somasundaram.a, erdemaktas, pgonda,
	duenwen, gthelen, wschwartz, dferguson, wbs, nifan.cxl,
	tanxiaofei, prime.zeng, roberto.sassu, kangkang.shen,
	wanghuiqiang, linuxarm



On 1/6/2025 4:10 AM, shiju.jose@huawei.com wrote:
> +static int ras2_report_cap_error(u32 cap_status)
> +{
> +	switch (cap_status) {
> +	case ACPI_RAS2_NOT_VALID:
> +	case ACPI_RAS2_NOT_SUPPORTED:
> +		return -EPERM;
> +	case ACPI_RAS2_BUSY:
> +		return -EBUSY;
> +	case ACPI_RAS2_FAILED:
> +	case ACPI_RAS2_ABORTED:
> +	case ACPI_RAS2_INVALID_DATA:
> +		return -EINVAL;
> +	default: /* 0 or other, Success */
> +		return 0;
> +	}
> +}
> +
> +static int ras2_check_pcc_chan(struct ras2_pcc_subspace *pcc_subspace)
> +{
> +	struct acpi_ras2_shared_memory __iomem *generic_comm_base = pcc_subspace->pcc_comm_addr;
> +	ktime_t next_deadline = ktime_add(ktime_get(), pcc_subspace->deadline);
> +	u32 cap_status;
> +	u16 status;
> +	u32 ret;
> +
> +	while (!ktime_after(ktime_get(), next_deadline)) {
> +		/*
> +		 * As per ACPI spec, the PCC space will be initialized by
> +		 * platform and should have set the command completion bit when
> +		 * PCC can be used by OSPM
> +		 */
> +		status = readw_relaxed(&generic_comm_base->status);
> +		if (status & RAS2_PCC_CMD_ERROR) {
> +			cap_status = readw_relaxed(&generic_comm_base->set_capabilities_status);
> +			ret = ras2_report_cap_error(cap_status);

There is some new information:

The Scrub parameter block intends to get its own Status field, and the
set_capabilities_status field is being deprecated. This also causes a
revision bump in the spec.

See [1]

Assuming this change is ratified (not guaranteed, still pending):
This change implies that we cannot centrally decode errors, as is done
here and now. Instead error decoding must be done after some
feature-specific routine calls ras2_send_pcc_cmd. It should be the case
that each new feature, moving forward, will likely have their own status.

Please see my follow up comment on [PATCH v18 06/19]

---
[1] https://github.com/tianocore/edk2/issues/10540
---

> +
> +			status &= ~RAS2_PCC_CMD_ERROR;
> +			writew_relaxed(status, &generic_comm_base->status);
> +			return ret;
> +		}
> +		if (status & RAS2_PCC_CMD_COMPLETE)
> +			return 0;
> +		/*
> +		 * Reducing the bus traffic in case this loop takes longer than
> +		 * a few retries.
> +		 */
> +		msleep(10);
> +	}
> +
> +	return -EIO;
> +}
> +
> +/**
> + * ras2_send_pcc_cmd() - Send RAS2 command via PCC channel
> + * @ras2_ctx:	pointer to the RAS2 context structure
> + * @cmd:	command to send
> + *
> + * Returns: 0 on success, an error otherwise
> + */
> +int ras2_send_pcc_cmd(struct ras2_mem_ctx *ras2_ctx, u16 cmd)
> +{
> +	struct ras2_pcc_subspace *pcc_subspace = ras2_ctx->pcc_subspace;
> +	struct acpi_ras2_shared_memory *generic_comm_base = pcc_subspace->pcc_comm_addr;
> +	static ktime_t last_cmd_cmpl_time, last_mpar_reset;
> +	struct mbox_chan *pcc_channel;
> +	unsigned int time_delta;
> +	static int mpar_count;
> +	int ret;
> +
> +	guard(mutex)(&ras2_pcc_subspace_lock);
> +	ret = ras2_check_pcc_chan(pcc_subspace);
> +	if (ret < 0)
> +		return ret;
> +	pcc_channel = pcc_subspace->pcc_chan->mchan;
> +
> +	/*
> +	 * Handle the Minimum Request Turnaround Time(MRTT)
> +	 * "The minimum amount of time that OSPM must wait after the completion
> +	 * of a command before issuing the next command, in microseconds"
> +	 */
> +	if (pcc_subspace->pcc_mrtt) {
> +		time_delta = ktime_us_delta(ktime_get(), last_cmd_cmpl_time);
> +		if (pcc_subspace->pcc_mrtt > time_delta)
> +			udelay(pcc_subspace->pcc_mrtt - time_delta);
> +	}
> +
> +	/*
> +	 * Handle the non-zero Maximum Periodic Access Rate(MPAR)
> +	 * "The maximum number of periodic requests that the subspace channel can
> +	 * support, reported in commands per minute. 0 indicates no limitation."
> +	 *
> +	 * This parameter should be ideally zero or large enough so that it can
> +	 * handle maximum number of requests that all the cores in the system can
> +	 * collectively generate. If it is not, we will follow the spec and just
> +	 * not send the request to the platform after hitting the MPAR limit in
> +	 * any 60s window
> +	 */
> +	if (pcc_subspace->pcc_mpar) {
> +		if (mpar_count == 0) {
> +			time_delta = ktime_ms_delta(ktime_get(), last_mpar_reset);
> +			if (time_delta < 60 * MSEC_PER_SEC) {
> +				dev_dbg(ras2_ctx->dev,
> +					"PCC cmd not sent due to MPAR limit");
> +				return -EIO;
> +			}
> +			last_mpar_reset = ktime_get();
> +			mpar_count = pcc_subspace->pcc_mpar;
> +		}
> +		mpar_count--;
> +	}
> +
> +	/* Write to the shared comm region. */
> +	writew_relaxed(cmd, &generic_comm_base->command);
> +
> +	/* Flip CMD COMPLETE bit */
> +	writew_relaxed(0, &generic_comm_base->status);
> +
> +	/* Ring doorbell */
> +	ret = mbox_send_message(pcc_channel, &cmd);
> +	if (ret < 0) {
> +		dev_err(ras2_ctx->dev,
> +			"Err sending PCC mbox message. cmd:%d, ret:%d\n",
> +			cmd, ret);
> +		return ret;
> +	}
> +
> +	/*
> +	 * If Minimum Request Turnaround Time is non-zero, we need
> +	 * to record the completion time of both READ and WRITE
> +	 * command for proper handling of MRTT, so we need to check
> +	 * for pcc_mrtt in addition to CMD_READ
> +	 */
> +	if (cmd == RAS2_PCC_CMD_EXEC || pcc_subspace->pcc_mrtt) {
> +		ret = ras2_check_pcc_chan(pcc_subspace);
> +		if (pcc_subspace->pcc_mrtt)
> +			last_cmd_cmpl_time = ktime_get();
> +	}
> +
> +	if (pcc_channel->mbox->txdone_irq)
> +		mbox_chan_txdone(pcc_channel, ret);
> +	else
> +		mbox_client_txdone(pcc_channel, ret);
> +
> +	return ret >= 0 ? 0 : ret;
> +}
> +EXPORT_SYMBOL_GPL(ras2_send_pcc_cmd);
> +
> +static int ras2_register_pcc_channel(struct ras2_mem_ctx *ras2_ctx, int pcc_subspace_id)
> +{
> +	struct ras2_pcc_subspace *pcc_subspace;
> +	struct pcc_mbox_chan *pcc_chan;
> +	struct mbox_client *mbox_cl;
> +
> +	if (pcc_subspace_id < 0)
> +		return -EINVAL;
> +
> +	mutex_lock(&ras2_pcc_subspace_lock);
> +	list_for_each_entry(pcc_subspace, &ras2_pcc_subspaces, elem) {
> +		if (pcc_subspace->pcc_subspace_id == pcc_subspace_id) {
> +			ras2_ctx->pcc_subspace = pcc_subspace;
> +			pcc_subspace->ref_count++;
> +			mutex_unlock(&ras2_pcc_subspace_lock);
> +			return 0;
> +		}
> +	}
> +	mutex_unlock(&ras2_pcc_subspace_lock);
> +
> +	pcc_subspace = kcalloc(1, sizeof(*pcc_subspace), GFP_KERNEL);
> +	if (!pcc_subspace)
> +		return -ENOMEM;
> +	mbox_cl = &pcc_subspace->mbox_client;
> +	mbox_cl->knows_txdone = true;
> +
> +	pcc_chan = pcc_mbox_request_channel(mbox_cl, pcc_subspace_id);
> +	if (IS_ERR(pcc_chan)) {
> +		kfree(pcc_subspace);
> +		return PTR_ERR(pcc_chan);
> +	}
> +	*pcc_subspace = (struct ras2_pcc_subspace) {
> +		.pcc_subspace_id = pcc_subspace_id,
> +		.pcc_chan = pcc_chan,
> +		.pcc_comm_addr = acpi_os_ioremap(pcc_chan->shmem_base_addr,
> +						 pcc_chan->shmem_size),
> +		.deadline = ns_to_ktime(RAS2_NUM_RETRIES *
> +					pcc_chan->latency *
> +					NSEC_PER_USEC),
> +		.pcc_mrtt = pcc_chan->min_turnaround_time,
> +		.pcc_mpar = pcc_chan->max_access_rate,
> +		.mbox_client = {
> +			.knows_txdone = true,
> +		},
> +		.pcc_channel_acquired = true,
> +	};
> +	mutex_lock(&ras2_pcc_subspace_lock);
> +	list_add(&pcc_subspace->elem, &ras2_pcc_subspaces);
> +	pcc_subspace->ref_count++;
> +	mutex_unlock(&ras2_pcc_subspace_lock);
> +	ras2_ctx->pcc_subspace = pcc_subspace;
> +	ras2_ctx->pcc_comm_addr = pcc_subspace->pcc_comm_addr;
> +	ras2_ctx->dev = pcc_chan->mchan->mbox->dev;
> +
> +	return 0;
> +}
> +
> +static DEFINE_IDA(ras2_ida);
> +static void ras2_remove_pcc(struct ras2_mem_ctx *ras2_ctx)
> +{
> +	struct ras2_pcc_subspace *pcc_subspace = ras2_ctx->pcc_subspace;
> +
> +	guard(mutex)(&ras2_pcc_subspace_lock);
> +	if (pcc_subspace->ref_count > 0)
> +		pcc_subspace->ref_count--;
> +	if (!pcc_subspace->ref_count) {
> +		list_del(&pcc_subspace->elem);
> +		pcc_mbox_free_channel(pcc_subspace->pcc_chan);
> +		kfree(pcc_subspace);
> +	}
> +}
> +
> +static void ras2_release(struct device *device)
> +{
> +	struct auxiliary_device *auxdev = container_of(device, struct auxiliary_device, dev);
> +	struct ras2_mem_ctx *ras2_ctx = container_of(auxdev, struct ras2_mem_ctx, adev);
> +
> +	ida_free(&ras2_ida, auxdev->id);
> +	ras2_remove_pcc(ras2_ctx);
> +	kfree(ras2_ctx);
> +}
> +
> +static struct ras2_mem_ctx *ras2_add_aux_device(char *name, int channel)
> +{
> +	struct ras2_mem_ctx *ras2_ctx;
> +	int id, ret;
> +
> +	ras2_ctx = kzalloc(sizeof(*ras2_ctx), GFP_KERNEL);
> +	if (!ras2_ctx)
> +		return ERR_PTR(-ENOMEM);
> +
> +	mutex_init(&ras2_ctx->lock);
> +
> +	ret = ras2_register_pcc_channel(ras2_ctx, channel);
> +	if (ret < 0) {
> +		pr_debug("failed to register pcc channel ret=%d\n", ret);
> +		goto ctx_free;
> +	}
> +
> +	id = ida_alloc(&ras2_ida, GFP_KERNEL);
> +	if (id < 0) {
> +		ret = id;
> +		goto pcc_free;
> +	}
> +	ras2_ctx->id = id;
> +	ras2_ctx->adev.id = id;
> +	ras2_ctx->adev.name = RAS2_MEM_DEV_ID_NAME;
> +	ras2_ctx->adev.dev.release = ras2_release;
> +	ras2_ctx->adev.dev.parent = ras2_ctx->dev;
> +
> +	ret = auxiliary_device_init(&ras2_ctx->adev);
> +	if (ret)
> +		goto ida_free;
> +
> +	ret = auxiliary_device_add(&ras2_ctx->adev);
> +	if (ret) {
> +		auxiliary_device_uninit(&ras2_ctx->adev);
> +		return ERR_PTR(ret);
> +	}
> +
> +	return ras2_ctx;
> +
> +ida_free:
> +	ida_free(&ras2_ida, id);
> +pcc_free:
> +	ras2_remove_pcc(ras2_ctx);
> +ctx_free:
> +	kfree(ras2_ctx);
> +	return ERR_PTR(ret);
> +}
> +
> +static int __init ras2_acpi_init(void)
> +{
> +	struct acpi_table_header *pAcpiTable = NULL;
> +	struct acpi_ras2_pcc_desc *pcc_desc_list;
> +	struct acpi_table_ras2 *pRas2Table;
> +	struct ras2_mem_ctx *ras2_ctx;
> +	int pcc_subspace_id;
> +	acpi_size ras2_size;
> +	acpi_status status;
> +	u8 count = 0, i;
> +	int ret = 0;
> +
> +	status = acpi_get_table("RAS2", 0, &pAcpiTable);
> +	if (ACPI_FAILURE(status) || !pAcpiTable) {
> +		pr_err("ACPI RAS2 driver failed to initialize, get table failed\n");
> +		return -EINVAL;
> +	}
> +
> +	ras2_size = pAcpiTable->length;
> +	if (ras2_size < sizeof(struct acpi_table_ras2)) {
> +		pr_err("ACPI RAS2 table present but broken (too short #1)\n");
> +		ret = -EINVAL;
> +		goto free_ras2_table;
> +	}
> +
> +	pRas2Table = (struct acpi_table_ras2 *)pAcpiTable;
> +	if (pRas2Table->num_pcc_descs <= 0) {
> +		pr_err("ACPI RAS2 table does not contain PCC descriptors\n");
> +		ret = -EINVAL;
> +		goto free_ras2_table;
> +	}
> +
> +	pcc_desc_list = (struct acpi_ras2_pcc_desc *)(pRas2Table + 1);
> +	/* Double scan for the case of only one actual controller */
> +	pcc_subspace_id = -1;
> +	count = 0;
> +	for (i = 0; i < pRas2Table->num_pcc_descs; i++, pcc_desc_list++) {
> +		if (pcc_desc_list->feature_type != RAS2_FEATURE_TYPE_MEMORY)
> +			continue;
> +		if (pcc_subspace_id == -1) {
> +			pcc_subspace_id = pcc_desc_list->channel_id;
> +			count++;
> +		}
> +		if (pcc_desc_list->channel_id != pcc_subspace_id)
> +			count++;
> +	}
> +	/*
> +	 * Workaround for the client platform with multiple scrub devices
> +	 * but uses single PCC subspace for communication.
> +	 */
> +	if (count == 1) {
> +		/* Add auxiliary device and bind ACPI RAS2 memory driver */
> +		ras2_ctx = ras2_add_aux_device(RAS2_MEM_DEV_ID_NAME, pcc_subspace_id);
> +		if (IS_ERR(ras2_ctx)) {
> +			ret = PTR_ERR(ras2_ctx);
> +			goto free_ras2_table;
> +		}
> +		acpi_put_table(pAcpiTable);
> +		return 0;
> +	}
> +
> +	pcc_desc_list = (struct acpi_ras2_pcc_desc *)(pRas2Table + 1);
> +	count = 0;
> +	for (i = 0; i < pRas2Table->num_pcc_descs; i++, pcc_desc_list++) {
> +		if (pcc_desc_list->feature_type != RAS2_FEATURE_TYPE_MEMORY)
> +			continue;
> +		pcc_subspace_id = pcc_desc_list->channel_id;
> +		/* Add auxiliary device and bind ACPI RAS2 memory driver */
> +		ras2_ctx = ras2_add_aux_device(RAS2_MEM_DEV_ID_NAME, pcc_subspace_id);
> +		if (IS_ERR(ras2_ctx)) {
> +			ret = PTR_ERR(ras2_ctx);
> +			goto free_ras2_table;
> +		}
> +	}
> +
> +free_ras2_table:
> +	acpi_put_table(pAcpiTable);
> +	return ret;
> +}
> +late_initcall(ras2_acpi_init)
> diff --git a/include/acpi/ras2_acpi.h b/include/acpi/ras2_acpi.h
> new file mode 100644
> index 000000000000..7b32407ae2af
> --- /dev/null
> +++ b/include/acpi/ras2_acpi.h
> @@ -0,0 +1,45 @@
> +/* SPDX-License-Identifier: GPL-2.0-only */
> +/*
> + * RAS2 ACPI driver header file
> + *
> + * (C) Copyright 2014, 2015 Hewlett-Packard Enterprises
> + *
> + * Copyright (c) 2024 HiSilicon Limited
> + */
> +
> +#ifndef _RAS2_ACPI_H
> +#define _RAS2_ACPI_H
> +
> +#include <linux/acpi.h>
> +#include <linux/auxiliary_bus.h>
> +#include <linux/mailbox_client.h>
> +#include <linux/mutex.h>
> +#include <linux/types.h>
> +
> +#define RAS2_PCC_CMD_COMPLETE	BIT(0)
> +#define RAS2_PCC_CMD_ERROR	BIT(2)
> +
> +/* RAS2 specific PCC commands */
> +#define RAS2_PCC_CMD_EXEC 0x01
> +
> +#define RAS2_AUX_DEV_NAME "ras2"
> +#define RAS2_MEM_DEV_ID_NAME "acpi_ras2_mem"
> +
> +/* Data structure RAS2 table */
> +struct ras2_mem_ctx {
> +	struct auxiliary_device adev;
> +	/* Lock to provide mutually exclusive access to PCC channel */
> +	struct mutex lock;
> +	int id;
> +	u8 instance;
> +	bool bg;
> +	u64 base, size;
> +	u8 scrub_cycle_hrs, min_scrub_cycle, max_scrub_cycle;
> +	struct device *dev;
> +	struct device *scrub_dev;
> +	void *pcc_subspace;
> +	struct acpi_ras2_shared_memory __iomem *pcc_comm_addr;
> +};


Could we break the ras2_mem_ctx structure up a little bit so that when
we add a new feature, it will be easier to add a new context?

In the following example, we show what it *might* look like if we add
another feature (Address Translation). But the ask here, is to split the
ras2_mem_ctx structure into two structures: ras2_mem_scrub_ctx  and
ras2_mem_ctx.

struct ras2_mem_address_translation_ctx {
	struct device *at_dev;
	...
};

struct ras2_mem_scrub_ctx {
	struct device *scrub_dev;
	bool bg;
	u64 base, size;
	u8 scrub_cycle_hrs, min_scrub_cycle, max_scrub_cycle;
};


/* Data structure RAS2 table */
struct ras2_mem_ctx {
	struct auxiliary_device adev;
	/* Lock to provide mutually exclusive access to PCC channel */
	struct mutex lock;
	int id;
	u8 instance;
	struct device *dev;
	void *pcc_subspace;
	struct acpi_ras2_shared_memory __iomem *pcc_comm_addr;

	struct ras2_mem_scrub_ctx scrub;
	struct ras2_mem_address_translation_ctx at;
};


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v18 06/19] ras: mem: Add memory ACPI RAS2 driver
  2025-01-06 12:10 ` [PATCH v18 06/19] ras: mem: Add memory " shiju.jose
@ 2025-01-21 23:01   ` Daniel Ferguson
  2025-01-30 19:19   ` Daniel Ferguson
  1 sibling, 0 replies; 87+ messages in thread
From: Daniel Ferguson @ 2025-01-21 23:01 UTC (permalink / raw)
  To: shiju.jose, linux-edac, linux-cxl, linux-acpi, linux-mm,
	linux-kernel
  Cc: bp, tony.luck, rafael, lenb, mchehab, dan.j.williams, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	ira.weiny, david, Vilas.Sridharan, leo.duran, Yazen.Ghannam,
	rientjes, jiaqiyan, Jon.Grimm, dave.hansen, naoya.horiguchi,
	james.morse, jthoughton, somasundaram.a, erdemaktas, pgonda,
	duenwen, gthelen, wschwartz, dferguson, wbs, nifan.cxl,
	tanxiaofei, prime.zeng, roberto.sassu, kangkang.shen,
	wanghuiqiang, linuxarm



On 1/6/2025 4:10 AM, shiju.jose@huawei.com wrote:
> +#define pr_fmt(fmt)	"MEMORY ACPI RAS2: " fmt
> +
> +#include <linux/bitfield.h>
> +#include <linux/edac.h>
> +#include <linux/platform_device.h>
> +#include <acpi/ras2_acpi.h>
> +
> +#define RAS2_DEV_NUM_RAS_FEATURES	1
> +
> +#define RAS2_SUPPORT_HW_PARTOL_SCRUB	BIT(0)
> +#define RAS2_TYPE_PATROL_SCRUB	0x0000
> +
> +#define RAS2_GET_PATROL_PARAMETERS	0x01
> +#define	RAS2_START_PATROL_SCRUBBER	0x02
> +#define	RAS2_STOP_PATROL_SCRUBBER	0x03
> +
> +#define RAS2_PATROL_SCRUB_SCHRS_IN_MASK	GENMASK(15, 8)
> +#define RAS2_PATROL_SCRUB_EN_BACKGROUND	BIT(0)
> +#define RAS2_PATROL_SCRUB_SCHRS_OUT_MASK	GENMASK(7, 0)
> +#define RAS2_PATROL_SCRUB_MIN_SCHRS_OUT_MASK	GENMASK(15, 8)
> +#define RAS2_PATROL_SCRUB_MAX_SCHRS_OUT_MASK	GENMASK(23, 16)
> +#define RAS2_PATROL_SCRUB_FLAG_SCRUBBER_RUNNING	BIT(0)
> +
> +#define RAS2_SCRUB_NAME_LEN      128
> +#define RAS2_HOUR_IN_SECS    3600
> +
> +struct acpi_ras2_ps_shared_mem {
> +	struct acpi_ras2_shared_memory common;
> +	struct acpi_ras2_patrol_scrub_parameter params;
> +};
> +

If the ACPI change here [1] comes to fruition, then checking for errors
will/may have to be done by each individual feature. To show how that
may look, I've included a possible implementation to illustrate what I'm
trying to convey.

static int ras2_scrub_map_status_to_error(u32 cap_status)
{
	switch (cap_status) {
	case ACPI_RAS2_NOT_VALID:
	case ACPI_RAS2_NOT_SUPPORTED:
		return -EPERM;
	case ACPI_RAS2_BUSY:
		return -EBUSY;
	case ACPI_RAS2_FAILED:
	case ACPI_RAS2_ABORTED:
	case ACPI_RAS2_INVALID_DATA:
		return -EINVAL;
	default: /* 0 or other, Success */
		return 0;
	}
}

[1] https://github.com/tianocore/edk2/issues/10540

> +static int ras2_is_patrol_scrub_support(struct ras2_mem_ctx *ras2_ctx)
> +{
> +	struct acpi_ras2_shared_memory __iomem *common = (void *)
> +						ras2_ctx->pcc_comm_addr;
> +
> +	guard(mutex)(&ras2_ctx->lock);
> +	common->set_capabilities[0] = 0;
> +
> +	return common->features[0] & RAS2_SUPPORT_HW_PARTOL_SCRUB;
> +}
> +
> +static int ras2_update_patrol_scrub_params_cache(struct ras2_mem_ctx *ras2_ctx)
> +{
> +	struct acpi_ras2_ps_shared_mem __iomem *ps_sm = (void *)
> +						ras2_ctx->pcc_comm_addr;
> +	int ret;
> +
> +	ps_sm->common.set_capabilities[0] = RAS2_SUPPORT_HW_PARTOL_SCRUB;
> +	ps_sm->params.patrol_scrub_command = RAS2_GET_PATROL_PARAMETERS;
> +
> +	ret = ras2_send_pcc_cmd(ras2_ctx, RAS2_PCC_CMD_EXEC);
> +	if (ret) {
> +		dev_err(ras2_ctx->dev, "failed to read parameters\n");
> +		return ret;
> +	}


ret = ras2_scrub_map_status_to_error(ps_sm->scrub_params.status);
if (ret != 0)
	return ret;

> +
> +	ras2_ctx->min_scrub_cycle = FIELD_GET(RAS2_PATROL_SCRUB_MIN_SCHRS_OUT_MASK,
> +					      ps_sm->params.scrub_params_out);
> +	ras2_ctx->max_scrub_cycle = FIELD_GET(RAS2_PATROL_SCRUB_MAX_SCHRS_OUT_MASK,
> +					      ps_sm->params.scrub_params_out);
> +	if (!ras2_ctx->bg) {
> +		ras2_ctx->base = ps_sm->params.actual_address_range[0];
> +		ras2_ctx->size = ps_sm->params.actual_address_range[1];
> +	}
> +	ras2_ctx->scrub_cycle_hrs = FIELD_GET(RAS2_PATROL_SCRUB_SCHRS_OUT_MASK,
> +					      ps_sm->params.scrub_params_out);
> +
> +	return 0;
> +}
> +
> +/* Context - lock must be held */
> +static int ras2_get_patrol_scrub_running(struct ras2_mem_ctx *ras2_ctx,
> +					 bool *running)
> +{
> +	struct acpi_ras2_ps_shared_mem __iomem *ps_sm = (void *)
> +						ras2_ctx->pcc_comm_addr;
> +	int ret;
> +
> +	ps_sm->common.set_capabilities[0] = RAS2_SUPPORT_HW_PARTOL_SCRUB;
> +	ps_sm->params.patrol_scrub_command = RAS2_GET_PATROL_PARAMETERS;
> +
> +	ret = ras2_send_pcc_cmd(ras2_ctx, RAS2_PCC_CMD_EXEC);
> +	if (ret) {
> +		dev_err(ras2_ctx->dev, "failed to read parameters\n");
> +		return ret;
> +	}

ret = ras2_scrub_map_status_to_error(ps_sm->scrub_params.status);
if (ret != 0)
	return ret;

> +
> +	*running = ps_sm->params.flags & RAS2_PATROL_SCRUB_FLAG_SCRUBBER_RUNNING;
> +
> +	return 0;
> +}
> +
> +static int ras2_hw_scrub_read_min_scrub_cycle(struct device *dev, void *drv_data,
> +					      u32 *min)
> +{
> +	struct ras2_mem_ctx *ras2_ctx = drv_data;
> +
> +	*min = ras2_ctx->min_scrub_cycle * RAS2_HOUR_IN_SECS;
> +
> +	return 0;
> +}
> +
> +static int ras2_hw_scrub_read_max_scrub_cycle(struct device *dev, void *drv_data,
> +					      u32 *max)
> +{
> +	struct ras2_mem_ctx *ras2_ctx = drv_data;
> +
> +	*max = ras2_ctx->max_scrub_cycle * RAS2_HOUR_IN_SECS;
> +
> +	return 0;
> +}
> +
> +static int ras2_hw_scrub_cycle_read(struct device *dev, void *drv_data,
> +				    u32 *scrub_cycle_secs)
> +{
> +	struct ras2_mem_ctx *ras2_ctx = drv_data;
> +
> +	*scrub_cycle_secs = ras2_ctx->scrub_cycle_hrs * RAS2_HOUR_IN_SECS;
> +
> +	return 0;
> +}
> +
> +static int ras2_hw_scrub_cycle_write(struct device *dev, void *drv_data,
> +				     u32 scrub_cycle_secs)
> +{
> +	u8 scrub_cycle_hrs = scrub_cycle_secs / RAS2_HOUR_IN_SECS;
> +	struct ras2_mem_ctx *ras2_ctx = drv_data;
> +	bool running;
> +	int ret;
> +
> +	guard(mutex)(&ras2_ctx->lock);
> +	ret = ras2_get_patrol_scrub_running(ras2_ctx, &running);
> +	if (ret)
> +		return ret;
> +
> +	if (running)
> +		return -EBUSY;
> +
> +	if (scrub_cycle_hrs < ras2_ctx->min_scrub_cycle ||
> +	    scrub_cycle_hrs > ras2_ctx->max_scrub_cycle)
> +		return -EINVAL;
> +
> +	ras2_ctx->scrub_cycle_hrs = scrub_cycle_hrs;
> +
> +	return 0;
> +}
> +
> +static int ras2_hw_scrub_read_addr(struct device *dev, void *drv_data, u64 *base)
> +{
> +	struct ras2_mem_ctx *ras2_ctx = drv_data;
> +	int ret;
> +
> +	/*
> +	 * When BG scrubbing is enabled the actual address range is not valid.
> +	 * Return -EBUSY now unless find out a method to retrieve actual full PA range.
> +	 */
> +	if (ras2_ctx->bg)
> +		return -EBUSY;
> +
> +	/*
> +	 * When demand scrubbing is finished firmware must reset actual
> +	 * address range to 0. Otherwise userspace assumes demand scrubbing
> +	 * is in progress.
> +	 */
> +	ret = ras2_update_patrol_scrub_params_cache(ras2_ctx);
> +	if (ret)
> +		return ret;
> +	*base = ras2_ctx->base;
> +
> +	return 0;
> +}
> +
> +static int ras2_hw_scrub_read_size(struct device *dev, void *drv_data, u64 *size)
> +{
> +	struct ras2_mem_ctx *ras2_ctx = drv_data;
> +	int ret;
> +
> +	if (ras2_ctx->bg)
> +		return -EBUSY;
> +
> +	ret = ras2_update_patrol_scrub_params_cache(ras2_ctx);
> +	if (ret)
> +		return ret;
> +	*size = ras2_ctx->size;
> +
> +	return 0;
> +}
> +
> +static int ras2_hw_scrub_write_addr(struct device *dev, void *drv_data, u64 base)
> +{
> +	struct ras2_mem_ctx *ras2_ctx = drv_data;
> +	struct acpi_ras2_ps_shared_mem __iomem *ps_sm = (void *)
> +						ras2_ctx->pcc_comm_addr;
> +	bool running;
> +	int ret;
> +
> +	guard(mutex)(&ras2_ctx->lock);
> +	ps_sm->common.set_capabilities[0] = RAS2_SUPPORT_HW_PARTOL_SCRUB;
> +	if (ras2_ctx->bg)
> +		return -EBUSY;
> +
> +	if (!base || !ras2_ctx->size) {
> +		dev_warn(ras2_ctx->dev,
> +			 "%s: Invalid address range, base=0x%llx "
> +			 "size=0x%llx\n", __func__,
> +			 base, ras2_ctx->size);
> +		return -ERANGE;
> +	}
> +
> +	ret = ras2_get_patrol_scrub_running(ras2_ctx, &running);
> +	if (ret)
> +		return ret;
> +
> +	if (running)
> +		return -EBUSY;
> +
> +	ps_sm->params.scrub_params_in &= ~RAS2_PATROL_SCRUB_SCHRS_IN_MASK;
> +	ps_sm->params.scrub_params_in |= FIELD_PREP(RAS2_PATROL_SCRUB_SCHRS_IN_MASK,
> +						    ras2_ctx->scrub_cycle_hrs);
> +	ps_sm->params.requested_address_range[0] = base;
> +	ps_sm->params.requested_address_range[1] = ras2_ctx->size;
> +	ps_sm->params.scrub_params_in &= ~RAS2_PATROL_SCRUB_EN_BACKGROUND;
> +	ps_sm->params.patrol_scrub_command = RAS2_START_PATROL_SCRUBBER;
> +
> +	ret = ras2_send_pcc_cmd(ras2_ctx, RAS2_PCC_CMD_EXEC);
> +	if (ret) {
> +		dev_err(ras2_ctx->dev, "Failed to start demand scrubbing\n");
> +		return ret;
> +	}
ret = ras2_scrub_map_status_to_error(ps_sm->scrub_params.status);
if (ret != 0)
	return ret;

> +
> +	return ras2_update_patrol_scrub_params_cache(ras2_ctx);
> +}
> +
> +static int ras2_hw_scrub_write_size(struct device *dev, void *drv_data, u64 size)
> +{
> +	struct ras2_mem_ctx *ras2_ctx = drv_data;
> +	bool running;
> +	int ret;
> +
> +	guard(mutex)(&ras2_ctx->lock);
> +	ret = ras2_get_patrol_scrub_running(ras2_ctx, &running);
> +	if (ret)
> +		return ret;
> +
> +	if (running)
> +		return -EBUSY;
> +
> +	if (!size) {
> +		dev_warn(dev, "%s: Invalid address range size=0x%llx\n",
> +			 __func__, size);
> +		return -EINVAL;
> +	}
> +
> +	ras2_ctx->size = size;
> +
> +	return 0;
> +}
> +
> +static int ras2_hw_scrub_set_enabled_bg(struct device *dev, void *drv_data, bool enable)
> +{
> +	struct ras2_mem_ctx *ras2_ctx = drv_data;
> +	struct acpi_ras2_ps_shared_mem __iomem *ps_sm = (void *)
> +						ras2_ctx->pcc_comm_addr;
> +	bool running;
> +	int ret;
> +
> +	guard(mutex)(&ras2_ctx->lock);
> +	ps_sm->common.set_capabilities[0] = RAS2_SUPPORT_HW_PARTOL_SCRUB;
> +	ret = ras2_get_patrol_scrub_running(ras2_ctx, &running);
> +	if (ret)
> +		return ret;
> +	if (enable) {
> +		if (ras2_ctx->bg || running)
> +			return -EBUSY;
> +		ps_sm->params.requested_address_range[0] = 0;
> +		ps_sm->params.requested_address_range[1] = 0;
> +		ps_sm->params.scrub_params_in &= ~RAS2_PATROL_SCRUB_SCHRS_IN_MASK;
> +		ps_sm->params.scrub_params_in |= FIELD_PREP(RAS2_PATROL_SCRUB_SCHRS_IN_MASK,
> +							    ras2_ctx->scrub_cycle_hrs);
> +		ps_sm->params.patrol_scrub_command = RAS2_START_PATROL_SCRUBBER;
> +	} else {
> +		if (!ras2_ctx->bg)
> +			return -EPERM;
> +		if (!ras2_ctx->bg && running)
> +			return -EBUSY;
> +		ps_sm->params.patrol_scrub_command = RAS2_STOP_PATROL_SCRUBBER;
> +	}
> +	ps_sm->params.scrub_params_in &= ~RAS2_PATROL_SCRUB_EN_BACKGROUND;
> +	ps_sm->params.scrub_params_in |= FIELD_PREP(RAS2_PATROL_SCRUB_EN_BACKGROUND,
> +						    enable);
> +	ret = ras2_send_pcc_cmd(ras2_ctx, RAS2_PCC_CMD_EXEC);
> +	if (ret) {
> +		dev_err(ras2_ctx->dev, "Failed to %s background scrubbing\n",
> +			enable ? "enable" : "disable");
> +		return ret;
> +	}
ret = ras2_scrub_map_status_to_error(ps_sm->scrub_params.status);
if (ret != 0)
	return ret;

> +	if (enable) {
> +		ras2_ctx->bg = true;
> +		/* Update the cache to account for rounding of supplied parameters and similar */
> +		ret = ras2_update_patrol_scrub_params_cache(ras2_ctx);
> +	} else {
> +		ret = ras2_update_patrol_scrub_params_cache(ras2_ctx);
> +		ras2_ctx->bg = false;
> +	}
> +
> +	return ret;
> +}



^ permalink raw reply	[flat|nested] 87+ messages in thread

* RE: [PATCH v18 05/19] ACPI:RAS2: Add ACPI RAS2 driver
  2025-01-21 23:01   ` Daniel Ferguson
@ 2025-01-22 15:38     ` Shiju Jose
  0 siblings, 0 replies; 87+ messages in thread
From: Shiju Jose @ 2025-01-22 15:38 UTC (permalink / raw)
  To: Daniel Ferguson, linux-edac@vger.kernel.org,
	linux-cxl@vger.kernel.org, linux-acpi@vger.kernel.org,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org
  Cc: bp@alien8.de, tony.luck@intel.com, rafael@kernel.org,
	lenb@kernel.org, mchehab@kernel.org, dan.j.williams@intel.com,
	dave@stgolabs.net, Jonathan Cameron, dave.jiang@intel.com,
	alison.schofield@intel.com, vishal.l.verma@intel.com,
	ira.weiny@intel.com, david@redhat.com, Vilas.Sridharan@amd.com,
	leo.duran@amd.com, Yazen.Ghannam@amd.com, rientjes@google.com,
	jiaqiyan@google.com, Jon.Grimm@amd.com,
	dave.hansen@linux.intel.com, naoya.horiguchi@nec.com,
	james.morse@arm.com, jthoughton@google.com,
	somasundaram.a@hpe.com, erdemaktas@google.com, pgonda@google.com,
	duenwen@google.com, gthelen@google.com,
	wschwartz@amperecomputing.com, dferguson@amperecomputing.com,
	wbs@os.amperecomputing.com, nifan.cxl@gmail.com, tanxiaofei,
	Zengtao (B), Roberto Sassu, kangkang.shen@futurewei.com,
	wanghuiqiang, Linuxarm


>-----Original Message-----
>From: Daniel Ferguson <danielf@os.amperecomputing.com>
>Sent: 21 January 2025 23:01
>To: Shiju Jose <shiju.jose@huawei.com>; linux-edac@vger.kernel.org; linux-
>cxl@vger.kernel.org; linux-acpi@vger.kernel.org; linux-mm@kvack.org; linux-
>kernel@vger.kernel.org
>Cc: bp@alien8.de; tony.luck@intel.com; rafael@kernel.org; lenb@kernel.org;
>mchehab@kernel.org; dan.j.williams@intel.com; dave@stgolabs.net; Jonathan
>Cameron <jonathan.cameron@huawei.com>; dave.jiang@intel.com;
>alison.schofield@intel.com; vishal.l.verma@intel.com; ira.weiny@intel.com;
>david@redhat.com; Vilas.Sridharan@amd.com; leo.duran@amd.com;
>Yazen.Ghannam@amd.com; rientjes@google.com; jiaqiyan@google.com;
>Jon.Grimm@amd.com; dave.hansen@linux.intel.com;
>naoya.horiguchi@nec.com; james.morse@arm.com; jthoughton@google.com;
>somasundaram.a@hpe.com; erdemaktas@google.com; pgonda@google.com;
>duenwen@google.com; gthelen@google.com;
>wschwartz@amperecomputing.com; dferguson@amperecomputing.com;
>wbs@os.amperecomputing.com; nifan.cxl@gmail.com; tanxiaofei
><tanxiaofei@huawei.com>; Zengtao (B) <prime.zeng@hisilicon.com>; Roberto
>Sassu <roberto.sassu@huawei.com>; kangkang.shen@futurewei.com;
>wanghuiqiang <wanghuiqiang@huawei.com>; Linuxarm
><linuxarm@huawei.com>
>Subject: Re: [PATCH v18 05/19] ACPI:RAS2: Add ACPI RAS2 driver
>
>
>
>On 1/6/2025 4:10 AM, shiju.jose@huawei.com wrote:
>> +static int ras2_report_cap_error(u32 cap_status) {
>> +	switch (cap_status) {
>> +	case ACPI_RAS2_NOT_VALID:
>> +	case ACPI_RAS2_NOT_SUPPORTED:
>> +		return -EPERM;
>> +	case ACPI_RAS2_BUSY:
>> +		return -EBUSY;
>> +	case ACPI_RAS2_FAILED:
>> +	case ACPI_RAS2_ABORTED:
>> +	case ACPI_RAS2_INVALID_DATA:
>> +		return -EINVAL;
>> +	default: /* 0 or other, Success */
>> +		return 0;
>> +	}
>> +}
>> +
>> +static int ras2_check_pcc_chan(struct ras2_pcc_subspace
>> +*pcc_subspace) {
>> +	struct acpi_ras2_shared_memory __iomem *generic_comm_base =
>pcc_subspace->pcc_comm_addr;
>> +	ktime_t next_deadline = ktime_add(ktime_get(), pcc_subspace-
>>deadline);
>> +	u32 cap_status;
>> +	u16 status;
>> +	u32 ret;
>> +
>> +	while (!ktime_after(ktime_get(), next_deadline)) {
>> +		/*
>> +		 * As per ACPI spec, the PCC space will be initialized by
>> +		 * platform and should have set the command completion bit
>when
>> +		 * PCC can be used by OSPM
>> +		 */
>> +		status = readw_relaxed(&generic_comm_base->status);
>> +		if (status & RAS2_PCC_CMD_ERROR) {
>> +			cap_status = readw_relaxed(&generic_comm_base-
>>set_capabilities_status);
>> +			ret = ras2_report_cap_error(cap_status);
>
>There is some new information:
>
>The Scrub parameter block intends to get its own Status field, and the
>set_capabilities_status field is being deprecated. This also causes a revision
>bump in the spec.
>
>See [1]
>
>Assuming this change is ratified (not guaranteed, still pending):
>This change implies that we cannot centrally decode errors, as is done here and
>now. Instead error decoding must be done after some feature-specific routine
>calls ras2_send_pcc_cmd. It should be the case that each new feature, moving
>forward, will likely have their own status.
>
>Please see my follow up comment on [PATCH v18 06/19]
>
>---
>[1] https://github.com/tianocore/edk2/issues/10540

Hi Daniel,

Thanks for the information and suggested modifications.

We will track the change and assuming it lands as an ECN
will try to add support in backwards compatible fashion.

>---
>
>> +
[...]
>
Thanks,
Shiju

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v18 04/19] EDAC: Add memory repair control feature
  2025-01-21 18:16                         ` Jonathan Cameron
@ 2025-01-22 19:09                           ` Borislav Petkov
  2025-02-06 13:39                             ` Jonathan Cameron
  0 siblings, 1 reply; 87+ messages in thread
From: Borislav Petkov @ 2025-01-22 19:09 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: Shiju Jose, linux-edac@vger.kernel.org, linux-cxl@vger.kernel.org,
	linux-acpi@vger.kernel.org, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org, tony.luck@intel.com,
	rafael@kernel.org, lenb@kernel.org, mchehab@kernel.org,
	dan.j.williams@intel.com, dave@stgolabs.net, dave.jiang@intel.com,
	alison.schofield@intel.com, vishal.l.verma@intel.com,
	ira.weiny@intel.com, david@redhat.com, Vilas.Sridharan@amd.com,
	leo.duran@amd.com, Yazen.Ghannam@amd.com, rientjes@google.com,
	jiaqiyan@google.com, Jon.Grimm@amd.com,
	dave.hansen@linux.intel.com, naoya.horiguchi@nec.com,
	james.morse@arm.com, jthoughton@google.com,
	somasundaram.a@hpe.com, erdemaktas@google.com, pgonda@google.com,
	duenwen@google.com, gthelen@google.com,
	wschwartz@amperecomputing.com, dferguson@amperecomputing.com,
	wbs@os.amperecomputing.com, nifan.cxl@gmail.com, tanxiaofei,
	Zengtao (B), Roberto Sassu, kangkang.shen@futurewei.com,
	wanghuiqiang, Linuxarm, Vandana Salve

On Tue, Jan 21, 2025 at 06:16:32PM +0000, Jonathan Cameron wrote:
> Clearly we need to provide more evidence of use cases: 'Show us your code'
> seems to apply here.  We'll do that over the next few weeks.

Thanks.

> based on simple algorithms applied to the data RAS Daemon already has.
> The interface for the reasons discussed in the long thread with Dan
> is the minimum required to provide the information needed to allow
> for two use cases.  We enumerated them explicitly in the discussion with
> Dan because they possibly affected 'safety'.
> 
> 1) Power up, pre memory online, (typically non persistent) repair of
>    known bad memory.

Lemme make sure I understand this: during boot you simply know from somewhere
that a certain rank (let's use rank for simplicity's sake) is faulty. Before
you online the memory, you simply replace that rank in the logic so that the
system uses the spare rank while the faulty rank is disabled.

>    There are two interface options for this, inject the prior mapping from
>    device physical address space (host address is not necessarily relevant
>    here as no address decoders have been programmed yet in CXL - that
>    happens as part of the flow to bring the memory up), or use the
>    information that userspace already has (bank, rank etc) to select what
>    memory is to be replaced with spare capacity.

Ok, so this is all CXL-specific because this use case relies on userspace
being present. Which means you cannot really use this for DIMMs used during
boot. So if DIMMs, those should be online-able later, when userspace is there.

>    Given the injection interface and the repair interface have to
>    convey the same data, the interface complexity is identical and
>    we might as well have a single step 'repair' rather than
>      1. Inject prior records then

What exactly is this injecting? The faulty rank? Which then would cause the
respective driver to go and do that repairing.

Which then means that you can online that device after rasdaemon has loaded
and has the required info to online it.

Which then means, rasdaemon needs to be part of the device onlining process.

I'm simply conjecturing here - I guess I'll see your detailed use case later.

>      2. Pass a physical address that is matched to one of those records.

I don't know what that one does.

>    There are no security related concerns here as we always treat this
>    as new memory and zero it etc as part of onlining.

Right, goes without saying.

> 2) Online case.  Here the restriction Dan proposed was that we 'check'
>    that we have seen an error record on this boot that matches the full
>    description.  That is matching both the physical address and the
>    topology (as that mapping can change from boot to boot, but not whilst
>    the memory is in use). This doesn't prevent any use case we have
>    come up with yet because, if we are making a post initial onlining
>    decision to repair we can assume there is a new error record that
>    provided new information on which we are acting.  Hence the kernel
>    had the information to check.
> 
>    Whilst I wasn't convinced that we had a definite security
>    problem without this protection, it requires minimal changes and doesn't
>    block the flows we care about so we are fine with adding this check.

I need more detail on that 2nd case - lemme read that other subthread.

> Ok. We'll put together an example script / RASdaemon code to show how
> it is used. I think you may be surprised at how simple this is and hopefully
> that will show that the interface is appropriate.

That sounds good, thanks.

> This we disagree on. For this persistent case in particular these are limited
> resources. Once you have used them all you can't do it again.  Using them
> carefully is key. An exception is mentioned below as a possible extension but
> it relies on a specific subset of allowed device functionality and only
> covers some use cases (so it's an extra, not a replacement for what this
> set does).

By "this persistent case" you mean collecting logs per error address,
collating them and massaging them or hunting them through a neural network to
recognize potential patterns and then act upon them?

In any case, I don't mean that - I mean something simple like: "after X errors
on address Y, offline page Z." Like we do with .../ras/cec.c. Ofc you can't
put really complex handling in the kernel and why would you - it must be *the*
best thing after sliced bread to impose that on everyone.

All I'm saying is, simple logic like that can be in the kernel if it is useful
in the general case. You don't *have* to carry all logic in some userspace
daemon - the kernel can be smart too :-)

> With the decision algorithms in userspace, we can design the userspace to kernel
> interface because we don't care about the algorithm choice - only what it needs
> to control which is well defined. Algorithms will start simple and then
> we'll iterate but it won't need changes in this interface because none of it
> is connected to how we use the data.

Are you saying that this interface you have right now is the necessary and
sufficient set of sysfs nodes which will be enough for most algorithms in
userspace?

And you won't have to change it because you realize down the road that it is
not enough?

> In general an ABI that is used is cast in stone. To my understanding there
> is nothing special about debugfs.  If we introduce a regression in tooling
> that uses that interface are we actually any better off than sysfs?
> https://lwn.net/Articles/309298/ was a good article on this a while back.
> 
> Maybe there has been a change of opinion on this that I missed.

I don't think so and I can see that article's point. So let's cut to the
chase: what are we going to do when the sysfs or debugfs nodes you've added
become insufficient and you or someone else needs to change them in the
future, for their specific use case?

The last paragraph of that article basically sums it up pretty nicely.

> Absolutely though the performance impact of punching holes in memory over
> time is getting some cloud folk pushing back because they can't get their
> 1GIB pages to put under a VM.  Mind you that's not particularly relevant
> to this thread.

What is relevant to this thread is the fact that you can't simply reboot as
a RAS recovery action. Not in all cases.

> For this we'll do as we did for scrub control and send a patch set adding tooling
> to RASdaemon and/or if more appropriate a script along side it.  My fault,
> I falsely thought this one was more obvious and we could leave that until
> this landed. Seems not!

Sorry, I can't always guess the use case by looking solely at the sysfs nodes.

> This I agree on. However, if CXL takes off (and there seems to be agreement
> it will to some degree at least) then this interface is fully general for any spec
> compliant device.

Ok, sounds good.

> Sure. We can definitely do that.  We have this split in v19 (just undergoing
> some final docs tidy up etc, should be posted soon).

Thx.

You don't have to rush it - we have merge window anyway.

> Early devices and the ones in a few years time may make different
> decisions on this. All options are covered by this driver (autonomous
> repair is covered for free as nothing to do!)

Don't forget devices which deviate from the spec because they were implemented
wrong. It happens and we have to support them because no one else cares but
people have already paid for them and want to use them.

> CXL is not vendor specific. Our other driver that I keep referring
> to as 'coming soon' is though.  I'll see if I can get a few memory
> device manufacturers to specifically stick their hands up that they
> care about this. As an example we presented on this topic with
> Micron at the LPC CXL uconf (+CC Vandana).  I don't have access
> to Micron parts so this isn't just Huawei using Micron, we simply had two
> proposals on the same topic so combined the sessions.  We have a CXL
> open source sync call in an hour so I'll ask there.

Having hw vendors agree on a single driver and Linux implementing it would be
ofc optimal.

> Maybe for the follow on topic of non persistent repair as a path to
> avoid offlining memory detected as bad. Maybe that counts
> as generalization (rather than extension).  But that's not covering
> our usecase of restablishing the offline at boot, or the persistent
> usecases.  So it's a value add feature for a follow up effort,
> not a baseline one which is the intent of this patch set.

Ok, I think this whole pile should simply be in two parts: generic, CXL-spec
implementing, vendor-agnostic pieces and vendor-specific drivers which use
that.

It'll be lovely if vendors could agree on this interface you're proposing but
I won't hold my breath...

> Thanks for taking time to continue the discussion and I think we
> are converging somewhat even if there is further to go.

Yap, I think so. A lot of things got cleared up for me too, so thanks too.
I'm sure you know what the important things are that we need to pay attention
when it comes to designing this with a broader audience in mind.

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v18 15/19] cxl/memfeature: Add CXL memory device patrol scrub control feature
  2025-01-06 12:10 ` [PATCH v18 15/19] cxl/memfeature: Add CXL memory device patrol scrub control feature shiju.jose
@ 2025-01-24 20:38   ` Dan Williams
  2025-01-27 10:06     ` Jonathan Cameron
  2025-01-27 12:53     ` Shiju Jose
  0 siblings, 2 replies; 87+ messages in thread
From: Dan Williams @ 2025-01-24 20:38 UTC (permalink / raw)
  To: shiju.jose, linux-edac, linux-cxl, linux-acpi, linux-mm,
	linux-kernel
  Cc: bp, tony.luck, rafael, lenb, mchehab, dan.j.williams, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	ira.weiny, david, Vilas.Sridharan, leo.duran, Yazen.Ghannam,
	rientjes, jiaqiyan, Jon.Grimm, dave.hansen, naoya.horiguchi,
	james.morse, jthoughton, somasundaram.a, erdemaktas, pgonda,
	duenwen, gthelen, wschwartz, dferguson, wbs, nifan.cxl,
	tanxiaofei, prime.zeng, roberto.sassu, kangkang.shen,
	wanghuiqiang, linuxarm, shiju.jose

shiju.jose@ wrote:
> From: Shiju Jose <shiju.jose@huawei.com>
> 
> CXL spec 3.1 section 8.2.9.9.11.1 describes the device patrol scrub control
> feature. The device patrol scrub proactively locates and makes corrections
> to errors in regular cycle.
> 
> Allow specifying the number of hours within which the patrol scrub must be
> completed, subject to minimum and maximum limits reported by the device.
> Also allow disabling scrub allowing trade-off error rates against
> performance.
> 
> Add support for patrol scrub control on CXL memory devices.
> Register with the EDAC device driver, which retrieves the scrub attribute
> descriptors from EDAC scrub and exposes the sysfs scrub control attributes
> to userspace. For example, scrub control for the CXL memory device
> "cxl_mem0" is exposed in /sys/bus/edac/devices/cxl_mem0/scrubX/.
> 
> Additionally, add support for region-based CXL memory patrol scrub control.
> CXL memory regions may be interleaved across one or more CXL memory
> devices. For example, region-based scrub control for "cxl_region1" is
> exposed in /sys/bus/edac/devices/cxl_region1/scrubX/.
> 
> Reviewed-by: Dave Jiang <dave.jiang@intel.com>
> Co-developed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
> ---
>  Documentation/edac/scrub.rst  |  66 ++++++
>  drivers/cxl/Kconfig           |  17 ++
>  drivers/cxl/core/Makefile     |   1 +
>  drivers/cxl/core/memfeature.c | 392 ++++++++++++++++++++++++++++++++++
>  drivers/cxl/core/region.c     |   6 +
>  drivers/cxl/cxlmem.h          |   7 +
>  drivers/cxl/mem.c             |   5 +
>  include/cxl/features.h        |  16 ++
>  8 files changed, 510 insertions(+)
>  create mode 100644 drivers/cxl/core/memfeature.c
> diff --git a/Documentation/edac/scrub.rst b/Documentation/edac/scrub.rst
> index f86645c7f0af..80e986c57885 100644
> --- a/Documentation/edac/scrub.rst
> +++ b/Documentation/edac/scrub.rst
> @@ -325,3 +325,69 @@ root@localhost:~# cat /sys/bus/edac/devices/acpi_ras_mem0/scrub0/current_cycle_d
>  10800
>  
>  root@localhost:~# echo 0 > /sys/bus/edac/devices/acpi_ras_mem0/scrub0/enable_background
> +
> +2. CXL memory device patrol scrubber
> +
> +2.1 Device based scrubbing
> +
> +root@localhost:~# cat /sys/bus/edac/devices/cxl_mem0/scrub0/min_cycle_duration
> +
> +3600
> +
> +root@localhost:~# cat /sys/bus/edac/devices/cxl_mem0/scrub0/max_cycle_duration
> +
> +918000
> +
> +root@localhost:~# cat /sys/bus/edac/devices/cxl_mem0/scrub0/current_cycle_duration
> +
> +43200
> +
> +root@localhost:~# echo 54000 > /sys/bus/edac/devices/cxl_mem0/scrub0/current_cycle_duration
> +
> +root@localhost:~# cat /sys/bus/edac/devices/cxl_mem0/scrub0/current_cycle_duration
> +
> +54000
> +
> +root@localhost:~# echo 1 > /sys/bus/edac/devices/cxl_mem0/scrub0/enable_background
> +
> +root@localhost:~# cat /sys/bus/edac/devices/cxl_mem0/scrub0/enable_background
> +
> +1
> +
> +root@localhost:~# echo 0 > /sys/bus/edac/devices/cxl_mem0/scrub0/enable_background
> +
> +root@localhost:~# cat /sys/bus/edac/devices/cxl_mem0/scrub0/enable_background
> +
> +0
> +
> +2.2. Region based scrubbing
> +
> +root@localhost:~# cat /sys/bus/edac/devices/cxl_region0/scrub0/min_cycle_duration
> +
> +3600
> +
> +root@localhost:~# cat /sys/bus/edac/devices/cxl_region0/scrub0/max_cycle_duration
> +
> +918000
> +
> +root@localhost:~# cat /sys/bus/edac/devices/cxl_region0/scrub0/current_cycle_duration
> +
> +43200
> +
> +root@localhost:~# echo 54000 > /sys/bus/edac/devices/cxl_region0/scrub0/current_cycle_duration
> +
> +root@localhost:~# cat /sys/bus/edac/devices/cxl_region0/scrub0/current_cycle_duration
> +
> +54000
> +
> +root@localhost:~# echo 1 > /sys/bus/edac/devices/cxl_region0/scrub0/enable_background
> +
> +root@localhost:~# cat /sys/bus/edac/devices/cxl_region0/scrub0/enable_background
> +
> +1
> +
> +root@localhost:~# echo 0 > /sys/bus/edac/devices/cxl_region0/scrub0/enable_background
> +
> +root@localhost:~# cat /sys/bus/edac/devices/cxl_region0/scrub0/enable_background

What is this content-free blob of cat and echo statements? Please write actual
documentation with theory of operation, clarification of assumptions,
rationale for defaults, guidance on changing defaults... 

> diff --git a/drivers/cxl/Kconfig b/drivers/cxl/Kconfig
> index 0bc6a2cb8474..6078f02e883b 100644
> --- a/drivers/cxl/Kconfig
> +++ b/drivers/cxl/Kconfig
> @@ -154,4 +154,21 @@ config CXL_FEATURES
>  
>  	  If unsure say 'y'.
>  
> +config CXL_RAS_FEATURES
> +	tristate "CXL: Memory RAS features"
> +	depends on CXL_PCI

What is the build dependency on CXL_PCI? This enabling does not call back into
symbols provided by cxl_pci.ko does it?

> +	depends on CXL_MEM

Similar comment, and this also goes away if all of this just moves into
the new cxl_features driver.

> +	depends on EDAC
> +	help
> +	  The CXL memory RAS feature control is optional and allows host to
> +	  control the RAS features configurations of CXL Type 3 devices.
> +
> +	  It registers with the EDAC device subsystem to expose control
> +	  attributes of CXL memory device's RAS features to the user.
> +	  It provides interface functions to support configuring the CXL
> +	  memory device's RAS features.
> +	  Say 'y/m/n' to enable/disable control of the CXL.mem device's RAS features.
> +	  See section 8.2.9.9.11 of CXL 3.1 specification for the detailed
> +	  information of CXL memory device features.

Usually the "say X" statement provides a rationale like.

"Say y/m if you have an expert need to change default memory scrub rates established
by the platform/device, otherwise say n"

> +
>  endif
> diff --git a/drivers/cxl/core/Makefile b/drivers/cxl/core/Makefile
> index 73b6348afd67..54baca513ecb 100644
> --- a/drivers/cxl/core/Makefile
> +++ b/drivers/cxl/core/Makefile
> @@ -17,3 +17,4 @@ cxl_core-y += cdat.o
>  cxl_core-y += features.o
>  cxl_core-$(CONFIG_TRACING) += trace.o
>  cxl_core-$(CONFIG_CXL_REGION) += region.o
> +cxl_core-$(CONFIG_CXL_RAS_FEATURES) += memfeature.o
> diff --git a/drivers/cxl/core/memfeature.c b/drivers/cxl/core/memfeature.c
> new file mode 100644
> index 000000000000..77d1bf6ce45f
> --- /dev/null
> +++ b/drivers/cxl/core/memfeature.c
> @@ -0,0 +1,392 @@
> +// SPDX-License-Identifier: GPL-2.0-or-later
> +/*
> + * CXL memory RAS feature driver.
> + *
> + * Copyright (c) 2024 HiSilicon Limited.
> + *
> + *  - Supports functions to configure RAS features of the
> + *    CXL memory devices.
> + *  - Registers with the EDAC device subsystem driver to expose
> + *    the features sysfs attributes to the user for configuring
> + *    CXL memory RAS feature.
> + */
> +
> +#include <linux/cleanup.h>
> +#include <linux/edac.h>
> +#include <linux/limits.h>
> +#include <cxl/features.h>
> +#include <cxl.h>
> +#include <cxlmem.h>
> +
> +#define CXL_DEV_NUM_RAS_FEATURES	1
> +#define CXL_DEV_HOUR_IN_SECS	3600
> +
> +#define CXL_DEV_NAME_LEN	128
> +
> +/* CXL memory patrol scrub control functions */
> +struct cxl_patrol_scrub_context {
> +	u8 instance;
> +	u16 get_feat_size;
> +	u16 set_feat_size;
> +	u8 get_version;
> +	u8 set_version;
> +	u16 effects;
> +	struct cxl_memdev *cxlmd;
> +	struct cxl_region *cxlr;
> +};
> +
> +/**
> + * struct cxl_memdev_ps_params - CXL memory patrol scrub parameter data structure.
> + * @enable:     [IN & OUT] enable(1)/disable(0) patrol scrub.
> + * @scrub_cycle_changeable: [OUT] scrub cycle attribute of patrol scrub is changeable.
> + * @scrub_cycle_hrs:    [IN] Requested patrol scrub cycle in hours.
> + *                      [OUT] Current patrol scrub cycle in hours.
> + * @min_scrub_cycle_hrs:[OUT] minimum patrol scrub cycle in hours supported.
> + */
> +struct cxl_memdev_ps_params {
> +	bool enable;
> +	bool scrub_cycle_changeable;
> +	u8 scrub_cycle_hrs;
> +	u8 min_scrub_cycle_hrs;
> +};
> +
> +enum cxl_scrub_param {
> +	CXL_PS_PARAM_ENABLE,
> +	CXL_PS_PARAM_SCRUB_CYCLE,
> +};
> +
> +#define CXL_MEMDEV_PS_SCRUB_CYCLE_CHANGE_CAP_MASK	BIT(0)
> +#define	CXL_MEMDEV_PS_SCRUB_CYCLE_REALTIME_REPORT_CAP_MASK	BIT(1)
> +#define	CXL_MEMDEV_PS_CUR_SCRUB_CYCLE_MASK	GENMASK(7, 0)
> +#define	CXL_MEMDEV_PS_MIN_SCRUB_CYCLE_MASK	GENMASK(15, 8)
> +#define	CXL_MEMDEV_PS_FLAG_ENABLED_MASK	BIT(0)
> +
> +struct cxl_memdev_ps_rd_attrs {
> +	u8 scrub_cycle_cap;
> +	__le16 scrub_cycle_hrs;
> +	u8 scrub_flags;
> +}  __packed;
> +
> +struct cxl_memdev_ps_wr_attrs {
> +	u8 scrub_cycle_hrs;
> +	u8 scrub_flags;
> +}  __packed;

If these are packed to match specification layout, include a
specification reference comment.

> +
> +static int cxl_mem_ps_get_attrs(struct cxl_mailbox *cxl_mbox,
> +				struct cxl_memdev_ps_params *params)
> +{
> +	size_t rd_data_size = sizeof(struct cxl_memdev_ps_rd_attrs);
> +	u16 scrub_cycle_hrs;
> +	size_t data_size;
> +	u16 return_code;
> +	struct cxl_memdev_ps_rd_attrs *rd_attrs __free(kfree) =
> +						kmalloc(rd_data_size, GFP_KERNEL);

I would feel better with kzalloc() if short reads are possible.

How big can rd_data_size get? I.e. should this be kvzalloc()?

> +	if (!rd_attrs)
> +		return -ENOMEM;
> +
> +	data_size = cxl_get_feature(cxl_mbox->features, CXL_FEAT_PATROL_SCRUB_UUID,
> +				    CXL_GET_FEAT_SEL_CURRENT_VALUE,
> +				    rd_attrs, rd_data_size, 0, &return_code);
> +	if (!data_size || return_code != CXL_MBOX_CMD_RC_SUCCESS)
> +		return -EIO;
> +
> +	params->scrub_cycle_changeable = FIELD_GET(CXL_MEMDEV_PS_SCRUB_CYCLE_CHANGE_CAP_MASK,
> +						   rd_attrs->scrub_cycle_cap);
> +	params->enable = FIELD_GET(CXL_MEMDEV_PS_FLAG_ENABLED_MASK,
> +				   rd_attrs->scrub_flags);
> +	scrub_cycle_hrs = le16_to_cpu(rd_attrs->scrub_cycle_hrs);
> +	params->scrub_cycle_hrs = FIELD_GET(CXL_MEMDEV_PS_CUR_SCRUB_CYCLE_MASK,
> +					    scrub_cycle_hrs);
> +	params->min_scrub_cycle_hrs = FIELD_GET(CXL_MEMDEV_PS_MIN_SCRUB_CYCLE_MASK,
> +						scrub_cycle_hrs);
> +
> +	return 0;
> +}
> +
> +static int cxl_ps_get_attrs(struct cxl_patrol_scrub_context *cxl_ps_ctx,
> +			    struct cxl_memdev_ps_params *params)
> +{
> +	struct cxl_memdev *cxlmd;
> +	u16 min_scrub_cycle = 0;
> +	int i, ret;
> +
> +	if (cxl_ps_ctx->cxlr) {
> +		struct cxl_region *cxlr = cxl_ps_ctx->cxlr;
> +		struct cxl_region_params *p = &cxlr->params;
> +
> +		for (i = p->interleave_ways - 1; i >= 0; i--) {
> +			struct cxl_endpoint_decoder *cxled = p->targets[i];

It looks like this is called directly as a callback from EDAC. Where is
the locking that keeps cxl_ps_ctx->cxlr valid, or p->targets content
stable?

> +
> +			cxlmd = cxled_to_memdev(cxled);
> +			ret = cxl_mem_ps_get_attrs(&cxlmd->cxlds->cxl_mbox, params);
> +			if (ret)
> +				return ret;
> +
> +			if (params->min_scrub_cycle_hrs > min_scrub_cycle)
> +				min_scrub_cycle = params->min_scrub_cycle_hrs;
> +		}
> +		params->min_scrub_cycle_hrs = min_scrub_cycle;
> +		return 0;
> +	}
> +	cxlmd = cxl_ps_ctx->cxlmd;
> +
> +	return cxl_mem_ps_get_attrs(&cxlmd->cxlds->cxl_mbox, params);
> +}
> +
> +static int cxl_mem_ps_set_attrs(struct device *dev,
> +				struct cxl_patrol_scrub_context *cxl_ps_ctx,
> +				struct cxl_mailbox *cxl_mbox,
> +				struct cxl_memdev_ps_params *params,
> +				enum cxl_scrub_param param_type)
> +{
> +	struct cxl_memdev_ps_wr_attrs wr_attrs;
> +	struct cxl_memdev_ps_params rd_params;
> +	u16 return_code;
> +	int ret;
> +
> +	ret = cxl_mem_ps_get_attrs(cxl_mbox, &rd_params);
> +	if (ret) {
> +		dev_err(dev, "Get cxlmemdev patrol scrub params failed ret=%d\n",
> +			ret);
> +		return ret;
> +	}
> +
> +	switch (param_type) {
> +	case CXL_PS_PARAM_ENABLE:
> +		wr_attrs.scrub_flags = FIELD_PREP(CXL_MEMDEV_PS_FLAG_ENABLED_MASK,
> +						  params->enable);
> +		wr_attrs.scrub_cycle_hrs = FIELD_PREP(CXL_MEMDEV_PS_CUR_SCRUB_CYCLE_MASK,
> +						      rd_params.scrub_cycle_hrs);
> +		break;
> +	case CXL_PS_PARAM_SCRUB_CYCLE:
> +		if (params->scrub_cycle_hrs < rd_params.min_scrub_cycle_hrs) {
> +			dev_err(dev, "Invalid CXL patrol scrub cycle(%d) to set\n",
> +				params->scrub_cycle_hrs);
> +			dev_err(dev, "Minimum supported CXL patrol scrub cycle in hour %d\n",
> +				rd_params.min_scrub_cycle_hrs);
> +			return -EINVAL;
> +		}
> +		wr_attrs.scrub_cycle_hrs = FIELD_PREP(CXL_MEMDEV_PS_CUR_SCRUB_CYCLE_MASK,
> +						      params->scrub_cycle_hrs);
> +		wr_attrs.scrub_flags = FIELD_PREP(CXL_MEMDEV_PS_FLAG_ENABLED_MASK,
> +						  rd_params.enable);
> +		break;
> +	}
> +
> +	ret = cxl_set_feature(cxl_mbox->features, CXL_FEAT_PATROL_SCRUB_UUID,
> +			      cxl_ps_ctx->set_version,
> +			      &wr_attrs, sizeof(wr_attrs),
> +			      CXL_SET_FEAT_FLAG_DATA_SAVED_ACROSS_RESET,
> +			      0, &return_code);
> +	if (ret || return_code != CXL_MBOX_CMD_RC_SUCCESS) {
> +		dev_err(dev, "CXL patrol scrub set feature failed ret=%d return_code=%u\n",
> +			ret, return_code);

What can the admin do with this log spam? I would reconsider making all
of these dev_dbg() and improving the sysfs documentation on what error
codes mean.

[..]
> +
> +int cxl_mem_ras_features_init(struct cxl_memdev *cxlmd, struct cxl_region *cxlr)

Please separate this into a memdev helper and a region helper. It is
silly to have two arguments to a function where one is expected to be
NULL at all times, and then have an if else statement inside that to
effectively turn it back into 2 code paths.

If there is code to be shared amongst those, make *that* the shared
helper.

> +{
> +	struct edac_dev_feature ras_features[CXL_DEV_NUM_RAS_FEATURES];
> +	char cxl_dev_name[CXL_DEV_NAME_LEN];
> +	int num_ras_features = 0;
> +	u8 scrub_inst = 0;
> +	int rc;
> +
> +	rc = cxl_memdev_scrub_init(cxlmd, cxlr, &ras_features[num_ras_features],
> +				   scrub_inst);
> +	if (rc < 0)
> +		return rc;
> +
> +	scrub_inst++;
> +	num_ras_features++;
> +
> +	if (cxlr)
> +		snprintf(cxl_dev_name, sizeof(cxl_dev_name),
> +			 "cxl_region%d", cxlr->id);

Why not pass dev_name(&cxlr->dev) directly?

> +	else
> +		snprintf(cxl_dev_name, sizeof(cxl_dev_name),
> +			 "%s_%s", "cxl", dev_name(&cxlmd->dev));

Can a "cxl" directory be created so that the raw name can be used?

> +
> +	return edac_dev_register(&cxlmd->dev, cxl_dev_name, NULL,
> +				 num_ras_features, ras_features);

I'm so confused... a few lines down in this patch we have:

    rc = cxl_mem_ras_features_init(NULL, cxlr);

...so how can this call to edac_dev_register() unconditionally
de-reference @cxlmd?

Are there any tests for this? cxl-test is purpose-built for this kind
of basic coverage tests.

> +EXPORT_SYMBOL_NS_GPL(cxl_mem_ras_features_init, "CXL");
> diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
> index b98b1ccffd1c..c2be70cd87f8 100644
> --- a/drivers/cxl/core/region.c
> +++ b/drivers/cxl/core/region.c
> @@ -3449,6 +3449,12 @@ static int cxl_region_probe(struct device *dev)
>  					p->res->start, p->res->end, cxlr,
>  					is_system_ram) > 0)
>  			return 0;
> +
> +		rc = cxl_mem_ras_features_init(NULL, cxlr);
> +		if (rc)
> +			dev_warn(&cxlr->dev, "CXL RAS features init for region_id=%d failed\n",
> +				 cxlr->id);

There is more to RAS than EDAC memory scrub so this message is
misleading. It is also unnecessary because the driver continues to load
and the admin, if they care, will notice that the EDAC attributes are
missing.

> +
>  		return devm_cxl_add_dax_region(cxlr);
>  	default:
>  		dev_dbg(&cxlr->dev, "unsupported region mode: %d\n",
> diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
> index 55c55685cb39..2b02e47cd7e7 100644
> --- a/drivers/cxl/cxlmem.h
> +++ b/drivers/cxl/cxlmem.h
> @@ -800,6 +800,13 @@ int cxl_trigger_poison_list(struct cxl_memdev *cxlmd);
>  int cxl_inject_poison(struct cxl_memdev *cxlmd, u64 dpa);
>  int cxl_clear_poison(struct cxl_memdev *cxlmd, u64 dpa);
>  
> +#if IS_ENABLED(CONFIG_CXL_RAS_FEATURES)
> +int cxl_mem_ras_features_init(struct cxl_memdev *cxlmd, struct cxl_region *cxlr);
> +#else
> +static inline int cxl_mem_ras_features_init(struct cxl_memdev *cxlmd, struct cxl_region *cxlr)
> +{ return 0; }
> +#endif
> +
>  #ifdef CONFIG_CXL_SUSPEND
>  void cxl_mem_active_inc(void);
>  void cxl_mem_active_dec(void);
> diff --git a/drivers/cxl/mem.c b/drivers/cxl/mem.c
> index 2f03a4d5606e..d236b4b8a93c 100644
> --- a/drivers/cxl/mem.c
> +++ b/drivers/cxl/mem.c
> @@ -116,6 +116,10 @@ static int cxl_mem_probe(struct device *dev)
>  	if (!cxlds->media_ready)
>  		return -EBUSY;
>  
> +	rc = cxl_mem_ras_features_init(cxlmd, NULL);
> +	if (rc)
> +		dev_warn(&cxlmd->dev, "CXL RAS features init failed\n");
> +
>  	/*
>  	 * Someone is trying to reattach this device after it lost its port
>  	 * connection (an endpoint port previously registered by this memdev was
> @@ -259,3 +263,4 @@ MODULE_ALIAS_CXL(CXL_DEVICE_MEMORY_EXPANDER);
>   * endpoint registration.
>   */
>  MODULE_SOFTDEP("pre: cxl_port");
> +MODULE_SOFTDEP("pre: cxl_features");

Why?

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v18 15/19] cxl/memfeature: Add CXL memory device patrol scrub control feature
  2025-01-24 20:38   ` Dan Williams
@ 2025-01-27 10:06     ` Jonathan Cameron
  2025-01-27 12:53     ` Shiju Jose
  1 sibling, 0 replies; 87+ messages in thread
From: Jonathan Cameron @ 2025-01-27 10:06 UTC (permalink / raw)
  To: Dan Williams
  Cc: shiju.jose, linux-edac, linux-cxl, linux-acpi, linux-mm,
	linux-kernel, bp, tony.luck, rafael, lenb, mchehab, dave,
	dave.jiang, alison.schofield, vishal.l.verma, ira.weiny, david,
	Vilas.Sridharan, leo.duran, Yazen.Ghannam, rientjes, jiaqiyan,
	Jon.Grimm, dave.hansen, naoya.horiguchi, james.morse, jthoughton,
	somasundaram.a, erdemaktas, pgonda, duenwen, gthelen, wschwartz,
	dferguson, wbs, nifan.cxl, tanxiaofei, prime.zeng, roberto.sassu,
	kangkang.shen, wanghuiqiang, linuxarm

On Fri, 24 Jan 2025 12:38:43 -0800
Dan Williams <dan.j.williams@intel.com> wrote:

> shiju.jose@ wrote:
> > From: Shiju Jose <shiju.jose@huawei.com>
> > 
> > CXL spec 3.1 section 8.2.9.9.11.1 describes the device patrol scrub control
> > feature. The device patrol scrub proactively locates and makes corrections
> > to errors in regular cycle.
> > 
> > Allow specifying the number of hours within which the patrol scrub must be
> > completed, subject to minimum and maximum limits reported by the device.
> > Also allow disabling scrub allowing trade-off error rates against
> > performance.
> > 
> > Add support for patrol scrub control on CXL memory devices.
> > Register with the EDAC device driver, which retrieves the scrub attribute
> > descriptors from EDAC scrub and exposes the sysfs scrub control attributes
> > to userspace. For example, scrub control for the CXL memory device
> > "cxl_mem0" is exposed in /sys/bus/edac/devices/cxl_mem0/scrubX/.
> > 
> > Additionally, add support for region-based CXL memory patrol scrub control.
> > CXL memory regions may be interleaved across one or more CXL memory
> > devices. For example, region-based scrub control for "cxl_region1" is
> > exposed in /sys/bus/edac/devices/cxl_region1/scrubX/.
> > 
> > Reviewed-by: Dave Jiang <dave.jiang@intel.com>
> > Co-developed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> > Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> > Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Hi Dan,

A few specific replies in line. I've left the detail stuff to Shiju
to address.  Definitely a few things in there I'd missed!

Thanks,

Jonathan


> > ---
> >  Documentation/edac/scrub.rst  |  66 ++++++
> >  drivers/cxl/Kconfig           |  17 ++
> >  drivers/cxl/core/Makefile     |   1 +
> >  drivers/cxl/core/memfeature.c | 392 ++++++++++++++++++++++++++++++++++
> >  drivers/cxl/core/region.c     |   6 +
> >  drivers/cxl/cxlmem.h          |   7 +
> >  drivers/cxl/mem.c             |   5 +
> >  include/cxl/features.h        |  16 ++
> >  8 files changed, 510 insertions(+)
> >  create mode 100644 drivers/cxl/core/memfeature.c
> > diff --git a/Documentation/edac/scrub.rst b/Documentation/edac/scrub.rst
> > index f86645c7f0af..80e986c57885 100644
> > --- a/Documentation/edac/scrub.rst
> > +++ b/Documentation/edac/scrub.rst
> > @@ -325,3 +325,69 @@ root@localhost:~# cat /sys/bus/edac/devices/acpi_ras_mem0/scrub0/current_cycle_d
> >  10800
> >  
> >  root@localhost:~# echo 0 > /sys/bus/edac/devices/acpi_ras_mem0/scrub0/enable_background
> > +
> > +2. CXL memory device patrol scrubber
> > +
> > +2.1 Device based scrubbing
> > +
> > +root@localhost:~# cat /sys/bus/edac/devices/cxl_mem0/scrub0/min_cycle_duration
> > +
> > +3600
> > +
> > +root@localhost:~# cat /sys/bus/edac/devices/cxl_mem0/scrub0/max_cycle_duration
> > +
> > +918000
> > +
> > +root@localhost:~# cat /sys/bus/edac/devices/cxl_mem0/scrub0/current_cycle_duration
> > +
> > +43200
> > +
> > +root@localhost:~# echo 54000 > /sys/bus/edac/devices/cxl_mem0/scrub0/current_cycle_duration
> > +
> > +root@localhost:~# cat /sys/bus/edac/devices/cxl_mem0/scrub0/current_cycle_duration
> > +
> > +54000
> > +
> > +root@localhost:~# echo 1 > /sys/bus/edac/devices/cxl_mem0/scrub0/enable_background
> > +
> > +root@localhost:~# cat /sys/bus/edac/devices/cxl_mem0/scrub0/enable_background
> > +
> > +1
> > +
> > +root@localhost:~# echo 0 > /sys/bus/edac/devices/cxl_mem0/scrub0/enable_background
> > +
> > +root@localhost:~# cat /sys/bus/edac/devices/cxl_mem0/scrub0/enable_background
> > +
> > +0
> > +
> > +2.2. Region based scrubbing
> > +
> > +root@localhost:~# cat /sys/bus/edac/devices/cxl_region0/scrub0/min_cycle_duration
> > +
> > +3600
> > +
> > +root@localhost:~# cat /sys/bus/edac/devices/cxl_region0/scrub0/max_cycle_duration
> > +
> > +918000
> > +
> > +root@localhost:~# cat /sys/bus/edac/devices/cxl_region0/scrub0/current_cycle_duration
> > +
> > +43200
> > +
> > +root@localhost:~# echo 54000 > /sys/bus/edac/devices/cxl_region0/scrub0/current_cycle_duration
> > +
> > +root@localhost:~# cat /sys/bus/edac/devices/cxl_region0/scrub0/current_cycle_duration
> > +
> > +54000
> > +
> > +root@localhost:~# echo 1 > /sys/bus/edac/devices/cxl_region0/scrub0/enable_background
> > +
> > +root@localhost:~# cat /sys/bus/edac/devices/cxl_region0/scrub0/enable_background
> > +
> > +1
> > +
> > +root@localhost:~# echo 0 > /sys/bus/edac/devices/cxl_region0/scrub0/enable_background
> > +
> > +root@localhost:~# cat /sys/bus/edac/devices/cxl_region0/scrub0/enable_background  
> 
> What is this content-free blob of cat and echo statements? Please write actual
> documentation with theory of operation, clarification of assumptions,
> rationale for defaults, guidance on changing defaults... 

Note this is a continuation of existing documentation, but sure some inline comments
talking more about it would be fine.  The rational and top level discussion is
meant to be described in patch 2 as it is not CXL specific.

Defaults are a device thing, there are no software driven defaults.

So I'd suggest the above just adds a few comments along the lines of
what the actions of each block does. 
Something like:

Check current parameters and program the scrubbing for a region to repeat
every X seconds (% of day)


> 
> > diff --git a/drivers/cxl/Kconfig b/drivers/cxl/Kconfig
> > index 0bc6a2cb8474..6078f02e883b 100644
> > --- a/drivers/cxl/Kconfig
> > +++ b/drivers/cxl/Kconfig
> > @@ -154,4 +154,21 @@ config CXL_FEATURES
> >  
> >  	  If unsure say 'y'.
> >  
> > +config CXL_RAS_FEATURES
> > +	tristate "CXL: Memory RAS features"
> > +	depends on CXL_PCI  
> 
> What is the build dependency on CXL_PCI? This enabling does not call back into
> symbols provided by cxl_pci.ko does it?
> 
> > +	depends on CXL_MEM  
> 
> Similar comment, and this also goes away if all of this just moves into
> the new cxl_features driver.

I'm not sure moving to be a child of cxl_features makes sense. Can
probably do it but it's making the spiders web of connections even harder
to relate to the underlying hardware. In my mental model, this stuff
takes services from 'features' and 'mailbox' parts of the CXL driver.

Take repair.   Less than half of each of those drivers is feature related
(a few 'what can I do' type aspects). The control plane goes via
maintenance commands.

Obviously we can get to those by adding additional infrastructure to the
features driver, but that seems likely to be ugly and where do we stop?
It won't scale to likely future cases where the feature part is a tiny
tweak on some much larger chunk of infrastructure (which is mostly what
spec defined features seem to be for). I don't think we want to support
the complexity of device built-in test in the 3.2 spec necessarily
(haven't really thought about it yet!) but it is an example of what would
be an EDAC feature but has no dependence on features.

We could register the patrol scrub stuff from features, and the rest
separately but that seems even more confusing.

So to me, nesting under features is an ugly solution but I'm not that
attached to current approach.

So in my view this stuff should be dependent on CXL_FEATURES but
not a child of it.


(lots skipped - I'll leave the detailed stuff to Shiju!)
> > +	else
> > +		snprintf(cxl_dev_name, sizeof(cxl_dev_name),
> > +			 "%s_%s", "cxl", dev_name(&cxlmd->dev));  
> 
> Can a "cxl" directory be created so that the raw name can be used?

I'd like feedback from Boris on that.  It is a mess to instantiate
devices in subdirectories under a bus (that's kind of the big issue
with the EDAC usage of the device model already).

I'd say no it can't.

> 


^ permalink raw reply	[flat|nested] 87+ messages in thread

* RE: [PATCH v18 15/19] cxl/memfeature: Add CXL memory device patrol scrub control feature
  2025-01-24 20:38   ` Dan Williams
  2025-01-27 10:06     ` Jonathan Cameron
@ 2025-01-27 12:53     ` Shiju Jose
  2025-01-27 23:17       ` Dan Williams
  1 sibling, 1 reply; 87+ messages in thread
From: Shiju Jose @ 2025-01-27 12:53 UTC (permalink / raw)
  To: Dan Williams, linux-edac@vger.kernel.org,
	linux-cxl@vger.kernel.org, linux-acpi@vger.kernel.org,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org
  Cc: bp@alien8.de, tony.luck@intel.com, rafael@kernel.org,
	lenb@kernel.org, mchehab@kernel.org, dave@stgolabs.net,
	Jonathan Cameron, dave.jiang@intel.com,
	alison.schofield@intel.com, vishal.l.verma@intel.com,
	ira.weiny@intel.com, david@redhat.com, Vilas.Sridharan@amd.com,
	leo.duran@amd.com, Yazen.Ghannam@amd.com, rientjes@google.com,
	jiaqiyan@google.com, Jon.Grimm@amd.com,
	dave.hansen@linux.intel.com, naoya.horiguchi@nec.com,
	james.morse@arm.com, jthoughton@google.com,
	somasundaram.a@hpe.com, erdemaktas@google.com, pgonda@google.com,
	duenwen@google.com, gthelen@google.com,
	wschwartz@amperecomputing.com, dferguson@amperecomputing.com,
	wbs@os.amperecomputing.com, nifan.cxl@gmail.com, tanxiaofei,
	Zengtao (B), Roberto Sassu, kangkang.shen@futurewei.com,
	wanghuiqiang, Linuxarm

Hi Dan,

Thanks for the comments.

Please find reply inline.

Thanks,
Shiju
>-----Original Message-----
>From: Dan Williams <dan.j.williams@intel.com>
>Sent: 24 January 2025 20:39
>To: Shiju Jose <shiju.jose@huawei.com>; linux-edac@vger.kernel.org; linux-
>cxl@vger.kernel.org; linux-acpi@vger.kernel.org; linux-mm@kvack.org; linux-
>kernel@vger.kernel.org
>Cc: bp@alien8.de; tony.luck@intel.com; rafael@kernel.org; lenb@kernel.org;
>mchehab@kernel.org; dan.j.williams@intel.com; dave@stgolabs.net; Jonathan
>Cameron <jonathan.cameron@huawei.com>; dave.jiang@intel.com;
>alison.schofield@intel.com; vishal.l.verma@intel.com; ira.weiny@intel.com;
>david@redhat.com; Vilas.Sridharan@amd.com; leo.duran@amd.com;
>Yazen.Ghannam@amd.com; rientjes@google.com; jiaqiyan@google.com;
>Jon.Grimm@amd.com; dave.hansen@linux.intel.com;
>naoya.horiguchi@nec.com; james.morse@arm.com; jthoughton@google.com;
>somasundaram.a@hpe.com; erdemaktas@google.com; pgonda@google.com;
>duenwen@google.com; gthelen@google.com;
>wschwartz@amperecomputing.com; dferguson@amperecomputing.com;
>wbs@os.amperecomputing.com; nifan.cxl@gmail.com; tanxiaofei
><tanxiaofei@huawei.com>; Zengtao (B) <prime.zeng@hisilicon.com>; Roberto
>Sassu <roberto.sassu@huawei.com>; kangkang.shen@futurewei.com;
>wanghuiqiang <wanghuiqiang@huawei.com>; Linuxarm
><linuxarm@huawei.com>; Shiju Jose <shiju.jose@huawei.com>
>Subject: Re: [PATCH v18 15/19] cxl/memfeature: Add CXL memory device patrol
>scrub control feature
>
>shiju.jose@ wrote:
>> From: Shiju Jose <shiju.jose@huawei.com>
>>
>> CXL spec 3.1 section 8.2.9.9.11.1 describes the device patrol scrub
>> control feature. The device patrol scrub proactively locates and makes
>> corrections to errors in regular cycle.
>>
>> Allow specifying the number of hours within which the patrol scrub
>> must be completed, subject to minimum and maximum limits reported by the
>device.
>> Also allow disabling scrub allowing trade-off error rates against
>> performance.
>>
>> Add support for patrol scrub control on CXL memory devices.
>> Register with the EDAC device driver, which retrieves the scrub
>> attribute descriptors from EDAC scrub and exposes the sysfs scrub
>> control attributes to userspace. For example, scrub control for the
>> CXL memory device "cxl_mem0" is exposed in
>/sys/bus/edac/devices/cxl_mem0/scrubX/.
>>
>> Additionally, add support for region-based CXL memory patrol scrub control.
>> CXL memory regions may be interleaved across one or more CXL memory
>> devices. For example, region-based scrub control for "cxl_region1" is
>> exposed in /sys/bus/edac/devices/cxl_region1/scrubX/.
>>
>> Reviewed-by: Dave Jiang <dave.jiang@intel.com>
>> Co-developed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
>> Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
>> Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
>> ---
>>  Documentation/edac/scrub.rst  |  66 ++++++
>>  drivers/cxl/Kconfig           |  17 ++
>>  drivers/cxl/core/Makefile     |   1 +
>>  drivers/cxl/core/memfeature.c | 392
>++++++++++++++++++++++++++++++++++
>>  drivers/cxl/core/region.c     |   6 +
>>  drivers/cxl/cxlmem.h          |   7 +
>>  drivers/cxl/mem.c             |   5 +
>>  include/cxl/features.h        |  16 ++
>>  8 files changed, 510 insertions(+)
>>  create mode 100644 drivers/cxl/core/memfeature.c diff --git
>> a/Documentation/edac/scrub.rst b/Documentation/edac/scrub.rst index
>> f86645c7f0af..80e986c57885 100644
>> --- a/Documentation/edac/scrub.rst
>> +++ b/Documentation/edac/scrub.rst
>> @@ -325,3 +325,69 @@ root@localhost:~# cat
>> /sys/bus/edac/devices/acpi_ras_mem0/scrub0/current_cycle_d
>>  10800
>>
>>  root@localhost:~# echo 0 >
>> /sys/bus/edac/devices/acpi_ras_mem0/scrub0/enable_background
>> +
>> +2. CXL memory device patrol scrubber
>> +
>> +2.1 Device based scrubbing
>> +
>> +root@localhost:~# cat
>> +/sys/bus/edac/devices/cxl_mem0/scrub0/min_cycle_duration
>> +
>> +3600
>> +
>> +root@localhost:~# cat
>> +/sys/bus/edac/devices/cxl_mem0/scrub0/max_cycle_duration
>> +
>> +918000
>> +
>> +root@localhost:~# cat
>> +/sys/bus/edac/devices/cxl_mem0/scrub0/current_cycle_duration
>> +
>> +43200
>> +
>> +root@localhost:~# echo 54000 >
>> +/sys/bus/edac/devices/cxl_mem0/scrub0/current_cycle_duration
>> +
>> +root@localhost:~# cat
>> +/sys/bus/edac/devices/cxl_mem0/scrub0/current_cycle_duration
>> +
>> +54000
>> +
>> +root@localhost:~# echo 1 >
>> +/sys/bus/edac/devices/cxl_mem0/scrub0/enable_background
>> +
>> +root@localhost:~# cat
>> +/sys/bus/edac/devices/cxl_mem0/scrub0/enable_background
>> +
>> +1
>> +
>> +root@localhost:~# echo 0 >
>> +/sys/bus/edac/devices/cxl_mem0/scrub0/enable_background
>> +
>> +root@localhost:~# cat
>> +/sys/bus/edac/devices/cxl_mem0/scrub0/enable_background
>> +
>> +0
>> +
>> +2.2. Region based scrubbing
>> +
>> +root@localhost:~# cat
>> +/sys/bus/edac/devices/cxl_region0/scrub0/min_cycle_duration
>> +
>> +3600
>> +
>> +root@localhost:~# cat
>> +/sys/bus/edac/devices/cxl_region0/scrub0/max_cycle_duration
>> +
>> +918000
>> +
>> +root@localhost:~# cat
>> +/sys/bus/edac/devices/cxl_region0/scrub0/current_cycle_duration
>> +
>> +43200
>> +
>> +root@localhost:~# echo 54000 >
>> +/sys/bus/edac/devices/cxl_region0/scrub0/current_cycle_duration
>> +
>> +root@localhost:~# cat
>> +/sys/bus/edac/devices/cxl_region0/scrub0/current_cycle_duration
>> +
>> +54000
>> +
>> +root@localhost:~# echo 1 >
>> +/sys/bus/edac/devices/cxl_region0/scrub0/enable_background
>> +
>> +root@localhost:~# cat
>> +/sys/bus/edac/devices/cxl_region0/scrub0/enable_background
>> +
>> +1
>> +
>> +root@localhost:~# echo 0 >
>> +/sys/bus/edac/devices/cxl_region0/scrub0/enable_background
>> +
>> +root@localhost:~# cat
>> +/sys/bus/edac/devices/cxl_region0/scrub0/enable_background
>
>What is this content-free blob of cat and echo statements? Please write actual
>documentation with theory of operation, clarification of assumptions, rationale
>for defaults, guidance on changing defaults...

Jonathan already replied.

>
>> diff --git a/drivers/cxl/Kconfig b/drivers/cxl/Kconfig index
>> 0bc6a2cb8474..6078f02e883b 100644
>> --- a/drivers/cxl/Kconfig
>> +++ b/drivers/cxl/Kconfig
>> @@ -154,4 +154,21 @@ config CXL_FEATURES
>>
>>  	  If unsure say 'y'.
>>
>> +config CXL_RAS_FEATURES
>> +	tristate "CXL: Memory RAS features"
>> +	depends on CXL_PCI
>
>What is the build dependency on CXL_PCI? This enabling does not call back into
>symbols provided by cxl_pci.ko does it?
Will remove, which is not required.  Initially cxl_mem_ras_features_init() was called from the pci.c

>
>> +	depends on CXL_MEM
>
>Similar comment, and this also goes away if all of this just moves into the new
>cxl_features driver.

Agree with  Jonathan told in reply. These are RAS specific features for CXL memory devices and
thus added in memfeature.c  
>
>> +	depends on EDAC
>> +	help
>> +	  The CXL memory RAS feature control is optional and allows host to
>> +	  control the RAS features configurations of CXL Type 3 devices.
>> +
>> +	  It registers with the EDAC device subsystem to expose control
>> +	  attributes of CXL memory device's RAS features to the user.
>> +	  It provides interface functions to support configuring the CXL
>> +	  memory device's RAS features.
>> +	  Say 'y/m/n' to enable/disable control of the CXL.mem device's RAS
>features.
>> +	  See section 8.2.9.9.11 of CXL 3.1 specification for the detailed
>> +	  information of CXL memory device features.
>
>Usually the "say X" statement provides a rationale like.
>
>"Say y/m if you have an expert need to change default memory scrub rates
>established by the platform/device, otherwise say n"

Will change.

>
>> +
>>  endif
>> diff --git a/drivers/cxl/core/Makefile b/drivers/cxl/core/Makefile
>> index 73b6348afd67..54baca513ecb 100644
>> --- a/drivers/cxl/core/Makefile
>> +++ b/drivers/cxl/core/Makefile
>> @@ -17,3 +17,4 @@ cxl_core-y += cdat.o  cxl_core-y += features.o
>>  cxl_core-$(CONFIG_TRACING) += trace.o
>>  cxl_core-$(CONFIG_CXL_REGION) += region.o
>> +cxl_core-$(CONFIG_CXL_RAS_FEATURES) += memfeature.o
>> diff --git a/drivers/cxl/core/memfeature.c
>> b/drivers/cxl/core/memfeature.c new file mode 100644 index
>> 000000000000..77d1bf6ce45f
>> --- /dev/null
>> +++ b/drivers/cxl/core/memfeature.c
>> @@ -0,0 +1,392 @@
>> +// SPDX-License-Identifier: GPL-2.0-or-later
>> +/*
>> + * CXL memory RAS feature driver.
>> + *
>> + * Copyright (c) 2024 HiSilicon Limited.
>> + *
>> + *  - Supports functions to configure RAS features of the
>> + *    CXL memory devices.
>> + *  - Registers with the EDAC device subsystem driver to expose
>> + *    the features sysfs attributes to the user for configuring
>> + *    CXL memory RAS feature.
>> + */
>> +
>> +#include <linux/cleanup.h>
>> +#include <linux/edac.h>
>> +#include <linux/limits.h>
>> +#include <cxl/features.h>
>> +#include <cxl.h>
>> +#include <cxlmem.h>
>> +
>> +#define CXL_DEV_NUM_RAS_FEATURES	1
>> +#define CXL_DEV_HOUR_IN_SECS	3600
>> +
>> +#define CXL_DEV_NAME_LEN	128
>> +
>> +/* CXL memory patrol scrub control functions */ struct
>> +cxl_patrol_scrub_context {
>> +	u8 instance;
>> +	u16 get_feat_size;
>> +	u16 set_feat_size;
>> +	u8 get_version;
>> +	u8 set_version;
>> +	u16 effects;
>> +	struct cxl_memdev *cxlmd;
>> +	struct cxl_region *cxlr;
>> +};
>> +
>> +/**
>> + * struct cxl_memdev_ps_params - CXL memory patrol scrub parameter data
>structure.
>> + * @enable:     [IN & OUT] enable(1)/disable(0) patrol scrub.
>> + * @scrub_cycle_changeable: [OUT] scrub cycle attribute of patrol scrub is
>changeable.
>> + * @scrub_cycle_hrs:    [IN] Requested patrol scrub cycle in hours.
>> + *                      [OUT] Current patrol scrub cycle in hours.
>> + * @min_scrub_cycle_hrs:[OUT] minimum patrol scrub cycle in hours
>supported.
>> + */
>> +struct cxl_memdev_ps_params {
>> +	bool enable;
>> +	bool scrub_cycle_changeable;
>> +	u8 scrub_cycle_hrs;
>> +	u8 min_scrub_cycle_hrs;
>> +};
>> +
>> +enum cxl_scrub_param {
>> +	CXL_PS_PARAM_ENABLE,
>> +	CXL_PS_PARAM_SCRUB_CYCLE,
>> +};
>> +
>> +#define CXL_MEMDEV_PS_SCRUB_CYCLE_CHANGE_CAP_MASK	BIT(0)
>> +#define
>	CXL_MEMDEV_PS_SCRUB_CYCLE_REALTIME_REPORT_CAP_MASK
>	BIT(1)
>> +#define	CXL_MEMDEV_PS_CUR_SCRUB_CYCLE_MASK	GENMASK(7, 0)
>> +#define	CXL_MEMDEV_PS_MIN_SCRUB_CYCLE_MASK	GENMASK(15,
>8)
>> +#define	CXL_MEMDEV_PS_FLAG_ENABLED_MASK	BIT(0)
>> +
>> +struct cxl_memdev_ps_rd_attrs {
>> +	u8 scrub_cycle_cap;
>> +	__le16 scrub_cycle_hrs;
>> +	u8 scrub_flags;
>> +}  __packed;
>> +
>> +struct cxl_memdev_ps_wr_attrs {
>> +	u8 scrub_cycle_hrs;
>> +	u8 scrub_flags;
>> +}  __packed;
>
>If these are packed to match specification layout, include a specification
>reference comment.

Will add specification reference comment. Added same for memory repair features,
but missed here.
>
>> +
>> +static int cxl_mem_ps_get_attrs(struct cxl_mailbox *cxl_mbox,
>> +				struct cxl_memdev_ps_params *params) {
>> +	size_t rd_data_size = sizeof(struct cxl_memdev_ps_rd_attrs);
>> +	u16 scrub_cycle_hrs;
>> +	size_t data_size;
>> +	u16 return_code;
>> +	struct cxl_memdev_ps_rd_attrs *rd_attrs __free(kfree) =
>> +						kmalloc(rd_data_size,
>GFP_KERNEL);
>
>I would feel better with kzalloc() if short reads are possible.

Will change to kzalloc().

>
>How big can rd_data_size get? I.e. should this be kvzalloc()?

rd_data_size is 4 bytes for the patrol scrub feature. 

>
>> +	if (!rd_attrs)
>> +		return -ENOMEM;
>> +
>> +	data_size = cxl_get_feature(cxl_mbox->features,
>CXL_FEAT_PATROL_SCRUB_UUID,
>> +				    CXL_GET_FEAT_SEL_CURRENT_VALUE,
>> +				    rd_attrs, rd_data_size, 0, &return_code);
>> +	if (!data_size || return_code != CXL_MBOX_CMD_RC_SUCCESS)
>> +		return -EIO;
>> +
>> +	params->scrub_cycle_changeable =
>FIELD_GET(CXL_MEMDEV_PS_SCRUB_CYCLE_CHANGE_CAP_MASK,
>> +						   rd_attrs->scrub_cycle_cap);
>> +	params->enable =
>FIELD_GET(CXL_MEMDEV_PS_FLAG_ENABLED_MASK,
>> +				   rd_attrs->scrub_flags);
>> +	scrub_cycle_hrs = le16_to_cpu(rd_attrs->scrub_cycle_hrs);
>> +	params->scrub_cycle_hrs =
>FIELD_GET(CXL_MEMDEV_PS_CUR_SCRUB_CYCLE_MASK,
>> +					    scrub_cycle_hrs);
>> +	params->min_scrub_cycle_hrs =
>FIELD_GET(CXL_MEMDEV_PS_MIN_SCRUB_CYCLE_MASK,
>> +						scrub_cycle_hrs);
>> +
>> +	return 0;
>> +}
>> +
>> +static int cxl_ps_get_attrs(struct cxl_patrol_scrub_context *cxl_ps_ctx,
>> +			    struct cxl_memdev_ps_params *params) {
>> +	struct cxl_memdev *cxlmd;
>> +	u16 min_scrub_cycle = 0;
>> +	int i, ret;
>> +
>> +	if (cxl_ps_ctx->cxlr) {
>> +		struct cxl_region *cxlr = cxl_ps_ctx->cxlr;
>> +		struct cxl_region_params *p = &cxlr->params;
>> +
>> +		for (i = p->interleave_ways - 1; i >= 0; i--) {
>> +			struct cxl_endpoint_decoder *cxled = p->targets[i];
>
>It looks like this is called directly as a callback from EDAC. Where is the locking
>that keeps cxl_ps_ctx->cxlr valid, or p->targets content stable?
Jonathan already replied.
>
>> +
>> +			cxlmd = cxled_to_memdev(cxled);
>> +			ret = cxl_mem_ps_get_attrs(&cxlmd->cxlds->cxl_mbox,
>params);
>> +			if (ret)
>> +				return ret;
>> +
>> +			if (params->min_scrub_cycle_hrs > min_scrub_cycle)
>> +				min_scrub_cycle = params-
>>min_scrub_cycle_hrs;
>> +		}
>> +		params->min_scrub_cycle_hrs = min_scrub_cycle;
>> +		return 0;
>> +	}
>> +	cxlmd = cxl_ps_ctx->cxlmd;
>> +
>> +	return cxl_mem_ps_get_attrs(&cxlmd->cxlds->cxl_mbox, params); }
>> +
>> +static int cxl_mem_ps_set_attrs(struct device *dev,
>> +				struct cxl_patrol_scrub_context *cxl_ps_ctx,
>> +				struct cxl_mailbox *cxl_mbox,
>> +				struct cxl_memdev_ps_params *params,
>> +				enum cxl_scrub_param param_type)
>> +{
>> +	struct cxl_memdev_ps_wr_attrs wr_attrs;
>> +	struct cxl_memdev_ps_params rd_params;
>> +	u16 return_code;
>> +	int ret;
>> +
>> +	ret = cxl_mem_ps_get_attrs(cxl_mbox, &rd_params);
>> +	if (ret) {
>> +		dev_err(dev, "Get cxlmemdev patrol scrub params failed
>ret=%d\n",
>> +			ret);
>> +		return ret;
>> +	}
>> +
>> +	switch (param_type) {
>> +	case CXL_PS_PARAM_ENABLE:
>> +		wr_attrs.scrub_flags =
>FIELD_PREP(CXL_MEMDEV_PS_FLAG_ENABLED_MASK,
>> +						  params->enable);
>> +		wr_attrs.scrub_cycle_hrs =
>FIELD_PREP(CXL_MEMDEV_PS_CUR_SCRUB_CYCLE_MASK,
>> +
>rd_params.scrub_cycle_hrs);
>> +		break;
>> +	case CXL_PS_PARAM_SCRUB_CYCLE:
>> +		if (params->scrub_cycle_hrs < rd_params.min_scrub_cycle_hrs)
>{
>> +			dev_err(dev, "Invalid CXL patrol scrub cycle(%d) to
>set\n",
>> +				params->scrub_cycle_hrs);
>> +			dev_err(dev, "Minimum supported CXL patrol scrub
>cycle in hour %d\n",
>> +				rd_params.min_scrub_cycle_hrs);
>> +			return -EINVAL;
>> +		}
>> +		wr_attrs.scrub_cycle_hrs =
>FIELD_PREP(CXL_MEMDEV_PS_CUR_SCRUB_CYCLE_MASK,
>> +						      params->scrub_cycle_hrs);
>> +		wr_attrs.scrub_flags =
>FIELD_PREP(CXL_MEMDEV_PS_FLAG_ENABLED_MASK,
>> +						  rd_params.enable);
>> +		break;
>> +	}
>> +
>> +	ret = cxl_set_feature(cxl_mbox->features,
>CXL_FEAT_PATROL_SCRUB_UUID,
>> +			      cxl_ps_ctx->set_version,
>> +			      &wr_attrs, sizeof(wr_attrs),
>> +			      CXL_SET_FEAT_FLAG_DATA_SAVED_ACROSS_RESET,
>> +			      0, &return_code);
>> +	if (ret || return_code != CXL_MBOX_CMD_RC_SUCCESS) {
>> +		dev_err(dev, "CXL patrol scrub set feature failed ret=%d
>return_code=%u\n",
>> +			ret, return_code);
>
>What can the admin do with this log spam? I would reconsider making all of
>these dev_dbg() and improving the sysfs documentation on what error codes
>mean.
Sure will change.
>
>[..]
>> +
>> +int cxl_mem_ras_features_init(struct cxl_memdev *cxlmd, struct
>> +cxl_region *cxlr)
>
>Please separate this into a memdev helper and a region helper. It is silly to have
>two arguments to a function where one is expected to be NULL at all times, and
>then have an if else statement inside that to effectively turn it back into 2 code
>paths.
>
>If there is code to be shared amongst those, make *that* the shared helper.
I added single function cxl_mem_ras_features_init() for both memdev and region based
scrubbing to reduce code size as there were feedbacks try reduce code size.  
>
>> +{
>> +	struct edac_dev_feature ras_features[CXL_DEV_NUM_RAS_FEATURES];
>> +	char cxl_dev_name[CXL_DEV_NAME_LEN];
>> +	int num_ras_features = 0;
>> +	u8 scrub_inst = 0;
>> +	int rc;
>> +
>> +	rc = cxl_memdev_scrub_init(cxlmd, cxlr,
>&ras_features[num_ras_features],
>> +				   scrub_inst);
>> +	if (rc < 0)
>> +		return rc;
>> +
>> +	scrub_inst++;
>> +	num_ras_features++;
>> +
>> +	if (cxlr)
>> +		snprintf(cxl_dev_name, sizeof(cxl_dev_name),
>> +			 "cxl_region%d", cxlr->id);
>
>Why not pass dev_name(&cxlr->dev) directly?
Jonathan already replied. 
>
>> +	else
>> +		snprintf(cxl_dev_name, sizeof(cxl_dev_name),
>> +			 "%s_%s", "cxl", dev_name(&cxlmd->dev));
>
>Can a "cxl" directory be created so that the raw name can be used?
>
>> +
>> +	return edac_dev_register(&cxlmd->dev, cxl_dev_name, NULL,
>> +				 num_ras_features, ras_features);
>
>I'm so confused... a few lines down in this patch we have:
>
>    rc = cxl_mem_ras_features_init(NULL, cxlr);
>
>...so how can this call to edac_dev_register() unconditionally de-reference
>@cxlmd?
Thanks for spotting this. It is a bug, need to fix , cxlmd inited for region based scrubbing 
was done inside cxl_mem_ras_features_init() previously, which now moved to
inside cxl_memdev_scrub_init(). 
Region based scrubbing required better testing because of some difficulty in running
this use case in my test setup. Will check with Jonathan how to do.
>
>Are there any tests for this? cxl-test is purpose-built for this kind of basic
>coverage tests.
Will check this.
>
>> +EXPORT_SYMBOL_NS_GPL(cxl_mem_ras_features_init, "CXL");
>> diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
>> index b98b1ccffd1c..c2be70cd87f8 100644
>> --- a/drivers/cxl/core/region.c
>> +++ b/drivers/cxl/core/region.c
>> @@ -3449,6 +3449,12 @@ static int cxl_region_probe(struct device *dev)
>>  					p->res->start, p->res->end, cxlr,
>>  					is_system_ram) > 0)
>>  			return 0;
>> +
>> +		rc = cxl_mem_ras_features_init(NULL, cxlr);
>> +		if (rc)
>> +			dev_warn(&cxlr->dev, "CXL RAS features init for
>region_id=%d failed\n",
>> +				 cxlr->id);
>
>There is more to RAS than EDAC memory scrub so this message is misleading. It
>is also unnecessary because the driver continues to load and the admin, if they
>care, will notice that the EDAC attributes are missing.
This message was added for the debugging purpose in CXL driver. I will change to  dev_dbg().
>
>> +
>>  		return devm_cxl_add_dax_region(cxlr);
>>  	default:
>>  		dev_dbg(&cxlr->dev, "unsupported region mode: %d\n", diff --
>git
>> a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h index
>> 55c55685cb39..2b02e47cd7e7 100644
>> --- a/drivers/cxl/cxlmem.h
>> +++ b/drivers/cxl/cxlmem.h
>> @@ -800,6 +800,13 @@ int cxl_trigger_poison_list(struct cxl_memdev
>> *cxlmd);  int cxl_inject_poison(struct cxl_memdev *cxlmd, u64 dpa);
>> int cxl_clear_poison(struct cxl_memdev *cxlmd, u64 dpa);
>>
>> +#if IS_ENABLED(CONFIG_CXL_RAS_FEATURES)
>> +int cxl_mem_ras_features_init(struct cxl_memdev *cxlmd, struct
>> +cxl_region *cxlr); #else static inline int
>> +cxl_mem_ras_features_init(struct cxl_memdev *cxlmd, struct cxl_region
>> +*cxlr) { return 0; } #endif
>> +
>>  #ifdef CONFIG_CXL_SUSPEND
>>  void cxl_mem_active_inc(void);
>>  void cxl_mem_active_dec(void);
>> diff --git a/drivers/cxl/mem.c b/drivers/cxl/mem.c index
>> 2f03a4d5606e..d236b4b8a93c 100644
>> --- a/drivers/cxl/mem.c
>> +++ b/drivers/cxl/mem.c
>> @@ -116,6 +116,10 @@ static int cxl_mem_probe(struct device *dev)
>>  	if (!cxlds->media_ready)
>>  		return -EBUSY;
>>
>> +	rc = cxl_mem_ras_features_init(cxlmd, NULL);
>> +	if (rc)
>> +		dev_warn(&cxlmd->dev, "CXL RAS features init failed\n");
>> +
>>  	/*
>>  	 * Someone is trying to reattach this device after it lost its port
>>  	 * connection (an endpoint port previously registered by this memdev
>> was @@ -259,3 +263,4 @@
>MODULE_ALIAS_CXL(CXL_DEVICE_MEMORY_EXPANDER);
>>   * endpoint registration.
>>   */
>>  MODULE_SOFTDEP("pre: cxl_port");
>> +MODULE_SOFTDEP("pre: cxl_features");
>
>Why?
This dependency is no more required. During integration testing, this was added when
cxl_features found was not getting initialized when CXL memdev RAS features are getting
initialized, which calls features command  function, cxl_get_supported_feature_entry, 
for the RAS features. The reason was different from this and got fixed. 

Thanks,
Shiju


^ permalink raw reply	[flat|nested] 87+ messages in thread

* RE: [PATCH v18 15/19] cxl/memfeature: Add CXL memory device patrol scrub control feature
  2025-01-27 12:53     ` Shiju Jose
@ 2025-01-27 23:17       ` Dan Williams
  2025-01-29 12:28         ` Shiju Jose
  0 siblings, 1 reply; 87+ messages in thread
From: Dan Williams @ 2025-01-27 23:17 UTC (permalink / raw)
  To: Shiju Jose, Dan Williams, linux-edac@vger.kernel.org,
	linux-cxl@vger.kernel.org, linux-acpi@vger.kernel.org,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org
  Cc: bp@alien8.de, tony.luck@intel.com, rafael@kernel.org,
	lenb@kernel.org, mchehab@kernel.org, dave@stgolabs.net,
	Jonathan Cameron, dave.jiang@intel.com,
	alison.schofield@intel.com, vishal.l.verma@intel.com,
	ira.weiny@intel.com, david@redhat.com, Vilas.Sridharan@amd.com,
	leo.duran@amd.com, Yazen.Ghannam@amd.com, rientjes@google.com,
	jiaqiyan@google.com, Jon.Grimm@amd.com,
	dave.hansen@linux.intel.com, naoya.horiguchi@nec.com,
	james.morse@arm.com, jthoughton@google.com,
	somasundaram.a@hpe.com, erdemaktas@google.com, pgonda@google.com,
	duenwen@google.com, gthelen@google.com,
	wschwartz@amperecomputing.com, dferguson@amperecomputing.com,
	wbs@os.amperecomputing.com, nifan.cxl@gmail.com, tanxiaofei,
	Zengtao (B), Roberto Sassu, kangkang.shen@futurewei.com,
	wanghuiqiang, Linuxarm

Shiju Jose wrote:
> Hi Dan,
> 
> Thanks for the comments.
> 
> Please find reply inline.
> 
> Thanks,
> Shiju
> >-----Original Message-----
> >From: Dan Williams <dan.j.williams@intel.com>
> >Sent: 24 January 2025 20:39
> >To: Shiju Jose <shiju.jose@huawei.com>; linux-edac@vger.kernel.org; linux-
> >cxl@vger.kernel.org; linux-acpi@vger.kernel.org; linux-mm@kvack.org; linux-
> >kernel@vger.kernel.org
> >Cc: bp@alien8.de; tony.luck@intel.com; rafael@kernel.org; lenb@kernel.org;
> >mchehab@kernel.org; dan.j.williams@intel.com; dave@stgolabs.net; Jonathan
> >Cameron <jonathan.cameron@huawei.com>; dave.jiang@intel.com;
> >alison.schofield@intel.com; vishal.l.verma@intel.com; ira.weiny@intel.com;
> >david@redhat.com; Vilas.Sridharan@amd.com; leo.duran@amd.com;
> >Yazen.Ghannam@amd.com; rientjes@google.com; jiaqiyan@google.com;
> >Jon.Grimm@amd.com; dave.hansen@linux.intel.com;
> >naoya.horiguchi@nec.com; james.morse@arm.com; jthoughton@google.com;
> >somasundaram.a@hpe.com; erdemaktas@google.com; pgonda@google.com;
> >duenwen@google.com; gthelen@google.com;
> >wschwartz@amperecomputing.com; dferguson@amperecomputing.com;
> >wbs@os.amperecomputing.com; nifan.cxl@gmail.com; tanxiaofei
> ><tanxiaofei@huawei.com>; Zengtao (B) <prime.zeng@hisilicon.com>; Roberto
> >Sassu <roberto.sassu@huawei.com>; kangkang.shen@futurewei.com;
> >wanghuiqiang <wanghuiqiang@huawei.com>; Linuxarm
> ><linuxarm@huawei.com>; Shiju Jose <shiju.jose@huawei.com>
> >Subject: Re: [PATCH v18 15/19] cxl/memfeature: Add CXL memory device patrol
> >scrub control feature
> >
> >shiju.jose@ wrote:
> >> From: Shiju Jose <shiju.jose@huawei.com>
> >>
> >> CXL spec 3.1 section 8.2.9.9.11.1 describes the device patrol scrub
> >> control feature. The device patrol scrub proactively locates and makes
> >> corrections to errors in regular cycle.
> >>
> >> Allow specifying the number of hours within which the patrol scrub
> >> must be completed, subject to minimum and maximum limits reported by the
> >device.
> >> Also allow disabling scrub allowing trade-off error rates against
> >> performance.
> >>
> >> Add support for patrol scrub control on CXL memory devices.
> >> Register with the EDAC device driver, which retrieves the scrub
> >> attribute descriptors from EDAC scrub and exposes the sysfs scrub
> >> control attributes to userspace. For example, scrub control for the
> >> CXL memory device "cxl_mem0" is exposed in
> >/sys/bus/edac/devices/cxl_mem0/scrubX/.
> >>
> >> Additionally, add support for region-based CXL memory patrol scrub control.
> >> CXL memory regions may be interleaved across one or more CXL memory
> >> devices. For example, region-based scrub control for "cxl_region1" is
> >> exposed in /sys/bus/edac/devices/cxl_region1/scrubX/.
> >>
> >> Reviewed-by: Dave Jiang <dave.jiang@intel.com>
> >> Co-developed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> >> Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> >> Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
> >> ---
> >>  Documentation/edac/scrub.rst  |  66 ++++++
> >>  drivers/cxl/Kconfig           |  17 ++
> >>  drivers/cxl/core/Makefile     |   1 +
> >>  drivers/cxl/core/memfeature.c | 392
> >++++++++++++++++++++++++++++++++++
> >>  drivers/cxl/core/region.c     |   6 +
> >>  drivers/cxl/cxlmem.h          |   7 +
> >>  drivers/cxl/mem.c             |   5 +
> >>  include/cxl/features.h        |  16 ++
> >>  8 files changed, 510 insertions(+)
> >>  create mode 100644 drivers/cxl/core/memfeature.c diff --git
> >> a/Documentation/edac/scrub.rst b/Documentation/edac/scrub.rst index
> >> f86645c7f0af..80e986c57885 100644
> >> --- a/Documentation/edac/scrub.rst
> >> +++ b/Documentation/edac/scrub.rst
[..]
> >
> >What is this content-free blob of cat and echo statements? Please write actual
> >documentation with theory of operation, clarification of assumptions, rationale
> >for defaults, guidance on changing defaults...
> 
> Jonathan already replied.

I disagree that any of that is useful to include without rationale, and
if the rationale is already somewhere else then delete the multiple
lines of showing how 'cat' and 'echo' work with sysfs.

[..]
> >> +	depends on CXL_MEM
> >
> >Similar comment, and this also goes away if all of this just moves into the new
> >cxl_features driver.
> 
> Agree with  Jonathan told in reply. These are RAS specific features for CXL memory devices and
> thus added in memfeature.c  

Apoligies for this comment, I had meant to delete it along with some
other commentary along this theme after thinking it through.

I am now advocating that Dave drop his cxl_features driver altogether
and mirror your approach. I.e. EDAC is registered from existing CXL
drivers, and FWCTL can be registered against a cxl_memdev just like the
fw_upload ABI.

There was a concern that CXL needed a separate FWCTL driver in case
distributions wanted to have a policy against FWCTL, but given CXL
already has CONFIG_CXL_MEM_RAW_COMMANDS at compile-time and a wide array
of CXL bus devices, a cxl_features device is an awkward fit.

[..]
> >> +static int cxl_ps_get_attrs(struct cxl_patrol_scrub_context *cxl_ps_ctx,
> >> +			    struct cxl_memdev_ps_params *params) {
> >> +	struct cxl_memdev *cxlmd;
> >> +	u16 min_scrub_cycle = 0;
> >> +	int i, ret;
> >> +
> >> +	if (cxl_ps_ctx->cxlr) {
> >> +		struct cxl_region *cxlr = cxl_ps_ctx->cxlr;
> >> +		struct cxl_region_params *p = &cxlr->params;
> >> +
> >> +		for (i = p->interleave_ways - 1; i >= 0; i--) {
> >> +			struct cxl_endpoint_decoder *cxled = p->targets[i];
> >
> >It looks like this is called directly as a callback from EDAC. Where is the locking
> >that keeps cxl_ps_ctx->cxlr valid, or p->targets content stable?
> Jonathan already replied.

I could not find that comment? I *think* it's ok because when the region
is in the probe state changes will not be made to this list, but it
would be useful to at least have commentary to that effect. Protect
against someone copying this code in isolation and not consider the
context.

[..]
> >> +
> >> +int cxl_mem_ras_features_init(struct cxl_memdev *cxlmd, struct
> >> +cxl_region *cxlr)
> >
> >Please separate this into a memdev helper and a region helper. It is silly to have
> >two arguments to a function where one is expected to be NULL at all times, and
> >then have an if else statement inside that to effectively turn it back into 2 code
> >paths.
> >
> >If there is code to be shared amongst those, make *that* the shared helper.
> I added single function cxl_mem_ras_features_init() for both memdev and region based
> scrubbing to reduce code size as there were feedbacks try reduce code size.  

"Succint" and "concise" does not necessarily mean less lines. I would
greatly prefer a few more lines if it mines not outsourcing complexity
to the calling context. Readable code means I do not need to wonder
what:

   cxl_mem_ras_features_init(NULL, cxlr)

...means. I can just read devm_cxl_region_edac_register(cxlr), and know
exactly what is happening without needing to lose my train of thought to
go read what semantics cxl_mem_ras_features_init() is implementing.

Note that all the other _init() calls in drivers/cxl/ (outside of
module_init callbacks), are just purely init work, not object
registration. Please keep that local style.

> >> +{
> >> +	struct edac_dev_feature ras_features[CXL_DEV_NUM_RAS_FEATURES];
> >> +	char cxl_dev_name[CXL_DEV_NAME_LEN];
> >> +	int num_ras_features = 0;
> >> +	u8 scrub_inst = 0;
> >> +	int rc;
> >> +
> >> +	rc = cxl_memdev_scrub_init(cxlmd, cxlr,
> >&ras_features[num_ras_features],
> >> +				   scrub_inst);
> >> +	if (rc < 0)
> >> +		return rc;
> >> +
> >> +	scrub_inst++;
> >> +	num_ras_features++;
> >> +
> >> +	if (cxlr)
> >> +		snprintf(cxl_dev_name, sizeof(cxl_dev_name),
> >> +			 "cxl_region%d", cxlr->id);
> >
> >Why not pass dev_name(&cxlr->dev) directly?
> Jonathan already replied. 

That was purely with the cxl_mem observation, cxlr can be passed
directly.

> >
> >> +	else
> >> +		snprintf(cxl_dev_name, sizeof(cxl_dev_name),
> >> +			 "%s_%s", "cxl", dev_name(&cxlmd->dev));
> >
> >Can a "cxl" directory be created so that the raw name can be used?

In fact we already do something similar for CONFIG_HMEM_REPORTING (i.e.
an "access%d" device to create a nameed directory of attributes) so it
is a question for Boris if he wants to tolerate a parent "cxl" device to
parent all CXL objects in EDAC.

> >
> >> +
> >> +	return edac_dev_register(&cxlmd->dev, cxl_dev_name, NULL,
> >> +				 num_ras_features, ras_features);
> >
> >I'm so confused... a few lines down in this patch we have:
> >
> >    rc = cxl_mem_ras_features_init(NULL, cxlr);
> >
> >...so how can this call to edac_dev_register() unconditionally de-reference
> >@cxlmd?
> Thanks for spotting this. It is a bug, need to fix.


[..]
> >> +EXPORT_SYMBOL_NS_GPL(cxl_mem_ras_features_init, "CXL");
> >> diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
> >> index b98b1ccffd1c..c2be70cd87f8 100644
> >> --- a/drivers/cxl/core/region.c
> >> +++ b/drivers/cxl/core/region.c
> >> @@ -3449,6 +3449,12 @@ static int cxl_region_probe(struct device *dev)
> >>  					p->res->start, p->res->end, cxlr,
> >>  					is_system_ram) > 0)
> >>  			return 0;
> >> +
> >> +		rc = cxl_mem_ras_features_init(NULL, cxlr);
> >> +		if (rc)
> >> +			dev_warn(&cxlr->dev, "CXL RAS features init for
> >region_id=%d failed\n",
> >> +				 cxlr->id);
> >
> >There is more to RAS than EDAC memory scrub so this message is misleading. It
> >is also unnecessary because the driver continues to load and the admin, if they
> >care, will notice that the EDAC attributes are missing.
> This message was added for the debugging purpose in CXL driver. I will change to  dev_dbg().

...but also stop calling this functionality with the blanket term "RAS".
It is "EDAC scrub and repair extensions to all the other RAS
functionality the CXL subsystem handles directly", name it accordingly.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* RE: [PATCH v18 15/19] cxl/memfeature: Add CXL memory device patrol scrub control feature
  2025-01-27 23:17       ` Dan Williams
@ 2025-01-29 12:28         ` Shiju Jose
  0 siblings, 0 replies; 87+ messages in thread
From: Shiju Jose @ 2025-01-29 12:28 UTC (permalink / raw)
  To: Dan Williams, linux-edac@vger.kernel.org,
	linux-cxl@vger.kernel.org, linux-acpi@vger.kernel.org,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org
  Cc: bp@alien8.de, tony.luck@intel.com, rafael@kernel.org,
	lenb@kernel.org, mchehab@kernel.org, dave@stgolabs.net,
	Jonathan Cameron, dave.jiang@intel.com,
	alison.schofield@intel.com, vishal.l.verma@intel.com,
	ira.weiny@intel.com, david@redhat.com, Vilas.Sridharan@amd.com,
	leo.duran@amd.com, Yazen.Ghannam@amd.com, rientjes@google.com,
	jiaqiyan@google.com, Jon.Grimm@amd.com,
	dave.hansen@linux.intel.com, naoya.horiguchi@nec.com,
	james.morse@arm.com, jthoughton@google.com,
	somasundaram.a@hpe.com, erdemaktas@google.com, pgonda@google.com,
	duenwen@google.com, gthelen@google.com,
	wschwartz@amperecomputing.com, dferguson@amperecomputing.com,
	wbs@os.amperecomputing.com, nifan.cxl@gmail.com, tanxiaofei,
	Zengtao (B), Roberto Sassu, kangkang.shen@futurewei.com,
	wanghuiqiang, Linuxarm

>-----Original Message-----
>From: Dan Williams <dan.j.williams@intel.com>
>Sent: 27 January 2025 23:17
>To: Shiju Jose <shiju.jose@huawei.com>; Dan Williams
><dan.j.williams@intel.com>; linux-edac@vger.kernel.org; linux-
>cxl@vger.kernel.org; linux-acpi@vger.kernel.org; linux-mm@kvack.org; linux-
>kernel@vger.kernel.org
>Cc: bp@alien8.de; tony.luck@intel.com; rafael@kernel.org; lenb@kernel.org;
>mchehab@kernel.org; dave@stgolabs.net; Jonathan Cameron
><jonathan.cameron@huawei.com>; dave.jiang@intel.com;
>alison.schofield@intel.com; vishal.l.verma@intel.com; ira.weiny@intel.com;
>david@redhat.com; Vilas.Sridharan@amd.com; leo.duran@amd.com;
>Yazen.Ghannam@amd.com; rientjes@google.com; jiaqiyan@google.com;
>Jon.Grimm@amd.com; dave.hansen@linux.intel.com;
>naoya.horiguchi@nec.com; james.morse@arm.com; jthoughton@google.com;
>somasundaram.a@hpe.com; erdemaktas@google.com; pgonda@google.com;
>duenwen@google.com; gthelen@google.com;
>wschwartz@amperecomputing.com; dferguson@amperecomputing.com;
>wbs@os.amperecomputing.com; nifan.cxl@gmail.com; tanxiaofei
><tanxiaofei@huawei.com>; Zengtao (B) <prime.zeng@hisilicon.com>; Roberto
>Sassu <roberto.sassu@huawei.com>; kangkang.shen@futurewei.com;
>wanghuiqiang <wanghuiqiang@huawei.com>; Linuxarm
><linuxarm@huawei.com>
>Subject: RE: [PATCH v18 15/19] cxl/memfeature: Add CXL memory device patrol
>scrub control feature
>
>Shiju Jose wrote:
>> Hi Dan,
>>
>> Thanks for the comments.
>>
>> Please find reply inline.
>>
>> Thanks,
>> Shiju
>> >-----Original Message-----
>> >From: Dan Williams <dan.j.williams@intel.com>
>> >Sent: 24 January 2025 20:39
>> >To: Shiju Jose <shiju.jose@huawei.com>; linux-edac@vger.kernel.org;
>> >linux- cxl@vger.kernel.org; linux-acpi@vger.kernel.org;
>> >linux-mm@kvack.org; linux- kernel@vger.kernel.org
>> >Cc: bp@alien8.de; tony.luck@intel.com; rafael@kernel.org;
>> >lenb@kernel.org; mchehab@kernel.org; dan.j.williams@intel.com;
>> >dave@stgolabs.net; Jonathan Cameron <jonathan.cameron@huawei.com>;
>> >dave.jiang@intel.com; alison.schofield@intel.com;
>> >vishal.l.verma@intel.com; ira.weiny@intel.com; david@redhat.com;
>> >Vilas.Sridharan@amd.com; leo.duran@amd.com;
>Yazen.Ghannam@amd.com;
>> >rientjes@google.com; jiaqiyan@google.com; Jon.Grimm@amd.com;
>> >dave.hansen@linux.intel.com; naoya.horiguchi@nec.com;
>> >james.morse@arm.com; jthoughton@google.com;
>somasundaram.a@hpe.com;
>> >erdemaktas@google.com; pgonda@google.com; duenwen@google.com;
>> >gthelen@google.com; wschwartz@amperecomputing.com;
>> >dferguson@amperecomputing.com; wbs@os.amperecomputing.com;
>> >nifan.cxl@gmail.com; tanxiaofei <tanxiaofei@huawei.com>; Zengtao (B)
>> ><prime.zeng@hisilicon.com>; Roberto Sassu <roberto.sassu@huawei.com>;
>> >kangkang.shen@futurewei.com; wanghuiqiang
><wanghuiqiang@huawei.com>;
>> >Linuxarm <linuxarm@huawei.com>; Shiju Jose <shiju.jose@huawei.com>
>> >Subject: Re: [PATCH v18 15/19] cxl/memfeature: Add CXL memory device
>> >patrol scrub control feature
>> >
>> >shiju.jose@ wrote:
>> >> From: Shiju Jose <shiju.jose@huawei.com>
>> >>
>> >> CXL spec 3.1 section 8.2.9.9.11.1 describes the device patrol scrub
>> >> control feature. The device patrol scrub proactively locates and
>> >> makes corrections to errors in regular cycle.
>> >>
>> >> Allow specifying the number of hours within which the patrol scrub
>> >> must be completed, subject to minimum and maximum limits reported
>> >> by the
>> >device.
>> >> Also allow disabling scrub allowing trade-off error rates against
>> >> performance.
>> >>
>> >> Add support for patrol scrub control on CXL memory devices.
>> >> Register with the EDAC device driver, which retrieves the scrub
>> >> attribute descriptors from EDAC scrub and exposes the sysfs scrub
>> >> control attributes to userspace. For example, scrub control for the
>> >> CXL memory device "cxl_mem0" is exposed in
>> >/sys/bus/edac/devices/cxl_mem0/scrubX/.
>> >>
>> >> Additionally, add support for region-based CXL memory patrol scrub
>control.
>> >> CXL memory regions may be interleaved across one or more CXL memory
>> >> devices. For example, region-based scrub control for "cxl_region1"
>> >> is exposed in /sys/bus/edac/devices/cxl_region1/scrubX/.
>> >>
>> >> Reviewed-by: Dave Jiang <dave.jiang@intel.com>
>> >> Co-developed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
>> >> Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
>> >> Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
>> >> ---
>> >>  Documentation/edac/scrub.rst  |  66 ++++++
>> >>  drivers/cxl/Kconfig           |  17 ++
>> >>  drivers/cxl/core/Makefile     |   1 +
>> >>  drivers/cxl/core/memfeature.c | 392
>> >++++++++++++++++++++++++++++++++++
>> >>  drivers/cxl/core/region.c     |   6 +
>> >>  drivers/cxl/cxlmem.h          |   7 +
>> >>  drivers/cxl/mem.c             |   5 +
>> >>  include/cxl/features.h        |  16 ++
>> >>  8 files changed, 510 insertions(+)  create mode 100644
>> >> drivers/cxl/core/memfeature.c diff --git
>> >> a/Documentation/edac/scrub.rst b/Documentation/edac/scrub.rst index
>> >> f86645c7f0af..80e986c57885 100644
>> >> --- a/Documentation/edac/scrub.rst
>> >> +++ b/Documentation/edac/scrub.rst
>[..]
>> >
>> >What is this content-free blob of cat and echo statements? Please
>> >write actual documentation with theory of operation, clarification of
>> >assumptions, rationale for defaults, guidance on changing defaults...
>>
>> Jonathan already replied.
>
>I disagree that any of that is useful to include without rationale, and if the
>rationale is already somewhere else then delete the multiple lines of showing
>how 'cat' and 'echo' work with sysfs.
I will discuss with Jonathan on this how to modify. 

>
>[..]
>> >> +	depends on CXL_MEM
>> >
>> >Similar comment, and this also goes away if all of this just moves
>> >into the new cxl_features driver.
>>
>> Agree with  Jonathan told in reply. These are RAS specific features
>> for CXL memory devices and thus added in memfeature.c
>
>Apoligies for this comment, I had meant to delete it along with some other
>commentary along this theme after thinking it through.
>
>I am now advocating that Dave drop his cxl_features driver altogether and
>mirror your approach. I.e. EDAC is registered from existing CXL drivers, and
>FWCTL can be registered against a cxl_memdev just like the fw_upload ABI.
>
>There was a concern that CXL needed a separate FWCTL driver in case
>distributions wanted to have a policy against FWCTL, but given CXL already has
>CONFIG_CXL_MEM_RAW_COMMANDS at compile-time and a wide array of CXL
>bus devices, a cxl_features device is an awkward fit.
Ok. 
>
>[..]
>> >> +static int cxl_ps_get_attrs(struct cxl_patrol_scrub_context *cxl_ps_ctx,
>> >> +			    struct cxl_memdev_ps_params *params) {
>> >> +	struct cxl_memdev *cxlmd;
>> >> +	u16 min_scrub_cycle = 0;
>> >> +	int i, ret;
>> >> +
>> >> +	if (cxl_ps_ctx->cxlr) {
>> >> +		struct cxl_region *cxlr = cxl_ps_ctx->cxlr;
>> >> +		struct cxl_region_params *p = &cxlr->params;
>> >> +
>> >> +		for (i = p->interleave_ways - 1; i >= 0; i--) {
>> >> +			struct cxl_endpoint_decoder *cxled = p->targets[i];
>> >
>> >It looks like this is called directly as a callback from EDAC. Where
>> >is the locking that keeps cxl_ps_ctx->cxlr valid, or p->targets content stable?
>> Jonathan already replied.
>
>I could not find that comment? I *think* it's ok because when the region is in the
>probe state changes will not be made to this list, but it would be useful to at
>least have commentary to that effect. Protect against someone copying this
>code in isolation and not consider the context.
Sure. Will do.
>
>[..]
>> >> +
>> >> +int cxl_mem_ras_features_init(struct cxl_memdev *cxlmd, struct
>> >> +cxl_region *cxlr)
>> >
>> >Please separate this into a memdev helper and a region helper. It is
>> >silly to have two arguments to a function where one is expected to be
>> >NULL at all times, and then have an if else statement inside that to
>> >effectively turn it back into 2 code paths.
>> >
>> >If there is code to be shared amongst those, make *that* the shared helper.
>> I added single function cxl_mem_ras_features_init() for both memdev
>> and region based scrubbing to reduce code size as there were feedbacks try
>reduce code size.
>
>"Succint" and "concise" does not necessarily mean less lines. I would greatly
>prefer a few more lines if it mines not outsourcing complexity to the calling
>context. Readable code means I do not need to wonder
>what:
>
>   cxl_mem_ras_features_init(NULL, cxlr)
>
>...means. I can just read devm_cxl_region_edac_register(cxlr), and know exactly
>what is happening without needing to lose my train of thought to go read what
>semantics cxl_mem_ras_features_init() is implementing.
>
>Note that all the other _init() calls in drivers/cxl/ (outside of module_init
>callbacks), are just purely init work, not object registration. Please keep that
>local style.
Sure. Will add  separate functions for region based edac registration.
>
>> >> +{
>> >> +	struct edac_dev_feature ras_features[CXL_DEV_NUM_RAS_FEATURES];
>> >> +	char cxl_dev_name[CXL_DEV_NAME_LEN];
>> >> +	int num_ras_features = 0;
>> >> +	u8 scrub_inst = 0;
>> >> +	int rc;
>> >> +
>> >> +	rc = cxl_memdev_scrub_init(cxlmd, cxlr,
>> >&ras_features[num_ras_features],
>> >> +				   scrub_inst);
>> >> +	if (rc < 0)
>> >> +		return rc;
>> >> +
>> >> +	scrub_inst++;
>> >> +	num_ras_features++;
>> >> +
>> >> +	if (cxlr)
>> >> +		snprintf(cxl_dev_name, sizeof(cxl_dev_name),
>> >> +			 "cxl_region%d", cxlr->id);
>> >
>> >Why not pass dev_name(&cxlr->dev) directly?
>> Jonathan already replied.
>
>That was purely with the cxl_mem observation, cxlr can be passed directly.
Will check.
>
>> >
>> >> +	else
>> >> +		snprintf(cxl_dev_name, sizeof(cxl_dev_name),
>> >> +			 "%s_%s", "cxl", dev_name(&cxlmd->dev));
>> >
>> >Can a "cxl" directory be created so that the raw name can be used?
>
>In fact we already do something similar for CONFIG_HMEM_REPORTING (i.e.
>an "access%d" device to create a nameed directory of attributes) so it is a
>question for Boris if he wants to tolerate a parent "cxl" device to parent all CXL
>objects in EDAC.
>
>> >
>> >> +
>> >> +	return edac_dev_register(&cxlmd->dev, cxl_dev_name, NULL,
>> >> +				 num_ras_features, ras_features);
>> >
>> >I'm so confused... a few lines down in this patch we have:
>> >
>> >    rc = cxl_mem_ras_features_init(NULL, cxlr);
>> >
>> >...so how can this call to edac_dev_register() unconditionally
>> >de-reference @cxlmd?
>> Thanks for spotting this. It is a bug, need to fix.
>
>
>[..]
>> >> +EXPORT_SYMBOL_NS_GPL(cxl_mem_ras_features_init, "CXL");
>> >> diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
>> >> index b98b1ccffd1c..c2be70cd87f8 100644
>> >> --- a/drivers/cxl/core/region.c
>> >> +++ b/drivers/cxl/core/region.c
>> >> @@ -3449,6 +3449,12 @@ static int cxl_region_probe(struct device *dev)
>> >>  					p->res->start, p->res->end, cxlr,
>> >>  					is_system_ram) > 0)
>> >>  			return 0;
>> >> +
>> >> +		rc = cxl_mem_ras_features_init(NULL, cxlr);
>> >> +		if (rc)
>> >> +			dev_warn(&cxlr->dev, "CXL RAS features init for
>> >region_id=%d failed\n",
>> >> +				 cxlr->id);
>> >
>> >There is more to RAS than EDAC memory scrub so this message is
>> >misleading. It is also unnecessary because the driver continues to
>> >load and the admin, if they care, will notice that the EDAC attributes are
>missing.
>> This message was added for the debugging purpose in CXL driver. I will change
>to  dev_dbg().
>
>...but also stop calling this functionality with the blanket term "RAS".
>It is "EDAC scrub and repair extensions to all the other RAS functionality the CXL
>subsystem handles directly", name it accordingly.
Sure. Will change.

Thanks,
Shiju


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v18 00/19] EDAC: Scrub: introduce generic EDAC RAS control feature driver + CXL/ACPI-RAS2 drivers
  2025-01-06 12:09 [PATCH v18 00/19] EDAC: Scrub: introduce generic EDAC RAS control feature driver + CXL/ACPI-RAS2 drivers shiju.jose
                   ` (19 preceding siblings ...)
  2025-01-13 14:46 ` [PATCH v18 00/19] EDAC: Scrub: introduce generic EDAC RAS control feature driver + CXL/ACPI-RAS2 drivers Mauro Carvalho Chehab
@ 2025-01-30 19:18 ` Daniel Ferguson
  2025-02-03  9:25   ` Shiju Jose
  20 siblings, 1 reply; 87+ messages in thread
From: Daniel Ferguson @ 2025-01-30 19:18 UTC (permalink / raw)
  To: shiju.jose, linux-edac, linux-cxl, linux-acpi, linux-mm,
	linux-kernel
  Cc: bp, tony.luck, rafael, lenb, mchehab, dan.j.williams, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	ira.weiny, david, Vilas.Sridharan, leo.duran, Yazen.Ghannam,
	rientjes, jiaqiyan, Jon.Grimm, dave.hansen, naoya.horiguchi,
	james.morse, jthoughton, somasundaram.a, erdemaktas, pgonda,
	duenwen, gthelen, wschwartz, dferguson, wbs, nifan.cxl,
	tanxiaofei, prime.zeng, roberto.sassu, kangkang.shen,
	wanghuiqiang, linuxarm

Hi Shiju,

I've tested the Scrub specific pieces and the EDAC infrastructure pieces(as far
as how it relates to the Scrub pieces). I am using an ARM64 platform for this
testing. I would like to offer my tested-by to those pieces I have personal
experience with. I will send them as replies to their respective patches.

Thank you,
~Daniel

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v18 01/19] EDAC: Add support for EDAC device features control
  2025-01-06 12:09 ` [PATCH v18 01/19] EDAC: Add support for EDAC device features control shiju.jose
  2025-01-06 13:37   ` Borislav Petkov
  2025-01-13 15:06   ` Mauro Carvalho Chehab
@ 2025-01-30 19:18   ` Daniel Ferguson
  2 siblings, 0 replies; 87+ messages in thread
From: Daniel Ferguson @ 2025-01-30 19:18 UTC (permalink / raw)
  To: shiju.jose, linux-edac, linux-cxl, linux-acpi, linux-mm,
	linux-kernel
  Cc: bp, tony.luck, rafael, lenb, mchehab, dan.j.williams, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	ira.weiny, david, Vilas.Sridharan, leo.duran, Yazen.Ghannam,
	rientjes, jiaqiyan, Jon.Grimm, dave.hansen, naoya.horiguchi,
	james.morse, jthoughton, somasundaram.a, erdemaktas, pgonda,
	duenwen, gthelen, wschwartz, dferguson, wbs, nifan.cxl,
	tanxiaofei, prime.zeng, roberto.sassu, kangkang.shen,
	wanghuiqiang, linuxarm

Tested-by: Daniel Ferguson <danielf@os.amperecomputing.com> # arm64


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v18 02/19] EDAC: Add scrub control feature
  2025-01-06 12:09 ` [PATCH v18 02/19] EDAC: Add scrub control feature shiju.jose
  2025-01-06 15:57   ` Borislav Petkov
  2025-01-13 15:50   ` Mauro Carvalho Chehab
@ 2025-01-30 19:18   ` Daniel Ferguson
  2 siblings, 0 replies; 87+ messages in thread
From: Daniel Ferguson @ 2025-01-30 19:18 UTC (permalink / raw)
  To: shiju.jose, linux-edac, linux-cxl, linux-acpi, linux-mm,
	linux-kernel
  Cc: bp, tony.luck, rafael, lenb, mchehab, dan.j.williams, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	ira.weiny, david, Vilas.Sridharan, leo.duran, Yazen.Ghannam,
	rientjes, jiaqiyan, Jon.Grimm, dave.hansen, naoya.horiguchi,
	james.morse, jthoughton, somasundaram.a, erdemaktas, pgonda,
	duenwen, gthelen, wschwartz, dferguson, wbs, nifan.cxl,
	tanxiaofei, prime.zeng, roberto.sassu, kangkang.shen,
	wanghuiqiang, linuxarm

Tested-by: Daniel Ferguson <danielf@os.amperecomputing.com> # arm64


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v18 05/19] ACPI:RAS2: Add ACPI RAS2 driver
  2025-01-06 12:10 ` [PATCH v18 05/19] ACPI:RAS2: Add ACPI RAS2 driver shiju.jose
  2025-01-21 23:01   ` Daniel Ferguson
@ 2025-01-30 19:19   ` Daniel Ferguson
  1 sibling, 0 replies; 87+ messages in thread
From: Daniel Ferguson @ 2025-01-30 19:19 UTC (permalink / raw)
  To: shiju.jose, linux-edac, linux-cxl, linux-acpi, linux-mm,
	linux-kernel
  Cc: bp, tony.luck, rafael, lenb, mchehab, dan.j.williams, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	ira.weiny, david, Vilas.Sridharan, leo.duran, Yazen.Ghannam,
	rientjes, jiaqiyan, Jon.Grimm, dave.hansen, naoya.horiguchi,
	james.morse, jthoughton, somasundaram.a, erdemaktas, pgonda,
	duenwen, gthelen, wschwartz, dferguson, wbs, nifan.cxl,
	tanxiaofei, prime.zeng, roberto.sassu, kangkang.shen,
	wanghuiqiang, linuxarm

Tested-by: Daniel Ferguson <danielf@os.amperecomputing.com> # arm64


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v18 06/19] ras: mem: Add memory ACPI RAS2 driver
  2025-01-06 12:10 ` [PATCH v18 06/19] ras: mem: Add memory " shiju.jose
  2025-01-21 23:01   ` Daniel Ferguson
@ 2025-01-30 19:19   ` Daniel Ferguson
  1 sibling, 0 replies; 87+ messages in thread
From: Daniel Ferguson @ 2025-01-30 19:19 UTC (permalink / raw)
  To: shiju.jose, linux-edac, linux-cxl, linux-acpi, linux-mm,
	linux-kernel
  Cc: bp, tony.luck, rafael, lenb, mchehab, dan.j.williams, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	ira.weiny, david, Vilas.Sridharan, leo.duran, Yazen.Ghannam,
	rientjes, jiaqiyan, Jon.Grimm, dave.hansen, naoya.horiguchi,
	james.morse, jthoughton, somasundaram.a, erdemaktas, pgonda,
	duenwen, gthelen, wschwartz, dferguson, wbs, nifan.cxl,
	tanxiaofei, prime.zeng, roberto.sassu, kangkang.shen,
	wanghuiqiang, linuxarm

Tested-by: Daniel Ferguson <danielf@os.amperecomputing.com> # arm64


^ permalink raw reply	[flat|nested] 87+ messages in thread

* RE: [PATCH v18 00/19] EDAC: Scrub: introduce generic EDAC RAS control feature driver + CXL/ACPI-RAS2 drivers
  2025-01-30 19:18 ` Daniel Ferguson
@ 2025-02-03  9:25   ` Shiju Jose
  0 siblings, 0 replies; 87+ messages in thread
From: Shiju Jose @ 2025-02-03  9:25 UTC (permalink / raw)
  To: Daniel Ferguson, linux-edac@vger.kernel.org,
	linux-cxl@vger.kernel.org, linux-acpi@vger.kernel.org,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org
  Cc: bp@alien8.de, tony.luck@intel.com, rafael@kernel.org,
	lenb@kernel.org, mchehab@kernel.org, dan.j.williams@intel.com,
	dave@stgolabs.net, Jonathan Cameron, dave.jiang@intel.com,
	alison.schofield@intel.com, vishal.l.verma@intel.com,
	ira.weiny@intel.com, david@redhat.com, Vilas.Sridharan@amd.com,
	leo.duran@amd.com, Yazen.Ghannam@amd.com, rientjes@google.com,
	jiaqiyan@google.com, Jon.Grimm@amd.com,
	dave.hansen@linux.intel.com, naoya.horiguchi@nec.com,
	james.morse@arm.com, jthoughton@google.com,
	somasundaram.a@hpe.com, erdemaktas@google.com, pgonda@google.com,
	duenwen@google.com, gthelen@google.com,
	wschwartz@amperecomputing.com, dferguson@amperecomputing.com,
	wbs@os.amperecomputing.com, nifan.cxl@gmail.com, tanxiaofei,
	Zengtao (B), Roberto Sassu, kangkang.shen@futurewei.com,
	wanghuiqiang, Linuxarm



>-----Original Message-----
>From: Daniel Ferguson <danielf@os.amperecomputing.com>
>Sent: 30 January 2025 19:18
>To: Shiju Jose <shiju.jose@huawei.com>; linux-edac@vger.kernel.org; linux-
>cxl@vger.kernel.org; linux-acpi@vger.kernel.org; linux-mm@kvack.org; linux-
>kernel@vger.kernel.org
>Cc: bp@alien8.de; tony.luck@intel.com; rafael@kernel.org; lenb@kernel.org;
>mchehab@kernel.org; dan.j.williams@intel.com; dave@stgolabs.net; Jonathan
>Cameron <jonathan.cameron@huawei.com>; dave.jiang@intel.com;
>alison.schofield@intel.com; vishal.l.verma@intel.com; ira.weiny@intel.com;
>david@redhat.com; Vilas.Sridharan@amd.com; leo.duran@amd.com;
>Yazen.Ghannam@amd.com; rientjes@google.com; jiaqiyan@google.com;
>Jon.Grimm@amd.com; dave.hansen@linux.intel.com;
>naoya.horiguchi@nec.com; james.morse@arm.com; jthoughton@google.com;
>somasundaram.a@hpe.com; erdemaktas@google.com; pgonda@google.com;
>duenwen@google.com; gthelen@google.com;
>wschwartz@amperecomputing.com; dferguson@amperecomputing.com;
>wbs@os.amperecomputing.com; nifan.cxl@gmail.com; tanxiaofei
><tanxiaofei@huawei.com>; Zengtao (B) <prime.zeng@hisilicon.com>; Roberto
>Sassu <roberto.sassu@huawei.com>; kangkang.shen@futurewei.com;
>wanghuiqiang <wanghuiqiang@huawei.com>; Linuxarm
><linuxarm@huawei.com>
>Subject: Re: [PATCH v18 00/19] EDAC: Scrub: introduce generic EDAC RAS
>control feature driver + CXL/ACPI-RAS2 drivers
>
>Hi Shiju,
>
>I've tested the Scrub specific pieces and the EDAC infrastructure pieces(as far as
>how it relates to the Scrub pieces). I am using an ARM64 platform for this
>testing. I would like to offer my tested-by to those pieces I have personal
>experience with. I will send them as replies to their respective patches.

Hi Daniel,

Thanks for testing the EDAC infrastructure for scrubbing feature.
I will add  tested-by for you. 

Thanks,
Shiju
>
>Thank you,
>~Daniel

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v18 04/19] EDAC: Add memory repair control feature
  2025-01-22 19:09                           ` Borislav Petkov
@ 2025-02-06 13:39                             ` Jonathan Cameron
  2025-02-17 13:23                               ` Borislav Petkov
  0 siblings, 1 reply; 87+ messages in thread
From: Jonathan Cameron @ 2025-02-06 13:39 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Shiju Jose, linux-edac@vger.kernel.org, linux-cxl@vger.kernel.org,
	linux-acpi@vger.kernel.org, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org, tony.luck@intel.com,
	rafael@kernel.org, lenb@kernel.org, mchehab@kernel.org,
	dan.j.williams@intel.com, dave@stgolabs.net, dave.jiang@intel.com,
	alison.schofield@intel.com, vishal.l.verma@intel.com,
	ira.weiny@intel.com, david@redhat.com, Vilas.Sridharan@amd.com,
	leo.duran@amd.com, Yazen.Ghannam@amd.com, rientjes@google.com,
	jiaqiyan@google.com, Jon.Grimm@amd.com,
	dave.hansen@linux.intel.com, naoya.horiguchi@nec.com,
	james.morse@arm.com, jthoughton@google.com,
	somasundaram.a@hpe.com, erdemaktas@google.com, pgonda@google.com,
	duenwen@google.com, gthelen@google.com,
	wschwartz@amperecomputing.com, dferguson@amperecomputing.com,
	wbs@os.amperecomputing.com, nifan.cxl@gmail.com, tanxiaofei,
	Zengtao (B), Roberto Sassu, kangkang.shen@futurewei.com,
	wanghuiqiang, Linuxarm, Vandana Salve

On Wed, 22 Jan 2025 20:09:17 +0100
Borislav Petkov <bp@alien8.de> wrote:

> On Tue, Jan 21, 2025 at 06:16:32PM +0000, Jonathan Cameron wrote:
> > Clearly we need to provide more evidence of use cases: 'Show us your code'
> > seems to apply here.  We'll do that over the next few weeks.  
> 
> Thanks.

Shiju is just finalizing a v19 + the userspace code.  So may make
sense to read this reply only after that is out!

> 
> > based on simple algorithms applied to the data RAS Daemon already has.
> > The interface for the reasons discussed in the long thread with Dan
> > is the minimum required to provide the information needed to allow
> > for two use cases.  We enumerated them explicitly in the discussion with
> > Dan because they possibly affected 'safety'.
> > 
> > 1) Power up, pre memory online, (typically non persistent) repair of
> >    known bad memory.  
> 
> Lemme make sure I understand this: during boot you simply know from somewhere
> that a certain rank (let's use rank for simplicity's sake) is faulty. Before
> you online the memory, you simply replace that rank in the logic so that the
> system uses the spare rank while the faulty rank is disabled.

Yes.

> 
> >    There are two interface options for this, inject the prior mapping from
> >    device physical address space (host address is not necessarily relevant
> >    here as no address decoders have been programmed yet in CXL - that
> >    happens as part of the flow to bring the memory up), or use the
> >    information that userspace already has (bank, rank etc) to select what
> >    memory is to be replaced with spare capacity.  
> 
> Ok, so this is all CXL-specific because this use case relies on userspace
> being present. Which means you cannot really use this for DIMMs used during
> boot. So if DIMMs, those should be online-able later, when userspace is there.

In this case yes, this isn't appropriate for general purpose DIMMs.
It's not CXL Specific as such but that is the most common type of device
providing expansion / memory pooling type facilities today and those
are the most common forms of memory not needed at boot (I don't think
NVDIMMS ever supported PPR etc).

> 
> >    Given the injection interface and the repair interface have to
> >    convey the same data, the interface complexity is identical and
> >    we might as well have a single step 'repair' rather than
> >      1. Inject prior records then  
> 
> What exactly is this injecting? The faulty rank? Which then would cause the
> respective driver to go and do that repairing.

For this comment I was referring letting the kernel do the
stats gathering etc. We would need to put back records from a previous boot.
That requires almost the same interface as just telling it to repair.
Note the address to physical memory mapping is not stable across boots
so we can't just provide a physical address, we need full description.

> 
> Which then means that you can online that device after rasdaemon has loaded
> and has the required info to online it.
> 
> Which then means, rasdaemon needs to be part of the device onlining process.

Yes. For this flow (sort of)  The PoC / proposal is a boot script rather
than rasdaemon but it uses the rasdaemon DB.

It would definitely be interesting to explore other options in addition to
this.  Perhaps we get some firmware interface standardization to ask the
firmware if it can repair stuff before the memory is in use.  Might work
for cases where the firmware knows enough about them to do the repair.

> 
> I'm simply conjecturing here - I guess I'll see your detailed use case later.
> 
> >      2. Pass a physical address that is matched to one of those records.  
> 
> I don't know what that one does.

This was about practical restrictions on interface simplification.
If we have prior records pushed back into the kernel, we could then 'find' the
data we need to repair by looking it up by physical address.  In v19 we do
that for records from this boot if the memory is online.  This is
part of the sanity check Dan asked for to harden against userspace
repairing something based on stale data, but it is specific to the CXL
driver. More generally that sanity check may not be needed if PA to
actual memory mapping is a stable thing.
> 
> >    There are no security related concerns here as we always treat this
> >    as new memory and zero it etc as part of onlining.  
> 
> Right, goes without saying.
> 
> > 2) Online case.  Here the restriction Dan proposed was that we 'check'
> >    that we have seen an error record on this boot that matches the full
> >    description.  That is matching both the physical address and the
> >    topology (as that mapping can change from boot to boot, but not whilst
> >    the memory is in use). This doesn't prevent any use case we have
> >    come up with yet because, if we are making a post initial onlining
> >    decision to repair we can assume there is a new error record that
> >    provided new information on which we are acting.  Hence the kernel
> >    had the information to check.
> > 
> >    Whilst I wasn't convinced that we had a definite security
> >    problem without this protection, it requires minimal changes and doesn't
> >    block the flows we care about so we are fine with adding this check.  
> 
> I need more detail on that 2nd case - lemme read that other subthread.
> 
> > Ok. We'll put together an example script / RASdaemon code to show how
> > it is used. I think you may be surprised at how simple this is and hopefully
> > that will show that the interface is appropriate.  
> 
> That sounds good, thanks.
> 
> > This we disagree on. For this persistent case in particular these are limited
> > resources. Once you have used them all you can't do it again.  Using them
> > carefully is key. An exception is mentioned below as a possible extension but
> > it relies on a specific subset of allowed device functionality and only
> > covers some use cases (so it's an extra, not a replacement for what this
> > set does).  
> 
> By "this persistent case" you mean collecting logs per error address,
> collating them and massaging them or hunting them through a neural network to
> recognize potential patterns and then act upon them?

Ah. No not that. I was just meaning the case where it is hard PPR. (hence
persistent for all time) Once you've done it you can't go back so after
N uses, any more errors mean you need a new device ASAP. That is as decision
with a very different threshold to soft PPR where it's a case of you
do it until you run out of spares, then you fall back to offlining
pages.  Next boot you get your spares back again and may use them
differently this time.

> 
> In any case, I don't mean that - I mean something simple like: "after X errors
> on address Y, offline page Z." Like we do with .../ras/cec.c. Ofc you can't
> put really complex handling in the kernel and why would you - it must be *the*
> best thing after sliced bread to impose that on everyone.
> 
> All I'm saying is, simple logic like that can be in the kernel if it is useful
> in the general case. You don't *have* to carry all logic in some userspace
> daemon - the kernel can be smart too :-)

True enough. I'm not against doing things in kernel in some cases.  Even
then I want the controls to allow user space to do more complex things.
Even in the cases where the devices suggests repair, we may not want to for
reasons that device can't know about.

> 
> > With the decision algorithms in userspace, we can design the userspace to kernel
> > interface because we don't care about the algorithm choice - only what it needs
> > to control which is well defined. Algorithms will start simple and then
> > we'll iterate but it won't need changes in this interface because none of it
> > is connected to how we use the data.  
> 
> Are you saying that this interface you have right now is the necessary and
> sufficient set of sysfs nodes which will be enough for most algorithms in
> userspace?
> 
> And you won't have to change it because you realize down the road that it is
> not enough?
> 

The interface provides all the data, and all the controls to match.

Sure, something new might come along that needs additional controls (subchannel
for DDR5 showed up recently for instance and are in v19) but that extension
should be easy and fit within the ABI.  Those new 'features' will need
kernel changes and matching rasdaemon changes anyway as there is new data
in the error records so this sort of extension should be fine.

> > In general an ABI that is used is cast in stone. To my understanding there
> > is nothing special about debugfs.  If we introduce a regression in tooling
> > that uses that interface are we actually any better off than sysfs?
> > https://lwn.net/Articles/309298/ was a good article on this a while back.
> > 
> > Maybe there has been a change of opinion on this that I missed.  
> 
> I don't think so and I can see that article's point. So let's cut to the
> chase: what are we going to do when the sysfs or debugfs nodes you've added
> become insufficient and you or someone else needs to change them in the
> future, for their specific use case?
> 
> The last paragraph of that article basically sums it up pretty nicely.

Agreed. We need an interface we can support indefinitely - there is nothing
different between doing it sysfs or debugfs. That should be
extensible in a clean fashion to support new data and matching control.

We don't have to guarantee that interface supports something 'new' though
as our crystal balls aren't perfect, but we do want to make extending to
cover the new straight forward.

> 
> > Absolutely though the performance impact of punching holes in memory over
> > time is getting some cloud folk pushing back because they can't get their
> > 1GIB pages to put under a VM.  Mind you that's not particularly relevant
> > to this thread.  
> 
> What is relevant to this thread is the fact that you can't simply reboot as
> a RAS recovery action. Not in all cases.

Agreed.  We still have the option to soft offline the memory if there is no
other choice.

> 
> > For this we'll do as we did for scrub control and send a patch set adding tooling
> > to RASdaemon and/or if more appropriate a script along side it.  My fault,
> > I falsely thought this one was more obvious and we could leave that until
> > this landed. Seems not!  
> 
> Sorry, I can't always guess the use case by looking solely at the sysfs nodes.
> 
> > This I agree on. However, if CXL takes off (and there seems to be agreement
> > it will to some degree at least) then this interface is fully general for any spec
> > compliant device.  
> 
> Ok, sounds good.
> 
> > Sure. We can definitely do that.  We have this split in v19 (just undergoing
> > some final docs tidy up etc, should be posted soon).  
> 
> Thx.
> 
> You don't have to rush it - we have merge window anyway.
> 
> > Early devices and the ones in a few years time may make different
> > decisions on this. All options are covered by this driver (autonomous
> > repair is covered for free as nothing to do!)  
> 
> Don't forget devices which deviate from the spec because they were implemented
> wrong. It happens and we have to support them because no one else cares but
> people have already paid for them and want to use them.

Mostly for CXL stuff we've so far avoided that in the upstream code,
but it is indeed the case that quirks will turn up :(

I'd hope the majority can be handled in the CXL specific driver or
by massaging the error records on their way out of the kernel.
For now we have constrained records to have to be complete as defined
by the spec, but I can definitely see we might have a 1 rank only device
that doesn't set the valid bit in the error record for rank.
In discussion with Shiju, we decided that's a case we'll solve in the
driver if it turns out to be relevant.  Maybe we'd quirk the error
report to fill in the missing data for that device, or maybe we'd
relax the constraints on parameters when doing an online repair.
That's a question to resolve if anyone ever builds it!

> 
> > CXL is not vendor specific. Our other driver that I keep referring
> > to as 'coming soon' is though.  I'll see if I can get a few memory
> > device manufacturers to specifically stick their hands up that they
> > care about this. As an example we presented on this topic with
> > Micron at the LPC CXL uconf (+CC Vandana).  I don't have access
> > to Micron parts so this isn't just Huawei using Micron, we simply had two
> > proposals on the same topic so combined the sessions.  We have a CXL
> > open source sync call in an hour so I'll ask there.  
> 
> Having hw vendors agree on a single driver and Linux implementing it would be
> ofc optimal.
> 
> > Maybe for the follow on topic of non persistent repair as a path to
> > avoid offlining memory detected as bad. Maybe that counts
> > as generalization (rather than extension).  But that's not covering
> > our usecase of restablishing the offline at boot, or the persistent
> > usecases.  So it's a value add feature for a follow up effort,
> > not a baseline one which is the intent of this patch set.  
> 
> Ok, I think this whole pile should simply be in two parts: generic, CXL-spec
> implementing, vendor-agnostic pieces and vendor-specific drivers which use
> that.

There is room for vendor specific drivers in that longer term and it
is happening in the CXL core to move from just supporting spec
complaint type 3 devices (the ones that use the class code)
to supporting accelerators (network cards, GPUs etc with local memory).
What we have here is vbuilt on the CXL core that provides services to
all those drivers. The actual opt in etc is coming from the type
3 device driver calling the registration call after finding the
hardware reports the relevant commands.

Overall the CXL subsystem is evolving to allow more reuse for accelerators.
Until recently the only public devices where type 3 memory only, so it
was a guessing game on where those spits should be.  Alejandro and others
are now fleshing that out.  So far no sign of memory repair on those
devices, but if they support it and use standard interfaces, then all good.

(for type 2 support)
https://lore.kernel.org/linux-cxl/20250205151950.25268-1-alucerop@amd.com/T/#t

> 
> It'll be lovely if vendors could agree on this interface you're proposing but
> I won't hold my breath...

True enough that vendors will do silly things.  However, if they want Linux
support for a CXL memory device out of the box the only option they
have is the driver stack that binds to the class code.
Dan and the rest of us are pushing back very hard (possibly too hard)
on vendor defined interfaces etc.  If we don't see at least an effort
to standardize them (not necessarily in the CXL spec, but somewhere)
then their chances of getting upstream support is near 0. There have
been a few 'discussions' about exceptions to this in the past.

Their only option might be fwctl for a few things with it's taints
etc.  The controls used for this set are explicitly excluded from being
used via fwctl:

https://lore.kernel.org/linux-cxl/20250204220430.4146187-1-dave.jiang@intel.com/T/#ma478ae9d7529f31ec6f08c2e98432d5721ca0b0e

If a vendor wants to do their own thing then good luck to them but don't expect
the standard software stack to work.  So far I have seen no sign of anyone
doing a non compliant memory expansion device and there are quite a
few spec compliant ones.

We will get weird memory devices with accelerators perhaps but then that
memory won't be treated as normal memory anyway and likely has a custom
RAS solution.  If they do use the spec defined commands, then this
support should work fine. Just needs a call from their drive to hook
it up.

It might not be the best analogy, but I think of the CXL type 3 device
spec as being similar to NVME. There are lots of options, but most people
will run one standard driver.  There may be custom features but the
device better be compatible with the NVME driver if they advertise
the class code (there are compliance suites etc)

> 
> > Thanks for taking time to continue the discussion and I think we
> > are converging somewhat even if there is further to go.  
> 
> Yap, I think so. A lot of things got cleared up for me too, so thanks too.
> I'm sure you know what the important things are that we need to pay attention
> when it comes to designing this with a broader audience in mind.

I'll encourage a few more memory vendor folk (who have visibility
of more specific implementations / use cases than me) to take another
look at v19 and the user space tooling.  Hopefully they will point
out any remaining holes.

Thanks,

Jonathan

> 
> Thx.
> 

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v18 04/19] EDAC: Add memory repair control feature
  2025-02-06 13:39                             ` Jonathan Cameron
@ 2025-02-17 13:23                               ` Borislav Petkov
  2025-02-18 16:51                                 ` Jonathan Cameron
  0 siblings, 1 reply; 87+ messages in thread
From: Borislav Petkov @ 2025-02-17 13:23 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: Shiju Jose, linux-edac@vger.kernel.org, linux-cxl@vger.kernel.org,
	linux-acpi@vger.kernel.org, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org, tony.luck@intel.com,
	rafael@kernel.org, lenb@kernel.org, mchehab@kernel.org,
	dan.j.williams@intel.com, dave@stgolabs.net, dave.jiang@intel.com,
	alison.schofield@intel.com, vishal.l.verma@intel.com,
	ira.weiny@intel.com, david@redhat.com, Vilas.Sridharan@amd.com,
	leo.duran@amd.com, Yazen.Ghannam@amd.com, rientjes@google.com,
	jiaqiyan@google.com, Jon.Grimm@amd.com,
	dave.hansen@linux.intel.com, naoya.horiguchi@nec.com,
	james.morse@arm.com, jthoughton@google.com,
	somasundaram.a@hpe.com, erdemaktas@google.com, pgonda@google.com,
	duenwen@google.com, gthelen@google.com,
	wschwartz@amperecomputing.com, dferguson@amperecomputing.com,
	wbs@os.amperecomputing.com, nifan.cxl@gmail.com, tanxiaofei,
	Zengtao (B), Roberto Sassu, kangkang.shen@futurewei.com,
	wanghuiqiang, Linuxarm, Vandana Salve

On Thu, Feb 06, 2025 at 01:39:49PM +0000, Jonathan Cameron wrote:
> Shiju is just finalizing a v19 + the userspace code.  So may make
> sense to read this reply only after that is out!

Saw them.

So, from a cursory view, all that sysfs marshalling that happens in patch
1 and 2 here:

https://lore.kernel.org/r/20250207143028.1865-1-shiju.jose@huawei.com

is not really needed, AFAICT.

You can basically check CXL_EVENT_RECORD_FLAG_MAINT_NEEDED *in the kernel* and
go and start the recovery action. rasdaemon is basically logging the error
record and parroting it back into sysfs which is completely unnecessary - the
kernel can simply do that.

Patches 3 and 4 are probably more of a justification for the userspace
interaction as the kernel driver is "not ready" to do recovery for <raisins>.

But there I'm also questioning the presence of the sysfs interface - the 
error record could simply be injected raw and the kernel can pick it apart.

Or maybe there's a point for rasdaemon to ponder over all those different
attributes and maybe involve some non-trivial massaging of error info in order
to come at some conclusion and inject that as a recovery action.

I guess I'm missing something and maybe there really is a valid use case to
expose all those attributes through sysfs and use them. But I don't see
a clear reason now...

> For this comment I was referring letting the kernel do the
> stats gathering etc. We would need to put back records from a previous boot.
> That requires almost the same interface as just telling it to repair.
> Note the address to physical memory mapping is not stable across boots
> so we can't just provide a physical address, we need full description.

Right.

> Ah. No not that. I was just meaning the case where it is hard PPR. (hence
> persistent for all time) Once you've done it you can't go back so after
> N uses, any more errors mean you need a new device ASAP. That is as decision
> with a very different threshold to soft PPR where it's a case of you
> do it until you run out of spares, then you fall back to offlining
> pages.  Next boot you get your spares back again and may use them
> differently this time.

Ok.

> True enough. I'm not against doing things in kernel in some cases.  Even
> then I want the controls to allow user space to do more complex things.
> Even in the cases where the devices suggests repair, we may not want to for
> reasons that device can't know about.

Sure, as long as supporting such a use case is important enough to warrant
supporting a user interface indefinitely.

All I'm saying is, it better be worth the effort.

> The interface provides all the data, and all the controls to match.
> 
> Sure, something new might come along that needs additional controls (subchannel
> for DDR5 showed up recently for instance and are in v19) but that extension
> should be easy and fit within the ABI.  Those new 'features' will need
> kernel changes and matching rasdaemon changes anyway as there is new data
> in the error records so this sort of extension should be fine.

As long as you don't break existing usage, you're good. The moment you have to
change how rasdaemon uses the interface with a new rasdaemon, then you need to
support both.

> Agreed. We need an interface we can support indefinitely - there is nothing
> different between doing it sysfs or debugfs. That should be
> extensible in a clean fashion to support new data and matching control.
> 
> We don't have to guarantee that interface supports something 'new' though
> as our crystal balls aren't perfect, but we do want to make extending to
> cover the new straight forward.

Right.

> If a vendor wants to do their own thing then good luck to them but don't expect
> the standard software stack to work.  So far I have seen no sign of anyone
> doing a non compliant memory expansion device and there are quite a
> few spec compliant ones.

Nowadays hw vendors use a lot of Linux to verify hw so catching an unsupported
device early is good. But there's always a case...

> 
> We will get weird memory devices with accelerators perhaps but then that
> memory won't be treated as normal memory anyway and likely has a custom
> RAS solution.  If they do use the spec defined commands, then this
> support should work fine. Just needs a call from their drive to hook
> it up.
> 
> It might not be the best analogy, but I think of the CXL type 3 device
> spec as being similar to NVME. There are lots of options, but most people
> will run one standard driver.  There may be custom features but the
> device better be compatible with the NVME driver if they advertise
> the class code (there are compliance suites etc)

Ack.

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v18 04/19] EDAC: Add memory repair control feature
  2025-02-17 13:23                               ` Borislav Petkov
@ 2025-02-18 16:51                                 ` Jonathan Cameron
  2025-02-19 18:45                                   ` Borislav Petkov
  0 siblings, 1 reply; 87+ messages in thread
From: Jonathan Cameron @ 2025-02-18 16:51 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Shiju Jose, linux-edac@vger.kernel.org, linux-cxl@vger.kernel.org,
	linux-acpi@vger.kernel.org, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org, tony.luck@intel.com,
	rafael@kernel.org, lenb@kernel.org, mchehab@kernel.org,
	dan.j.williams@intel.com, dave@stgolabs.net, dave.jiang@intel.com,
	alison.schofield@intel.com, vishal.l.verma@intel.com,
	ira.weiny@intel.com, david@redhat.com, Vilas.Sridharan@amd.com,
	leo.duran@amd.com, Yazen.Ghannam@amd.com, rientjes@google.com,
	jiaqiyan@google.com, Jon.Grimm@amd.com,
	dave.hansen@linux.intel.com, naoya.horiguchi@nec.com,
	james.morse@arm.com, jthoughton@google.com,
	somasundaram.a@hpe.com, erdemaktas@google.com, pgonda@google.com,
	duenwen@google.com, gthelen@google.com,
	wschwartz@amperecomputing.com, dferguson@amperecomputing.com,
	wbs@os.amperecomputing.com, nifan.cxl@gmail.com, tanxiaofei,
	Zengtao (B), Roberto Sassu, kangkang.shen@futurewei.com,
	wanghuiqiang, Linuxarm, Vandana Salve, Steven Rostedt

On Mon, 17 Feb 2025 14:23:22 +0100
Borislav Petkov <bp@alien8.de> wrote:

+CC Steven for question about using tracepoints as injection ABI.

> On Thu, Feb 06, 2025 at 01:39:49PM +0000, Jonathan Cameron wrote:
> > Shiju is just finalizing a v19 + the userspace code.  So may make
> > sense to read this reply only after that is out!  
> 
> Saw them.
> 
> So, from a cursory view, all that sysfs marshalling that happens in patch
> 1 and 2 here:
> 
> https://lore.kernel.org/r/20250207143028.1865-1-shiju.jose@huawei.com
> 
> is not really needed, AFAICT.
> 
> You can basically check CXL_EVENT_RECORD_FLAG_MAINT_NEEDED *in the kernel* and
> go and start the recovery action. rasdaemon is basically logging the error
> record and parroting it back into sysfs which is completely unnecessary - the
> kernel can simply do that.

Hi Borislav,

I think this point is addressing only one case (the one we chose to
prove out the interface so fair enough!) It is optional whether
a device ever sets the maintenance needed flag or does any of the
tracking needed to do so (which is potentially very expensive to
implement at the finer granularities of repair). Also no guarantees
on persistence of that tracking over reset. 

The intent of the rasdaemon series was to focus on the usability of the
interface, not the perfect decision process - we just picked the
easy case of the device giving us a hint. Perhaps we didn't call
that out clearly.

As a side note, if you are in the situation where the device can do
memory repair without any disruption of memory access then my
assumption is in the case where the device would set the maintenance
needed + where it is considering soft repair (so no long term cost
to a wrong decision) then the device would probably just do it
autonomously and at most we might get a notification.

So I think that if we see this there will be some disruption.
Latency spikes for soft repair or we are looking at hard repair.
In that case we'd need policy on whether to repair at all.
In general the rasdaemon handling in that series is intentionally
simplistic. Real solutions will take time to refine but they
don't need changes to the kernel interface, just when to poke it.

If that assumption about autonomous repair is wrong it would be
helpful for a memory manufacturer to shout!

> 
> Patches 3 and 4 are probably more of a justification for the userspace
> interaction as the kernel driver is "not ready" to do recovery for <raisins>.
> 
> But there I'm also questioning the presence of the sysfs interface - the 
> error record could simply be injected raw and the kernel can pick it apart.

The error record comes out as a trace point. Is there any precedence for
injecting those back into the kernel?  Whilst we'd need a subset of the
parsing code I think that's a tall order unless that infrastructure
already exists for some use case I don't know about. We'd also need to
invent a new logging scheme to keep the binary tracepoint around across
boots and poke it back in again. Given this is keeping tracepoint dumps over
kernel boots we'd also need to deal with backwards/forwards compatibility over
kernel version changes, so log the format file as well or convert to
a standard form.

Alternatively we could push these out through a new or modified version of
existing binary interfaces in a standard form.  I think that's unnecessary
duplication but happy to consider it if that's a path forwards. To me it
seems like a lot of complexity compared to current solution.

> 
> Or maybe there's a point for rasdaemon to ponder over all those different
> attributes and maybe involve some non-trivial massaging of error info in order
> to come at some conclusion and inject that as a recovery action.

That policy question is a long term one but I can suggest 'possible' policies
that might help motivate the discussion
1. Repair may be very disruptive to memory latency. Delay until a maintenance
   window when latency spike is accepted by the customer until then rely on
   maintenance needed still representing a relatively low chance of failure.
2. Hard repair uses known limited resources - e.g. those are known to match up
   to a particular number of rows in each module. That is not discoverable under
   the CXL spec so would have to come from another source of metadata.
   Apply some sort of fall off function so that we repair only the very worst
   cases as we run out. Alternative is always soft offline the memory in the OS,
   aim is to reduce chance of having to do that a somewhat optimal fashion.
   I'm not sure on the appropriate stats, maybe assume a given granual failure
   rate follows a Poison distribution and attempt to estimate lambda?  Would
   need an expert in appropriate failure modes or a lot of data to define
   this!

> 
> I guess I'm missing something and maybe there really is a valid use case to
> expose all those attributes through sysfs and use them. But I don't see
> a clear reason now...

It is the simplest interface that we have come up with so far. I'm fully open
to alternatives that provide a clean way to get this data back into the
kernel and play well with existing logging tooling (e.g. rasdaemon)

Some things we could do,
* Store binary of trace event and reinject. As above + we would have to be
  very careful that any changes to the event are made with knowledge that
  we need to handle this path.  Little or now marshaling / formatting code
  in userspace, but new logging infrastructure needed + a chardev /ioctl
  to inject the data and a bit of userspace glue to talk to it.
* Reinject a binary representation we define, via an ioctl on some
  chardev we create for the purpose.  Userspace code has to take
  key value pairs and process them into this form.  So similar amount
  of marshaling code to what we have for sysfs.
* Or what we currently propose, write set of key value pairs to a simple
  (though multifile) sysfs interface. As you've noted marshaling is needed.

> 
> > For this comment I was referring letting the kernel do the
> > stats gathering etc. We would need to put back records from a previous boot.
> > That requires almost the same interface as just telling it to repair.
> > Note the address to physical memory mapping is not stable across boots
> > so we can't just provide a physical address, we need full description.  
> 
> Right.
> 
> > Ah. No not that. I was just meaning the case where it is hard PPR. (hence
> > persistent for all time) Once you've done it you can't go back so after
> > N uses, any more errors mean you need a new device ASAP. That is as decision
> > with a very different threshold to soft PPR where it's a case of you
> > do it until you run out of spares, then you fall back to offlining
> > pages.  Next boot you get your spares back again and may use them
> > differently this time.  
> 
> Ok.
> 
> > True enough. I'm not against doing things in kernel in some cases.  Even
> > then I want the controls to allow user space to do more complex things.
> > Even in the cases where the devices suggests repair, we may not want to for
> > reasons that device can't know about.  
> 
> Sure, as long as supporting such a use case is important enough to warrant
> supporting a user interface indefinitely.
> 
> All I'm saying is, it better be worth the effort.

Absolutely agree - it is a trade off against supporting that interface.

> 
> > The interface provides all the data, and all the controls to match.
> > 
> > Sure, something new might come along that needs additional controls (subchannel
> > for DDR5 showed up recently for instance and are in v19) but that extension
> > should be easy and fit within the ABI.  Those new 'features' will need
> > kernel changes and matching rasdaemon changes anyway as there is new data
> > in the error records so this sort of extension should be fine.  
> 
> As long as you don't break existing usage, you're good. The moment you have to
> change how rasdaemon uses the interface with a new rasdaemon, then you need to
> support both.

Agreed.  Given any new thing should be optional anyway (either you have subchannels
or you don't) then that should come naturally.  I'd not expect to see anything
new being added for software only reasons and we need to support old hardware.

> 
> > Agreed. We need an interface we can support indefinitely - there is nothing
> > different between doing it sysfs or debugfs. That should be
> > extensible in a clean fashion to support new data and matching control.
> > 
> > We don't have to guarantee that interface supports something 'new' though
> > as our crystal balls aren't perfect, but we do want to make extending to
> > cover the new straight forward.  
> 
> Right.
> 
> > If a vendor wants to do their own thing then good luck to them but don't expect
> > the standard software stack to work.  So far I have seen no sign of anyone
> > doing a non compliant memory expansion device and there are quite a
> > few spec compliant ones.  
> 
> Nowadays hw vendors use a lot of Linux to verify hw so catching an unsupported
> device early is good. But there's always a case...

True enough.  They get to find out how grumpy the maintainers are - thankfully
this stuff is typically mostly device firmware defined so we can (and will)
push back hard.

> 
> > 
> > We will get weird memory devices with accelerators perhaps but then that
> > memory won't be treated as normal memory anyway and likely has a custom
> > RAS solution.  If they do use the spec defined commands, then this
> > support should work fine. Just needs a call from their drive to hook
> > it up.
> > 
> > It might not be the best analogy, but I think of the CXL type 3 device
> > spec as being similar to NVME. There are lots of options, but most people
> > will run one standard driver.  There may be custom features but the
> > device better be compatible with the NVME driver if they advertise
> > the class code (there are compliance suites etc)  
> 
> Ack.
> 
> Thx.
> 
Thanks again for your inputs! I hope I've perhaps addressed some of them.

Jonathan

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v18 04/19] EDAC: Add memory repair control feature
  2025-02-18 16:51                                 ` Jonathan Cameron
@ 2025-02-19 18:45                                   ` Borislav Petkov
  2025-02-20 12:19                                     ` Jonathan Cameron
  0 siblings, 1 reply; 87+ messages in thread
From: Borislav Petkov @ 2025-02-19 18:45 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: Shiju Jose, linux-edac@vger.kernel.org, linux-cxl@vger.kernel.org,
	linux-acpi@vger.kernel.org, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org, tony.luck@intel.com,
	rafael@kernel.org, lenb@kernel.org, mchehab@kernel.org,
	dan.j.williams@intel.com, dave@stgolabs.net, dave.jiang@intel.com,
	alison.schofield@intel.com, vishal.l.verma@intel.com,
	ira.weiny@intel.com, david@redhat.com, Vilas.Sridharan@amd.com,
	leo.duran@amd.com, Yazen.Ghannam@amd.com, rientjes@google.com,
	jiaqiyan@google.com, Jon.Grimm@amd.com,
	dave.hansen@linux.intel.com, naoya.horiguchi@nec.com,
	james.morse@arm.com, jthoughton@google.com,
	somasundaram.a@hpe.com, erdemaktas@google.com, pgonda@google.com,
	duenwen@google.com, gthelen@google.com,
	wschwartz@amperecomputing.com, dferguson@amperecomputing.com,
	wbs@os.amperecomputing.com, nifan.cxl@gmail.com, tanxiaofei,
	Zengtao (B), Roberto Sassu, kangkang.shen@futurewei.com,
	wanghuiqiang, Linuxarm, Vandana Salve, Steven Rostedt

On Tue, Feb 18, 2025 at 04:51:25PM +0000, Jonathan Cameron wrote:
> As a side note, if you are in the situation where the device can do
> memory repair without any disruption of memory access then my
> assumption is in the case where the device would set the maintenance
> needed + where it is considering soft repair (so no long term cost
> to a wrong decision) then the device would probably just do it
> autonomously and at most we might get a notification.

And this is basically what I'm trying to hint at: if you can do recovery
action without userspace involvement, then please, by all means. There's no
need to noodle information back'n'forth through user if the kernel or the
device itself even, can handle it on its own.

More involved stuff should obviously rely on userspace to do more involved
"pondering."

> So I think that if we see this there will be some disruption.
> Latency spikes for soft repair or we are looking at hard repair.
> In that case we'd need policy on whether to repair at all.
> In general the rasdaemon handling in that series is intentionally
> simplistic. Real solutions will take time to refine but they
> don't need changes to the kernel interface, just when to poke it.

I hope so.

> The error record comes out as a trace point. Is there any precedence for
> injecting those back into the kernel? 

I'm just questioning the whole interface and its usability. Not saying it
doesn't make sense - we're simply weighing all options here.

> That policy question is a long term one but I can suggest 'possible' policies
> that might help motivate the discussion
>
> 1. Repair may be very disruptive to memory latency. Delay until a maintenance
>    window when latency spike is accepted by the customer until then rely on
>    maintenance needed still representing a relatively low chance of failure.

So during the maintenance window, the operator is supposed to do

rasdaemon --start-expensive-repair-operations

?

> 2. Hard repair uses known limited resources - e.g. those are known to match up
>    to a particular number of rows in each module. That is not discoverable under
>    the CXL spec so would have to come from another source of metadata.
>    Apply some sort of fall off function so that we repair only the very worst
>    cases as we run out. Alternative is always soft offline the memory in the OS,
>    aim is to reduce chance of having to do that a somewhat optimal fashion.
>    I'm not sure on the appropriate stats, maybe assume a given granual failure
>    rate follows a Poison distribution and attempt to estimate lambda?  Would
>    need an expert in appropriate failure modes or a lot of data to define
>    this!

I have no clue what you're saying here. :-)

> It is the simplest interface that we have come up with so far. I'm fully open
> to alternatives that provide a clean way to get this data back into the
> kernel and play well with existing logging tooling (e.g. rasdaemon)
> 
> Some things we could do,
> * Store binary of trace event and reinject. As above + we would have to be
>   very careful that any changes to the event are made with knowledge that
>   we need to handle this path.  Little or now marshaling / formatting code
>   in userspace, but new logging infrastructure needed + a chardev /ioctl
>   to inject the data and a bit of userspace glue to talk to it.
> * Reinject a binary representation we define, via an ioctl on some
>   chardev we create for the purpose.  Userspace code has to take
>   key value pairs and process them into this form.  So similar amount
>   of marshaling code to what we have for sysfs.
> * Or what we currently propose, write set of key value pairs to a simple
>   (though multifile) sysfs interface. As you've noted marshaling is needed.

... and the advantage of having such a sysfs interface: it is human readable
and usable vs having to use a tool to create a binary blob in a certain
format...

Ok, then. Let's give that API a try... I guess I need to pick up the EDAC
patches from here:

https://lore.kernel.org/r/20250212143654.1893-1-shiju.jose@huawei.com

If so, there's an EDAC patch 14 which is not together with the first 4. And
I was thinking of taking the first 4 or 5 and then giving other folks an
immutable branch in the EDAC tree which they can use to base the CXL stuff on
top.

What's up?

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v18 04/19] EDAC: Add memory repair control feature
  2025-02-19 18:45                                   ` Borislav Petkov
@ 2025-02-20 12:19                                     ` Jonathan Cameron
  0 siblings, 0 replies; 87+ messages in thread
From: Jonathan Cameron @ 2025-02-20 12:19 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Shiju Jose, linux-edac@vger.kernel.org, linux-cxl@vger.kernel.org,
	linux-acpi@vger.kernel.org, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org, tony.luck@intel.com,
	rafael@kernel.org, lenb@kernel.org, mchehab@kernel.org,
	dan.j.williams@intel.com, dave@stgolabs.net, dave.jiang@intel.com,
	alison.schofield@intel.com, vishal.l.verma@intel.com,
	ira.weiny@intel.com, david@redhat.com, Vilas.Sridharan@amd.com,
	leo.duran@amd.com, Yazen.Ghannam@amd.com, rientjes@google.com,
	jiaqiyan@google.com, Jon.Grimm@amd.com,
	dave.hansen@linux.intel.com, naoya.horiguchi@nec.com,
	james.morse@arm.com, jthoughton@google.com,
	somasundaram.a@hpe.com, erdemaktas@google.com, pgonda@google.com,
	duenwen@google.com, gthelen@google.com,
	wschwartz@amperecomputing.com, dferguson@amperecomputing.com,
	wbs@os.amperecomputing.com, nifan.cxl@gmail.com, tanxiaofei,
	Zengtao (B), Roberto Sassu, kangkang.shen@futurewei.com,
	wanghuiqiang, Linuxarm, Vandana Salve, Steven Rostedt

On Wed, 19 Feb 2025 19:45:33 +0100
Borislav Petkov <bp@alien8.de> wrote:

> On Tue, Feb 18, 2025 at 04:51:25PM +0000, Jonathan Cameron wrote:
> > As a side note, if you are in the situation where the device can do
> > memory repair without any disruption of memory access then my
> > assumption is in the case where the device would set the maintenance
> > needed + where it is considering soft repair (so no long term cost
> > to a wrong decision) then the device would probably just do it
> > autonomously and at most we might get a notification.  
> 
> And this is basically what I'm trying to hint at: if you can do recovery
> action without userspace involvement, then please, by all means. There's no
> need to noodle information back'n'forth through user if the kernel or the
> device itself even, can handle it on its own.
> 
> More involved stuff should obviously rely on userspace to do more involved
> "pondering."

Lets explore this further as a follow up. A policy switch to let the kernel
do the 'easy' stuff (assuming device didn't do it) makes sense if this
particular combination is common.

> 
> > So I think that if we see this there will be some disruption.
> > Latency spikes for soft repair or we are looking at hard repair.
> > In that case we'd need policy on whether to repair at all.
> > In general the rasdaemon handling in that series is intentionally
> > simplistic. Real solutions will take time to refine but they
> > don't need changes to the kernel interface, just when to poke it.  
> 
> I hope so.
> 
> > The error record comes out as a trace point. Is there any precedence for
> > injecting those back into the kernel?   
> 
> I'm just questioning the whole interface and its usability. Not saying it
> doesn't make sense - we're simply weighing all options here.
> 
> > That policy question is a long term one but I can suggest 'possible' policies
> > that might help motivate the discussion
> >
> > 1. Repair may be very disruptive to memory latency. Delay until a maintenance
> >    window when latency spike is accepted by the customer until then rely on
> >    maintenance needed still representing a relatively low chance of failure.  
> 
> So during the maintenance window, the operator is supposed to do
> 
> rasdaemon --start-expensive-repair-operations

Yes, would be something along those lines.  Or a script very similar to the
the boot one Shiju wrote.  Scan the DB and find what needs repairing + do so.

> 
> ?
> 
> > 2. Hard repair uses known limited resources - e.g. those are known to match up
> >    to a particular number of rows in each module. That is not discoverable under
> >    the CXL spec so would have to come from another source of metadata.
> >    Apply some sort of fall off function so that we repair only the very worst
> >    cases as we run out. Alternative is always soft offline the memory in the OS,
> >    aim is to reduce chance of having to do that a somewhat optimal fashion.
> >    I'm not sure on the appropriate stats, maybe assume a given granual failure
> >    rate follows a Poison distribution and attempt to estimate lambda?  Would
> >    need an expert in appropriate failure modes or a lot of data to define
> >    this!  
> 
> I have no clue what you're saying here. :-)

I'll write something up at some point as it's definitely a complex
topic and I need to find a statistician + hardware folk with error models to
help flesh it out. 

There is another topic to look at which is what to do with synchronous poison
if we can repair the memory and bring it back into use.
I can't find the thread, but last time I asked about recovering from that, the
mm folk said they'd need to see the code + usecases (fair enough!).

> 
> > It is the simplest interface that we have come up with so far. I'm fully open
> > to alternatives that provide a clean way to get this data back into the
> > kernel and play well with existing logging tooling (e.g. rasdaemon)
> > 
> > Some things we could do,
> > * Store binary of trace event and reinject. As above + we would have to be
> >   very careful that any changes to the event are made with knowledge that
> >   we need to handle this path.  Little or now marshaling / formatting code
> >   in userspace, but new logging infrastructure needed + a chardev /ioctl
> >   to inject the data and a bit of userspace glue to talk to it.
> > * Reinject a binary representation we define, via an ioctl on some
> >   chardev we create for the purpose.  Userspace code has to take
> >   key value pairs and process them into this form.  So similar amount
> >   of marshaling code to what we have for sysfs.
> > * Or what we currently propose, write set of key value pairs to a simple
> >   (though multifile) sysfs interface. As you've noted marshaling is needed.  
> 
> ... and the advantage of having such a sysfs interface: it is human readable
> and usable vs having to use a tool to create a binary blob in a certain
> format...
> 
> Ok, then. Let's give that API a try... I guess I need to pick up the EDAC
> patches from here:
> 
> https://lore.kernel.org/r/20250212143654.1893-1-shiju.jose@huawei.com
> 
> If so, there's an EDAC patch 14 which is not together with the first 4. And
> I was thinking of taking the first 4 or 5 and then giving other folks an
> immutable branch in the EDAC tree which they can use to base the CXL stuff on
> top.
> 
> What's up?

My fault. I asked Shiju to split the more complex ABI for sparing out
to build the complexity up rather than having it all in one patch.

Should be fine for you to take 1-4 and 14 which is all the EDAC parts.

For 5 and 6 Rafael acked the ACPI part (5), and the ACPI ras2 scrub driver
has no other dependencies so I think that should go through your
tree as well, though no need to be in the immutable branch.

Dave Jiang can work his magic on the CXL stuff on top of a merge of your
immutable branch.

Thanks!

Jonathan
> 


^ permalink raw reply	[flat|nested] 87+ messages in thread

end of thread, other threads:[~2025-02-20 12:19 UTC | newest]

Thread overview: 87+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-01-06 12:09 [PATCH v18 00/19] EDAC: Scrub: introduce generic EDAC RAS control feature driver + CXL/ACPI-RAS2 drivers shiju.jose
2025-01-06 12:09 ` [PATCH v18 01/19] EDAC: Add support for EDAC device features control shiju.jose
2025-01-06 13:37   ` Borislav Petkov
2025-01-06 14:48     ` Shiju Jose
2025-01-13 15:06   ` Mauro Carvalho Chehab
2025-01-14  9:55     ` Jonathan Cameron
2025-01-14 10:08     ` Shiju Jose
2025-01-14 11:33       ` Mauro Carvalho Chehab
2025-01-30 19:18   ` Daniel Ferguson
2025-01-06 12:09 ` [PATCH v18 02/19] EDAC: Add scrub control feature shiju.jose
2025-01-06 15:57   ` Borislav Petkov
2025-01-06 19:34     ` Shiju Jose
2025-01-07  7:32       ` Borislav Petkov
2025-01-07  9:23         ` Shiju Jose
2025-01-08 15:47         ` Shiju Jose
2025-01-13 15:50   ` Mauro Carvalho Chehab
2025-01-30 19:18   ` Daniel Ferguson
2025-01-06 12:09 ` [PATCH v18 03/19] EDAC: Add ECS " shiju.jose
2025-01-13 16:09   ` Mauro Carvalho Chehab
2025-01-06 12:10 ` [PATCH v18 04/19] EDAC: Add memory repair " shiju.jose
2025-01-09  9:19   ` Borislav Petkov
2025-01-09 11:00     ` Shiju Jose
2025-01-09 12:32       ` Borislav Petkov
2025-01-09 14:24         ` Jonathan Cameron
2025-01-09 15:18           ` Borislav Petkov
2025-01-09 16:01             ` Jonathan Cameron
2025-01-09 16:19               ` Borislav Petkov
2025-01-09 18:34                 ` Jonathan Cameron
2025-01-09 23:51                   ` Dan Williams
2025-01-10 11:01                     ` Jonathan Cameron
2025-01-10 22:49                       ` Dan Williams
2025-01-13 11:40                         ` Jonathan Cameron
2025-01-14 19:35                           ` Dan Williams
2025-01-15 10:07                             ` Jonathan Cameron
2025-01-15 11:35                             ` Mauro Carvalho Chehab
2025-01-11 17:12                   ` Borislav Petkov
2025-01-13 11:07                     ` Jonathan Cameron
2025-01-21 16:16                       ` Borislav Petkov
2025-01-21 18:16                         ` Jonathan Cameron
2025-01-22 19:09                           ` Borislav Petkov
2025-02-06 13:39                             ` Jonathan Cameron
2025-02-17 13:23                               ` Borislav Petkov
2025-02-18 16:51                                 ` Jonathan Cameron
2025-02-19 18:45                                   ` Borislav Petkov
2025-02-20 12:19                                     ` Jonathan Cameron
2025-01-14 13:10                   ` Mauro Carvalho Chehab
2025-01-14 12:57               ` Mauro Carvalho Chehab
2025-01-14 12:38           ` Mauro Carvalho Chehab
2025-01-14 13:05             ` Jonathan Cameron
2025-01-14 14:39               ` Mauro Carvalho Chehab
2025-01-14 11:47   ` Mauro Carvalho Chehab
2025-01-14 12:31     ` Shiju Jose
2025-01-14 14:26       ` Mauro Carvalho Chehab
2025-01-14 13:47   ` Mauro Carvalho Chehab
2025-01-14 14:30     ` Shiju Jose
2025-01-15 12:03       ` Mauro Carvalho Chehab
2025-01-06 12:10 ` [PATCH v18 05/19] ACPI:RAS2: Add ACPI RAS2 driver shiju.jose
2025-01-21 23:01   ` Daniel Ferguson
2025-01-22 15:38     ` Shiju Jose
2025-01-30 19:19   ` Daniel Ferguson
2025-01-06 12:10 ` [PATCH v18 06/19] ras: mem: Add memory " shiju.jose
2025-01-21 23:01   ` Daniel Ferguson
2025-01-30 19:19   ` Daniel Ferguson
2025-01-06 12:10 ` [PATCH v18 07/19] cxl: Refactor user ioctl command path from mds to mailbox shiju.jose
2025-01-06 12:10 ` [PATCH v18 08/19] cxl: Add skeletal features driver shiju.jose
2025-01-06 12:10 ` [PATCH v18 09/19] cxl: Enumerate feature commands shiju.jose
2025-01-06 12:10 ` [PATCH v18 10/19] cxl: Add Get Supported Features command for kernel usage shiju.jose
2025-01-06 12:10 ` [PATCH v18 11/19] cxl: Add features driver attribute to emit number of features supported shiju.jose
2025-01-06 12:10 ` [PATCH v18 12/19] cxl/mbox: Add GET_FEATURE mailbox command shiju.jose
2025-01-06 12:10 ` [PATCH v18 13/19] cxl/mbox: Add SET_FEATURE " shiju.jose
2025-01-06 12:10 ` [PATCH v18 14/19] cxl: Setup exclusive CXL features that are reserved for the kernel shiju.jose
2025-01-06 12:10 ` [PATCH v18 15/19] cxl/memfeature: Add CXL memory device patrol scrub control feature shiju.jose
2025-01-24 20:38   ` Dan Williams
2025-01-27 10:06     ` Jonathan Cameron
2025-01-27 12:53     ` Shiju Jose
2025-01-27 23:17       ` Dan Williams
2025-01-29 12:28         ` Shiju Jose
2025-01-06 12:10 ` [PATCH v18 16/19] cxl/memfeature: Add CXL memory device ECS " shiju.jose
2025-01-06 12:10 ` [PATCH v18 17/19] cxl/mbox: Add support for PERFORM_MAINTENANCE mailbox command shiju.jose
2025-01-06 12:10 ` [PATCH v18 18/19] cxl/memfeature: Add CXL memory device soft PPR control feature shiju.jose
2025-01-06 12:10 ` [PATCH v18 19/19] cxl/memfeature: Add CXL memory device memory sparing " shiju.jose
2025-01-13 14:46 ` [PATCH v18 00/19] EDAC: Scrub: introduce generic EDAC RAS control feature driver + CXL/ACPI-RAS2 drivers Mauro Carvalho Chehab
2025-01-13 15:36   ` Jonathan Cameron
2025-01-14 14:06     ` Mauro Carvalho Chehab
2025-01-13 18:15   ` Shiju Jose
2025-01-30 19:18 ` Daniel Ferguson
2025-02-03  9:25   ` Shiju Jose

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).