From: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
To: Shiju Jose <shiju.jose@huawei.com>
Cc: "linux-edac@vger.kernel.org" <linux-edac@vger.kernel.org>,
"linux-cxl@vger.kernel.org" <linux-cxl@vger.kernel.org>,
"linux-acpi@vger.kernel.org" <linux-acpi@vger.kernel.org>,
"linux-mm@kvack.org" <linux-mm@kvack.org>,
"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
"bp@alien8.de" <bp@alien8.de>,
"tony.luck@intel.com" <tony.luck@intel.com>,
"rafael@kernel.org" <rafael@kernel.org>,
"lenb@kernel.org" <lenb@kernel.org>,
"mchehab@kernel.org" <mchehab@kernel.org>,
"dan.j.williams@intel.com" <dan.j.williams@intel.com>,
"dave@stgolabs.net" <dave@stgolabs.net>,
"Jonathan Cameron" <jonathan.cameron@huawei.com>,
"dave.jiang@intel.com" <dave.jiang@intel.com>,
"alison.schofield@intel.com" <alison.schofield@intel.com>,
"vishal.l.verma@intel.com" <vishal.l.verma@intel.com>,
"ira.weiny@intel.com" <ira.weiny@intel.com>,
"david@redhat.com" <david@redhat.com>,
"Vilas.Sridharan@amd.com" <Vilas.Sridharan@amd.com>,
"leo.duran@amd.com" <leo.duran@amd.com>,
"Yazen.Ghannam@amd.com" <Yazen.Ghannam@amd.com>,
"rientjes@google.com" <rientjes@google.com>,
"jiaqiyan@google.com" <jiaqiyan@google.com>,
"Jon.Grimm@amd.com" <Jon.Grimm@amd.com>,
"dave.hansen@linux.intel.com" <dave.hansen@linux.intel.com>,
"naoya.horiguchi@nec.com" <naoya.horiguchi@nec.com>,
"james.morse@arm.com" <james.morse@arm.com>,
"jthoughton@google.com" <jthoughton@google.com>,
"somasundaram.a@hpe.com" <somasundaram.a@hpe.com>,
"erdemaktas@google.com" <erdemaktas@google.com>,
"pgonda@google.com" <pgonda@google.com>,
"duenwen@google.com" <duenwen@google.com>,
"gthelen@google.com" <gthelen@google.com>,
"wschwartz@amperecomputing.com" <wschwartz@amperecomputing.com>,
"dferguson@amperecomputing.com" <dferguson@amperecomputing.com>,
"wbs@os.amperecomputing.com" <wbs@os.amperecomputing.com>,
"nifan.cxl@gmail.com" <nifan.cxl@gmail.com>,
tanxiaofei <tanxiaofei@huawei.com>,
"Zengtao (B)" <prime.zeng@hisilicon.com>,
"Roberto Sassu" <roberto.sassu@huawei.com>,
"kangkang.shen@futurewei.com" <kangkang.shen@futurewei.com>,
wanghuiqiang <wanghuiqiang@huawei.com>,
Linuxarm <linuxarm@huawei.com>
Subject: Re: [PATCH v18 04/19] EDAC: Add memory repair control feature
Date: Tue, 14 Jan 2025 15:26:17 +0100 [thread overview]
Message-ID: <20250114152617.14eb41b5@foz.lan> (raw)
In-Reply-To: <df8b3c3bffd24e1e8eb05b2ec53b3c58@huawei.com>
Em Tue, 14 Jan 2025 12:31:44 +0000
Shiju Jose <shiju.jose@huawei.com> escreveu:
> Hi Mauro,
>
> Thanks for the comments.
>
> >-----Original Message-----
> >From: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
> >Sent: 14 January 2025 11:48
> >To: Shiju Jose <shiju.jose@huawei.com>
> >Cc: linux-edac@vger.kernel.org; linux-cxl@vger.kernel.org; linux-
> >acpi@vger.kernel.org; linux-mm@kvack.org; linux-kernel@vger.kernel.org;
> >bp@alien8.de; tony.luck@intel.com; rafael@kernel.org; lenb@kernel.org;
> >mchehab@kernel.org; dan.j.williams@intel.com; dave@stgolabs.net; Jonathan
> >Cameron <jonathan.cameron@huawei.com>; dave.jiang@intel.com;
> >alison.schofield@intel.com; vishal.l.verma@intel.com; ira.weiny@intel.com;
> >david@redhat.com; Vilas.Sridharan@amd.com; leo.duran@amd.com;
> >Yazen.Ghannam@amd.com; rientjes@google.com; jiaqiyan@google.com;
> >Jon.Grimm@amd.com; dave.hansen@linux.intel.com;
> >naoya.horiguchi@nec.com; james.morse@arm.com; jthoughton@google.com;
> >somasundaram.a@hpe.com; erdemaktas@google.com; pgonda@google.com;
> >duenwen@google.com; gthelen@google.com;
> >wschwartz@amperecomputing.com; dferguson@amperecomputing.com;
> >wbs@os.amperecomputing.com; nifan.cxl@gmail.com; tanxiaofei
> ><tanxiaofei@huawei.com>; Zengtao (B) <prime.zeng@hisilicon.com>; Roberto
> >Sassu <roberto.sassu@huawei.com>; kangkang.shen@futurewei.com;
> >wanghuiqiang <wanghuiqiang@huawei.com>; Linuxarm
> ><linuxarm@huawei.com>
> >Subject: Re: [PATCH v18 04/19] EDAC: Add memory repair control feature
> >
> >Em Mon, 6 Jan 2025 12:10:00 +0000
> ><shiju.jose@huawei.com> escreveu:
> >
> >> From: Shiju Jose <shiju.jose@huawei.com>
> >>
> >> Add a generic EDAC memory repair control driver to manage memory repairs
> >> in the system, such as CXL Post Package Repair (PPR) and CXL memory sparing
> >> features.
> >>
> >> For example, a CXL device with DRAM components that support PPR features
> >> may implement PPR maintenance operations. DRAM components may support
> >two
> >> types of PPR, hard PPR, for a permanent row repair, and soft PPR, for a
> >> temporary row repair. Soft PPR is much faster than hard PPR, but the repair
> >> is lost with a power cycle.
> >> Similarly a CXL memory device may support soft and hard memory sparing at
> >> cacheline, row, bank and rank granularities. Memory sparing is defined as
> >> a repair function that replaces a portion of memory with a portion of
> >> functional memory at that same granularity.
> >> When a CXL device detects an error in a memory, it may report the host of
> >> the need for a repair maintenance operation by using an event record where
> >> the "maintenance needed" flag is set. The event records contains the device
> >> physical address(DPA) and other attributes of the memory to repair (such as
> >> channel, sub-channel, bank group, bank, rank, row, column etc). The kernel
> >> will report the corresponding CXL general media or DRAM trace event to
> >> userspace, and userspace tools (e.g. rasdaemon) will initiate a repair
> >> operation in response to the device request via the sysfs repair control.
> >>
> >> Device with memory repair features registers with EDAC device driver,
> >> which retrieves memory repair descriptor from EDAC memory repair driver
> >> and exposes the sysfs repair control attributes to userspace in
> >> /sys/bus/edac/devices/<dev-name>/mem_repairX/.
> >>
> >> The common memory repair control interface abstracts the control of
> >> arbitrary memory repair functionality into a standardized set of functions.
> >> The sysfs memory repair attribute nodes are only available if the client
> >> driver has implemented the corresponding attribute callback function and
> >> provided operations to the EDAC device driver during registration.
> >>
> >> Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
> >> ---
> >> .../ABI/testing/sysfs-edac-memory-repair | 244 +++++++++
> >> Documentation/edac/features.rst | 3 +
> >> Documentation/edac/index.rst | 1 +
> >> Documentation/edac/memory_repair.rst | 101 ++++
> >> drivers/edac/Makefile | 2 +-
> >> drivers/edac/edac_device.c | 33 ++
> >> drivers/edac/mem_repair.c | 492 ++++++++++++++++++
> >> include/linux/edac.h | 139 +++++
> >> 8 files changed, 1014 insertions(+), 1 deletion(-)
> >> create mode 100644 Documentation/ABI/testing/sysfs-edac-memory-repair
> >> create mode 100644 Documentation/edac/memory_repair.rst
> >> create mode 100755 drivers/edac/mem_repair.c
> >>
> >> diff --git a/Documentation/ABI/testing/sysfs-edac-memory-repair
> >b/Documentation/ABI/testing/sysfs-edac-memory-repair
> >> new file mode 100644
> >> index 000000000000..e9268f3780ed
> >> --- /dev/null
> >> +++ b/Documentation/ABI/testing/sysfs-edac-memory-repair
> >> @@ -0,0 +1,244 @@
> >> +What: /sys/bus/edac/devices/<dev-name>/mem_repairX
> >> +Date: Jan 2025
> >> +KernelVersion: 6.14
> >> +Contact: linux-edac@vger.kernel.org
> >> +Description:
> >> + The sysfs EDAC bus devices /<dev-name>/mem_repairX
> >subdirectory
> >> + pertains to the memory media repair features control, such as
> >> + PPR (Post Package Repair), memory sparing etc, where<dev-
> >name>
> >> + directory corresponds to a device registered with the EDAC
> >> + device driver for the memory repair features.
> >> +
> >> + Post Package Repair is a maintenance operation requests the
> >memory
> >> + device to perform a repair operation on its media, in detail is a
> >> + memory self-healing feature that fixes a failing memory
> >location by
> >> + replacing it with a spare row in a DRAM device. For example, a
> >> + CXL memory device with DRAM components that support PPR
> >features may
> >> + implement PPR maintenance operations. DRAM components
> >may support
> >> + two types of PPR functions: hard PPR, for a permanent row
> >repair, and
> >> + soft PPR, for a temporary row repair. soft PPR is much faster
> >than
> >> + hard PPR, but the repair is lost with a power cycle.
> >> +
> >> + Memory sparing is a repair function that replaces a portion
> >> + of memory with a portion of functional memory at that same
> >> + sparing granularity. Memory sparing has
> >cacheline/row/bank/rank
> >> + sparing granularities. For example, in memory-sparing mode,
> >> + one memory rank serves as a spare for other ranks on the same
> >> + channel in case they fail. The spare rank is held in reserve and
> >> + not used as active memory until a failure is indicated, with
> >> + reserved capacity subtracted from the total available memory
> >> + in the system.The DIMM installation order for memory sparing
> >> + varies based on the number of processors and memory modules
> >> + installed in the server. After an error threshold is surpassed
> >> + in a system protected by memory sparing, the content of a
> >failing
> >> + rank of DIMMs is copied to the spare rank. The failing rank is
> >> + then taken offline and the spare rank placed online for use as
> >> + active memory in place of the failed rank.
> >> +
> >> + The sysfs attributes nodes for a repair feature are only
> >> + present if the parent driver has implemented the corresponding
> >> + attr callback function and provided the necessary operations
> >> + to the EDAC device driver during registration.
> >> +
> >> + In some states of system configuration (e.g. before address
> >> + decoders have been configured), memory devices (e.g. CXL)
> >> + may not have an active mapping in the main host address
> >> + physical address map. As such, the memory to repair must be
> >> + identified by a device specific physical addressing scheme
> >> + using a device physical address(DPA). The DPA and other control
> >> + attributes to use will be presented in related error records.
> >> +
> >> +What: /sys/bus/edac/devices/<dev-
> >name>/mem_repairX/repair_function
> >> +Date: Jan 2025
> >> +KernelVersion: 6.14
> >> +Contact: linux-edac@vger.kernel.org
> >> +Description:
> >> + (RO) Memory repair function type. For eg. post package repair,
> >> + memory sparing etc.
> >> + EDAC_SOFT_PPR - Soft post package repair
> >> + EDAC_HARD_PPR - Hard post package repair
> >> + EDAC_CACHELINE_MEM_SPARING - Cacheline memory sparing
> >> + EDAC_ROW_MEM_SPARING - Row memory sparing
> >> + EDAC_BANK_MEM_SPARING - Bank memory sparing
> >> + EDAC_RANK_MEM_SPARING - Rank memory sparing
> >> + All other values are reserved.
> >
> >Too big strings. Why are them in upper cases? IMO:
> >
> > soft-ppr, hard-ppr, ... would be enough.
> >
> Here return repair type (single value, such as 0, 1, or 2 etc not as decoded string for eg."EDAC_SOFT_PPR")
> of the memory repair instance, which is defined as enums (EDAC_SOFT_PPR, EDAC_HARD_PPR, ... etc)
> for the memory repair interface in the include/linux/edac.h.
>
> enum edac_mem_repair_function {
> EDAC_SOFT_PPR,
> EDAC_HARD_PPR,
> EDAC_CACHELINE_MEM_SPARING,
> EDAC_ROW_MEM_SPARING,
> EDAC_BANK_MEM_SPARING,
> EDAC_RANK_MEM_SPARING,
> };
>
> I documented return value in terms of the above enums.
The ABI documentation describes exactly what numeric/strings values will be there.
So, if you place:
EDAC_SOFT_PPR
It means a string with EDAC_SOFT_PPR, not a numeric zero value.
Also, as I explained at:
https://lore.kernel.org/linux-edac/1bf421f9d1924d68860d08c70829a705@huawei.com/T/#m1e60da13198b47701a4c2f740d4b78701f912d2d
it doesn't make sense to report soft/hard PPR, as the persist mode
is designed to be on a different sysfs devnode (/persist_mode on your
proposal).
So, here you need to fold EDAC_SOFT_PPR and EDAC_HARD_PPR into a single
value ("ppr").
-
Btw, very few sysfs nodes use numbers for things that can be mapped with
enums:
$ git grep -l "\- 0" Documentation/ABI|wc -l
20
(several of those are actually false-positives)
and this is done mostly when it reports what the hardware actually
outputs when reading some register.
Thanks,
Mauro
next prev parent reply other threads:[~2025-01-14 14:26 UTC|newest]
Thread overview: 87+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-01-06 12:09 [PATCH v18 00/19] EDAC: Scrub: introduce generic EDAC RAS control feature driver + CXL/ACPI-RAS2 drivers shiju.jose
2025-01-06 12:09 ` [PATCH v18 01/19] EDAC: Add support for EDAC device features control shiju.jose
2025-01-06 13:37 ` Borislav Petkov
2025-01-06 14:48 ` Shiju Jose
2025-01-13 15:06 ` Mauro Carvalho Chehab
2025-01-14 9:55 ` Jonathan Cameron
2025-01-14 10:08 ` Shiju Jose
2025-01-14 11:33 ` Mauro Carvalho Chehab
2025-01-30 19:18 ` Daniel Ferguson
2025-01-06 12:09 ` [PATCH v18 02/19] EDAC: Add scrub control feature shiju.jose
2025-01-06 15:57 ` Borislav Petkov
2025-01-06 19:34 ` Shiju Jose
2025-01-07 7:32 ` Borislav Petkov
2025-01-07 9:23 ` Shiju Jose
2025-01-08 15:47 ` Shiju Jose
2025-01-13 15:50 ` Mauro Carvalho Chehab
2025-01-30 19:18 ` Daniel Ferguson
2025-01-06 12:09 ` [PATCH v18 03/19] EDAC: Add ECS " shiju.jose
2025-01-13 16:09 ` Mauro Carvalho Chehab
2025-01-06 12:10 ` [PATCH v18 04/19] EDAC: Add memory repair " shiju.jose
2025-01-09 9:19 ` Borislav Petkov
2025-01-09 11:00 ` Shiju Jose
2025-01-09 12:32 ` Borislav Petkov
2025-01-09 14:24 ` Jonathan Cameron
2025-01-09 15:18 ` Borislav Petkov
2025-01-09 16:01 ` Jonathan Cameron
2025-01-09 16:19 ` Borislav Petkov
2025-01-09 18:34 ` Jonathan Cameron
2025-01-09 23:51 ` Dan Williams
2025-01-10 11:01 ` Jonathan Cameron
2025-01-10 22:49 ` Dan Williams
2025-01-13 11:40 ` Jonathan Cameron
2025-01-14 19:35 ` Dan Williams
2025-01-15 10:07 ` Jonathan Cameron
2025-01-15 11:35 ` Mauro Carvalho Chehab
2025-01-11 17:12 ` Borislav Petkov
2025-01-13 11:07 ` Jonathan Cameron
2025-01-21 16:16 ` Borislav Petkov
2025-01-21 18:16 ` Jonathan Cameron
2025-01-22 19:09 ` Borislav Petkov
2025-02-06 13:39 ` Jonathan Cameron
2025-02-17 13:23 ` Borislav Petkov
2025-02-18 16:51 ` Jonathan Cameron
2025-02-19 18:45 ` Borislav Petkov
2025-02-20 12:19 ` Jonathan Cameron
2025-01-14 13:10 ` Mauro Carvalho Chehab
2025-01-14 12:57 ` Mauro Carvalho Chehab
2025-01-14 12:38 ` Mauro Carvalho Chehab
2025-01-14 13:05 ` Jonathan Cameron
2025-01-14 14:39 ` Mauro Carvalho Chehab
2025-01-14 11:47 ` Mauro Carvalho Chehab
2025-01-14 12:31 ` Shiju Jose
2025-01-14 14:26 ` Mauro Carvalho Chehab [this message]
2025-01-14 13:47 ` Mauro Carvalho Chehab
2025-01-14 14:30 ` Shiju Jose
2025-01-15 12:03 ` Mauro Carvalho Chehab
2025-01-06 12:10 ` [PATCH v18 05/19] ACPI:RAS2: Add ACPI RAS2 driver shiju.jose
2025-01-21 23:01 ` Daniel Ferguson
2025-01-22 15:38 ` Shiju Jose
2025-01-30 19:19 ` Daniel Ferguson
2025-01-06 12:10 ` [PATCH v18 06/19] ras: mem: Add memory " shiju.jose
2025-01-21 23:01 ` Daniel Ferguson
2025-01-30 19:19 ` Daniel Ferguson
2025-01-06 12:10 ` [PATCH v18 07/19] cxl: Refactor user ioctl command path from mds to mailbox shiju.jose
2025-01-06 12:10 ` [PATCH v18 08/19] cxl: Add skeletal features driver shiju.jose
2025-01-06 12:10 ` [PATCH v18 09/19] cxl: Enumerate feature commands shiju.jose
2025-01-06 12:10 ` [PATCH v18 10/19] cxl: Add Get Supported Features command for kernel usage shiju.jose
2025-01-06 12:10 ` [PATCH v18 11/19] cxl: Add features driver attribute to emit number of features supported shiju.jose
2025-01-06 12:10 ` [PATCH v18 12/19] cxl/mbox: Add GET_FEATURE mailbox command shiju.jose
2025-01-06 12:10 ` [PATCH v18 13/19] cxl/mbox: Add SET_FEATURE " shiju.jose
2025-01-06 12:10 ` [PATCH v18 14/19] cxl: Setup exclusive CXL features that are reserved for the kernel shiju.jose
2025-01-06 12:10 ` [PATCH v18 15/19] cxl/memfeature: Add CXL memory device patrol scrub control feature shiju.jose
2025-01-24 20:38 ` Dan Williams
2025-01-27 10:06 ` Jonathan Cameron
2025-01-27 12:53 ` Shiju Jose
2025-01-27 23:17 ` Dan Williams
2025-01-29 12:28 ` Shiju Jose
2025-01-06 12:10 ` [PATCH v18 16/19] cxl/memfeature: Add CXL memory device ECS " shiju.jose
2025-01-06 12:10 ` [PATCH v18 17/19] cxl/mbox: Add support for PERFORM_MAINTENANCE mailbox command shiju.jose
2025-01-06 12:10 ` [PATCH v18 18/19] cxl/memfeature: Add CXL memory device soft PPR control feature shiju.jose
2025-01-06 12:10 ` [PATCH v18 19/19] cxl/memfeature: Add CXL memory device memory sparing " shiju.jose
2025-01-13 14:46 ` [PATCH v18 00/19] EDAC: Scrub: introduce generic EDAC RAS control feature driver + CXL/ACPI-RAS2 drivers Mauro Carvalho Chehab
2025-01-13 15:36 ` Jonathan Cameron
2025-01-14 14:06 ` Mauro Carvalho Chehab
2025-01-13 18:15 ` Shiju Jose
2025-01-30 19:18 ` Daniel Ferguson
2025-02-03 9:25 ` Shiju Jose
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20250114152617.14eb41b5@foz.lan \
--to=mchehab+huawei@kernel.org \
--cc=Jon.Grimm@amd.com \
--cc=Vilas.Sridharan@amd.com \
--cc=Yazen.Ghannam@amd.com \
--cc=alison.schofield@intel.com \
--cc=bp@alien8.de \
--cc=dan.j.williams@intel.com \
--cc=dave.hansen@linux.intel.com \
--cc=dave.jiang@intel.com \
--cc=dave@stgolabs.net \
--cc=david@redhat.com \
--cc=dferguson@amperecomputing.com \
--cc=duenwen@google.com \
--cc=erdemaktas@google.com \
--cc=gthelen@google.com \
--cc=ira.weiny@intel.com \
--cc=james.morse@arm.com \
--cc=jiaqiyan@google.com \
--cc=jonathan.cameron@huawei.com \
--cc=jthoughton@google.com \
--cc=kangkang.shen@futurewei.com \
--cc=lenb@kernel.org \
--cc=leo.duran@amd.com \
--cc=linux-acpi@vger.kernel.org \
--cc=linux-cxl@vger.kernel.org \
--cc=linux-edac@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=linuxarm@huawei.com \
--cc=mchehab@kernel.org \
--cc=naoya.horiguchi@nec.com \
--cc=nifan.cxl@gmail.com \
--cc=pgonda@google.com \
--cc=prime.zeng@hisilicon.com \
--cc=rafael@kernel.org \
--cc=rientjes@google.com \
--cc=roberto.sassu@huawei.com \
--cc=shiju.jose@huawei.com \
--cc=somasundaram.a@hpe.com \
--cc=tanxiaofei@huawei.com \
--cc=tony.luck@intel.com \
--cc=vishal.l.verma@intel.com \
--cc=wanghuiqiang@huawei.com \
--cc=wbs@os.amperecomputing.com \
--cc=wschwartz@amperecomputing.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.