linux-acpi.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jonathan Cameron <Jonathan.Cameron@huawei.com>
To: Dan Williams <dan.j.williams@intel.com>
Cc: <shiju.jose@huawei.com>, <linux-edac@vger.kernel.org>,
	<linux-cxl@vger.kernel.org>, <linux-acpi@vger.kernel.org>,
	<linux-mm@kvack.org>, <linux-kernel@vger.kernel.org>,
	<bp@alien8.de>, <tony.luck@intel.com>, <rafael@kernel.org>,
	<lenb@kernel.org>, <mchehab@kernel.org>, <dave@stgolabs.net>,
	<dave.jiang@intel.com>, <alison.schofield@intel.com>,
	<vishal.l.verma@intel.com>, <ira.weiny@intel.com>,
	<david@redhat.com>, <Vilas.Sridharan@amd.com>,
	<leo.duran@amd.com>, <Yazen.Ghannam@amd.com>,
	<rientjes@google.com>, <jiaqiyan@google.com>, <Jon.Grimm@amd.com>,
	<dave.hansen@linux.intel.com>, <naoya.horiguchi@nec.com>,
	<james.morse@arm.com>, <jthoughton@google.com>,
	<somasundaram.a@hpe.com>, <erdemaktas@google.com>,
	<pgonda@google.com>, <duenwen@google.com>, <gthelen@google.com>,
	<wschwartz@amperecomputing.com>, <dferguson@amperecomputing.com>,
	<wbs@os.amperecomputing.com>, <nifan.cxl@gmail.com>,
	<tanxiaofei@huawei.com>, <prime.zeng@hisilicon.com>,
	<roberto.sassu@huawei.com>, <kangkang.shen@futurewei.com>,
	<wanghuiqiang@huawei.com>, <linuxarm@huawei.com>
Subject: Re: [PATCH v18 15/19] cxl/memfeature: Add CXL memory device patrol scrub control feature
Date: Mon, 27 Jan 2025 10:06:17 +0000	[thread overview]
Message-ID: <20250127100617.00005e77@huawei.com> (raw)
In-Reply-To: <6793fa5351fc7_20f3294d0@dwillia2-xfh.jf.intel.com.notmuch>

On Fri, 24 Jan 2025 12:38:43 -0800
Dan Williams <dan.j.williams@intel.com> wrote:

> shiju.jose@ wrote:
> > From: Shiju Jose <shiju.jose@huawei.com>
> > 
> > CXL spec 3.1 section 8.2.9.9.11.1 describes the device patrol scrub control
> > feature. The device patrol scrub proactively locates and makes corrections
> > to errors in regular cycle.
> > 
> > Allow specifying the number of hours within which the patrol scrub must be
> > completed, subject to minimum and maximum limits reported by the device.
> > Also allow disabling scrub allowing trade-off error rates against
> > performance.
> > 
> > Add support for patrol scrub control on CXL memory devices.
> > Register with the EDAC device driver, which retrieves the scrub attribute
> > descriptors from EDAC scrub and exposes the sysfs scrub control attributes
> > to userspace. For example, scrub control for the CXL memory device
> > "cxl_mem0" is exposed in /sys/bus/edac/devices/cxl_mem0/scrubX/.
> > 
> > Additionally, add support for region-based CXL memory patrol scrub control.
> > CXL memory regions may be interleaved across one or more CXL memory
> > devices. For example, region-based scrub control for "cxl_region1" is
> > exposed in /sys/bus/edac/devices/cxl_region1/scrubX/.
> > 
> > Reviewed-by: Dave Jiang <dave.jiang@intel.com>
> > Co-developed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> > Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> > Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Hi Dan,

A few specific replies in line. I've left the detail stuff to Shiju
to address.  Definitely a few things in there I'd missed!

Thanks,

Jonathan


> > ---
> >  Documentation/edac/scrub.rst  |  66 ++++++
> >  drivers/cxl/Kconfig           |  17 ++
> >  drivers/cxl/core/Makefile     |   1 +
> >  drivers/cxl/core/memfeature.c | 392 ++++++++++++++++++++++++++++++++++
> >  drivers/cxl/core/region.c     |   6 +
> >  drivers/cxl/cxlmem.h          |   7 +
> >  drivers/cxl/mem.c             |   5 +
> >  include/cxl/features.h        |  16 ++
> >  8 files changed, 510 insertions(+)
> >  create mode 100644 drivers/cxl/core/memfeature.c
> > diff --git a/Documentation/edac/scrub.rst b/Documentation/edac/scrub.rst
> > index f86645c7f0af..80e986c57885 100644
> > --- a/Documentation/edac/scrub.rst
> > +++ b/Documentation/edac/scrub.rst
> > @@ -325,3 +325,69 @@ root@localhost:~# cat /sys/bus/edac/devices/acpi_ras_mem0/scrub0/current_cycle_d
> >  10800
> >  
> >  root@localhost:~# echo 0 > /sys/bus/edac/devices/acpi_ras_mem0/scrub0/enable_background
> > +
> > +2. CXL memory device patrol scrubber
> > +
> > +2.1 Device based scrubbing
> > +
> > +root@localhost:~# cat /sys/bus/edac/devices/cxl_mem0/scrub0/min_cycle_duration
> > +
> > +3600
> > +
> > +root@localhost:~# cat /sys/bus/edac/devices/cxl_mem0/scrub0/max_cycle_duration
> > +
> > +918000
> > +
> > +root@localhost:~# cat /sys/bus/edac/devices/cxl_mem0/scrub0/current_cycle_duration
> > +
> > +43200
> > +
> > +root@localhost:~# echo 54000 > /sys/bus/edac/devices/cxl_mem0/scrub0/current_cycle_duration
> > +
> > +root@localhost:~# cat /sys/bus/edac/devices/cxl_mem0/scrub0/current_cycle_duration
> > +
> > +54000
> > +
> > +root@localhost:~# echo 1 > /sys/bus/edac/devices/cxl_mem0/scrub0/enable_background
> > +
> > +root@localhost:~# cat /sys/bus/edac/devices/cxl_mem0/scrub0/enable_background
> > +
> > +1
> > +
> > +root@localhost:~# echo 0 > /sys/bus/edac/devices/cxl_mem0/scrub0/enable_background
> > +
> > +root@localhost:~# cat /sys/bus/edac/devices/cxl_mem0/scrub0/enable_background
> > +
> > +0
> > +
> > +2.2. Region based scrubbing
> > +
> > +root@localhost:~# cat /sys/bus/edac/devices/cxl_region0/scrub0/min_cycle_duration
> > +
> > +3600
> > +
> > +root@localhost:~# cat /sys/bus/edac/devices/cxl_region0/scrub0/max_cycle_duration
> > +
> > +918000
> > +
> > +root@localhost:~# cat /sys/bus/edac/devices/cxl_region0/scrub0/current_cycle_duration
> > +
> > +43200
> > +
> > +root@localhost:~# echo 54000 > /sys/bus/edac/devices/cxl_region0/scrub0/current_cycle_duration
> > +
> > +root@localhost:~# cat /sys/bus/edac/devices/cxl_region0/scrub0/current_cycle_duration
> > +
> > +54000
> > +
> > +root@localhost:~# echo 1 > /sys/bus/edac/devices/cxl_region0/scrub0/enable_background
> > +
> > +root@localhost:~# cat /sys/bus/edac/devices/cxl_region0/scrub0/enable_background
> > +
> > +1
> > +
> > +root@localhost:~# echo 0 > /sys/bus/edac/devices/cxl_region0/scrub0/enable_background
> > +
> > +root@localhost:~# cat /sys/bus/edac/devices/cxl_region0/scrub0/enable_background  
> 
> What is this content-free blob of cat and echo statements? Please write actual
> documentation with theory of operation, clarification of assumptions,
> rationale for defaults, guidance on changing defaults... 

Note this is a continuation of existing documentation, but sure some inline comments
talking more about it would be fine.  The rational and top level discussion is
meant to be described in patch 2 as it is not CXL specific.

Defaults are a device thing, there are no software driven defaults.

So I'd suggest the above just adds a few comments along the lines of
what the actions of each block does. 
Something like:

Check current parameters and program the scrubbing for a region to repeat
every X seconds (% of day)


> 
> > diff --git a/drivers/cxl/Kconfig b/drivers/cxl/Kconfig
> > index 0bc6a2cb8474..6078f02e883b 100644
> > --- a/drivers/cxl/Kconfig
> > +++ b/drivers/cxl/Kconfig
> > @@ -154,4 +154,21 @@ config CXL_FEATURES
> >  
> >  	  If unsure say 'y'.
> >  
> > +config CXL_RAS_FEATURES
> > +	tristate "CXL: Memory RAS features"
> > +	depends on CXL_PCI  
> 
> What is the build dependency on CXL_PCI? This enabling does not call back into
> symbols provided by cxl_pci.ko does it?
> 
> > +	depends on CXL_MEM  
> 
> Similar comment, and this also goes away if all of this just moves into
> the new cxl_features driver.

I'm not sure moving to be a child of cxl_features makes sense. Can
probably do it but it's making the spiders web of connections even harder
to relate to the underlying hardware. In my mental model, this stuff
takes services from 'features' and 'mailbox' parts of the CXL driver.

Take repair.   Less than half of each of those drivers is feature related
(a few 'what can I do' type aspects). The control plane goes via
maintenance commands.

Obviously we can get to those by adding additional infrastructure to the
features driver, but that seems likely to be ugly and where do we stop?
It won't scale to likely future cases where the feature part is a tiny
tweak on some much larger chunk of infrastructure (which is mostly what
spec defined features seem to be for). I don't think we want to support
the complexity of device built-in test in the 3.2 spec necessarily
(haven't really thought about it yet!) but it is an example of what would
be an EDAC feature but has no dependence on features.

We could register the patrol scrub stuff from features, and the rest
separately but that seems even more confusing.

So to me, nesting under features is an ugly solution but I'm not that
attached to current approach.

So in my view this stuff should be dependent on CXL_FEATURES but
not a child of it.


(lots skipped - I'll leave the detailed stuff to Shiju!)
> > +	else
> > +		snprintf(cxl_dev_name, sizeof(cxl_dev_name),
> > +			 "%s_%s", "cxl", dev_name(&cxlmd->dev));  
> 
> Can a "cxl" directory be created so that the raw name can be used?

I'd like feedback from Boris on that.  It is a mess to instantiate
devices in subdirectories under a bus (that's kind of the big issue
with the EDAC usage of the device model already).

I'd say no it can't.

> 


  reply	other threads:[~2025-01-27 10:06 UTC|newest]

Thread overview: 87+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-01-06 12:09 [PATCH v18 00/19] EDAC: Scrub: introduce generic EDAC RAS control feature driver + CXL/ACPI-RAS2 drivers shiju.jose
2025-01-06 12:09 ` [PATCH v18 01/19] EDAC: Add support for EDAC device features control shiju.jose
2025-01-06 13:37   ` Borislav Petkov
2025-01-06 14:48     ` Shiju Jose
2025-01-13 15:06   ` Mauro Carvalho Chehab
2025-01-14  9:55     ` Jonathan Cameron
2025-01-14 10:08     ` Shiju Jose
2025-01-14 11:33       ` Mauro Carvalho Chehab
2025-01-30 19:18   ` Daniel Ferguson
2025-01-06 12:09 ` [PATCH v18 02/19] EDAC: Add scrub control feature shiju.jose
2025-01-06 15:57   ` Borislav Petkov
2025-01-06 19:34     ` Shiju Jose
2025-01-07  7:32       ` Borislav Petkov
2025-01-07  9:23         ` Shiju Jose
2025-01-08 15:47         ` Shiju Jose
2025-01-13 15:50   ` Mauro Carvalho Chehab
2025-01-30 19:18   ` Daniel Ferguson
2025-01-06 12:09 ` [PATCH v18 03/19] EDAC: Add ECS " shiju.jose
2025-01-13 16:09   ` Mauro Carvalho Chehab
2025-01-06 12:10 ` [PATCH v18 04/19] EDAC: Add memory repair " shiju.jose
2025-01-09  9:19   ` Borislav Petkov
2025-01-09 11:00     ` Shiju Jose
2025-01-09 12:32       ` Borislav Petkov
2025-01-09 14:24         ` Jonathan Cameron
2025-01-09 15:18           ` Borislav Petkov
2025-01-09 16:01             ` Jonathan Cameron
2025-01-09 16:19               ` Borislav Petkov
2025-01-09 18:34                 ` Jonathan Cameron
2025-01-09 23:51                   ` Dan Williams
2025-01-10 11:01                     ` Jonathan Cameron
2025-01-10 22:49                       ` Dan Williams
2025-01-13 11:40                         ` Jonathan Cameron
2025-01-14 19:35                           ` Dan Williams
2025-01-15 10:07                             ` Jonathan Cameron
2025-01-15 11:35                             ` Mauro Carvalho Chehab
2025-01-11 17:12                   ` Borislav Petkov
2025-01-13 11:07                     ` Jonathan Cameron
2025-01-21 16:16                       ` Borislav Petkov
2025-01-21 18:16                         ` Jonathan Cameron
2025-01-22 19:09                           ` Borislav Petkov
2025-02-06 13:39                             ` Jonathan Cameron
2025-02-17 13:23                               ` Borislav Petkov
2025-02-18 16:51                                 ` Jonathan Cameron
2025-02-19 18:45                                   ` Borislav Petkov
2025-02-20 12:19                                     ` Jonathan Cameron
2025-01-14 13:10                   ` Mauro Carvalho Chehab
2025-01-14 12:57               ` Mauro Carvalho Chehab
2025-01-14 12:38           ` Mauro Carvalho Chehab
2025-01-14 13:05             ` Jonathan Cameron
2025-01-14 14:39               ` Mauro Carvalho Chehab
2025-01-14 11:47   ` Mauro Carvalho Chehab
2025-01-14 12:31     ` Shiju Jose
2025-01-14 14:26       ` Mauro Carvalho Chehab
2025-01-14 13:47   ` Mauro Carvalho Chehab
2025-01-14 14:30     ` Shiju Jose
2025-01-15 12:03       ` Mauro Carvalho Chehab
2025-01-06 12:10 ` [PATCH v18 05/19] ACPI:RAS2: Add ACPI RAS2 driver shiju.jose
2025-01-21 23:01   ` Daniel Ferguson
2025-01-22 15:38     ` Shiju Jose
2025-01-30 19:19   ` Daniel Ferguson
2025-01-06 12:10 ` [PATCH v18 06/19] ras: mem: Add memory " shiju.jose
2025-01-21 23:01   ` Daniel Ferguson
2025-01-30 19:19   ` Daniel Ferguson
2025-01-06 12:10 ` [PATCH v18 07/19] cxl: Refactor user ioctl command path from mds to mailbox shiju.jose
2025-01-06 12:10 ` [PATCH v18 08/19] cxl: Add skeletal features driver shiju.jose
2025-01-06 12:10 ` [PATCH v18 09/19] cxl: Enumerate feature commands shiju.jose
2025-01-06 12:10 ` [PATCH v18 10/19] cxl: Add Get Supported Features command for kernel usage shiju.jose
2025-01-06 12:10 ` [PATCH v18 11/19] cxl: Add features driver attribute to emit number of features supported shiju.jose
2025-01-06 12:10 ` [PATCH v18 12/19] cxl/mbox: Add GET_FEATURE mailbox command shiju.jose
2025-01-06 12:10 ` [PATCH v18 13/19] cxl/mbox: Add SET_FEATURE " shiju.jose
2025-01-06 12:10 ` [PATCH v18 14/19] cxl: Setup exclusive CXL features that are reserved for the kernel shiju.jose
2025-01-06 12:10 ` [PATCH v18 15/19] cxl/memfeature: Add CXL memory device patrol scrub control feature shiju.jose
2025-01-24 20:38   ` Dan Williams
2025-01-27 10:06     ` Jonathan Cameron [this message]
2025-01-27 12:53     ` Shiju Jose
2025-01-27 23:17       ` Dan Williams
2025-01-29 12:28         ` Shiju Jose
2025-01-06 12:10 ` [PATCH v18 16/19] cxl/memfeature: Add CXL memory device ECS " shiju.jose
2025-01-06 12:10 ` [PATCH v18 17/19] cxl/mbox: Add support for PERFORM_MAINTENANCE mailbox command shiju.jose
2025-01-06 12:10 ` [PATCH v18 18/19] cxl/memfeature: Add CXL memory device soft PPR control feature shiju.jose
2025-01-06 12:10 ` [PATCH v18 19/19] cxl/memfeature: Add CXL memory device memory sparing " shiju.jose
2025-01-13 14:46 ` [PATCH v18 00/19] EDAC: Scrub: introduce generic EDAC RAS control feature driver + CXL/ACPI-RAS2 drivers Mauro Carvalho Chehab
2025-01-13 15:36   ` Jonathan Cameron
2025-01-14 14:06     ` Mauro Carvalho Chehab
2025-01-13 18:15   ` Shiju Jose
2025-01-30 19:18 ` Daniel Ferguson
2025-02-03  9:25   ` Shiju Jose

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20250127100617.00005e77@huawei.com \
    --to=jonathan.cameron@huawei.com \
    --cc=Jon.Grimm@amd.com \
    --cc=Vilas.Sridharan@amd.com \
    --cc=Yazen.Ghannam@amd.com \
    --cc=alison.schofield@intel.com \
    --cc=bp@alien8.de \
    --cc=dan.j.williams@intel.com \
    --cc=dave.hansen@linux.intel.com \
    --cc=dave.jiang@intel.com \
    --cc=dave@stgolabs.net \
    --cc=david@redhat.com \
    --cc=dferguson@amperecomputing.com \
    --cc=duenwen@google.com \
    --cc=erdemaktas@google.com \
    --cc=gthelen@google.com \
    --cc=ira.weiny@intel.com \
    --cc=james.morse@arm.com \
    --cc=jiaqiyan@google.com \
    --cc=jthoughton@google.com \
    --cc=kangkang.shen@futurewei.com \
    --cc=lenb@kernel.org \
    --cc=leo.duran@amd.com \
    --cc=linux-acpi@vger.kernel.org \
    --cc=linux-cxl@vger.kernel.org \
    --cc=linux-edac@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linuxarm@huawei.com \
    --cc=mchehab@kernel.org \
    --cc=naoya.horiguchi@nec.com \
    --cc=nifan.cxl@gmail.com \
    --cc=pgonda@google.com \
    --cc=prime.zeng@hisilicon.com \
    --cc=rafael@kernel.org \
    --cc=rientjes@google.com \
    --cc=roberto.sassu@huawei.com \
    --cc=shiju.jose@huawei.com \
    --cc=somasundaram.a@hpe.com \
    --cc=tanxiaofei@huawei.com \
    --cc=tony.luck@intel.com \
    --cc=vishal.l.verma@intel.com \
    --cc=wanghuiqiang@huawei.com \
    --cc=wbs@os.amperecomputing.com \
    --cc=wschwartz@amperecomputing.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).