linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 00/25] DCD: Add support for Dynamic Capacity Devices (DCD)
@ 2024-08-16 14:44 Ira Weiny
  2024-08-16 14:44 ` [PATCH v3 01/25] range: Add range_overlaps() Ira Weiny
                   ` (24 more replies)
  0 siblings, 25 replies; 120+ messages in thread
From: Ira Weiny @ 2024-08-16 14:44 UTC (permalink / raw)
  To: Dave Jiang, Fan Ni, Jonathan Cameron, Navneet Singh, Chris Mason,
	Josef Bacik, David Sterba, Petr Mladek, Steven Rostedt,
	Andy Shevchenko, Rasmus Villemoes, Sergey Senozhatsky,
	Jonathan Corbet, Andrew Morton
  Cc: Dan Williams, Davidlohr Bueso, Alison Schofield, Vishal Verma,
	Ira Weiny, linux-btrfs, linux-cxl, linux-kernel, linux-doc,
	nvdimm, Johannes Thumshirn, Li, Ming, Jonathan Cameron

A git tree of this series can be found here:

	https://github.com/weiny2/linux-kernel/tree/dcd-v4-2024-08-16

This series requires the CXL memory notifier lock change:

	https://lore.kernel.org/all/20240814-fix-notifiers-v2-1-6bab38192c7c@intel.com/

Background
==========

A Dynamic Capacity Device (DCD) (CXL 3.1 sec 9.13.3) is a CXL memory
device that allows memory capacity within a region to change
dynamically without the need for resetting the device, reconfiguring
HDM decoders, or reconfiguring software DAX regions.

One of the biggest use cases for Dynamic Capacity is to allow hosts to
share memory dynamically within a data center without increasing the
per-host attached memory.

The general flow for the addition or removal of memory is to have an
orchestrator coordinate the use of the memory.  Generally there are 5
actors in such a system, the Orchestrator, Fabric Manager, the Logical
device, the Host Kernel, and a Host User.

Typical work flows are shown below.

Orchestrator      FM         Device       Host Kernel    Host User

    |             |           |            |              |
    |-------------- Create region ----------------------->|
    |             |           |            |              |
    |             |           |            |<-- Create ---|
    |             |           |            |    Region    |
    |<------------- Signal done --------------------------|
    |             |           |            |              |
    |-- Add ----->|-- Add --->|--- Add --->|              |
    |  Capacity   |  Extent   |   Extent   |              |
    |             |           |            |              |
    |             |<- Accept -|<- Accept  -|              |
    |             |   Extent  |   Extent   |              |
    |             |           |            |<- Create --->|
    |             |           |            |   DAX dev    |-- Use memory
    |             |           |            |              |   |
    |             |           |            |              |   |
    |             |           |            |<- Release ---| <-+
    |             |           |            |   DAX dev    |
    |             |           |            |              |
    |<------------- Signal done --------------------------|
    |             |           |            |              |
    |-- Remove -->|- Release->|- Release ->|              |
    |  Capacity   |  Extent   |   Extent   |              |
    |             |           |            |              |
    |             |<- Release-|<- Release -|              |
    |             |   Extent  |   Extent   |              |
    |             |           |            |              |
    |-- Add ----->|-- Add --->|--- Add --->|              |
    |  Capacity   |  Extent   |   Extent   |              |
    |             |           |            |              |
    |             |<- Accept -|<- Accept  -|              |
    |             |   Extent  |   Extent   |              |
    |             |           |            |<- Create ----|
    |             |           |            |   DAX dev    |-- Use memory
    |             |           |            |              |   |
    |             |           |            |<- Release ---| <-+
    |             |           |            |   DAX dev    |
    |<------------- Signal done --------------------------|
    |             |           |            |              |
    |-- Remove -->|- Release->|- Release ->|              |
    |  Capacity   |  Extent   |   Extent   |              |
    |             |           |            |              |
    |             |<- Release-|<- Release -|              |
    |             |   Extent  |   Extent   |              |
    |             |           |            |              |
    |-- Add ----->|-- Add --->|--- Add --->|              |
    |  Capacity   |  Extent   |   Extent   |              |
    |             |           |            |<- Create ----|
    |             |           |            |   DAX dev    |-- Use memory
    |             |           |            |              |   |
    |-- Remove -->|- Release->|- Release ->|              |   |
    |  Capacity   |  Extent   |   Extent   |              |   |
    |             |           |            |              |   |
    |             |           |     (Release Ignored)     |   |
    |             |           |            |              |   |
    |             |           |            |<- Release ---| <-+
    |             |           |            |   DAX dev    |
    |<------------- Signal done --------------------------|
    |             |           |            |              |
    |             |- Release->|- Release ->|              |
    |             |  Extent   |   Extent   |              |
    |             |           |            |              |
    |             |<- Release-|<- Release -|              |
    |             |   Extent  |   Extent   |              |
    |             |           |            |<- Destroy ---|
    |             |           |            |   Region     |
    |             |           |            |              |

Previous versions of this series[0] resulted in architectural comments
as well as confusion on the architecture based on the organization of
patch series itself.

This version has reordered the patches to clarify the architecture.
It also streamlines extent handling more.

The series still requires the creation of regions and DAX devices to be
synchronized with the Orchestrator and Fabric Manager.  The host kernel
will reject an add extent event if the region is not created yet.  It
will also ignore a release if the DAX device is created and referencing
an extent.

These synchronizations are not anticipated to be an issue with real
applications.

In order to allow for capacity to be added and removed a new concept of
a sparse DAX region is introduced.  A sparse DAX region may have 0 or
more bytes of available space.  The total space depends on the number
and size of the extents which have been added.

Initially it is anticipated that users of the memory will carefully
coordinate the surfacing of additional capacity with the creation of DAX
devices which use that capacity.  Therefore, the allocation of the
memory to DAX devices does not allow for specific associations between
DAX device and extent.  This keeps allocations very similar to existing
DAX region behavior.

Great care was taken to keep the extent tracking simple.  Some xarray's
needed to be added but extra software objects were kept to a minimum.

Region extents continue to be tracked as sub-devices of the DAX region.
This ensures that region destruction cleans up all extent allocations
properly.

Due to these major changes all reviews were removed from the larger
patches.  A few of the straight forward patches have kept the tags.

In summary the major functionality of this series includes:

- Getting the dynamic capacity (DC) configuration information from cxl
  devices

- Configuring the DC partitions reported by hardware

- Enhancing the CXL and DAX regions for dynamic capacity support
	a. Maintain a logical separation between hardware extents and
	   software managed region extents.  This provides an
	   abstraction between the layers and should allow for
	   interleaving in the future

- Get hardware extent lists for endpoint decoders upon
  region creation.

- Adjust extent/region memory available on the following events.
        a. Add capacity Events
	b. Release capacity events

- Host response for add capacity
	a. do not accept the extent if:
		If the region does not exist
		or an error occurs realizing the extent
	b. If the region does exist
		realize a DAX region extent with 1:1 mapping (no
		interleave yet)
	c. Support the more bit by processing a list of extents marked
	   with the more bit together before setting up a response.

- Host response for remove capacity
	a. If no DAX device references the extent; release the extent
	b. If a reference does exist, ignore the request.
	   (Require FM to issue release again.)

- Modify DAX device creation/resize to account for extents within a
  sparse DAX region

- Trace Dynamic Capacity events for debugging

- Add cxl-test infrastructure to allow for faster unit testing
  (See new ndctl branch for cxl-dcd.sh test[1])

Fan Ni's upstream of Qemu DCD was used for testing.

Remaining work:

	1) Integrate the QoS work from Dave Jiang
	2) Interleave support

Possible additional work depending on requirements:

	1) Allow mapping to specific extents (perhaps based on
	   label/tag)
	2) Release extents when DAX devices are released if a release
	   was previously seen from the device
	3) Accept a new extent which extends (but overlaps) an existing
	   extent(s)
	4) Rework DAX device interfaces, memfd has been explored a bit

[0] v1: https://lore.kernel.org/all/20240324-dcd-type2-upstream-v1-0-b7b00d623625@intel.com/
[1] https://github.com/weiny2/ndctl/tree/dcd-region2-2024-08-15

---
Major changes:
- Jonathan: support the more bit
- djbw: Allow more than 1 region per DC partition
- All: Address the many comments on the series.
- iweiny: rebase
- iweiny: Rework the series to make it easier to review and understand
          the flow
- Link to v1: https://lore.kernel.org/r/20240324-dcd-type2-upstream-v1-0-b7b00d623625@intel.com
- Link to v2: https://lore.kernel.org/all/20240816-dcd-type2-upstream-v2-0-20189a10ad7d@intel.com/

---
Ira Weiny (11):
      range: Add range_overlaps()
      printk: Add print format (%par) for struct range
      dax: Document dax dev range tuple
      cxl/pci: Delay event buffer allocation
      cxl/region: Refactor common create region code
      cxl/events: Split event msgnum configuration from irq setup
      cxl/pci: Factor out interrupt policy check
      cxl/core: Return endpoint decoder information from region search
      dax/bus: Factor out dev dax resize logic
      tools/testing/cxl: Make event logs dynamic
      tools/testing/cxl: Add DC Regions to mock mem data

Navneet Singh (14):
      cxl/mbox: Flag support for Dynamic Capacity Devices (DCD)
      cxl/mem: Read dynamic capacity configuration from the device
      cxl/core: Separate region mode from decoder mode
      cxl/region: Add dynamic capacity decoder and region modes
      cxl/hdm: Add dynamic capacity size support to endpoint decoders
      cxl/port: Add endpoint decoder DC mode support to sysfs
      cxl/mem: Expose DCD partition capabilities in sysfs
      cxl/region: Add sparse DAX region support
      cxl/mem: Configure dynamic capacity interrupts
      cxl/extent: Process DCD events and realize region extents
      cxl/region/extent: Expose region extent information in sysfs
      dax/region: Create resources on sparse DAX regions
      cxl/region: Read existing extents on region creation
      cxl/mem: Trace Dynamic capacity Event Record

 Documentation/ABI/testing/sysfs-bus-cxl   |  68 ++-
 Documentation/core-api/printk-formats.rst |  14 +
 drivers/cxl/core/Makefile                 |   2 +-
 drivers/cxl/core/core.h                   |  33 +-
 drivers/cxl/core/extent.c                 | 467 ++++++++++++++
 drivers/cxl/core/hdm.c                    | 206 ++++++-
 drivers/cxl/core/mbox.c                   | 578 +++++++++++++++++-
 drivers/cxl/core/memdev.c                 | 101 ++-
 drivers/cxl/core/port.c                   |  13 +-
 drivers/cxl/core/region.c                 | 173 ++++--
 drivers/cxl/core/trace.h                  |  65 ++
 drivers/cxl/cxl.h                         | 122 +++-
 drivers/cxl/cxlmem.h                      | 128 +++-
 drivers/cxl/pci.c                         | 123 +++-
 drivers/dax/bus.c                         | 352 +++++++++--
 drivers/dax/bus.h                         |   4 +-
 drivers/dax/cxl.c                         |  73 ++-
 drivers/dax/dax-private.h                 |  39 +-
 drivers/dax/hmem/hmem.c                   |   2 +-
 drivers/dax/pmem.c                        |   2 +-
 fs/btrfs/ordered-data.c                   |  10 +-
 include/linux/cxl-event.h                 |  32 +
 include/linux/range.h                     |   7 +
 lib/vsprintf.c                            |  37 ++
 tools/testing/cxl/Kbuild                  |   3 +-
 tools/testing/cxl/test/mem.c              | 981 ++++++++++++++++++++++++++----
 26 files changed, 3327 insertions(+), 308 deletions(-)
---
base-commit: 3cef9316df4cda21b5bf25e4230221b02050dfa1
change-id: 20230604-dcd-type2-upstream-0cd15f6216fd

Best regards,
-- 
Ira Weiny <ira.weiny@intel.com>


^ permalink raw reply	[flat|nested] 120+ messages in thread

* [PATCH v3 01/25] range: Add range_overlaps()
  2024-08-16 14:44 [PATCH v3 00/25] DCD: Add support for Dynamic Capacity Devices (DCD) Ira Weiny
@ 2024-08-16 14:44 ` Ira Weiny
  2024-08-16 14:44 ` [PATCH v3 02/25] printk: Add print format (%par) for struct range Ira Weiny
                   ` (23 subsequent siblings)
  24 siblings, 0 replies; 120+ messages in thread
From: Ira Weiny @ 2024-08-16 14:44 UTC (permalink / raw)
  To: Dave Jiang, Fan Ni, Jonathan Cameron, Navneet Singh, Chris Mason,
	Josef Bacik, David Sterba, Petr Mladek, Steven Rostedt,
	Andy Shevchenko, Rasmus Villemoes, Sergey Senozhatsky,
	Jonathan Corbet, Andrew Morton
  Cc: Dan Williams, Davidlohr Bueso, Alison Schofield, Vishal Verma,
	Ira Weiny, linux-btrfs, linux-cxl, linux-kernel, linux-doc,
	nvdimm, Johannes Thumshirn

Code to support CXL Dynamic Capacity devices will have extent ranges
which need to be compared for intersection not a subset as is being
checked in range_contains().

range_overlaps() is defined in btrfs with a different meaning from what
is required in the standard range code.  Dan Williams pointed this out
in [1].  Adjust the btrfs call according to his suggestion there.

Then add a generic range_overlaps().

Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Chris Mason <clm@fb.com>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: David Sterba <dsterba@suse.com>
Cc: linux-btrfs@vger.kernel.org

Acked-by: David Sterba <dsterba@suse.com>
Reviewed-by: Davidlohr Bueso <dave@stgolabs.net>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Fan Ni <fan.ni@samsung.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>

[1] https://lore.kernel.org/all/65949f79ef908_8dc68294f2@dwillia2-xfh.jf.intel.com.notmuch/
---
 fs/btrfs/ordered-data.c | 10 +++++-----
 include/linux/range.h   |  7 +++++++
 2 files changed, 12 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
index 82a68394a89c..37164cc44a25 100644
--- a/fs/btrfs/ordered-data.c
+++ b/fs/btrfs/ordered-data.c
@@ -111,8 +111,8 @@ static struct rb_node *__tree_search(struct rb_root *root, u64 file_offset,
 	return NULL;
 }
 
-static int range_overlaps(struct btrfs_ordered_extent *entry, u64 file_offset,
-			  u64 len)
+static int btrfs_range_overlaps(struct btrfs_ordered_extent *entry, u64 file_offset,
+				u64 len)
 {
 	if (file_offset + len <= entry->file_offset ||
 	    entry->file_offset + entry->num_bytes <= file_offset)
@@ -985,7 +985,7 @@ struct btrfs_ordered_extent *btrfs_lookup_ordered_range(
 
 	while (1) {
 		entry = rb_entry(node, struct btrfs_ordered_extent, rb_node);
-		if (range_overlaps(entry, file_offset, len))
+		if (btrfs_range_overlaps(entry, file_offset, len))
 			break;
 
 		if (entry->file_offset >= file_offset + len) {
@@ -1114,12 +1114,12 @@ struct btrfs_ordered_extent *btrfs_lookup_first_ordered_range(
 	}
 	if (prev) {
 		entry = rb_entry(prev, struct btrfs_ordered_extent, rb_node);
-		if (range_overlaps(entry, file_offset, len))
+		if (btrfs_range_overlaps(entry, file_offset, len))
 			goto out;
 	}
 	if (next) {
 		entry = rb_entry(next, struct btrfs_ordered_extent, rb_node);
-		if (range_overlaps(entry, file_offset, len))
+		if (btrfs_range_overlaps(entry, file_offset, len))
 			goto out;
 	}
 	/* No ordered extent in the range */
diff --git a/include/linux/range.h b/include/linux/range.h
index 6ad0b73cb7ad..9a46f3212965 100644
--- a/include/linux/range.h
+++ b/include/linux/range.h
@@ -13,11 +13,18 @@ static inline u64 range_len(const struct range *range)
 	return range->end - range->start + 1;
 }
 
+/* True if r1 completely contains r2 */
 static inline bool range_contains(struct range *r1, struct range *r2)
 {
 	return r1->start <= r2->start && r1->end >= r2->end;
 }
 
+/* True if any part of r1 overlaps r2 */
+static inline bool range_overlaps(struct range *r1, struct range *r2)
+{
+	return r1->start <= r2->end && r1->end >= r2->start;
+}
+
 int add_range(struct range *range, int az, int nr_range,
 		u64 start, u64 end);
 

-- 
2.45.2


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v3 02/25] printk: Add print format (%par) for struct range
  2024-08-16 14:44 [PATCH v3 00/25] DCD: Add support for Dynamic Capacity Devices (DCD) Ira Weiny
  2024-08-16 14:44 ` [PATCH v3 01/25] range: Add range_overlaps() Ira Weiny
@ 2024-08-16 14:44 ` Ira Weiny
  2024-08-20 14:08   ` Petr Mladek
  2024-08-16 14:44 ` [PATCH v3 03/25] dax: Document dax dev range tuple Ira Weiny
                   ` (22 subsequent siblings)
  24 siblings, 1 reply; 120+ messages in thread
From: Ira Weiny @ 2024-08-16 14:44 UTC (permalink / raw)
  To: Dave Jiang, Fan Ni, Jonathan Cameron, Navneet Singh, Chris Mason,
	Josef Bacik, David Sterba, Petr Mladek, Steven Rostedt,
	Andy Shevchenko, Rasmus Villemoes, Sergey Senozhatsky,
	Jonathan Corbet, Andrew Morton
  Cc: Dan Williams, Davidlohr Bueso, Alison Schofield, Vishal Verma,
	Ira Weiny, linux-btrfs, linux-cxl, linux-kernel, linux-doc,
	nvdimm

The use of struct range in the CXL subsystem is growing.  In particular,
the addition of Dynamic Capacity devices uses struct range in a number
of places which are reported in debug and error messages.

To wit requiring the printing of the start/end fields in each print
became cumbersome.  Dan Williams mentions in [1] that it might be time
to have a print specifier for struct range similar to struct resource

A few alternatives were considered including '%pn' for 'print raNge' but
%par follows that struct range is most often used to store a range of
physical addresses.  So use '%par' for 'print address range'.

To: Petr Mladek <pmladek@suse.com> (maintainer:VSPRINTF)
To: Steven Rostedt <rostedt@goodmis.org> (maintainer:VSPRINTF)
To: Jonathan Corbet <corbet@lwn.net> (maintainer:DOCUMENTATION)
Cc: linux-doc@vger.kernel.org (open list:DOCUMENTATION)
Cc: linux-kernel@vger.kernel.org (open list)
Link: https://lore.kernel.org/all/663922b475e50_d54d72945b@dwillia2-xfh.jf.intel.com.notmuch/ [1]
Suggested-by: "Dan Williams" <dan.j.williams@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
---
 Documentation/core-api/printk-formats.rst | 14 ++++++++++++
 lib/vsprintf.c                            | 37 +++++++++++++++++++++++++++++++
 2 files changed, 51 insertions(+)

diff --git a/Documentation/core-api/printk-formats.rst b/Documentation/core-api/printk-formats.rst
index 4451ef501936..a02ef899b2a6 100644
--- a/Documentation/core-api/printk-formats.rst
+++ b/Documentation/core-api/printk-formats.rst
@@ -231,6 +231,20 @@ width of the CPU data path.
 
 Passed by reference.
 
+Struct Range
+------------
+
+::
+
+	%par	[range 0x60000000-0x6fffffff] or
+		[range 0x0000000060000000-0x000000006fffffff]
+
+For printing struct range.  A variation of printing a physical address is to
+print the value of struct range which are often used to hold a physical address
+range.
+
+Passed by reference.
+
 DMA address types dma_addr_t
 ----------------------------
 
diff --git a/lib/vsprintf.c b/lib/vsprintf.c
index 2d71b1115916..c132178fac07 100644
--- a/lib/vsprintf.c
+++ b/lib/vsprintf.c
@@ -1140,6 +1140,39 @@ char *resource_string(char *buf, char *end, struct resource *res,
 	return string_nocheck(buf, end, sym, spec);
 }
 
+static noinline_for_stack
+char *range_string(char *buf, char *end, const struct range *range,
+		      struct printf_spec spec, const char *fmt)
+{
+#define RANGE_PRINTK_SIZE		16
+#define RANGE_DECODED_BUF_SIZE		((2 * sizeof(struct range)) + 4)
+#define RANGE_PRINT_BUF_SIZE		sizeof("[range - ]")
+	char sym[RANGE_DECODED_BUF_SIZE + RANGE_PRINT_BUF_SIZE];
+	char *p = sym, *pend = sym + sizeof(sym);
+
+	static const struct printf_spec str_spec = {
+		.field_width = -1,
+		.precision = 10,
+		.flags = LEFT,
+	};
+	static const struct printf_spec range_spec = {
+		.base = 16,
+		.field_width = RANGE_PRINTK_SIZE,
+		.precision = -1,
+		.flags = SPECIAL | SMALL | ZEROPAD,
+	};
+
+	*p++ = '[';
+	p = string_nocheck(p, pend, "range ", str_spec);
+	p = number(p, pend, range->start, range_spec);
+	*p++ = '-';
+	p = number(p, pend, range->end, range_spec);
+	*p++ = ']';
+	*p = '\0';
+
+	return string_nocheck(buf, end, sym, spec);
+}
+
 static noinline_for_stack
 char *hex_string(char *buf, char *end, u8 *addr, struct printf_spec spec,
 		 const char *fmt)
@@ -1802,6 +1835,8 @@ char *address_val(char *buf, char *end, const void *addr,
 		return buf;
 
 	switch (fmt[1]) {
+	case 'r':
+		return range_string(buf, end, addr, spec, fmt);
 	case 'd':
 		num = *(const dma_addr_t *)addr;
 		size = sizeof(dma_addr_t);
@@ -2364,6 +2399,8 @@ char *rust_fmt_argument(char *buf, char *end, void *ptr);
  *            to use print_hex_dump() for the larger input.
  * - 'a[pd]' For address types [p] phys_addr_t, [d] dma_addr_t and derivatives
  *           (default assumed to be phys_addr_t, passed by reference)
+ * - 'ar' For decoded struct ranges (a variation of physical address which are
+ *        most often stored in struct ranges.
  * - 'd[234]' For a dentry name (optionally 2-4 last components)
  * - 'D[234]' Same as 'd' but for a struct file
  * - 'g' For block_device name (gendisk + partition number)

-- 
2.45.2


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v3 03/25] dax: Document dax dev range tuple
  2024-08-16 14:44 [PATCH v3 00/25] DCD: Add support for Dynamic Capacity Devices (DCD) Ira Weiny
  2024-08-16 14:44 ` [PATCH v3 01/25] range: Add range_overlaps() Ira Weiny
  2024-08-16 14:44 ` [PATCH v3 02/25] printk: Add print format (%par) for struct range Ira Weiny
@ 2024-08-16 14:44 ` Ira Weiny
  2024-08-16 20:58   ` Dave Jiang
  2024-08-23 15:29   ` Jonathan Cameron
  2024-08-16 14:44 ` [PATCH v3 04/25] cxl/pci: Delay event buffer allocation Ira Weiny
                   ` (21 subsequent siblings)
  24 siblings, 2 replies; 120+ messages in thread
From: Ira Weiny @ 2024-08-16 14:44 UTC (permalink / raw)
  To: Dave Jiang, Fan Ni, Jonathan Cameron, Navneet Singh, Chris Mason,
	Josef Bacik, David Sterba, Petr Mladek, Steven Rostedt,
	Andy Shevchenko, Rasmus Villemoes, Sergey Senozhatsky,
	Jonathan Corbet, Andrew Morton
  Cc: Dan Williams, Davidlohr Bueso, Alison Schofield, Vishal Verma,
	Ira Weiny, linux-btrfs, linux-cxl, linux-kernel, linux-doc,
	nvdimm

The device DAX structure is being enhanced to track additional DCD
information.

The current range tuple was not fully documented.  Document it prior to
adding information for DC.

Suggested-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Changes:
[iweiny: move to start of series]
---
 drivers/dax/dax-private.h | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/dax/dax-private.h b/drivers/dax/dax-private.h
index 446617b73aea..ccde98c3d4e2 100644
--- a/drivers/dax/dax-private.h
+++ b/drivers/dax/dax-private.h
@@ -58,7 +58,10 @@ struct dax_mapping {
  * @dev - device core
  * @pgmap - pgmap for memmap setup / lifetime (driver owned)
  * @nr_range: size of @ranges
- * @ranges: resource-span + pgoff tuples for the instance
+ * @ranges: range tuples of memory used
+ * @pgoff: page offset
+ * @range: resource-span
+ * @mapping: device to assist in interrogating the range layout
  */
 struct dev_dax {
 	struct dax_region *region;

-- 
2.45.2


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v3 04/25] cxl/pci: Delay event buffer allocation
  2024-08-16 14:44 [PATCH v3 00/25] DCD: Add support for Dynamic Capacity Devices (DCD) Ira Weiny
                   ` (2 preceding siblings ...)
  2024-08-16 14:44 ` [PATCH v3 03/25] dax: Document dax dev range tuple Ira Weiny
@ 2024-08-16 14:44 ` Ira Weiny
  2024-09-03  6:49   ` Li, Ming4
  2024-09-05 19:44   ` Fan Ni
  2024-08-16 14:44 ` [PATCH v3 05/25] cxl/mbox: Flag support for Dynamic Capacity Devices (DCD) ira.weiny
                   ` (20 subsequent siblings)
  24 siblings, 2 replies; 120+ messages in thread
From: Ira Weiny @ 2024-08-16 14:44 UTC (permalink / raw)
  To: Dave Jiang, Fan Ni, Jonathan Cameron, Navneet Singh, Chris Mason,
	Josef Bacik, David Sterba, Petr Mladek, Steven Rostedt,
	Andy Shevchenko, Rasmus Villemoes, Sergey Senozhatsky,
	Jonathan Corbet, Andrew Morton
  Cc: Dan Williams, Davidlohr Bueso, Alison Schofield, Vishal Verma,
	Ira Weiny, linux-btrfs, linux-cxl, linux-kernel, linux-doc,
	nvdimm

The event buffer does not need to be allocated if something has failed in
setting up event irq's.

In prep for adjusting event configuration for DCD events move the buffer
allocation to the end of the event configuration.

Reviewed-by: Davidlohr Bueso <dave@stgolabs.net>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Changes:
[iweiny: keep tags for early simple patch]
[Davidlohr, Jonathan, djiang: move to beginning of series]
	[Dave feel free to pick this up if you like]
---
 drivers/cxl/pci.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
index 4be35dc22202..3a60cd66263e 100644
--- a/drivers/cxl/pci.c
+++ b/drivers/cxl/pci.c
@@ -760,10 +760,6 @@ static int cxl_event_config(struct pci_host_bridge *host_bridge,
 		return 0;
 	}
 
-	rc = cxl_mem_alloc_event_buf(mds);
-	if (rc)
-		return rc;
-
 	rc = cxl_event_get_int_policy(mds, &policy);
 	if (rc)
 		return rc;
@@ -777,6 +773,10 @@ static int cxl_event_config(struct pci_host_bridge *host_bridge,
 		return -EBUSY;
 	}
 
+	rc = cxl_mem_alloc_event_buf(mds);
+	if (rc)
+		return rc;
+
 	rc = cxl_event_irqsetup(mds);
 	if (rc)
 		return rc;

-- 
2.45.2


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v3 05/25] cxl/mbox: Flag support for Dynamic Capacity Devices (DCD)
  2024-08-16 14:44 [PATCH v3 00/25] DCD: Add support for Dynamic Capacity Devices (DCD) Ira Weiny
                   ` (3 preceding siblings ...)
  2024-08-16 14:44 ` [PATCH v3 04/25] cxl/pci: Delay event buffer allocation Ira Weiny
@ 2024-08-16 14:44 ` ira.weiny
  2024-09-03  6:50   ` Li, Ming4
  2024-08-16 14:44 ` [PATCH v3 06/25] cxl/mem: Read dynamic capacity configuration from the device ira.weiny
                   ` (19 subsequent siblings)
  24 siblings, 1 reply; 120+ messages in thread
From: ira.weiny @ 2024-08-16 14:44 UTC (permalink / raw)
  To: Dave Jiang, Fan Ni, Jonathan Cameron, Navneet Singh, Chris Mason,
	Josef Bacik, David Sterba, Petr Mladek, Steven Rostedt,
	Andy Shevchenko, Rasmus Villemoes, Sergey Senozhatsky,
	Jonathan Corbet, Andrew Morton
  Cc: Dan Williams, Davidlohr Bueso, Alison Schofield, Vishal Verma,
	Ira Weiny, linux-btrfs, linux-cxl, linux-kernel, linux-doc,
	nvdimm

From: Navneet Singh <navneet.singh@intel.com>

Per the CXL 3.1 specification software must check the Command Effects
Log (CEL) for dynamic capacity command support.

Detect support for the DCD commands while reading the CEL, including:

	Get DC Config
	Get DC Extent List
	Add DC Response
	Release DC

Signed-off-by: Navneet Singh <navneet.singh@intel.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: Fan Ni <fan.ni@samsung.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Reviewed-by: Davidlohr Bueso <dave@stgolabs.net>
Co-developed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Changes:
[iweiny: Keep tags for this early simple patch]
[Davidlohr: update commit message]
[djiang: Fix misalignment]
---
 drivers/cxl/core/mbox.c | 33 +++++++++++++++++++++++++++++++++
 drivers/cxl/cxlmem.h    | 15 +++++++++++++++
 2 files changed, 48 insertions(+)

diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
index e5cdeafdf76e..8eb196858abe 100644
--- a/drivers/cxl/core/mbox.c
+++ b/drivers/cxl/core/mbox.c
@@ -164,6 +164,34 @@ static void cxl_set_security_cmd_enabled(struct cxl_security_state *security,
 	}
 }
 
+static bool cxl_is_dcd_command(u16 opcode)
+{
+#define CXL_MBOX_OP_DCD_CMDS 0x48
+
+	return (opcode >> 8) == CXL_MBOX_OP_DCD_CMDS;
+}
+
+static void cxl_set_dcd_cmd_enabled(struct cxl_memdev_state *mds,
+				    u16 opcode)
+{
+	switch (opcode) {
+	case CXL_MBOX_OP_GET_DC_CONFIG:
+		set_bit(CXL_DCD_ENABLED_GET_CONFIG, mds->dcd_cmds);
+		break;
+	case CXL_MBOX_OP_GET_DC_EXTENT_LIST:
+		set_bit(CXL_DCD_ENABLED_GET_EXTENT_LIST, mds->dcd_cmds);
+		break;
+	case CXL_MBOX_OP_ADD_DC_RESPONSE:
+		set_bit(CXL_DCD_ENABLED_ADD_RESPONSE, mds->dcd_cmds);
+		break;
+	case CXL_MBOX_OP_RELEASE_DC:
+		set_bit(CXL_DCD_ENABLED_RELEASE, mds->dcd_cmds);
+		break;
+	default:
+		break;
+	}
+}
+
 static bool cxl_is_poison_command(u16 opcode)
 {
 #define CXL_MBOX_OP_POISON_CMDS 0x43
@@ -745,6 +773,11 @@ static void cxl_walk_cel(struct cxl_memdev_state *mds, size_t size, u8 *cel)
 			enabled++;
 		}
 
+		if (cxl_is_dcd_command(opcode)) {
+			cxl_set_dcd_cmd_enabled(mds, opcode);
+			enabled++;
+		}
+
 		dev_dbg(dev, "Opcode 0x%04x %s\n", opcode,
 			enabled ? "enabled" : "unsupported by driver");
 	}
diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
index afb53d058d62..f2f8b567e0e7 100644
--- a/drivers/cxl/cxlmem.h
+++ b/drivers/cxl/cxlmem.h
@@ -238,6 +238,15 @@ struct cxl_event_state {
 	struct mutex log_lock;
 };
 
+/* Device enabled DCD commands */
+enum dcd_cmd_enabled_bits {
+	CXL_DCD_ENABLED_GET_CONFIG,
+	CXL_DCD_ENABLED_GET_EXTENT_LIST,
+	CXL_DCD_ENABLED_ADD_RESPONSE,
+	CXL_DCD_ENABLED_RELEASE,
+	CXL_DCD_ENABLED_MAX
+};
+
 /* Device enabled poison commands */
 enum poison_cmd_enabled_bits {
 	CXL_POISON_ENABLED_LIST,
@@ -454,6 +463,7 @@ struct cxl_dev_state {
  *                (CXL 2.0 8.2.9.5.1.1 Identify Memory Device)
  * @mbox_mutex: Mutex to synchronize mailbox access.
  * @firmware_version: Firmware version for the memory device.
+ * @dcd_cmds: List of DCD commands implemented by memory device
  * @enabled_cmds: Hardware commands found enabled in CEL.
  * @exclusive_cmds: Commands that are kernel-internal only
  * @total_bytes: sum of all possible capacities
@@ -482,6 +492,7 @@ struct cxl_memdev_state {
 	size_t lsa_size;
 	struct mutex mbox_mutex; /* Protects device mailbox and firmware */
 	char firmware_version[0x10];
+	DECLARE_BITMAP(dcd_cmds, CXL_DCD_ENABLED_MAX);
 	DECLARE_BITMAP(enabled_cmds, CXL_MEM_COMMAND_ID_MAX);
 	DECLARE_BITMAP(exclusive_cmds, CXL_MEM_COMMAND_ID_MAX);
 	u64 total_bytes;
@@ -555,6 +566,10 @@ enum cxl_opcode {
 	CXL_MBOX_OP_UNLOCK		= 0x4503,
 	CXL_MBOX_OP_FREEZE_SECURITY	= 0x4504,
 	CXL_MBOX_OP_PASSPHRASE_SECURE_ERASE	= 0x4505,
+	CXL_MBOX_OP_GET_DC_CONFIG	= 0x4800,
+	CXL_MBOX_OP_GET_DC_EXTENT_LIST	= 0x4801,
+	CXL_MBOX_OP_ADD_DC_RESPONSE	= 0x4802,
+	CXL_MBOX_OP_RELEASE_DC		= 0x4803,
 	CXL_MBOX_OP_MAX			= 0x10000
 };
 

-- 
2.45.2


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v3 06/25] cxl/mem: Read dynamic capacity configuration from the device
  2024-08-16 14:44 [PATCH v3 00/25] DCD: Add support for Dynamic Capacity Devices (DCD) Ira Weiny
                   ` (4 preceding siblings ...)
  2024-08-16 14:44 ` [PATCH v3 05/25] cxl/mbox: Flag support for Dynamic Capacity Devices (DCD) ira.weiny
@ 2024-08-16 14:44 ` ira.weiny
  2024-08-16 21:45   ` Dave Jiang
  2024-08-23 15:45   ` Jonathan Cameron
  2024-08-16 14:44 ` [PATCH v3 07/25] cxl/core: Separate region mode from decoder mode ira.weiny
                   ` (18 subsequent siblings)
  24 siblings, 2 replies; 120+ messages in thread
From: ira.weiny @ 2024-08-16 14:44 UTC (permalink / raw)
  To: Dave Jiang, Fan Ni, Jonathan Cameron, Navneet Singh, Chris Mason,
	Josef Bacik, David Sterba, Petr Mladek, Steven Rostedt,
	Andy Shevchenko, Rasmus Villemoes, Sergey Senozhatsky,
	Jonathan Corbet, Andrew Morton
  Cc: Dan Williams, Davidlohr Bueso, Alison Schofield, Vishal Verma,
	Ira Weiny, linux-btrfs, linux-cxl, linux-kernel, linux-doc,
	nvdimm, Li, Ming

From: Navneet Singh <navneet.singh@intel.com>

Devices which optionally support Dynamic Capacity (DC) are configured
via mailbox commands.  CXL 3.1 requires the host to issue the Get DC
Configuration command in order to properly configure DCDs.  Without the
Get DC Configuration command DCD can't be supported.

Implement the DC mailbox commands as specified in CXL 3.1 section
8.2.9.9.9 (opcodes 48XXh) to read and store the DCD configuration
information.  Disable DCD if DCD is not supported.  Leverage the Get DC
Configuration command supported bit to indicate if DCD support.

Linux has no use for the trailing fields of the Get Dynamic Capacity
Configuration Output Payload (Total number of supported extents, number
of available extents, total number of supported tags, and number of
available tags).  Avoid defining those fields to use the more useful
dynamic C array.

Cc: "Li, Ming" <ming4.li@intel.com>
Signed-off-by: Navneet Singh <navneet.singh@intel.com>
Co-developed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Changes:
[Li, Ming: Fix bug in total_bytes calculation]
[iweiny: update commit message]
[Jonathan: fix formatting]
[Jonathan: Define block line size]
[Jonathan/Fan: use regions returned field instead of macro in get config]
[Jørgen: Rename memdev state range variables]
[Jonathan: adjust use of rc in cxl_dev_dynamic_capacity_identify()]
[Jonathan: white space cleanup]
[fan: make a comment about the trailing configuration output fields]
---
 drivers/cxl/core/mbox.c | 171 +++++++++++++++++++++++++++++++++++++++++++++++-
 drivers/cxl/cxlmem.h    |  64 +++++++++++++++++-
 drivers/cxl/pci.c       |   4 ++
 3 files changed, 237 insertions(+), 2 deletions(-)

diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
index 8eb196858abe..68c26c4be91a 100644
--- a/drivers/cxl/core/mbox.c
+++ b/drivers/cxl/core/mbox.c
@@ -1157,7 +1157,7 @@ int cxl_dev_state_identify(struct cxl_memdev_state *mds)
 	if (rc < 0)
 		return rc;
 
-	mds->total_bytes =
+	mds->static_bytes =
 		le64_to_cpu(id.total_capacity) * CXL_CAPACITY_MULTIPLIER;
 	mds->volatile_only_bytes =
 		le64_to_cpu(id.volatile_capacity) * CXL_CAPACITY_MULTIPLIER;
@@ -1264,6 +1264,159 @@ int cxl_mem_sanitize(struct cxl_memdev *cxlmd, u16 cmd)
 	return rc;
 }
 
+static int cxl_dc_save_region_info(struct cxl_memdev_state *mds, u8 index,
+				   struct cxl_dc_region_config *region_config)
+{
+	struct cxl_dc_region_info *dcr = &mds->dc_region[index];
+	struct device *dev = mds->cxlds.dev;
+
+	dcr->base = le64_to_cpu(region_config->region_base);
+	dcr->decode_len = le64_to_cpu(region_config->region_decode_length);
+	dcr->decode_len *= CXL_CAPACITY_MULTIPLIER;
+	dcr->len = le64_to_cpu(region_config->region_length);
+	dcr->blk_size = le64_to_cpu(region_config->region_block_size);
+	dcr->dsmad_handle = le32_to_cpu(region_config->region_dsmad_handle);
+	dcr->flags = region_config->flags;
+	snprintf(dcr->name, CXL_DC_REGION_STRLEN, "dc%d", index);
+
+	/* Check regions are in increasing DPA order */
+	if (index > 0) {
+		struct cxl_dc_region_info *prev_dcr = &mds->dc_region[index - 1];
+
+		if ((prev_dcr->base + prev_dcr->decode_len) > dcr->base) {
+			dev_err(dev,
+				"DPA ordering violation for DC region %d and %d\n",
+				index - 1, index);
+			return -EINVAL;
+		}
+	}
+
+	if (!IS_ALIGNED(dcr->base, SZ_256M) ||
+	    !IS_ALIGNED(dcr->base, dcr->blk_size)) {
+		dev_err(dev, "DC region %d invalid base %#llx blk size %#llx\n",
+			index, dcr->base, dcr->blk_size);
+		return -EINVAL;
+	}
+
+	if (dcr->decode_len == 0 || dcr->len == 0 || dcr->decode_len < dcr->len ||
+	    !IS_ALIGNED(dcr->len, dcr->blk_size)) {
+		dev_err(dev, "DC region %d invalid length; decode %#llx len %#llx blk size %#llx\n",
+			index, dcr->decode_len, dcr->len, dcr->blk_size);
+		return -EINVAL;
+	}
+
+	if (dcr->blk_size == 0 || dcr->blk_size % CXL_DCD_BLOCK_LINE_SIZE ||
+	    !is_power_of_2(dcr->blk_size)) {
+		dev_err(dev, "DC region %d invalid block size; %#llx\n",
+			index, dcr->blk_size);
+		return -EINVAL;
+	}
+
+	dev_dbg(dev,
+		"DC region %s base %#llx length %#llx block size %#llx\n",
+		dcr->name, dcr->base, dcr->decode_len, dcr->blk_size);
+
+	return 0;
+}
+
+/* Returns the number of regions in dc_resp or -ERRNO */
+static int cxl_get_dc_config(struct cxl_memdev_state *mds, u8 start_region,
+			     struct cxl_mbox_get_dc_config_out *dc_resp,
+			     size_t dc_resp_size)
+{
+	struct cxl_mbox_get_dc_config_in get_dc = (struct cxl_mbox_get_dc_config_in) {
+		.region_count = CXL_MAX_DC_REGION,
+		.start_region_index = start_region,
+	};
+	struct cxl_mbox_cmd mbox_cmd = (struct cxl_mbox_cmd) {
+		.opcode = CXL_MBOX_OP_GET_DC_CONFIG,
+		.payload_in = &get_dc,
+		.size_in = sizeof(get_dc),
+		.size_out = dc_resp_size,
+		.payload_out = dc_resp,
+		.min_out = 1,
+	};
+	struct device *dev = mds->cxlds.dev;
+	int rc;
+
+	rc = cxl_internal_send_cmd(mds, &mbox_cmd);
+	if (rc < 0)
+		return rc;
+
+	dev_dbg(dev, "Read %d/%d DC regions\n",
+		dc_resp->regions_returned, dc_resp->avail_region_count);
+	return dc_resp->regions_returned;
+}
+
+/**
+ * cxl_dev_dynamic_capacity_identify() - Reads the dynamic capacity
+ *					 information from the device.
+ * @mds: The memory device state
+ *
+ * Read Dynamic Capacity information from the device and populate the state
+ * structures for later use.
+ *
+ * Return: 0 if identify was executed successfully, -ERRNO on error.
+ */
+int cxl_dev_dynamic_capacity_identify(struct cxl_memdev_state *mds)
+{
+	size_t dc_resp_size = mds->payload_size;
+	struct device *dev = mds->cxlds.dev;
+	u8 start_region, i;
+
+	for (i = 0; i < CXL_MAX_DC_REGION; i++)
+		snprintf(mds->dc_region[i].name, CXL_DC_REGION_STRLEN, "<nil>");
+
+	if (!cxl_dcd_supported(mds)) {
+		dev_dbg(dev, "DCD not supported\n");
+		return 0;
+	}
+
+	struct cxl_mbox_get_dc_config_out *dc_resp __free(kfree) =
+					kvmalloc(dc_resp_size, GFP_KERNEL);
+	if (!dc_resp)
+		return -ENOMEM;
+
+	start_region = 0;
+	do {
+		int rc, j;
+
+		rc = cxl_get_dc_config(mds, start_region, dc_resp, dc_resp_size);
+		if (rc < 0) {
+			dev_dbg(dev, "Failed to get DC config: %d\n", rc);
+			return rc;
+		}
+
+		mds->nr_dc_region += rc;
+
+		if (mds->nr_dc_region < 1 || mds->nr_dc_region > CXL_MAX_DC_REGION) {
+			dev_err(dev, "Invalid num of dynamic capacity regions %d\n",
+				mds->nr_dc_region);
+			return -EINVAL;
+		}
+
+		for (i = start_region, j = 0; i < mds->nr_dc_region; i++, j++) {
+			rc = cxl_dc_save_region_info(mds, i, &dc_resp->region[j]);
+			if (rc) {
+				dev_dbg(dev, "Failed to save region info: %d\n", rc);
+				return rc;
+			}
+		}
+
+		start_region = mds->nr_dc_region;
+
+	} while (mds->nr_dc_region < dc_resp->avail_region_count);
+
+	mds->dynamic_bytes =
+		mds->dc_region[mds->nr_dc_region - 1].base +
+		mds->dc_region[mds->nr_dc_region - 1].decode_len -
+		mds->dc_region[0].base;
+	dev_dbg(dev, "Total dynamic range: %#llx\n", mds->dynamic_bytes);
+
+	return 0;
+}
+EXPORT_SYMBOL_NS_GPL(cxl_dev_dynamic_capacity_identify, CXL);
+
 static int add_dpa_res(struct device *dev, struct resource *parent,
 		       struct resource *res, resource_size_t start,
 		       resource_size_t size, const char *type)
@@ -1294,8 +1447,15 @@ int cxl_mem_create_range_info(struct cxl_memdev_state *mds)
 {
 	struct cxl_dev_state *cxlds = &mds->cxlds;
 	struct device *dev = cxlds->dev;
+	size_t untenanted_mem;
 	int rc;
 
+	mds->total_bytes = mds->static_bytes;
+	if (mds->nr_dc_region) {
+		untenanted_mem = mds->dc_region[0].base - mds->static_bytes;
+		mds->total_bytes += untenanted_mem + mds->dynamic_bytes;
+	}
+
 	if (!cxlds->media_ready) {
 		cxlds->dpa_res = DEFINE_RES_MEM(0, 0);
 		cxlds->ram_res = DEFINE_RES_MEM(0, 0);
@@ -1305,6 +1465,15 @@ int cxl_mem_create_range_info(struct cxl_memdev_state *mds)
 
 	cxlds->dpa_res = DEFINE_RES_MEM(0, mds->total_bytes);
 
+	for (int i = 0; i < mds->nr_dc_region; i++) {
+		struct cxl_dc_region_info *dcr = &mds->dc_region[i];
+
+		rc = add_dpa_res(dev, &cxlds->dpa_res, &cxlds->dc_res[i],
+				 dcr->base, dcr->decode_len, dcr->name);
+		if (rc)
+			return rc;
+	}
+
 	if (mds->partition_align_bytes == 0) {
 		rc = add_dpa_res(dev, &cxlds->dpa_res, &cxlds->ram_res, 0,
 				 mds->volatile_only_bytes, "ram");
diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
index f2f8b567e0e7..b4eb8164d05d 100644
--- a/drivers/cxl/cxlmem.h
+++ b/drivers/cxl/cxlmem.h
@@ -402,6 +402,7 @@ enum cxl_devtype {
 	CXL_DEVTYPE_CLASSMEM,
 };
 
+#define CXL_MAX_DC_REGION 8
 /**
  * struct cxl_dpa_perf - DPA performance property entry
  * @dpa_range: range for DPA address
@@ -431,6 +432,8 @@ struct cxl_dpa_perf {
  * @dpa_res: Overall DPA resource tree for the device
  * @pmem_res: Active Persistent memory capacity configuration
  * @ram_res: Active Volatile memory capacity configuration
+ * @dc_res: Active Dynamic Capacity memory configuration for each possible
+ *          region
  * @serial: PCIe Device Serial Number
  * @type: Generic Memory Class device or Vendor Specific Memory device
  */
@@ -445,10 +448,22 @@ struct cxl_dev_state {
 	struct resource dpa_res;
 	struct resource pmem_res;
 	struct resource ram_res;
+	struct resource dc_res[CXL_MAX_DC_REGION];
 	u64 serial;
 	enum cxl_devtype type;
 };
 
+#define CXL_DC_REGION_STRLEN 8
+struct cxl_dc_region_info {
+	u64 base;
+	u64 decode_len;
+	u64 len;
+	u64 blk_size;
+	u32 dsmad_handle;
+	u8 flags;
+	u8 name[CXL_DC_REGION_STRLEN];
+};
+
 /**
  * struct cxl_memdev_state - Generic Type-3 Memory Device Class driver data
  *
@@ -466,7 +481,9 @@ struct cxl_dev_state {
  * @dcd_cmds: List of DCD commands implemented by memory device
  * @enabled_cmds: Hardware commands found enabled in CEL.
  * @exclusive_cmds: Commands that are kernel-internal only
- * @total_bytes: sum of all possible capacities
+ * @total_bytes: length of all possible capacities
+ * @static_bytes: length of possible static RAM and PMEM partitions
+ * @dynamic_bytes: length of possible DC partitions (DC Regions)
  * @volatile_only_bytes: hard volatile capacity
  * @persistent_only_bytes: hard persistent capacity
  * @partition_align_bytes: alignment size for partition-able capacity
@@ -476,6 +493,8 @@ struct cxl_dev_state {
  * @next_persistent_bytes: persistent capacity change pending device reset
  * @ram_perf: performance data entry matched to RAM partition
  * @pmem_perf: performance data entry matched to PMEM partition
+ * @nr_dc_region: number of DC regions implemented in the memory device
+ * @dc_region: array containing info about the DC regions
  * @event: event log driver state
  * @poison: poison driver state info
  * @security: security driver state info
@@ -496,6 +515,8 @@ struct cxl_memdev_state {
 	DECLARE_BITMAP(enabled_cmds, CXL_MEM_COMMAND_ID_MAX);
 	DECLARE_BITMAP(exclusive_cmds, CXL_MEM_COMMAND_ID_MAX);
 	u64 total_bytes;
+	u64 static_bytes;
+	u64 dynamic_bytes;
 	u64 volatile_only_bytes;
 	u64 persistent_only_bytes;
 	u64 partition_align_bytes;
@@ -507,6 +528,9 @@ struct cxl_memdev_state {
 	struct cxl_dpa_perf ram_perf;
 	struct cxl_dpa_perf pmem_perf;
 
+	u8 nr_dc_region;
+	struct cxl_dc_region_info dc_region[CXL_MAX_DC_REGION];
+
 	struct cxl_event_state event;
 	struct cxl_poison_state poison;
 	struct cxl_security_state security;
@@ -709,6 +733,32 @@ struct cxl_mbox_set_partition_info {
 
 #define  CXL_SET_PARTITION_IMMEDIATE_FLAG	BIT(0)
 
+/* See CXL 3.1 Table 8-163 get dynamic capacity config Input Payload */
+struct cxl_mbox_get_dc_config_in {
+	u8 region_count;
+	u8 start_region_index;
+} __packed;
+
+/* See CXL 3.1 Table 8-164 get dynamic capacity config Output Payload */
+struct cxl_mbox_get_dc_config_out {
+	u8 avail_region_count;
+	u8 regions_returned;
+	u8 rsvd[6];
+	/* See CXL 3.1 Table 8-165 */
+	struct cxl_dc_region_config {
+		__le64 region_base;
+		__le64 region_decode_length;
+		__le64 region_length;
+		__le64 region_block_size;
+		__le32 region_dsmad_handle;
+		u8 flags;
+		u8 rsvd[3];
+	} __packed region[];
+	/* Trailing fields unused */
+} __packed;
+#define CXL_DYNAMIC_CAPACITY_SANITIZE_ON_RELEASE_FLAG BIT(0)
+#define CXL_DCD_BLOCK_LINE_SIZE 0x40
+
 /* Set Timestamp CXL 3.0 Spec 8.2.9.4.2 */
 struct cxl_mbox_set_timestamp_in {
 	__le64 timestamp;
@@ -832,6 +882,7 @@ enum {
 int cxl_internal_send_cmd(struct cxl_memdev_state *mds,
 			  struct cxl_mbox_cmd *cmd);
 int cxl_dev_state_identify(struct cxl_memdev_state *mds);
+int cxl_dev_dynamic_capacity_identify(struct cxl_memdev_state *mds);
 int cxl_await_media_ready(struct cxl_dev_state *cxlds);
 int cxl_enumerate_cmds(struct cxl_memdev_state *mds);
 int cxl_mem_create_range_info(struct cxl_memdev_state *mds);
@@ -845,6 +896,17 @@ void cxl_event_trace_record(const struct cxl_memdev *cxlmd,
 			    enum cxl_event_log_type type,
 			    enum cxl_event_type event_type,
 			    const uuid_t *uuid, union cxl_event *evt);
+
+static inline bool cxl_dcd_supported(struct cxl_memdev_state *mds)
+{
+	return test_bit(CXL_DCD_ENABLED_GET_CONFIG, mds->dcd_cmds);
+}
+
+static inline void cxl_disable_dcd(struct cxl_memdev_state *mds)
+{
+	clear_bit(CXL_DCD_ENABLED_GET_CONFIG, mds->dcd_cmds);
+}
+
 int cxl_set_timestamp(struct cxl_memdev_state *mds);
 int cxl_poison_state_init(struct cxl_memdev_state *mds);
 int cxl_mem_get_poison(struct cxl_memdev *cxlmd, u64 offset, u64 len,
diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
index 3a60cd66263e..f7f03599bc83 100644
--- a/drivers/cxl/pci.c
+++ b/drivers/cxl/pci.c
@@ -874,6 +874,10 @@ static int cxl_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 	if (rc)
 		return rc;
 
+	rc = cxl_dev_dynamic_capacity_identify(mds);
+	if (rc)
+		cxl_disable_dcd(mds);
+
 	rc = cxl_mem_create_range_info(mds);
 	if (rc)
 		return rc;

-- 
2.45.2


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v3 07/25] cxl/core: Separate region mode from decoder mode
  2024-08-16 14:44 [PATCH v3 00/25] DCD: Add support for Dynamic Capacity Devices (DCD) Ira Weiny
                   ` (5 preceding siblings ...)
  2024-08-16 14:44 ` [PATCH v3 06/25] cxl/mem: Read dynamic capacity configuration from the device ira.weiny
@ 2024-08-16 14:44 ` ira.weiny
  2024-08-16 22:11   ` Dave Jiang
                     ` (2 more replies)
  2024-08-16 14:44 ` [PATCH v3 08/25] cxl/region: Add dynamic capacity decoder and region modes ira.weiny
                   ` (17 subsequent siblings)
  24 siblings, 3 replies; 120+ messages in thread
From: ira.weiny @ 2024-08-16 14:44 UTC (permalink / raw)
  To: Dave Jiang, Fan Ni, Jonathan Cameron, Navneet Singh, Chris Mason,
	Josef Bacik, David Sterba, Petr Mladek, Steven Rostedt,
	Andy Shevchenko, Rasmus Villemoes, Sergey Senozhatsky,
	Jonathan Corbet, Andrew Morton
  Cc: Dan Williams, Davidlohr Bueso, Alison Schofield, Vishal Verma,
	Ira Weiny, linux-btrfs, linux-cxl, linux-kernel, linux-doc,
	nvdimm, Jonathan Cameron

From: Navneet Singh <navneet.singh@intel.com>

Until now region modes and decoder modes were equivalent in that both
modes were either PMEM or RAM.  The addition of Dynamic
Capacity partitions defines up to 8 DC partitions per device.

The region mode is thus no longer equivalent to the endpoint decoder
mode.  IOW the endpoint decoders may have modes of DC0-DC7 while the
region mode is simply DC.

Define a new region mode enumeration which applies to regions separate
from the decoder mode.  Adjust the code to process these modes
independently.

There is no equal to decoder mode dead in region modes.  Avoid
constructing regions with decoders which have been flagged as dead.

Suggested-by: Jonathan Cameron <Jonathan.Cameron@Huawei.com>
Signed-off-by: Navneet Singh <navneet.singh@intel.com>
Co-developed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Changes:
[iweiny: rebase]
[Jonathan: remove dead code]
[Jonathan: clarify commit message]
---
 drivers/cxl/core/region.c | 75 ++++++++++++++++++++++++++++++++++-------------
 drivers/cxl/cxl.h         | 26 ++++++++++++++--
 2 files changed, 79 insertions(+), 22 deletions(-)

diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
index 971a314b6b0e..796e5a791e44 100644
--- a/drivers/cxl/core/region.c
+++ b/drivers/cxl/core/region.c
@@ -144,7 +144,7 @@ static ssize_t uuid_show(struct device *dev, struct device_attribute *attr,
 	rc = down_read_interruptible(&cxl_region_rwsem);
 	if (rc)
 		return rc;
-	if (cxlr->mode != CXL_DECODER_PMEM)
+	if (cxlr->mode != CXL_REGION_PMEM)
 		rc = sysfs_emit(buf, "\n");
 	else
 		rc = sysfs_emit(buf, "%pUb\n", &p->uuid);
@@ -457,7 +457,7 @@ static umode_t cxl_region_visible(struct kobject *kobj, struct attribute *a,
 	 * Support tooling that expects to find a 'uuid' attribute for all
 	 * regions regardless of mode.
 	 */
-	if (a == &dev_attr_uuid.attr && cxlr->mode != CXL_DECODER_PMEM)
+	if (a == &dev_attr_uuid.attr && cxlr->mode != CXL_REGION_PMEM)
 		return 0444;
 	return a->mode;
 }
@@ -620,7 +620,7 @@ static ssize_t mode_show(struct device *dev, struct device_attribute *attr,
 {
 	struct cxl_region *cxlr = to_cxl_region(dev);
 
-	return sysfs_emit(buf, "%s\n", cxl_decoder_mode_name(cxlr->mode));
+	return sysfs_emit(buf, "%s\n", cxl_region_mode_name(cxlr->mode));
 }
 static DEVICE_ATTR_RO(mode);
 
@@ -646,7 +646,7 @@ static int alloc_hpa(struct cxl_region *cxlr, resource_size_t size)
 
 	/* ways, granularity and uuid (if PMEM) need to be set before HPA */
 	if (!p->interleave_ways || !p->interleave_granularity ||
-	    (cxlr->mode == CXL_DECODER_PMEM && uuid_is_null(&p->uuid)))
+	    (cxlr->mode == CXL_REGION_PMEM && uuid_is_null(&p->uuid)))
 		return -ENXIO;
 
 	div64_u64_rem(size, (u64)SZ_256M * p->interleave_ways, &remainder);
@@ -1863,6 +1863,17 @@ static int cxl_region_sort_targets(struct cxl_region *cxlr)
 	return rc;
 }
 
+static bool cxl_modes_compatible(enum cxl_region_mode rmode,
+				 enum cxl_decoder_mode dmode)
+{
+	if (rmode == CXL_REGION_RAM && dmode == CXL_DECODER_RAM)
+		return true;
+	if (rmode == CXL_REGION_PMEM && dmode == CXL_DECODER_PMEM)
+		return true;
+
+	return false;
+}
+
 static int cxl_region_attach(struct cxl_region *cxlr,
 			     struct cxl_endpoint_decoder *cxled, int pos)
 {
@@ -1882,9 +1893,11 @@ static int cxl_region_attach(struct cxl_region *cxlr,
 		return rc;
 	}
 
-	if (cxled->mode != cxlr->mode) {
-		dev_dbg(&cxlr->dev, "%s region mode: %d mismatch: %d\n",
-			dev_name(&cxled->cxld.dev), cxlr->mode, cxled->mode);
+	if (!cxl_modes_compatible(cxlr->mode, cxled->mode)) {
+		dev_dbg(&cxlr->dev, "%s region mode: %s mismatch decoder: %s\n",
+			dev_name(&cxled->cxld.dev),
+			cxl_region_mode_name(cxlr->mode),
+			cxl_decoder_mode_name(cxled->mode));
 		return -EINVAL;
 	}
 
@@ -2447,7 +2460,7 @@ static int cxl_region_calculate_adistance(struct notifier_block *nb,
  * devm_cxl_add_region - Adds a region to a decoder
  * @cxlrd: root decoder
  * @id: memregion id to create, or memregion_free() on failure
- * @mode: mode for the endpoint decoders of this region
+ * @mode: mode of this region
  * @type: select whether this is an expander or accelerator (type-2 or type-3)
  *
  * This is the second step of region initialization. Regions exist within an
@@ -2458,7 +2471,7 @@ static int cxl_region_calculate_adistance(struct notifier_block *nb,
  */
 static struct cxl_region *devm_cxl_add_region(struct cxl_root_decoder *cxlrd,
 					      int id,
-					      enum cxl_decoder_mode mode,
+					      enum cxl_region_mode mode,
 					      enum cxl_decoder_type type)
 {
 	struct cxl_port *port = to_cxl_port(cxlrd->cxlsd.cxld.dev.parent);
@@ -2512,16 +2525,17 @@ static ssize_t create_ram_region_show(struct device *dev,
 }
 
 static struct cxl_region *__create_region(struct cxl_root_decoder *cxlrd,
-					  enum cxl_decoder_mode mode, int id)
+					  enum cxl_region_mode mode, int id)
 {
 	int rc;
 
 	switch (mode) {
-	case CXL_DECODER_RAM:
-	case CXL_DECODER_PMEM:
+	case CXL_REGION_RAM:
+	case CXL_REGION_PMEM:
 		break;
 	default:
-		dev_err(&cxlrd->cxlsd.cxld.dev, "unsupported mode %d\n", mode);
+		dev_err(&cxlrd->cxlsd.cxld.dev, "unsupported mode %s\n",
+			cxl_region_mode_name(mode));
 		return ERR_PTR(-EINVAL);
 	}
 
@@ -2549,7 +2563,7 @@ static ssize_t create_pmem_region_store(struct device *dev,
 	if (rc != 1)
 		return -EINVAL;
 
-	cxlr = __create_region(cxlrd, CXL_DECODER_PMEM, id);
+	cxlr = __create_region(cxlrd, CXL_REGION_PMEM, id);
 	if (IS_ERR(cxlr))
 		return PTR_ERR(cxlr);
 
@@ -2569,7 +2583,7 @@ static ssize_t create_ram_region_store(struct device *dev,
 	if (rc != 1)
 		return -EINVAL;
 
-	cxlr = __create_region(cxlrd, CXL_DECODER_RAM, id);
+	cxlr = __create_region(cxlrd, CXL_REGION_RAM, id);
 	if (IS_ERR(cxlr))
 		return PTR_ERR(cxlr);
 
@@ -3215,6 +3229,22 @@ static int match_region_by_range(struct device *dev, void *data)
 	return rc;
 }
 
+static enum cxl_region_mode
+cxl_decoder_to_region_mode(enum cxl_decoder_mode mode)
+{
+	switch (mode) {
+	case CXL_DECODER_NONE:
+		return CXL_REGION_NONE;
+	case CXL_DECODER_RAM:
+		return CXL_REGION_RAM;
+	case CXL_DECODER_PMEM:
+		return CXL_REGION_PMEM;
+	case CXL_DECODER_MIXED:
+	default:
+		return CXL_REGION_MIXED;
+	}
+}
+
 /* Establish an empty region covering the given HPA range */
 static struct cxl_region *construct_region(struct cxl_root_decoder *cxlrd,
 					   struct cxl_endpoint_decoder *cxled)
@@ -3223,12 +3253,17 @@ static struct cxl_region *construct_region(struct cxl_root_decoder *cxlrd,
 	struct cxl_port *port = cxlrd_to_port(cxlrd);
 	struct range *hpa = &cxled->cxld.hpa_range;
 	struct cxl_region_params *p;
+	enum cxl_region_mode mode;
 	struct cxl_region *cxlr;
 	struct resource *res;
 	int rc;
 
+	if (cxled->mode == CXL_DECODER_DEAD)
+		return ERR_PTR(-EINVAL);
+
+	mode = cxl_decoder_to_region_mode(cxled->mode);
 	do {
-		cxlr = __create_region(cxlrd, cxled->mode,
+		cxlr = __create_region(cxlrd, mode,
 				       atomic_read(&cxlrd->region_id));
 	} while (IS_ERR(cxlr) && PTR_ERR(cxlr) == -EBUSY);
 
@@ -3431,9 +3466,9 @@ static int cxl_region_probe(struct device *dev)
 		return rc;
 
 	switch (cxlr->mode) {
-	case CXL_DECODER_PMEM:
+	case CXL_REGION_PMEM:
 		return devm_cxl_add_pmem_region(cxlr);
-	case CXL_DECODER_RAM:
+	case CXL_REGION_RAM:
 		/*
 		 * The region can not be manged by CXL if any portion of
 		 * it is already online as 'System RAM'
@@ -3445,8 +3480,8 @@ static int cxl_region_probe(struct device *dev)
 			return 0;
 		return devm_cxl_add_dax_region(cxlr);
 	default:
-		dev_dbg(&cxlr->dev, "unsupported region mode: %d\n",
-			cxlr->mode);
+		dev_dbg(&cxlr->dev, "unsupported region mode: %s\n",
+			cxl_region_mode_name(cxlr->mode));
 		return -ENXIO;
 	}
 }
diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
index 9afb407d438f..f766b2a8bf53 100644
--- a/drivers/cxl/cxl.h
+++ b/drivers/cxl/cxl.h
@@ -388,6 +388,27 @@ static inline const char *cxl_decoder_mode_name(enum cxl_decoder_mode mode)
 	return "mixed";
 }
 
+enum cxl_region_mode {
+	CXL_REGION_NONE,
+	CXL_REGION_RAM,
+	CXL_REGION_PMEM,
+	CXL_REGION_MIXED,
+};
+
+static inline const char *cxl_region_mode_name(enum cxl_region_mode mode)
+{
+	static const char * const names[] = {
+		[CXL_REGION_NONE] = "none",
+		[CXL_REGION_RAM] = "ram",
+		[CXL_REGION_PMEM] = "pmem",
+		[CXL_REGION_MIXED] = "mixed",
+	};
+
+	if (mode >= CXL_REGION_NONE && mode <= CXL_REGION_MIXED)
+		return names[mode];
+	return "mixed";
+}
+
 /*
  * Track whether this decoder is reserved for region autodiscovery, or
  * free for userspace provisioning.
@@ -515,7 +536,8 @@ struct cxl_region_params {
  * struct cxl_region - CXL region
  * @dev: This region's device
  * @id: This region's id. Id is globally unique across all regions
- * @mode: Endpoint decoder allocation / access mode
+ * @mode: Region mode which defines which endpoint decoder modes the region is
+ *        compatible with
  * @type: Endpoint decoder target type
  * @cxl_nvb: nvdimm bridge for coordinating @cxlr_pmem setup / shutdown
  * @cxlr_pmem: (for pmem regions) cached copy of the nvdimm bridge
@@ -528,7 +550,7 @@ struct cxl_region_params {
 struct cxl_region {
 	struct device dev;
 	int id;
-	enum cxl_decoder_mode mode;
+	enum cxl_region_mode mode;
 	enum cxl_decoder_type type;
 	struct cxl_nvdimm_bridge *cxl_nvb;
 	struct cxl_pmem_region *cxlr_pmem;

-- 
2.45.2


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v3 08/25] cxl/region: Add dynamic capacity decoder and region modes
  2024-08-16 14:44 [PATCH v3 00/25] DCD: Add support for Dynamic Capacity Devices (DCD) Ira Weiny
                   ` (6 preceding siblings ...)
  2024-08-16 14:44 ` [PATCH v3 07/25] cxl/core: Separate region mode from decoder mode ira.weiny
@ 2024-08-16 14:44 ` ira.weiny
  2024-08-16 22:14   ` Dave Jiang
  2024-09-03  6:57   ` Li, Ming4
  2024-08-16 14:44 ` [PATCH v3 09/25] cxl/hdm: Add dynamic capacity size support to endpoint decoders ira.weiny
                   ` (16 subsequent siblings)
  24 siblings, 2 replies; 120+ messages in thread
From: ira.weiny @ 2024-08-16 14:44 UTC (permalink / raw)
  To: Dave Jiang, Fan Ni, Jonathan Cameron, Navneet Singh, Chris Mason,
	Josef Bacik, David Sterba, Petr Mladek, Steven Rostedt,
	Andy Shevchenko, Rasmus Villemoes, Sergey Senozhatsky,
	Jonathan Corbet, Andrew Morton
  Cc: Dan Williams, Davidlohr Bueso, Alison Schofield, Vishal Verma,
	Ira Weiny, linux-btrfs, linux-cxl, linux-kernel, linux-doc,
	nvdimm

From: Navneet Singh <navneet.singh@intel.com>

One or more decoders each pointing to a Dynamic Capacity (DC) partition
form a CXL software region.  The region mode reflects composition of
that entire software region.  Decoder mode reflects a specific DC
partition.  DC partitions are also known as DC regions per CXL
specification r3.1.

Define the new modes and helper functions required to make the
association between these new modes.

Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: Fan Ni <fan.ni@samsung.com>
Signed-off-by: Navneet Singh <navneet.singh@intel.com>
Co-developed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Changes:
[iweiny: keep tags on simple patch]
[Fan: s/partitions/partition/]
[djiang: New wording for the commit message]
[iweiny: reword commit message more]
---
 drivers/cxl/core/region.c |  4 ++++
 drivers/cxl/cxl.h         | 23 +++++++++++++++++++++++
 2 files changed, 27 insertions(+)

diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
index 796e5a791e44..650fe33f2ed4 100644
--- a/drivers/cxl/core/region.c
+++ b/drivers/cxl/core/region.c
@@ -1870,6 +1870,8 @@ static bool cxl_modes_compatible(enum cxl_region_mode rmode,
 		return true;
 	if (rmode == CXL_REGION_PMEM && dmode == CXL_DECODER_PMEM)
 		return true;
+	if (rmode == CXL_REGION_DC && cxl_decoder_mode_is_dc(dmode))
+		return true;
 
 	return false;
 }
@@ -3239,6 +3241,8 @@ cxl_decoder_to_region_mode(enum cxl_decoder_mode mode)
 		return CXL_REGION_RAM;
 	case CXL_DECODER_PMEM:
 		return CXL_REGION_PMEM;
+	case CXL_DECODER_DC0 ... CXL_DECODER_DC7:
+		return CXL_REGION_DC;
 	case CXL_DECODER_MIXED:
 	default:
 		return CXL_REGION_MIXED;
diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
index f766b2a8bf53..d2674ab46f35 100644
--- a/drivers/cxl/cxl.h
+++ b/drivers/cxl/cxl.h
@@ -370,6 +370,14 @@ enum cxl_decoder_mode {
 	CXL_DECODER_NONE,
 	CXL_DECODER_RAM,
 	CXL_DECODER_PMEM,
+	CXL_DECODER_DC0,
+	CXL_DECODER_DC1,
+	CXL_DECODER_DC2,
+	CXL_DECODER_DC3,
+	CXL_DECODER_DC4,
+	CXL_DECODER_DC5,
+	CXL_DECODER_DC6,
+	CXL_DECODER_DC7,
 	CXL_DECODER_MIXED,
 	CXL_DECODER_DEAD,
 };
@@ -380,6 +388,14 @@ static inline const char *cxl_decoder_mode_name(enum cxl_decoder_mode mode)
 		[CXL_DECODER_NONE] = "none",
 		[CXL_DECODER_RAM] = "ram",
 		[CXL_DECODER_PMEM] = "pmem",
+		[CXL_DECODER_DC0] = "dc0",
+		[CXL_DECODER_DC1] = "dc1",
+		[CXL_DECODER_DC2] = "dc2",
+		[CXL_DECODER_DC3] = "dc3",
+		[CXL_DECODER_DC4] = "dc4",
+		[CXL_DECODER_DC5] = "dc5",
+		[CXL_DECODER_DC6] = "dc6",
+		[CXL_DECODER_DC7] = "dc7",
 		[CXL_DECODER_MIXED] = "mixed",
 	};
 
@@ -388,10 +404,16 @@ static inline const char *cxl_decoder_mode_name(enum cxl_decoder_mode mode)
 	return "mixed";
 }
 
+static inline bool cxl_decoder_mode_is_dc(enum cxl_decoder_mode mode)
+{
+	return (mode >= CXL_DECODER_DC0 && mode <= CXL_DECODER_DC7);
+}
+
 enum cxl_region_mode {
 	CXL_REGION_NONE,
 	CXL_REGION_RAM,
 	CXL_REGION_PMEM,
+	CXL_REGION_DC,
 	CXL_REGION_MIXED,
 };
 
@@ -401,6 +423,7 @@ static inline const char *cxl_region_mode_name(enum cxl_region_mode mode)
 		[CXL_REGION_NONE] = "none",
 		[CXL_REGION_RAM] = "ram",
 		[CXL_REGION_PMEM] = "pmem",
+		[CXL_REGION_DC] = "dc",
 		[CXL_REGION_MIXED] = "mixed",
 	};
 

-- 
2.45.2


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v3 09/25] cxl/hdm: Add dynamic capacity size support to endpoint decoders
  2024-08-16 14:44 [PATCH v3 00/25] DCD: Add support for Dynamic Capacity Devices (DCD) Ira Weiny
                   ` (7 preceding siblings ...)
  2024-08-16 14:44 ` [PATCH v3 08/25] cxl/region: Add dynamic capacity decoder and region modes ira.weiny
@ 2024-08-16 14:44 ` ira.weiny
  2024-08-16 23:08   ` Dave Jiang
  2024-08-23 16:09   ` Jonathan Cameron
  2024-08-16 14:44 ` [PATCH v3 10/25] cxl/port: Add endpoint decoder DC mode support to sysfs ira.weiny
                   ` (15 subsequent siblings)
  24 siblings, 2 replies; 120+ messages in thread
From: ira.weiny @ 2024-08-16 14:44 UTC (permalink / raw)
  To: Dave Jiang, Fan Ni, Jonathan Cameron, Navneet Singh, Chris Mason,
	Josef Bacik, David Sterba, Petr Mladek, Steven Rostedt,
	Andy Shevchenko, Rasmus Villemoes, Sergey Senozhatsky,
	Jonathan Corbet, Andrew Morton
  Cc: Dan Williams, Davidlohr Bueso, Alison Schofield, Vishal Verma,
	Ira Weiny, linux-btrfs, linux-cxl, linux-kernel, linux-doc,
	nvdimm

From: Navneet Singh <navneet.singh@intel.com>

To support Dynamic Capacity Devices (DCD) endpoint decoders will need to
map DC partitions (regions).  In addition to assigning the size of the
DC partition, the decoder must assign any skip value from the previous
decoder.  This must be done within a contiguous DPA space.

Two complications arise with Dynamic Capacity regions which did not
exist with Ram and PMEM partitions.  First, gaps in the DPA space can
exist between and around the DC partitions.  Second, the Linux resource
tree does not allow a resource to be marked across existing nodes within
a tree.

For clarity, below is an example of an 60GB device with 10GB of RAM,
10GB of PMEM and 10GB for each of 2 DC partitions.  The desired CXL
mapping is 5GB of RAM, 5GB of PMEM, and 5GB of DC1.

     DPA RANGE
     (dpa_res)
0GB        10GB       20GB       30GB       40GB       50GB       60GB
|----------|----------|----------|----------|----------|----------|

RAM         PMEM                  DC0                   DC1
 (ram_res)  (pmem_res)            (dc_res[0])           (dc_res[1])
|----------|----------|   <gap>  |----------|   <gap>  |----------|

 RAM        PMEM                                        DC1
|XXXXX|----|XXXXX|----|----------|----------|----------|XXXXX-----|
0GB   5GB  10GB  15GB 20GB       30GB       40GB       50GB       60GB

The previous skip resource between RAM and PMEM was always a child of
the RAM resource and fit nicely [see (S) below].  Because of this
simplicity this skip resource reference was not stored in any CXL state.
On release the skip range could be calculated based on the endpoint
decoders stored values.

Now when DC1 is being mapped 4 skip resources must be created as
children.  One for the PMEM resource (A), two of the parent DPA resource
(B,D), and one more child of the DC0 resource (C).

0GB        10GB       20GB       30GB       40GB       50GB       60GB
|----------|----------|----------|----------|----------|----------|
                           |                     |
|----------|----------|    |     |----------|    |     |----------|
        |          |       |          |          |
       (S)        (A)     (B)        (C)        (D)
	v          v       v          v          v
|XXXXX|----|XXXXX|----|----------|----------|----------|XXXXX-----|
       skip       skip  skip        skip      skip

Expand the calculation of DPA free space and enhance the logic to
support this more complex skipping.  To track the potential of multiple
skip resources an xarray is attached to the endpoint decoder.  The
existing algorithm between RAM and PMEM is consolidated within the new
one to streamline the code even though the result is the storage of a
single skip resource in the xarray.

Signed-off-by: Navneet Singh <navneet.singh@intel.com>
Co-developed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Changes:
[Jonathan: Use an example only mapping 1/2 of DC1]
[iweiny: Update cover letter]
[iweiny: Fix 0day bugs
	https://lore.kernel.org/all/202408090138.RB41yBE8-lkp@intel.com/
[djbw/Jonathan: allow more than 1 region per DC partition]
---
 drivers/cxl/core/hdm.c  | 196 ++++++++++++++++++++++++++++++++++++++++++++----
 drivers/cxl/core/port.c |   2 +
 drivers/cxl/cxl.h       |   2 +
 3 files changed, 184 insertions(+), 16 deletions(-)

diff --git a/drivers/cxl/core/hdm.c b/drivers/cxl/core/hdm.c
index 3df10517a327..b4a517c6d283 100644
--- a/drivers/cxl/core/hdm.c
+++ b/drivers/cxl/core/hdm.c
@@ -223,6 +223,25 @@ void cxl_dpa_debug(struct seq_file *file, struct cxl_dev_state *cxlds)
 }
 EXPORT_SYMBOL_NS_GPL(cxl_dpa_debug, CXL);
 
+static void cxl_skip_release(struct cxl_endpoint_decoder *cxled)
+{
+	struct cxl_dev_state *cxlds = cxled_to_memdev(cxled)->cxlds;
+	struct cxl_port *port = cxled_to_port(cxled);
+	struct device *dev = &port->dev;
+	unsigned long index;
+	void *entry;
+
+	xa_for_each(&cxled->skip_res, index, entry) {
+		struct resource *res = entry;
+
+		dev_dbg(dev, "decoder%d.%d: releasing skipped space; %pr\n",
+			port->id, cxled->cxld.id, res);
+		__release_region(&cxlds->dpa_res, res->start,
+				 resource_size(res));
+		xa_erase(&cxled->skip_res, index);
+	}
+}
+
 /*
  * Must be called in a context that synchronizes against this decoder's
  * port ->remove() callback (like an endpoint decoder sysfs attribute)
@@ -233,15 +252,11 @@ static void __cxl_dpa_release(struct cxl_endpoint_decoder *cxled)
 	struct cxl_port *port = cxled_to_port(cxled);
 	struct cxl_dev_state *cxlds = cxlmd->cxlds;
 	struct resource *res = cxled->dpa_res;
-	resource_size_t skip_start;
 
 	lockdep_assert_held_write(&cxl_dpa_rwsem);
 
-	/* save @skip_start, before @res is released */
-	skip_start = res->start - cxled->skip;
 	__release_region(&cxlds->dpa_res, res->start, resource_size(res));
-	if (cxled->skip)
-		__release_region(&cxlds->dpa_res, skip_start, cxled->skip);
+	cxl_skip_release(cxled);
 	cxled->skip = 0;
 	cxled->dpa_res = NULL;
 	put_device(&cxled->cxld.dev);
@@ -268,6 +283,105 @@ static void devm_cxl_dpa_release(struct cxl_endpoint_decoder *cxled)
 	__cxl_dpa_release(cxled);
 }
 
+static int dc_mode_to_region_index(enum cxl_decoder_mode mode)
+{
+	return mode - CXL_DECODER_DC0;
+}
+
+static int cxl_request_skip(struct cxl_endpoint_decoder *cxled,
+			    resource_size_t skip_base, resource_size_t skip_len)
+{
+	struct cxl_dev_state *cxlds = cxled_to_memdev(cxled)->cxlds;
+	const char *name = dev_name(&cxled->cxld.dev);
+	struct cxl_port *port = cxled_to_port(cxled);
+	struct resource *dpa_res = &cxlds->dpa_res;
+	struct device *dev = &port->dev;
+	struct resource *res;
+	int rc;
+
+	res = __request_region(dpa_res, skip_base, skip_len, name, 0);
+	if (!res)
+		return -EBUSY;
+
+	rc = xa_insert(&cxled->skip_res, skip_base, res, GFP_KERNEL);
+	if (rc) {
+		__release_region(dpa_res, skip_base, skip_len);
+		return rc;
+	}
+
+	dev_dbg(dev, "decoder%d.%d: skipped space; %pr\n",
+		port->id, cxled->cxld.id, res);
+	return 0;
+}
+
+static int cxl_reserve_dpa_skip(struct cxl_endpoint_decoder *cxled,
+				resource_size_t base, resource_size_t skipped)
+{
+	struct cxl_memdev *cxlmd = cxled_to_memdev(cxled);
+	struct cxl_port *port = cxled_to_port(cxled);
+	struct cxl_dev_state *cxlds = cxlmd->cxlds;
+	resource_size_t skip_base = base - skipped;
+	struct device *dev = &port->dev;
+	resource_size_t skip_len = 0;
+	int rc, index;
+
+	if (resource_size(&cxlds->ram_res) && skip_base <= cxlds->ram_res.end) {
+		skip_len = cxlds->ram_res.end - skip_base + 1;
+		rc = cxl_request_skip(cxled, skip_base, skip_len);
+		if (rc)
+			return rc;
+		skip_base += skip_len;
+	}
+
+	if (skip_base == base) {
+		dev_dbg(dev, "skip done ram!\n");
+		return 0;
+	}
+
+	if (resource_size(&cxlds->pmem_res) &&
+	    skip_base <= cxlds->pmem_res.end) {
+		skip_len = cxlds->pmem_res.end - skip_base + 1;
+		rc = cxl_request_skip(cxled, skip_base, skip_len);
+		if (rc)
+			return rc;
+		skip_base += skip_len;
+	}
+
+	index = dc_mode_to_region_index(cxled->mode);
+	for (int i = 0; i <= index; i++) {
+		struct resource *dcr = &cxlds->dc_res[i];
+
+		if (skip_base < dcr->start) {
+			skip_len = dcr->start - skip_base;
+			rc = cxl_request_skip(cxled, skip_base, skip_len);
+			if (rc)
+				return rc;
+			skip_base += skip_len;
+		}
+
+		if (skip_base == base) {
+			dev_dbg(dev, "skip done DC region %d!\n", i);
+			break;
+		}
+
+		if (resource_size(dcr) && skip_base <= dcr->end) {
+			if (skip_base > base) {
+				dev_err(dev, "Skip error DC region %d; skip_base %pa; base %pa\n",
+					i, &skip_base, &base);
+				return -ENXIO;
+			}
+
+			skip_len = dcr->end - skip_base + 1;
+			rc = cxl_request_skip(cxled, skip_base, skip_len);
+			if (rc)
+				return rc;
+			skip_base += skip_len;
+		}
+	}
+
+	return 0;
+}
+
 static int __cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
 			     resource_size_t base, resource_size_t len,
 			     resource_size_t skipped)
@@ -305,13 +419,12 @@ static int __cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
 	}
 
 	if (skipped) {
-		res = __request_region(&cxlds->dpa_res, base - skipped, skipped,
-				       dev_name(&cxled->cxld.dev), 0);
-		if (!res) {
-			dev_dbg(dev,
-				"decoder%d.%d: failed to reserve skipped space\n",
-				port->id, cxled->cxld.id);
-			return -EBUSY;
+		int rc = cxl_reserve_dpa_skip(cxled, base, skipped);
+
+		if (rc) {
+			dev_dbg(dev, "decoder%d.%d: failed to reserve skipped space; %pa - %pa\n",
+				port->id, cxled->cxld.id, &base, &skipped);
+			return rc;
 		}
 	}
 	res = __request_region(&cxlds->dpa_res, base, len,
@@ -319,14 +432,20 @@ static int __cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
 	if (!res) {
 		dev_dbg(dev, "decoder%d.%d: failed to reserve allocation\n",
 			port->id, cxled->cxld.id);
-		if (skipped)
-			__release_region(&cxlds->dpa_res, base - skipped,
-					 skipped);
+		cxl_skip_release(cxled);
 		return -EBUSY;
 	}
 	cxled->dpa_res = res;
 	cxled->skip = skipped;
 
+	for (int mode = CXL_DECODER_DC0; mode <= CXL_DECODER_DC7; mode++) {
+		int index = dc_mode_to_region_index(mode);
+
+		if (resource_contains(&cxlds->dc_res[index], res)) {
+			cxled->mode = mode;
+			goto success;
+		}
+	}
 	if (resource_contains(&cxlds->pmem_res, res))
 		cxled->mode = CXL_DECODER_PMEM;
 	else if (resource_contains(&cxlds->ram_res, res))
@@ -337,6 +456,9 @@ static int __cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
 		cxled->mode = CXL_DECODER_MIXED;
 	}
 
+success:
+	dev_dbg(dev, "decoder%d.%d: %pr mode: %d\n", port->id, cxled->cxld.id,
+		cxled->dpa_res, cxled->mode);
 	port->hdm_end++;
 	get_device(&cxled->cxld.dev);
 	return 0;
@@ -466,8 +588,8 @@ int cxl_dpa_set_mode(struct cxl_endpoint_decoder *cxled,
 
 int cxl_dpa_alloc(struct cxl_endpoint_decoder *cxled, unsigned long long size)
 {
-	struct cxl_memdev *cxlmd = cxled_to_memdev(cxled);
 	resource_size_t free_ram_start, free_pmem_start;
+	struct cxl_memdev *cxlmd = cxled_to_memdev(cxled);
 	struct cxl_port *port = cxled_to_port(cxled);
 	struct cxl_dev_state *cxlds = cxlmd->cxlds;
 	struct device *dev = &cxled->cxld.dev;
@@ -524,12 +646,54 @@ int cxl_dpa_alloc(struct cxl_endpoint_decoder *cxled, unsigned long long size)
 		else
 			skip_end = start - 1;
 		skip = skip_end - skip_start + 1;
+	} else if (cxl_decoder_mode_is_dc(cxled->mode)) {
+		int dc_index = dc_mode_to_region_index(cxled->mode);
+
+		for (p = cxlds->dc_res[dc_index].child, last = NULL; p; p = p->sibling)
+			last = p;
+
+		if (last) {
+			/*
+			 * Some capacity in this DC partition is already allocated,
+			 * that allocation already handled the skip.
+			 */
+			start = last->end + 1;
+			skip = 0;
+		} else {
+			/* Calculate skip */
+			resource_size_t skip_start, skip_end;
+
+			start = cxlds->dc_res[dc_index].start;
+
+			if ((resource_size(&cxlds->pmem_res) == 0) || !cxlds->pmem_res.child)
+				skip_start = free_ram_start;
+			else
+				skip_start = free_pmem_start;
+			/*
+			 * If any dc region is already mapped, then that allocation
+			 * already handled the RAM and PMEM skip.  Check for DC region
+			 * skip.
+			 */
+			for (int i = dc_index - 1; i >= 0 ; i--) {
+				if (cxlds->dc_res[i].child) {
+					skip_start = cxlds->dc_res[i].child->end + 1;
+					break;
+				}
+			}
+
+			skip_end = start - 1;
+			skip = skip_end - skip_start + 1;
+		}
+		avail = cxlds->dc_res[dc_index].end - start + 1;
 	} else {
 		dev_dbg(dev, "mode not set\n");
 		rc = -EINVAL;
 		goto out;
 	}
 
+	dev_dbg(dev, "DPA Allocation start: %pa len: %#llx Skip: %pa\n",
+		&start, size, &skip);
+
 	if (size > avail) {
 		dev_dbg(dev, "%pa exceeds available %s capacity: %pa\n", &size,
 			cxl_decoder_mode_name(cxled->mode), &avail);
diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
index 1d5007e3795a..8054cbaac9f6 100644
--- a/drivers/cxl/core/port.c
+++ b/drivers/cxl/core/port.c
@@ -419,6 +419,7 @@ static void cxl_endpoint_decoder_release(struct device *dev)
 	struct cxl_endpoint_decoder *cxled = to_cxl_endpoint_decoder(dev);
 
 	__cxl_decoder_release(&cxled->cxld);
+	xa_destroy(&cxled->skip_res);
 	kfree(cxled);
 }
 
@@ -1899,6 +1900,7 @@ struct cxl_endpoint_decoder *cxl_endpoint_decoder_alloc(struct cxl_port *port)
 		return ERR_PTR(-ENOMEM);
 
 	cxled->pos = -1;
+	xa_init(&cxled->skip_res);
 	cxld = &cxled->cxld;
 	rc = cxl_decoder_init(port, cxld);
 	if (rc)	 {
diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
index d2674ab46f35..53b666ef4097 100644
--- a/drivers/cxl/cxl.h
+++ b/drivers/cxl/cxl.h
@@ -446,6 +446,7 @@ enum cxl_decoder_state {
  * @cxld: base cxl_decoder_object
  * @dpa_res: actively claimed DPA span of this decoder
  * @skip: offset into @dpa_res where @cxld.hpa_range maps
+ * @skip_res: array of skipped resources from the previous decoder end
  * @mode: which memory type / access-mode-partition this decoder targets
  * @state: autodiscovery state
  * @pos: interleave position in @cxld.region
@@ -454,6 +455,7 @@ struct cxl_endpoint_decoder {
 	struct cxl_decoder cxld;
 	struct resource *dpa_res;
 	resource_size_t skip;
+	struct xarray skip_res;
 	enum cxl_decoder_mode mode;
 	enum cxl_decoder_state state;
 	int pos;

-- 
2.45.2


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v3 10/25] cxl/port: Add endpoint decoder DC mode support to sysfs
  2024-08-16 14:44 [PATCH v3 00/25] DCD: Add support for Dynamic Capacity Devices (DCD) Ira Weiny
                   ` (8 preceding siblings ...)
  2024-08-16 14:44 ` [PATCH v3 09/25] cxl/hdm: Add dynamic capacity size support to endpoint decoders ira.weiny
@ 2024-08-16 14:44 ` ira.weiny
  2024-08-16 23:17   ` Dave Jiang
  2024-08-23 16:12   ` Jonathan Cameron
  2024-08-16 14:44 ` [PATCH v3 11/25] cxl/mem: Expose DCD partition capabilities in sysfs ira.weiny
                   ` (14 subsequent siblings)
  24 siblings, 2 replies; 120+ messages in thread
From: ira.weiny @ 2024-08-16 14:44 UTC (permalink / raw)
  To: Dave Jiang, Fan Ni, Jonathan Cameron, Navneet Singh, Chris Mason,
	Josef Bacik, David Sterba, Petr Mladek, Steven Rostedt,
	Andy Shevchenko, Rasmus Villemoes, Sergey Senozhatsky,
	Jonathan Corbet, Andrew Morton
  Cc: Dan Williams, Davidlohr Bueso, Alison Schofield, Vishal Verma,
	Ira Weiny, linux-btrfs, linux-cxl, linux-kernel, linux-doc,
	nvdimm

From: Navneet Singh <navneet.singh@intel.com>

Endpoint decoder mode is used to represent the partition the decoder
points to such as ram or pmem.

Expand the mode to allow a decoder to point to a specific DC partition
(Region).

Signed-off-by: Navneet Singh <navneet.singh@intel.com>
Co-developed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Changes:
[Fan: change mode range logic]
[Fan: use !resource_size()]
[djiang: use the static mode name string array in mode_store()]
[Jonathan: remove rc check from mode to region index]
[Jonathan: clarify decoder mode 'mixed']
[djbw: drop cleanup patch and just follow the convention in cxl_dpa_set_mode()]
[fan: make dcd resource size check similar to other partitions]
[djbw, jonathan, fan: remove mode range check from dc_mode_to_region_index]
[iweiny: push sysfs versions to 6.12]
---
 Documentation/ABI/testing/sysfs-bus-cxl | 21 ++++++++++----------
 drivers/cxl/core/hdm.c                  | 10 ++++++++++
 drivers/cxl/core/port.c                 | 10 +++++-----
 drivers/cxl/cxl.h                       | 35 ++++++++++++++++++---------------
 4 files changed, 45 insertions(+), 31 deletions(-)

diff --git a/Documentation/ABI/testing/sysfs-bus-cxl b/Documentation/ABI/testing/sysfs-bus-cxl
index 3f5627a1210a..957717264709 100644
--- a/Documentation/ABI/testing/sysfs-bus-cxl
+++ b/Documentation/ABI/testing/sysfs-bus-cxl
@@ -316,23 +316,24 @@ Description:
 
 
 What:		/sys/bus/cxl/devices/decoderX.Y/mode
-Date:		May, 2022
-KernelVersion:	v6.0
+Date:		May, 2022, October 2024
+KernelVersion:	v6.0, v6.12 (dcY)
 Contact:	linux-cxl@vger.kernel.org
 Description:
 		(RW) When a CXL decoder is of devtype "cxl_decoder_endpoint" it
 		translates from a host physical address range, to a device local
 		address range. Device-local address ranges are further split
-		into a 'ram' (volatile memory) range and 'pmem' (persistent
-		memory) range. The 'mode' attribute emits one of 'ram', 'pmem',
-		'mixed', or 'none'. The 'mixed' indication is for error cases
-		when a decoder straddles the volatile/persistent partition
-		boundary, and 'none' indicates the decoder is not actively
-		decoding, or no DPA allocation policy has been set.
+		into a 'ram' (volatile memory) range, 'pmem' (persistent
+		memory) range, or Dynamic Capacity (DC) range. The 'mode'
+		attribute emits one of 'ram', 'pmem', 'dcY', 'mixed', or
+		'none'. The 'mixed' indication is for error cases when a
+		decoder straddles partition boundaries, and 'none' indicates
+		the decoder is not actively decoding, or no DPA allocation
+		policy has been set.
 
 		'mode' can be written, when the decoder is in the 'disabled'
-		state, with either 'ram' or 'pmem' to set the boundaries for the
-		next allocation.
+		state, with 'ram', 'pmem', or 'dcY' to set the boundaries for
+		the next allocation.
 
 
 What:		/sys/bus/cxl/devices/decoderX.Y/dpa_resource
diff --git a/drivers/cxl/core/hdm.c b/drivers/cxl/core/hdm.c
index b4a517c6d283..ceca0b3d3e5c 100644
--- a/drivers/cxl/core/hdm.c
+++ b/drivers/cxl/core/hdm.c
@@ -551,6 +551,7 @@ int cxl_dpa_set_mode(struct cxl_endpoint_decoder *cxled,
 	switch (mode) {
 	case CXL_DECODER_RAM:
 	case CXL_DECODER_PMEM:
+	case CXL_DECODER_DC0 ... CXL_DECODER_DC7:
 		break;
 	default:
 		dev_dbg(dev, "unsupported mode: %d\n", mode);
@@ -578,6 +579,15 @@ int cxl_dpa_set_mode(struct cxl_endpoint_decoder *cxled,
 		goto out;
 	}
 
+	if (mode >= CXL_DECODER_DC0 && mode <= CXL_DECODER_DC7) {
+		rc = dc_mode_to_region_index(mode);
+		if (!resource_size(&cxlds->dc_res[rc])) {
+			dev_dbg(dev, "no available dynamic capacity\n");
+			rc = -ENXIO;
+			goto out;
+		}
+	}
+
 	cxled->mode = mode;
 	rc = 0;
 out:
diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
index 8054cbaac9f6..222aa0aeeef7 100644
--- a/drivers/cxl/core/port.c
+++ b/drivers/cxl/core/port.c
@@ -205,11 +205,11 @@ static ssize_t mode_store(struct device *dev, struct device_attribute *attr,
 	enum cxl_decoder_mode mode;
 	ssize_t rc;
 
-	if (sysfs_streq(buf, "pmem"))
-		mode = CXL_DECODER_PMEM;
-	else if (sysfs_streq(buf, "ram"))
-		mode = CXL_DECODER_RAM;
-	else
+	for (mode = CXL_DECODER_RAM; mode < CXL_DECODER_MIXED; mode++)
+		if (sysfs_streq(buf, cxl_decoder_mode_names[mode]))
+			break;
+
+	if (mode >= CXL_DECODER_MIXED)
 		return -EINVAL;
 
 	rc = cxl_dpa_set_mode(cxled, mode);
diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
index 53b666ef4097..16861c867537 100644
--- a/drivers/cxl/cxl.h
+++ b/drivers/cxl/cxl.h
@@ -365,6 +365,9 @@ struct cxl_decoder {
 /*
  * CXL_DECODER_DEAD prevents endpoints from being reattached to regions
  * while cxld_unregister() is running
+ *
+ * NOTE: CXL_DECODER_RAM must be second and CXL_DECODER_MIXED must be last.
+ *	 See mode_store()
  */
 enum cxl_decoder_mode {
 	CXL_DECODER_NONE,
@@ -382,25 +385,25 @@ enum cxl_decoder_mode {
 	CXL_DECODER_DEAD,
 };
 
+static const char * const cxl_decoder_mode_names[] = {
+	[CXL_DECODER_NONE] = "none",
+	[CXL_DECODER_RAM] = "ram",
+	[CXL_DECODER_PMEM] = "pmem",
+	[CXL_DECODER_DC0] = "dc0",
+	[CXL_DECODER_DC1] = "dc1",
+	[CXL_DECODER_DC2] = "dc2",
+	[CXL_DECODER_DC3] = "dc3",
+	[CXL_DECODER_DC4] = "dc4",
+	[CXL_DECODER_DC5] = "dc5",
+	[CXL_DECODER_DC6] = "dc6",
+	[CXL_DECODER_DC7] = "dc7",
+	[CXL_DECODER_MIXED] = "mixed",
+};
+
 static inline const char *cxl_decoder_mode_name(enum cxl_decoder_mode mode)
 {
-	static const char * const names[] = {
-		[CXL_DECODER_NONE] = "none",
-		[CXL_DECODER_RAM] = "ram",
-		[CXL_DECODER_PMEM] = "pmem",
-		[CXL_DECODER_DC0] = "dc0",
-		[CXL_DECODER_DC1] = "dc1",
-		[CXL_DECODER_DC2] = "dc2",
-		[CXL_DECODER_DC3] = "dc3",
-		[CXL_DECODER_DC4] = "dc4",
-		[CXL_DECODER_DC5] = "dc5",
-		[CXL_DECODER_DC6] = "dc6",
-		[CXL_DECODER_DC7] = "dc7",
-		[CXL_DECODER_MIXED] = "mixed",
-	};
-
 	if (mode >= CXL_DECODER_NONE && mode <= CXL_DECODER_MIXED)
-		return names[mode];
+		return cxl_decoder_mode_names[mode];
 	return "mixed";
 }
 

-- 
2.45.2


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v3 11/25] cxl/mem: Expose DCD partition capabilities in sysfs
  2024-08-16 14:44 [PATCH v3 00/25] DCD: Add support for Dynamic Capacity Devices (DCD) Ira Weiny
                   ` (9 preceding siblings ...)
  2024-08-16 14:44 ` [PATCH v3 10/25] cxl/port: Add endpoint decoder DC mode support to sysfs ira.weiny
@ 2024-08-16 14:44 ` ira.weiny
  2024-08-16 23:42   ` Dave Jiang
  2024-08-16 14:44 ` [PATCH v3 12/25] cxl/region: Refactor common create region code Ira Weiny
                   ` (13 subsequent siblings)
  24 siblings, 1 reply; 120+ messages in thread
From: ira.weiny @ 2024-08-16 14:44 UTC (permalink / raw)
  To: Dave Jiang, Fan Ni, Jonathan Cameron, Navneet Singh, Chris Mason,
	Josef Bacik, David Sterba, Petr Mladek, Steven Rostedt,
	Andy Shevchenko, Rasmus Villemoes, Sergey Senozhatsky,
	Jonathan Corbet, Andrew Morton
  Cc: Dan Williams, Davidlohr Bueso, Alison Schofield, Vishal Verma,
	Ira Weiny, linux-btrfs, linux-cxl, linux-kernel, linux-doc,
	nvdimm

From: Navneet Singh <navneet.singh@intel.com>

To properly configure CXL regions on Dynamic Capacity Devices (DCD),
user space will need to know the details of the DC partitions available.

Expose dynamic capacity capabilities through sysfs.

Signed-off-by: Navneet Singh <navneet.singh@intel.com>
Co-developed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Changes:
[iweiny: remove review tags]
[Davidlohr/Fan/Jonathan: omit 'dc' attribute directory if device is not DC]
[Jonathan: update documentation for dc visibility]
[Jonathan: Add a comment to DC region X attributes to ensure visibility checks work]
[iweiny: push sysfs version to 6.12]
---
 Documentation/ABI/testing/sysfs-bus-cxl | 12 ++++
 drivers/cxl/core/memdev.c               | 97 +++++++++++++++++++++++++++++++++
 2 files changed, 109 insertions(+)

diff --git a/Documentation/ABI/testing/sysfs-bus-cxl b/Documentation/ABI/testing/sysfs-bus-cxl
index 957717264709..6227ae0ab3fc 100644
--- a/Documentation/ABI/testing/sysfs-bus-cxl
+++ b/Documentation/ABI/testing/sysfs-bus-cxl
@@ -54,6 +54,18 @@ Description:
 		identically named field in the Identify Memory Device Output
 		Payload in the CXL-2.0 specification.
 
+What:		/sys/bus/cxl/devices/memX/dc/region_count
+		/sys/bus/cxl/devices/memX/dc/regionY_size
+Date:		August, 2024
+KernelVersion:	v6.12
+Contact:	linux-cxl@vger.kernel.org
+Description:
+		(RO) Dynamic Capacity (DC) region information.  The dc
+		directory is only visible on devices which support Dynamic
+		Capacity.
+		The region_count is the number of Dynamic Capacity (DC)
+		partitions (regions) supported on the device.
+		regionY_size is the size of each of those partitions.
 
 What:		/sys/bus/cxl/devices/memX/pmem/qos_class
 Date:		May, 2023
diff --git a/drivers/cxl/core/memdev.c b/drivers/cxl/core/memdev.c
index 0277726afd04..7da1f0f5711a 100644
--- a/drivers/cxl/core/memdev.c
+++ b/drivers/cxl/core/memdev.c
@@ -101,6 +101,18 @@ static ssize_t pmem_size_show(struct device *dev, struct device_attribute *attr,
 static struct device_attribute dev_attr_pmem_size =
 	__ATTR(size, 0444, pmem_size_show, NULL);
 
+static ssize_t region_count_show(struct device *dev, struct device_attribute *attr,
+				 char *buf)
+{
+	struct cxl_memdev *cxlmd = to_cxl_memdev(dev);
+	struct cxl_memdev_state *mds = to_cxl_memdev_state(cxlmd->cxlds);
+
+	return sysfs_emit(buf, "%d\n", mds->nr_dc_region);
+}
+
+static struct device_attribute dev_attr_region_count =
+	__ATTR(region_count, 0444, region_count_show, NULL);
+
 static ssize_t serial_show(struct device *dev, struct device_attribute *attr,
 			   char *buf)
 {
@@ -448,6 +460,90 @@ static struct attribute *cxl_memdev_security_attributes[] = {
 	NULL,
 };
 
+static ssize_t show_size_regionN(struct cxl_memdev *cxlmd, char *buf, int pos)
+{
+	struct cxl_memdev_state *mds = to_cxl_memdev_state(cxlmd->cxlds);
+
+	return sysfs_emit(buf, "%#llx\n", mds->dc_region[pos].decode_len);
+}
+
+#define REGION_SIZE_ATTR_RO(n)						\
+static ssize_t region##n##_size_show(struct device *dev,		\
+				     struct device_attribute *attr,	\
+				     char *buf)				\
+{									\
+	return show_size_regionN(to_cxl_memdev(dev), buf, (n));		\
+}									\
+static DEVICE_ATTR_RO(region##n##_size)
+REGION_SIZE_ATTR_RO(0);
+REGION_SIZE_ATTR_RO(1);
+REGION_SIZE_ATTR_RO(2);
+REGION_SIZE_ATTR_RO(3);
+REGION_SIZE_ATTR_RO(4);
+REGION_SIZE_ATTR_RO(5);
+REGION_SIZE_ATTR_RO(6);
+REGION_SIZE_ATTR_RO(7);
+
+/*
+ * RegionX attributes must be listed in order and first in this array to
+ * support the visbility checks.
+ */
+static struct attribute *cxl_memdev_dc_attributes[] = {
+	&dev_attr_region0_size.attr,
+	&dev_attr_region1_size.attr,
+	&dev_attr_region2_size.attr,
+	&dev_attr_region3_size.attr,
+	&dev_attr_region4_size.attr,
+	&dev_attr_region5_size.attr,
+	&dev_attr_region6_size.attr,
+	&dev_attr_region7_size.attr,
+	&dev_attr_region_count.attr,
+	NULL,
+};
+
+static umode_t cxl_memdev_dc_attr_visible(struct kobject *kobj, struct attribute *a, int n)
+{
+	struct device *dev = kobj_to_dev(kobj);
+	struct cxl_memdev *cxlmd = to_cxl_memdev(dev);
+	struct cxl_memdev_state *mds = to_cxl_memdev_state(cxlmd->cxlds);
+
+	/* Not a memory device */
+	if (!mds)
+		return 0;
+
+	if (a == &dev_attr_region_count.attr)
+		return a->mode;
+
+	/*
+	 * Show only the regions supported, regionX attributes are first in the
+	 * list
+	 */
+	if (n < mds->nr_dc_region)
+		return a->mode;
+
+	return 0;
+}
+
+static bool cxl_memdev_dc_group_visible(struct kobject *kobj)
+{
+	struct device *dev = kobj_to_dev(kobj);
+	struct cxl_memdev *cxlmd = to_cxl_memdev(dev);
+	struct cxl_memdev_state *mds = to_cxl_memdev_state(cxlmd->cxlds);
+
+	/* No DC regions */
+	if (!mds || mds->nr_dc_region == 0)
+		return false;
+	return true;
+}
+
+DEFINE_SYSFS_GROUP_VISIBLE(cxl_memdev_dc);
+
+static struct attribute_group cxl_memdev_dc_group = {
+	.name = "dc",
+	.attrs = cxl_memdev_dc_attributes,
+	.is_visible = SYSFS_GROUP_VISIBLE(cxl_memdev_dc),
+};
+
 static umode_t cxl_memdev_visible(struct kobject *kobj, struct attribute *a,
 				  int n)
 {
@@ -528,6 +624,7 @@ static const struct attribute_group *cxl_memdev_attribute_groups[] = {
 	&cxl_memdev_ram_attribute_group,
 	&cxl_memdev_pmem_attribute_group,
 	&cxl_memdev_security_attribute_group,
+	&cxl_memdev_dc_group,
 	NULL,
 };
 

-- 
2.45.2


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v3 12/25] cxl/region: Refactor common create region code
  2024-08-16 14:44 [PATCH v3 00/25] DCD: Add support for Dynamic Capacity Devices (DCD) Ira Weiny
                   ` (10 preceding siblings ...)
  2024-08-16 14:44 ` [PATCH v3 11/25] cxl/mem: Expose DCD partition capabilities in sysfs ira.weiny
@ 2024-08-16 14:44 ` Ira Weiny
  2024-08-16 23:43   ` Dave Jiang
                     ` (3 more replies)
  2024-08-16 14:44 ` [PATCH v3 13/25] cxl/region: Add sparse DAX region support ira.weiny
                   ` (12 subsequent siblings)
  24 siblings, 4 replies; 120+ messages in thread
From: Ira Weiny @ 2024-08-16 14:44 UTC (permalink / raw)
  To: Dave Jiang, Fan Ni, Jonathan Cameron, Navneet Singh, Chris Mason,
	Josef Bacik, David Sterba, Petr Mladek, Steven Rostedt,
	Andy Shevchenko, Rasmus Villemoes, Sergey Senozhatsky,
	Jonathan Corbet, Andrew Morton
  Cc: Dan Williams, Davidlohr Bueso, Alison Schofield, Vishal Verma,
	Ira Weiny, linux-btrfs, linux-cxl, linux-kernel, linux-doc,
	nvdimm

create_pmem_region_store() and create_ram_region_store() are identical
with the exception of the region mode.  With the addition of DC region
mode this would end up being 3 copies of the same code.

Refactor create_pmem_region_store() and create_ram_region_store() to use
a single common function to be used in subsequent DC code.

Suggested-by: Fan Ni <fan.ni@samsung.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
---
 drivers/cxl/core/region.c | 28 +++++++++++-----------------
 1 file changed, 11 insertions(+), 17 deletions(-)

diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
index 650fe33f2ed4..f85b26b39b2f 100644
--- a/drivers/cxl/core/region.c
+++ b/drivers/cxl/core/region.c
@@ -2553,9 +2553,8 @@ static struct cxl_region *__create_region(struct cxl_root_decoder *cxlrd,
 	return devm_cxl_add_region(cxlrd, id, mode, CXL_DECODER_HOSTONLYMEM);
 }
 
-static ssize_t create_pmem_region_store(struct device *dev,
-					struct device_attribute *attr,
-					const char *buf, size_t len)
+static ssize_t create_region_store(struct device *dev, const char *buf,
+				   size_t len, enum cxl_region_mode mode)
 {
 	struct cxl_root_decoder *cxlrd = to_cxl_root_decoder(dev);
 	struct cxl_region *cxlr;
@@ -2565,31 +2564,26 @@ static ssize_t create_pmem_region_store(struct device *dev,
 	if (rc != 1)
 		return -EINVAL;
 
-	cxlr = __create_region(cxlrd, CXL_REGION_PMEM, id);
+	cxlr = __create_region(cxlrd, mode, id);
 	if (IS_ERR(cxlr))
 		return PTR_ERR(cxlr);
 
 	return len;
 }
+
+static ssize_t create_pmem_region_store(struct device *dev,
+					struct device_attribute *attr,
+					const char *buf, size_t len)
+{
+	return create_region_store(dev, buf, len, CXL_REGION_PMEM);
+}
 DEVICE_ATTR_RW(create_pmem_region);
 
 static ssize_t create_ram_region_store(struct device *dev,
 				       struct device_attribute *attr,
 				       const char *buf, size_t len)
 {
-	struct cxl_root_decoder *cxlrd = to_cxl_root_decoder(dev);
-	struct cxl_region *cxlr;
-	int rc, id;
-
-	rc = sscanf(buf, "region%d\n", &id);
-	if (rc != 1)
-		return -EINVAL;
-
-	cxlr = __create_region(cxlrd, CXL_REGION_RAM, id);
-	if (IS_ERR(cxlr))
-		return PTR_ERR(cxlr);
-
-	return len;
+	return create_region_store(dev, buf, len, CXL_REGION_RAM);
 }
 DEVICE_ATTR_RW(create_ram_region);
 

-- 
2.45.2


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v3 13/25] cxl/region: Add sparse DAX region support
  2024-08-16 14:44 [PATCH v3 00/25] DCD: Add support for Dynamic Capacity Devices (DCD) Ira Weiny
                   ` (11 preceding siblings ...)
  2024-08-16 14:44 ` [PATCH v3 12/25] cxl/region: Refactor common create region code Ira Weiny
@ 2024-08-16 14:44 ` ira.weiny
  2024-08-16 23:51   ` Dave Jiang
                     ` (3 more replies)
  2024-08-16 14:44 ` [PATCH v3 14/25] cxl/events: Split event msgnum configuration from irq setup Ira Weiny
                   ` (11 subsequent siblings)
  24 siblings, 4 replies; 120+ messages in thread
From: ira.weiny @ 2024-08-16 14:44 UTC (permalink / raw)
  To: Dave Jiang, Fan Ni, Jonathan Cameron, Navneet Singh, Chris Mason,
	Josef Bacik, David Sterba, Petr Mladek, Steven Rostedt,
	Andy Shevchenko, Rasmus Villemoes, Sergey Senozhatsky,
	Jonathan Corbet, Andrew Morton
  Cc: Dan Williams, Davidlohr Bueso, Alison Schofield, Vishal Verma,
	Ira Weiny, linux-btrfs, linux-cxl, linux-kernel, linux-doc,
	nvdimm

From: Navneet Singh <navneet.singh@intel.com>

Dynamic Capacity CXL regions must allow memory to be added or removed
dynamically.  In addition to the quantity of memory available the
location of the memory within a DC partition is dynamic based on the
extents offered by a device.  CXL DAX regions must accommodate the
sparseness of this memory in the management of DAX regions and devices.

Introduce the concept of a sparse DAX region.  Add a create_dc_region()
sysfs entry to create such regions.  Special case DC capable regions to
create a 0 sized seed DAX device to maintain compatibility which
requires a default DAX device to hold a region reference.

Indicate 0 byte available capacity until such time that capacity is
added.

Sparse regions complicate the range mapping of dax devices.  There is no
known use case for range mapping on sparse regions.  Avoid the
complication by preventing range mapping of dax devices on sparse
regions.

Interleaving is deferred for now.  Add checks.

Signed-off-by: Navneet Singh <navneet.singh@intel.com>
Co-developed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Changes:
[Fan: use single function for dc region store]
[djiang: avoid setting dev_size twice]
[djbw: Check DCD support and interleave restriction on region creation]
[iweiny: squash patch : dax/region: Prevent range mapping allocation on sparse regions]
[iwieny: remove reviews]
[iweiny: rebase to master]
[iweiny: push sysfs version to 6.12]
[iweiny: make cxled_to_mds inline]
---
 Documentation/ABI/testing/sysfs-bus-cxl | 22 ++++++++--------
 drivers/cxl/core/core.h                 | 12 +++++++++
 drivers/cxl/core/port.c                 |  1 +
 drivers/cxl/core/region.c               | 46 +++++++++++++++++++++++++++++++--
 drivers/dax/bus.c                       | 10 +++++++
 drivers/dax/bus.h                       |  1 +
 drivers/dax/cxl.c                       | 16 ++++++++++--
 7 files changed, 93 insertions(+), 15 deletions(-)

diff --git a/Documentation/ABI/testing/sysfs-bus-cxl b/Documentation/ABI/testing/sysfs-bus-cxl
index 6227ae0ab3fc..3a5ee88e551b 100644
--- a/Documentation/ABI/testing/sysfs-bus-cxl
+++ b/Documentation/ABI/testing/sysfs-bus-cxl
@@ -406,20 +406,20 @@ Description:
 		interleave_granularity).
 
 
-What:		/sys/bus/cxl/devices/decoderX.Y/create_{pmem,ram}_region
-Date:		May, 2022, January, 2023
-KernelVersion:	v6.0 (pmem), v6.3 (ram)
+What:		/sys/bus/cxl/devices/decoderX.Y/create_{pmem,ram,dc}_region
+Date:		May, 2022, January, 2023, August 2024
+KernelVersion:	v6.0 (pmem), v6.3 (ram), v6.12 (dc)
 Contact:	linux-cxl@vger.kernel.org
 Description:
 		(RW) Write a string in the form 'regionZ' to start the process
-		of defining a new persistent, or volatile memory region
-		(interleave-set) within the decode range bounded by root decoder
-		'decoderX.Y'. The value written must match the current value
-		returned from reading this attribute. An atomic compare exchange
-		operation is done on write to assign the requested id to a
-		region and allocate the region-id for the next creation attempt.
-		EBUSY is returned if the region name written does not match the
-		current cached value.
+		of defining a new persistent, volatile, or Dynamic Capacity
+		(DC) memory region (interleave-set) within the decode range
+		bounded by root decoder 'decoderX.Y'. The value written must
+		match the current value returned from reading this attribute.
+		An atomic compare exchange operation is done on write to assign
+		the requested id to a region and allocate the region-id for the
+		next creation attempt.  EBUSY is returned if the region name
+		written does not match the current cached value.
 
 
 What:		/sys/bus/cxl/devices/decoderX.Y/delete_region
diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
index 72a506c9dbd0..15b6cf1c19ef 100644
--- a/drivers/cxl/core/core.h
+++ b/drivers/cxl/core/core.h
@@ -4,15 +4,27 @@
 #ifndef __CXL_CORE_H__
 #define __CXL_CORE_H__
 
+#include <cxlmem.h>
+
 extern const struct device_type cxl_nvdimm_bridge_type;
 extern const struct device_type cxl_nvdimm_type;
 extern const struct device_type cxl_pmu_type;
 
 extern struct attribute_group cxl_base_attribute_group;
 
+static inline struct cxl_memdev_state *
+cxled_to_mds(struct cxl_endpoint_decoder *cxled)
+{
+	struct cxl_memdev *cxlmd = cxled_to_memdev(cxled);
+	struct cxl_dev_state *cxlds = cxlmd->cxlds;
+
+	return container_of(cxlds, struct cxl_memdev_state, cxlds);
+}
+
 #ifdef CONFIG_CXL_REGION
 extern struct device_attribute dev_attr_create_pmem_region;
 extern struct device_attribute dev_attr_create_ram_region;
+extern struct device_attribute dev_attr_create_dc_region;
 extern struct device_attribute dev_attr_delete_region;
 extern struct device_attribute dev_attr_region;
 extern const struct device_type cxl_pmem_region_type;
diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
index 222aa0aeeef7..44e1e203173d 100644
--- a/drivers/cxl/core/port.c
+++ b/drivers/cxl/core/port.c
@@ -320,6 +320,7 @@ static struct attribute *cxl_decoder_root_attrs[] = {
 	&dev_attr_qos_class.attr,
 	SET_CXL_REGION_ATTR(create_pmem_region)
 	SET_CXL_REGION_ATTR(create_ram_region)
+	SET_CXL_REGION_ATTR(create_dc_region)
 	SET_CXL_REGION_ATTR(delete_region)
 	NULL,
 };
diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
index f85b26b39b2f..35c4a1f4f9bd 100644
--- a/drivers/cxl/core/region.c
+++ b/drivers/cxl/core/region.c
@@ -496,6 +496,11 @@ static ssize_t interleave_ways_store(struct device *dev,
 	if (rc)
 		return rc;
 
+	if (cxlr->mode == CXL_REGION_DC && val != 1) {
+		dev_err(dev, "Interleaving and DCD not supported\n");
+		return -EINVAL;
+	}
+
 	rc = ways_to_eiw(val, &iw);
 	if (rc)
 		return rc;
@@ -2174,6 +2179,7 @@ static size_t store_targetN(struct cxl_region *cxlr, const char *buf, int pos,
 	if (sysfs_streq(buf, "\n"))
 		rc = detach_target(cxlr, pos);
 	else {
+		struct cxl_endpoint_decoder *cxled;
 		struct device *dev;
 
 		dev = bus_find_device_by_name(&cxl_bus_type, NULL, buf);
@@ -2185,8 +2191,13 @@ static size_t store_targetN(struct cxl_region *cxlr, const char *buf, int pos,
 			goto out;
 		}
 
-		rc = attach_target(cxlr, to_cxl_endpoint_decoder(dev), pos,
-				   TASK_INTERRUPTIBLE);
+		cxled = to_cxl_endpoint_decoder(dev);
+		if (cxlr->mode == CXL_REGION_DC &&
+		    !cxl_dcd_supported(cxled_to_mds(cxled))) {
+			dev_dbg(dev, "DCD unsupported\n");
+			return -EINVAL;
+		}
+		rc = attach_target(cxlr, cxled, pos, TASK_INTERRUPTIBLE);
 out:
 		put_device(dev);
 	}
@@ -2534,6 +2545,7 @@ static struct cxl_region *__create_region(struct cxl_root_decoder *cxlrd,
 	switch (mode) {
 	case CXL_REGION_RAM:
 	case CXL_REGION_PMEM:
+	case CXL_REGION_DC:
 		break;
 	default:
 		dev_err(&cxlrd->cxlsd.cxld.dev, "unsupported mode %s\n",
@@ -2587,6 +2599,20 @@ static ssize_t create_ram_region_store(struct device *dev,
 }
 DEVICE_ATTR_RW(create_ram_region);
 
+static ssize_t create_dc_region_show(struct device *dev,
+				     struct device_attribute *attr, char *buf)
+{
+	return __create_region_show(to_cxl_root_decoder(dev), buf);
+}
+
+static ssize_t create_dc_region_store(struct device *dev,
+				      struct device_attribute *attr,
+				      const char *buf, size_t len)
+{
+	return create_region_store(dev, buf, len, CXL_REGION_DC);
+}
+DEVICE_ATTR_RW(create_dc_region);
+
 static ssize_t region_show(struct device *dev, struct device_attribute *attr,
 			   char *buf)
 {
@@ -3168,6 +3194,11 @@ static int devm_cxl_add_dax_region(struct cxl_region *cxlr)
 	struct device *dev;
 	int rc;
 
+	if (cxlr->mode == CXL_REGION_DC && cxlr->params.interleave_ways != 1) {
+		dev_err(&cxlr->dev, "Interleaving DC not supported\n");
+		return -EINVAL;
+	}
+
 	cxlr_dax = cxl_dax_region_alloc(cxlr);
 	if (IS_ERR(cxlr_dax))
 		return PTR_ERR(cxlr_dax);
@@ -3260,6 +3291,16 @@ static struct cxl_region *construct_region(struct cxl_root_decoder *cxlrd,
 		return ERR_PTR(-EINVAL);
 
 	mode = cxl_decoder_to_region_mode(cxled->mode);
+	if (mode == CXL_REGION_DC) {
+		if (!cxl_dcd_supported(cxled_to_mds(cxled))) {
+			dev_err(&cxled->cxld.dev, "DCD unsupported\n");
+			return ERR_PTR(-EINVAL);
+		}
+		if (cxled->cxld.interleave_ways != 1) {
+			dev_err(&cxled->cxld.dev, "Interleaving and DCD not supported\n");
+			return ERR_PTR(-EINVAL);
+		}
+	}
 	do {
 		cxlr = __create_region(cxlrd, mode,
 				       atomic_read(&cxlrd->region_id));
@@ -3467,6 +3508,7 @@ static int cxl_region_probe(struct device *dev)
 	case CXL_REGION_PMEM:
 		return devm_cxl_add_pmem_region(cxlr);
 	case CXL_REGION_RAM:
+	case CXL_REGION_DC:
 		/*
 		 * The region can not be manged by CXL if any portion of
 		 * it is already online as 'System RAM'
diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
index fde29e0ad68b..d8cb5195a227 100644
--- a/drivers/dax/bus.c
+++ b/drivers/dax/bus.c
@@ -178,6 +178,11 @@ static bool is_static(struct dax_region *dax_region)
 	return (dax_region->res.flags & IORESOURCE_DAX_STATIC) != 0;
 }
 
+static bool is_sparse(struct dax_region *dax_region)
+{
+	return (dax_region->res.flags & IORESOURCE_DAX_SPARSE_CAP) != 0;
+}
+
 bool static_dev_dax(struct dev_dax *dev_dax)
 {
 	return is_static(dev_dax->region);
@@ -301,6 +306,9 @@ static unsigned long long dax_region_avail_size(struct dax_region *dax_region)
 
 	lockdep_assert_held(&dax_region_rwsem);
 
+	if (is_sparse(dax_region))
+		return 0;
+
 	for_each_dax_region_resource(dax_region, res)
 		size -= resource_size(res);
 	return size;
@@ -1373,6 +1381,8 @@ static umode_t dev_dax_visible(struct kobject *kobj, struct attribute *a, int n)
 		return 0;
 	if (a == &dev_attr_mapping.attr && is_static(dax_region))
 		return 0;
+	if (a == &dev_attr_mapping.attr && is_sparse(dax_region))
+		return 0;
 	if ((a == &dev_attr_align.attr ||
 	     a == &dev_attr_size.attr) && is_static(dax_region))
 		return 0444;
diff --git a/drivers/dax/bus.h b/drivers/dax/bus.h
index cbbf64443098..783bfeef42cc 100644
--- a/drivers/dax/bus.h
+++ b/drivers/dax/bus.h
@@ -13,6 +13,7 @@ struct dax_region;
 /* dax bus specific ioresource flags */
 #define IORESOURCE_DAX_STATIC BIT(0)
 #define IORESOURCE_DAX_KMEM BIT(1)
+#define IORESOURCE_DAX_SPARSE_CAP BIT(2)
 
 struct dax_region *alloc_dax_region(struct device *parent, int region_id,
 		struct range *range, int target_node, unsigned int align,
diff --git a/drivers/dax/cxl.c b/drivers/dax/cxl.c
index 9b29e732b39a..367e86b1c22a 100644
--- a/drivers/dax/cxl.c
+++ b/drivers/dax/cxl.c
@@ -13,19 +13,31 @@ static int cxl_dax_region_probe(struct device *dev)
 	struct cxl_region *cxlr = cxlr_dax->cxlr;
 	struct dax_region *dax_region;
 	struct dev_dax_data data;
+	resource_size_t dev_size;
+	unsigned long flags;
 
 	if (nid == NUMA_NO_NODE)
 		nid = memory_add_physaddr_to_nid(cxlr_dax->hpa_range.start);
 
+	flags = IORESOURCE_DAX_KMEM;
+	if (cxlr->mode == CXL_REGION_DC)
+		flags |= IORESOURCE_DAX_SPARSE_CAP;
+
 	dax_region = alloc_dax_region(dev, cxlr->id, &cxlr_dax->hpa_range, nid,
-				      PMD_SIZE, IORESOURCE_DAX_KMEM);
+				      PMD_SIZE, flags);
 	if (!dax_region)
 		return -ENOMEM;
 
+	if (cxlr->mode == CXL_REGION_DC)
+		/* Add empty seed dax device */
+		dev_size = 0;
+	else
+		dev_size = range_len(&cxlr_dax->hpa_range);
+
 	data = (struct dev_dax_data) {
 		.dax_region = dax_region,
 		.id = -1,
-		.size = range_len(&cxlr_dax->hpa_range),
+		.size = dev_size,
 		.memmap_on_memory = true,
 	};
 

-- 
2.45.2


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v3 14/25] cxl/events: Split event msgnum configuration from irq setup
  2024-08-16 14:44 [PATCH v3 00/25] DCD: Add support for Dynamic Capacity Devices (DCD) Ira Weiny
                   ` (12 preceding siblings ...)
  2024-08-16 14:44 ` [PATCH v3 13/25] cxl/region: Add sparse DAX region support ira.weiny
@ 2024-08-16 14:44 ` Ira Weiny
  2024-08-16 23:57   ` Dave Jiang
                     ` (3 more replies)
  2024-08-16 14:44 ` [PATCH v3 15/25] cxl/pci: Factor out interrupt policy check Ira Weiny
                   ` (10 subsequent siblings)
  24 siblings, 4 replies; 120+ messages in thread
From: Ira Weiny @ 2024-08-16 14:44 UTC (permalink / raw)
  To: Dave Jiang, Fan Ni, Jonathan Cameron, Navneet Singh, Chris Mason,
	Josef Bacik, David Sterba, Petr Mladek, Steven Rostedt,
	Andy Shevchenko, Rasmus Villemoes, Sergey Senozhatsky,
	Jonathan Corbet, Andrew Morton
  Cc: Dan Williams, Davidlohr Bueso, Alison Schofield, Vishal Verma,
	Ira Weiny, linux-btrfs, linux-cxl, linux-kernel, linux-doc,
	nvdimm

Dynamic Capacity Devices (DCD) require event interrupts to process
memory addition or removal.  BIOS may have control over non-DCD event
processing.  DCD interrupt configuration needs to be separate from
memory event interrupt configuration.

Split cxl_event_config_msgnums() from irq setup in preparation for
separate DCD interrupts configuration.

Signed-off-by: Ira Weiny <ira.weiny@intel.com>
---
 drivers/cxl/pci.c | 24 ++++++++++++------------
 1 file changed, 12 insertions(+), 12 deletions(-)

diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
index f7f03599bc83..17bea49bbf4d 100644
--- a/drivers/cxl/pci.c
+++ b/drivers/cxl/pci.c
@@ -698,35 +698,31 @@ static int cxl_event_config_msgnums(struct cxl_memdev_state *mds,
 	return cxl_event_get_int_policy(mds, policy);
 }
 
-static int cxl_event_irqsetup(struct cxl_memdev_state *mds)
+static int cxl_event_irqsetup(struct cxl_memdev_state *mds,
+			      struct cxl_event_interrupt_policy *policy)
 {
 	struct cxl_dev_state *cxlds = &mds->cxlds;
-	struct cxl_event_interrupt_policy policy;
 	int rc;
 
-	rc = cxl_event_config_msgnums(mds, &policy);
-	if (rc)
-		return rc;
-
-	rc = cxl_event_req_irq(cxlds, policy.info_settings);
+	rc = cxl_event_req_irq(cxlds, policy->info_settings);
 	if (rc) {
 		dev_err(cxlds->dev, "Failed to get interrupt for event Info log\n");
 		return rc;
 	}
 
-	rc = cxl_event_req_irq(cxlds, policy.warn_settings);
+	rc = cxl_event_req_irq(cxlds, policy->warn_settings);
 	if (rc) {
 		dev_err(cxlds->dev, "Failed to get interrupt for event Warn log\n");
 		return rc;
 	}
 
-	rc = cxl_event_req_irq(cxlds, policy.failure_settings);
+	rc = cxl_event_req_irq(cxlds, policy->failure_settings);
 	if (rc) {
 		dev_err(cxlds->dev, "Failed to get interrupt for event Failure log\n");
 		return rc;
 	}
 
-	rc = cxl_event_req_irq(cxlds, policy.fatal_settings);
+	rc = cxl_event_req_irq(cxlds, policy->fatal_settings);
 	if (rc) {
 		dev_err(cxlds->dev, "Failed to get interrupt for event Fatal log\n");
 		return rc;
@@ -745,7 +741,7 @@ static bool cxl_event_int_is_fw(u8 setting)
 static int cxl_event_config(struct pci_host_bridge *host_bridge,
 			    struct cxl_memdev_state *mds, bool irq_avail)
 {
-	struct cxl_event_interrupt_policy policy;
+	struct cxl_event_interrupt_policy policy = { 0 };
 	int rc;
 
 	/*
@@ -773,11 +769,15 @@ static int cxl_event_config(struct pci_host_bridge *host_bridge,
 		return -EBUSY;
 	}
 
+	rc = cxl_event_config_msgnums(mds, &policy);
+	if (rc)
+		return rc;
+
 	rc = cxl_mem_alloc_event_buf(mds);
 	if (rc)
 		return rc;
 
-	rc = cxl_event_irqsetup(mds);
+	rc = cxl_event_irqsetup(mds, &policy);
 	if (rc)
 		return rc;
 

-- 
2.45.2


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v3 15/25] cxl/pci: Factor out interrupt policy check
  2024-08-16 14:44 [PATCH v3 00/25] DCD: Add support for Dynamic Capacity Devices (DCD) Ira Weiny
                   ` (13 preceding siblings ...)
  2024-08-16 14:44 ` [PATCH v3 14/25] cxl/events: Split event msgnum configuration from irq setup Ira Weiny
@ 2024-08-16 14:44 ` Ira Weiny
  2024-08-22 21:41   ` Fan Ni
  2024-09-03  7:07   ` Li, Ming4
  2024-08-16 14:44 ` [PATCH v3 16/25] cxl/mem: Configure dynamic capacity interrupts ira.weiny
                   ` (9 subsequent siblings)
  24 siblings, 2 replies; 120+ messages in thread
From: Ira Weiny @ 2024-08-16 14:44 UTC (permalink / raw)
  To: Dave Jiang, Fan Ni, Jonathan Cameron, Navneet Singh, Chris Mason,
	Josef Bacik, David Sterba, Petr Mladek, Steven Rostedt,
	Andy Shevchenko, Rasmus Villemoes, Sergey Senozhatsky,
	Jonathan Corbet, Andrew Morton
  Cc: Dan Williams, Davidlohr Bueso, Alison Schofield, Vishal Verma,
	Ira Weiny, linux-btrfs, linux-cxl, linux-kernel, linux-doc,
	nvdimm

Dynamic Capacity Devices (DCD) require event interrupts to process
memory addition or removal.  BIOS may have control over non-DCD event
processing.  DCD interrupt configuration needs to be separate from
memory event interrupt configuration.

Factor out event interrupt setting validation.

Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Changes:
[iweiny: reword commit message]
[iweiny: keep review tags on simple patch]
---
 drivers/cxl/pci.c | 23 ++++++++++++++++-------
 1 file changed, 16 insertions(+), 7 deletions(-)

diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
index 17bea49bbf4d..370c74eae323 100644
--- a/drivers/cxl/pci.c
+++ b/drivers/cxl/pci.c
@@ -738,6 +738,21 @@ static bool cxl_event_int_is_fw(u8 setting)
 	return mode == CXL_INT_FW;
 }
 
+static bool cxl_event_validate_mem_policy(struct cxl_memdev_state *mds,
+					  struct cxl_event_interrupt_policy *policy)
+{
+	if (cxl_event_int_is_fw(policy->info_settings) ||
+	    cxl_event_int_is_fw(policy->warn_settings) ||
+	    cxl_event_int_is_fw(policy->failure_settings) ||
+	    cxl_event_int_is_fw(policy->fatal_settings)) {
+		dev_err(mds->cxlds.dev,
+			"FW still in control of Event Logs despite _OSC settings\n");
+		return false;
+	}
+
+	return true;
+}
+
 static int cxl_event_config(struct pci_host_bridge *host_bridge,
 			    struct cxl_memdev_state *mds, bool irq_avail)
 {
@@ -760,14 +775,8 @@ static int cxl_event_config(struct pci_host_bridge *host_bridge,
 	if (rc)
 		return rc;
 
-	if (cxl_event_int_is_fw(policy.info_settings) ||
-	    cxl_event_int_is_fw(policy.warn_settings) ||
-	    cxl_event_int_is_fw(policy.failure_settings) ||
-	    cxl_event_int_is_fw(policy.fatal_settings)) {
-		dev_err(mds->cxlds.dev,
-			"FW still in control of Event Logs despite _OSC settings\n");
+	if (!cxl_event_validate_mem_policy(mds, &policy))
 		return -EBUSY;
-	}
 
 	rc = cxl_event_config_msgnums(mds, &policy);
 	if (rc)

-- 
2.45.2


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v3 16/25] cxl/mem: Configure dynamic capacity interrupts
  2024-08-16 14:44 [PATCH v3 00/25] DCD: Add support for Dynamic Capacity Devices (DCD) Ira Weiny
                   ` (14 preceding siblings ...)
  2024-08-16 14:44 ` [PATCH v3 15/25] cxl/pci: Factor out interrupt policy check Ira Weiny
@ 2024-08-16 14:44 ` ira.weiny
  2024-08-17  0:02   ` Dave Jiang
                     ` (2 more replies)
  2024-08-16 14:44 ` [PATCH v3 17/25] cxl/core: Return endpoint decoder information from region search Ira Weiny
                   ` (8 subsequent siblings)
  24 siblings, 3 replies; 120+ messages in thread
From: ira.weiny @ 2024-08-16 14:44 UTC (permalink / raw)
  To: Dave Jiang, Fan Ni, Jonathan Cameron, Navneet Singh, Chris Mason,
	Josef Bacik, David Sterba, Petr Mladek, Steven Rostedt,
	Andy Shevchenko, Rasmus Villemoes, Sergey Senozhatsky,
	Jonathan Corbet, Andrew Morton
  Cc: Dan Williams, Davidlohr Bueso, Alison Schofield, Vishal Verma,
	Ira Weiny, linux-btrfs, linux-cxl, linux-kernel, linux-doc,
	nvdimm

From: Navneet Singh <navneet.singh@intel.com>

Dynamic Capacity Devices (DCD) support extent change notifications
through the event log mechanism.  The interrupt mailbox commands were
extended in CXL 3.1 to support these notifications.  Firmware can't
configure DCD events to be FW controlled but can retain control of
memory events.

Configure DCD event log interrupts on devices supporting dynamic
capacity.  Disable DCD if interrupts are not supported.

Care is taken to preserve the interrupt policy set by the FW if FW first
has been selected by the BIOS.

Signed-off-by: Navneet Singh <navneet.singh@intel.com>
Co-developed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Changes:
[iweiny: update commit message]
[iweiny: rebase to upstream irq code]
[iweiny: disable DCD if irqs not supported]
[Jonathan: formatting fix]
[Fan: add text to debug print]
[djiang: make dcd helpers inline]
---
 drivers/cxl/cxlmem.h |  2 ++
 drivers/cxl/pci.c    | 72 +++++++++++++++++++++++++++++++++++++++++++---------
 2 files changed, 62 insertions(+), 12 deletions(-)

diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
index b4eb8164d05d..d41bec5433db 100644
--- a/drivers/cxl/cxlmem.h
+++ b/drivers/cxl/cxlmem.h
@@ -225,7 +225,9 @@ struct cxl_event_interrupt_policy {
 	u8 warn_settings;
 	u8 failure_settings;
 	u8 fatal_settings;
+	u8 dcd_settings;
 } __packed;
+#define CXL_EVENT_INT_POLICY_BASE_SIZE 4 /* info, warn, failure, fatal */
 
 /**
  * struct cxl_event_state - Event log driver state
diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
index 370c74eae323..e5430c4e3a3b 100644
--- a/drivers/cxl/pci.c
+++ b/drivers/cxl/pci.c
@@ -669,22 +669,33 @@ static int cxl_event_get_int_policy(struct cxl_memdev_state *mds,
 }
 
 static int cxl_event_config_msgnums(struct cxl_memdev_state *mds,
-				    struct cxl_event_interrupt_policy *policy)
+				    struct cxl_event_interrupt_policy *policy,
+				    bool native_cxl)
 {
+	size_t size_in = CXL_EVENT_INT_POLICY_BASE_SIZE;
 	struct cxl_mbox_cmd mbox_cmd;
 	int rc;
 
-	*policy = (struct cxl_event_interrupt_policy) {
-		.info_settings = CXL_INT_MSI_MSIX,
-		.warn_settings = CXL_INT_MSI_MSIX,
-		.failure_settings = CXL_INT_MSI_MSIX,
-		.fatal_settings = CXL_INT_MSI_MSIX,
-	};
+	/* memory event policy is left if FW has control */
+	if (native_cxl) {
+		*policy = (struct cxl_event_interrupt_policy) {
+			.info_settings = CXL_INT_MSI_MSIX,
+			.warn_settings = CXL_INT_MSI_MSIX,
+			.failure_settings = CXL_INT_MSI_MSIX,
+			.fatal_settings = CXL_INT_MSI_MSIX,
+			.dcd_settings = 0,
+		};
+	}
+
+	if (cxl_dcd_supported(mds)) {
+		policy->dcd_settings = CXL_INT_MSI_MSIX;
+		size_in += sizeof(policy->dcd_settings);
+	}
 
 	mbox_cmd = (struct cxl_mbox_cmd) {
 		.opcode = CXL_MBOX_OP_SET_EVT_INT_POLICY,
 		.payload_in = policy,
-		.size_in = sizeof(*policy),
+		.size_in = size_in,
 	};
 
 	rc = cxl_internal_send_cmd(mds, &mbox_cmd);
@@ -731,6 +742,31 @@ static int cxl_event_irqsetup(struct cxl_memdev_state *mds,
 	return 0;
 }
 
+static int cxl_irqsetup(struct cxl_memdev_state *mds,
+			struct cxl_event_interrupt_policy *policy,
+			bool native_cxl)
+{
+	struct cxl_dev_state *cxlds = &mds->cxlds;
+	int rc;
+
+	if (native_cxl) {
+		rc = cxl_event_irqsetup(mds, policy);
+		if (rc)
+			return rc;
+	}
+
+	if (cxl_dcd_supported(mds)) {
+		rc = cxl_event_req_irq(cxlds, policy->dcd_settings);
+		if (rc) {
+			dev_err(cxlds->dev, "Failed to get interrupt for DCD event log\n");
+			cxl_disable_dcd(mds);
+			return rc;
+		}
+	}
+
+	return 0;
+}
+
 static bool cxl_event_int_is_fw(u8 setting)
 {
 	u8 mode = FIELD_GET(CXLDEV_EVENT_INT_MODE_MASK, setting);
@@ -757,17 +793,25 @@ static int cxl_event_config(struct pci_host_bridge *host_bridge,
 			    struct cxl_memdev_state *mds, bool irq_avail)
 {
 	struct cxl_event_interrupt_policy policy = { 0 };
+	bool native_cxl = host_bridge->native_cxl_error;
 	int rc;
 
 	/*
 	 * When BIOS maintains CXL error reporting control, it will process
 	 * event records.  Only one agent can do so.
+	 *
+	 * If BIOS has control of events and DCD is not supported skip event
+	 * configuration.
 	 */
-	if (!host_bridge->native_cxl_error)
+	if (!native_cxl && !cxl_dcd_supported(mds))
 		return 0;
 
 	if (!irq_avail) {
 		dev_info(mds->cxlds.dev, "No interrupt support, disable event processing.\n");
+		if (cxl_dcd_supported(mds)) {
+			dev_info(mds->cxlds.dev, "DCD requires interrupts, disable DCD\n");
+			cxl_disable_dcd(mds);
+		}
 		return 0;
 	}
 
@@ -775,10 +819,10 @@ static int cxl_event_config(struct pci_host_bridge *host_bridge,
 	if (rc)
 		return rc;
 
-	if (!cxl_event_validate_mem_policy(mds, &policy))
+	if (native_cxl && !cxl_event_validate_mem_policy(mds, &policy))
 		return -EBUSY;
 
-	rc = cxl_event_config_msgnums(mds, &policy);
+	rc = cxl_event_config_msgnums(mds, &policy, native_cxl);
 	if (rc)
 		return rc;
 
@@ -786,12 +830,16 @@ static int cxl_event_config(struct pci_host_bridge *host_bridge,
 	if (rc)
 		return rc;
 
-	rc = cxl_event_irqsetup(mds, &policy);
+	rc = cxl_irqsetup(mds, &policy, native_cxl);
 	if (rc)
 		return rc;
 
 	cxl_mem_get_event_records(mds, CXLDEV_EVENT_STATUS_ALL);
 
+	dev_dbg(mds->cxlds.dev, "Event config : %s DCD %s\n",
+		native_cxl ? "OS" : "BIOS",
+		cxl_dcd_supported(mds) ? "supported" : "not supported");
+
 	return 0;
 }
 

-- 
2.45.2


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v3 17/25] cxl/core: Return endpoint decoder information from region search
  2024-08-16 14:44 [PATCH v3 00/25] DCD: Add support for Dynamic Capacity Devices (DCD) Ira Weiny
                   ` (15 preceding siblings ...)
  2024-08-16 14:44 ` [PATCH v3 16/25] cxl/mem: Configure dynamic capacity interrupts ira.weiny
@ 2024-08-16 14:44 ` Ira Weiny
  2024-08-19 16:35   ` Dave Jiang
                     ` (2 more replies)
  2024-08-16 14:44 ` [PATCH v3 18/25] cxl/extent: Process DCD events and realize region extents ira.weiny
                   ` (7 subsequent siblings)
  24 siblings, 3 replies; 120+ messages in thread
From: Ira Weiny @ 2024-08-16 14:44 UTC (permalink / raw)
  To: Dave Jiang, Fan Ni, Jonathan Cameron, Navneet Singh, Chris Mason,
	Josef Bacik, David Sterba, Petr Mladek, Steven Rostedt,
	Andy Shevchenko, Rasmus Villemoes, Sergey Senozhatsky,
	Jonathan Corbet, Andrew Morton
  Cc: Dan Williams, Davidlohr Bueso, Alison Schofield, Vishal Verma,
	Ira Weiny, linux-btrfs, linux-cxl, linux-kernel, linux-doc,
	nvdimm

cxl_dpa_to_region() finds the region from a <DPA, device> tuple.
The search involves finding the device endpoint decoder as well.

Dynamic capacity extent processing uses the endpoint decoder HPA
information to calculate the HPA offset.  In addition, well behaved
extents should be contained within an endpoint decoder.

Return the endpoint decoder found to be used in subsequent DCD code.

Signed-off-by: Ira Weiny <ira.weiny@intel.com>
---
 drivers/cxl/core/core.h   | 6 ++++--
 drivers/cxl/core/mbox.c   | 2 +-
 drivers/cxl/core/memdev.c | 4 ++--
 drivers/cxl/core/region.c | 8 +++++++-
 4 files changed, 14 insertions(+), 6 deletions(-)

diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
index 15b6cf1c19ef..76c4153a9b2c 100644
--- a/drivers/cxl/core/core.h
+++ b/drivers/cxl/core/core.h
@@ -39,7 +39,8 @@ void cxl_decoder_kill_region(struct cxl_endpoint_decoder *cxled);
 int cxl_region_init(void);
 void cxl_region_exit(void);
 int cxl_get_poison_by_endpoint(struct cxl_port *port);
-struct cxl_region *cxl_dpa_to_region(const struct cxl_memdev *cxlmd, u64 dpa);
+struct cxl_region *cxl_dpa_to_region(const struct cxl_memdev *cxlmd, u64 dpa,
+				     struct cxl_endpoint_decoder **cxled);
 u64 cxl_dpa_to_hpa(struct cxl_region *cxlr, const struct cxl_memdev *cxlmd,
 		   u64 dpa);
 
@@ -50,7 +51,8 @@ static inline u64 cxl_dpa_to_hpa(struct cxl_region *cxlr,
 	return ULLONG_MAX;
 }
 static inline
-struct cxl_region *cxl_dpa_to_region(const struct cxl_memdev *cxlmd, u64 dpa)
+struct cxl_region *cxl_dpa_to_region(const struct cxl_memdev *cxlmd, u64 dpa,
+				     struct cxl_endpoint_decoder **cxled)
 {
 	return NULL;
 }
diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
index 68c26c4be91a..01a447aaa1b1 100644
--- a/drivers/cxl/core/mbox.c
+++ b/drivers/cxl/core/mbox.c
@@ -909,7 +909,7 @@ void cxl_event_trace_record(const struct cxl_memdev *cxlmd,
 		guard(rwsem_read)(&cxl_dpa_rwsem);
 
 		dpa = le64_to_cpu(evt->media_hdr.phys_addr) & CXL_DPA_MASK;
-		cxlr = cxl_dpa_to_region(cxlmd, dpa);
+		cxlr = cxl_dpa_to_region(cxlmd, dpa, NULL);
 		if (cxlr)
 			hpa = cxl_dpa_to_hpa(cxlr, cxlmd, dpa);
 
diff --git a/drivers/cxl/core/memdev.c b/drivers/cxl/core/memdev.c
index 7da1f0f5711a..12fb07fb89a6 100644
--- a/drivers/cxl/core/memdev.c
+++ b/drivers/cxl/core/memdev.c
@@ -323,7 +323,7 @@ int cxl_inject_poison(struct cxl_memdev *cxlmd, u64 dpa)
 	if (rc)
 		goto out;
 
-	cxlr = cxl_dpa_to_region(cxlmd, dpa);
+	cxlr = cxl_dpa_to_region(cxlmd, dpa, NULL);
 	if (cxlr)
 		dev_warn_once(mds->cxlds.dev,
 			      "poison inject dpa:%#llx region: %s\n", dpa,
@@ -387,7 +387,7 @@ int cxl_clear_poison(struct cxl_memdev *cxlmd, u64 dpa)
 	if (rc)
 		goto out;
 
-	cxlr = cxl_dpa_to_region(cxlmd, dpa);
+	cxlr = cxl_dpa_to_region(cxlmd, dpa, NULL);
 	if (cxlr)
 		dev_warn_once(mds->cxlds.dev,
 			      "poison clear dpa:%#llx region: %s\n", dpa,
diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
index 35c4a1f4f9bd..8e0884b52f84 100644
--- a/drivers/cxl/core/region.c
+++ b/drivers/cxl/core/region.c
@@ -2828,6 +2828,7 @@ int cxl_get_poison_by_endpoint(struct cxl_port *port)
 struct cxl_dpa_to_region_context {
 	struct cxl_region *cxlr;
 	u64 dpa;
+	struct cxl_endpoint_decoder *cxled;
 };
 
 static int __cxl_dpa_to_region(struct device *dev, void *arg)
@@ -2861,11 +2862,13 @@ static int __cxl_dpa_to_region(struct device *dev, void *arg)
 			dev_name(dev));
 
 	ctx->cxlr = cxlr;
+	ctx->cxled = cxled;
 
 	return 1;
 }
 
-struct cxl_region *cxl_dpa_to_region(const struct cxl_memdev *cxlmd, u64 dpa)
+struct cxl_region *cxl_dpa_to_region(const struct cxl_memdev *cxlmd, u64 dpa,
+				     struct cxl_endpoint_decoder **cxled)
 {
 	struct cxl_dpa_to_region_context ctx;
 	struct cxl_port *port;
@@ -2877,6 +2880,9 @@ struct cxl_region *cxl_dpa_to_region(const struct cxl_memdev *cxlmd, u64 dpa)
 	if (port && is_cxl_endpoint(port) && cxl_num_decoders_committed(port))
 		device_for_each_child(&port->dev, &ctx, __cxl_dpa_to_region);
 
+	if (cxled)
+		*cxled = ctx.cxled;
+
 	return ctx.cxlr;
 }
 

-- 
2.45.2


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v3 18/25] cxl/extent: Process DCD events and realize region extents
  2024-08-16 14:44 [PATCH v3 00/25] DCD: Add support for Dynamic Capacity Devices (DCD) Ira Weiny
                   ` (16 preceding siblings ...)
  2024-08-16 14:44 ` [PATCH v3 17/25] cxl/core: Return endpoint decoder information from region search Ira Weiny
@ 2024-08-16 14:44 ` ira.weiny
  2024-08-19 18:51   ` Dave Jiang
                     ` (4 more replies)
  2024-08-16 14:44 ` [PATCH v3 19/25] cxl/region/extent: Expose region extent information in sysfs ira.weiny
                   ` (6 subsequent siblings)
  24 siblings, 5 replies; 120+ messages in thread
From: ira.weiny @ 2024-08-16 14:44 UTC (permalink / raw)
  To: Dave Jiang, Fan Ni, Jonathan Cameron, Navneet Singh, Chris Mason,
	Josef Bacik, David Sterba, Petr Mladek, Steven Rostedt,
	Andy Shevchenko, Rasmus Villemoes, Sergey Senozhatsky,
	Jonathan Corbet, Andrew Morton
  Cc: Dan Williams, Davidlohr Bueso, Alison Schofield, Vishal Verma,
	Ira Weiny, linux-btrfs, linux-cxl, linux-kernel, linux-doc,
	nvdimm

From: Navneet Singh <navneet.singh@intel.com>

A dynamic capacity device (DCD) sends events to signal the host for
changes in the availability of Dynamic Capacity (DC) memory.  These
events contain extents describing a DPA range and meta data for memory
to be added or removed.  Events may be sent from the device at any time.

Three types of events can be signaled, Add, Release, and Force Release.

On add, the host may accept or reject the memory being offered.  If no
region exists, or the extent is invalid, the extent should be rejected.
Add extent events may be grouped by a 'more' bit which indicates those
extents should be processed as a group.

On remove, the host can delay the response until the host is safely not
using the memory.  If no region exists the release can be sent
immediately.  The host may also release extents (or partial extents) at
any time.  Thus the 'more' bit grouping of release events is of less
value and can be ignored in favor of sending multiple release capacity
responses for groups of release events.

Force removal is intended as a mechanism between the FM and the device
and intended only when the host is unresponsive, out of sync, or
otherwise broken.  Purposely ignore force removal events.

Regions are made up of one or more devices which may be surfacing memory
to the host.  Once all devices in a region have surfaced an extent the
region can expose a corresponding extent for the user to consume.
Without interleaving a device extent forms a 1:1 relationship with the
region extent.  Immediately surface a region extent upon getting a
device extent.

Per the specification the device is allowed to offer or remove extents
at any time.  However, anticipated use cases can expect extents to be
offered, accepted, and removed in well defined chunks.

Simplify extent tracking with the following restrictions.

	1) Flag for removal any extent which overlaps a requested
	   release range.
	2) Refuse the offer of extents which overlap already accepted
	   memory ranges.
	3) Accept again a range which has already been accepted by the
	   host.  (It is likely the device has an error because it
	   should already know that this range was accepted.  But from
	   the host point of view it is safe to acknowledge that
	   acceptance again.)

Management of the region extent devices must be synchronized with
potential uses of the memory within the DAX layer.  Create region extent
devices as children of the cxl_dax_region device such that the DAX
region driver can co-drive them and synchronize with the DAX layer.
Synchronization and management is handled in a subsequent patch.

Process DCD events and create region devices.

Signed-off-by: Navneet Singh <navneet.singh@intel.com>
Co-developed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Changes:
[iweiny: combine this with the extent surface patches to better show the
         lifetime extent objects in review]
[iweiny: clean up commit message.]
[iweiny: move extent verification of the 'read extents on region
         creation' to this patch]
[iweiny: Provide for a common path for extent realization between an add
	 event and adding existing extents.]
[iweiny: Persist a check that an extent is within an endpoint decoder]
[iweiny: reduce exported and non-static calls]
[iweiny: use %par]

	<Combined comments from the old patches which were addressed>

[Jonathan: implement the more bit with a simple algorithm which accepts
	   all extents it can.
	   Also include the response more bit to prevent payload
	   overflow]
[Fan: Do not error if a contained extent is added.]
[Jonathan: allocate ida after kzalloc]
[iweiny: fix ida resource leak]
[fan/djiang: remove unneeded memset]
[djiang: fix indentation]
[Jonathan: Fix indentation]
[Jonathan/djbw: make tag a uuid]
[djbw: create helper calc_hpa_range() straight away]
[djbw: Allow for multiple cxled_extents per region_extent]
[djbw: s/cxl_ed/cxled]
[djbw: s/cxl_release_ed_extent/cxled_release_extent/]
[djbw: s/reg_ext/region_extent/]
[djbw: s/dc_extent/extent/]
[Gregory/djbw: reject shared extents]
[iweiny: predicate extent.c compile on CONFIG_CXL_REGION]
---
 drivers/cxl/core/Makefile |   2 +-
 drivers/cxl/core/core.h   |  13 ++
 drivers/cxl/core/extent.c | 345 ++++++++++++++++++++++++++++++++++++++++++++++
 drivers/cxl/core/mbox.c   | 268 ++++++++++++++++++++++++++++++++++-
 drivers/cxl/core/region.c |   6 +
 drivers/cxl/cxl.h         |  52 ++++++-
 drivers/cxl/cxlmem.h      |  26 ++++
 include/linux/cxl-event.h |  32 +++++
 tools/testing/cxl/Kbuild  |   3 +-
 9 files changed, 743 insertions(+), 4 deletions(-)

diff --git a/drivers/cxl/core/Makefile b/drivers/cxl/core/Makefile
index 9259bcc6773c..3b812515e725 100644
--- a/drivers/cxl/core/Makefile
+++ b/drivers/cxl/core/Makefile
@@ -15,4 +15,4 @@ cxl_core-y += hdm.o
 cxl_core-y += pmu.o
 cxl_core-y += cdat.o
 cxl_core-$(CONFIG_TRACING) += trace.o
-cxl_core-$(CONFIG_CXL_REGION) += region.o
+cxl_core-$(CONFIG_CXL_REGION) += region.o extent.o
diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
index 76c4153a9b2c..8dfc97b2e0a4 100644
--- a/drivers/cxl/core/core.h
+++ b/drivers/cxl/core/core.h
@@ -44,12 +44,24 @@ struct cxl_region *cxl_dpa_to_region(const struct cxl_memdev *cxlmd, u64 dpa,
 u64 cxl_dpa_to_hpa(struct cxl_region *cxlr, const struct cxl_memdev *cxlmd,
 		   u64 dpa);
 
+int cxl_add_extent(struct cxl_memdev_state *mds, struct cxl_extent *extent);
+int cxl_rm_extent(struct cxl_memdev_state *mds, struct cxl_extent *extent);
 #else
 static inline u64 cxl_dpa_to_hpa(struct cxl_region *cxlr,
 				 const struct cxl_memdev *cxlmd, u64 dpa)
 {
 	return ULLONG_MAX;
 }
+static inline int cxl_add_extent(struct cxl_memdev_state *mds,
+				   struct cxl_extent *extent)
+{
+	return 0;
+}
+static inline int cxl_rm_extent(struct cxl_memdev_state *mds,
+				struct cxl_extent *extent)
+{
+	return 0;
+}
 static inline
 struct cxl_region *cxl_dpa_to_region(const struct cxl_memdev *cxlmd, u64 dpa,
 				     struct cxl_endpoint_decoder **cxled)
@@ -121,5 +133,6 @@ long cxl_pci_get_latency(struct pci_dev *pdev);
 int cxl_update_hmat_access_coordinates(int nid, struct cxl_region *cxlr,
 				       enum access_coordinate_class access);
 bool cxl_need_node_perf_attrs_update(int nid);
+void memdev_release_extent(struct cxl_memdev_state *mds, struct range *range);
 
 #endif /* __CXL_CORE_H__ */
diff --git a/drivers/cxl/core/extent.c b/drivers/cxl/core/extent.c
new file mode 100644
index 000000000000..34456594cdc3
--- /dev/null
+++ b/drivers/cxl/core/extent.c
@@ -0,0 +1,345 @@
+// SPDX-License-Identifier: GPL-2.0
+/*  Copyright(c) 2024 Intel Corporation. All rights reserved. */
+
+#include <linux/device.h>
+#include <cxl.h>
+
+#include "core.h"
+
+static void cxled_release_extent(struct cxl_endpoint_decoder *cxled,
+				 struct cxled_extent *ed_extent)
+{
+	struct cxl_memdev_state *mds = cxled_to_mds(cxled);
+	struct device *dev = &cxled->cxld.dev;
+
+	dev_dbg(dev, "Remove extent %par (%*phC)\n", &ed_extent->dpa_range,
+		CXL_EXTENT_TAG_LEN, ed_extent->tag);
+	memdev_release_extent(mds, &ed_extent->dpa_range);
+	kfree(ed_extent);
+}
+
+static void free_region_extent(struct region_extent *region_extent)
+{
+	struct cxled_extent *ed_extent;
+	unsigned long index;
+
+	/*
+	 * Remove from each endpoint decoder the extent which backs this region
+	 * extent
+	 */
+	xa_for_each(&region_extent->decoder_extents, index, ed_extent)
+		cxled_release_extent(ed_extent->cxled, ed_extent);
+	xa_destroy(&region_extent->decoder_extents);
+	ida_free(&region_extent->cxlr_dax->extent_ida, region_extent->dev.id);
+	kfree(region_extent);
+}
+
+static void region_extent_release(struct device *dev)
+{
+	struct region_extent *region_extent = to_region_extent(dev);
+
+	free_region_extent(region_extent);
+}
+
+static const struct device_type region_extent_type = {
+	.name = "extent",
+	.release = region_extent_release,
+};
+
+bool is_region_extent(struct device *dev)
+{
+	return dev->type == &region_extent_type;
+}
+EXPORT_SYMBOL_NS_GPL(is_region_extent, CXL);
+
+static void region_extent_unregister(void *ext)
+{
+	struct region_extent *region_extent = ext;
+
+	dev_dbg(&region_extent->dev, "DAX region rm extent HPA %par\n",
+		&region_extent->hpa_range);
+	device_unregister(&region_extent->dev);
+}
+
+static void region_rm_extent(struct region_extent *region_extent)
+{
+	struct device *region_dev = region_extent->dev.parent;
+
+	devm_release_action(region_dev, region_extent_unregister, region_extent);
+}
+
+static struct region_extent *
+alloc_region_extent(struct cxl_dax_region *cxlr_dax, struct range *hpa_range, u8 *tag)
+{
+	int id;
+
+	struct region_extent *region_extent __free(kfree) =
+				kzalloc(sizeof(*region_extent), GFP_KERNEL);
+	if (!region_extent)
+		return ERR_PTR(-ENOMEM);
+
+	id = ida_alloc(&cxlr_dax->extent_ida, GFP_KERNEL);
+	if (id < 0)
+		return ERR_PTR(-ENOMEM);
+
+	region_extent->hpa_range = *hpa_range;
+	region_extent->cxlr_dax = cxlr_dax;
+	import_uuid(&region_extent->tag, tag);
+	region_extent->dev.id = id;
+	xa_init(&region_extent->decoder_extents);
+	return no_free_ptr(region_extent);
+}
+
+static int online_region_extent(struct region_extent *region_extent)
+{
+	struct cxl_dax_region *cxlr_dax = region_extent->cxlr_dax;
+	struct device *dev;
+	int rc;
+
+	dev = &region_extent->dev;
+	device_initialize(dev);
+	device_set_pm_not_required(dev);
+	dev->parent = &cxlr_dax->dev;
+	dev->type = &region_extent_type;
+	rc = dev_set_name(dev, "extent%d.%d", cxlr_dax->cxlr->id, dev->id);
+	if (rc)
+		goto err;
+
+	rc = device_add(dev);
+	if (rc)
+		goto err;
+
+	dev_dbg(dev, "region extent HPA %par\n", &region_extent->hpa_range);
+	return devm_add_action_or_reset(&cxlr_dax->dev, region_extent_unregister,
+					region_extent);
+
+err:
+	dev_err(&cxlr_dax->dev, "Failed to initialize region extent HPA %par\n",
+		&region_extent->hpa_range);
+
+	put_device(dev);
+	return rc;
+}
+
+struct match_data {
+	struct cxl_endpoint_decoder *cxled;
+	struct range *new_range;
+};
+
+static int match_contains(struct device *dev, void *data)
+{
+	struct region_extent *region_extent = to_region_extent(dev);
+	struct match_data *md = data;
+	struct cxled_extent *entry;
+	unsigned long index;
+
+	if (!region_extent)
+		return 0;
+
+	xa_for_each(&region_extent->decoder_extents, index, entry) {
+		if (md->cxled == entry->cxled &&
+		    range_contains(&entry->dpa_range, md->new_range))
+			return true;
+	}
+	return false;
+}
+
+static bool extents_contain(struct cxl_dax_region *cxlr_dax,
+			    struct cxl_endpoint_decoder *cxled,
+			    struct range *new_range)
+{
+	struct device *extent_device;
+	struct match_data md = {
+		.cxled = cxled,
+		.new_range = new_range,
+	};
+
+	extent_device = device_find_child(&cxlr_dax->dev, &md, match_contains);
+	if (!extent_device)
+		return false;
+
+	put_device(extent_device);
+	return true;
+}
+
+static int match_overlaps(struct device *dev, void *data)
+{
+	struct region_extent *region_extent = to_region_extent(dev);
+	struct match_data *md = data;
+	struct cxled_extent *entry;
+	unsigned long index;
+
+	if (!region_extent)
+		return 0;
+
+	xa_for_each(&region_extent->decoder_extents, index, entry) {
+		if (md->cxled == entry->cxled &&
+		    range_overlaps(&entry->dpa_range, md->new_range))
+			return true;
+	}
+
+	return false;
+}
+
+static bool extents_overlap(struct cxl_dax_region *cxlr_dax,
+			    struct cxl_endpoint_decoder *cxled,
+			    struct range *new_range)
+{
+	struct device *extent_device;
+	struct match_data md = {
+		.cxled = cxled,
+		.new_range = new_range,
+	};
+
+	extent_device = device_find_child(&cxlr_dax->dev, &md, match_overlaps);
+	if (!extent_device)
+		return false;
+
+	put_device(extent_device);
+	return true;
+}
+
+static void calc_hpa_range(struct cxl_endpoint_decoder *cxled,
+			   struct cxl_dax_region *cxlr_dax,
+			   struct range *dpa_range,
+			   struct range *hpa_range)
+{
+	resource_size_t dpa_offset, hpa;
+
+	dpa_offset = dpa_range->start - cxled->dpa_res->start;
+	hpa = cxled->cxld.hpa_range.start + dpa_offset;
+
+	hpa_range->start = hpa - cxlr_dax->hpa_range.start;
+	hpa_range->end = hpa_range->start + range_len(dpa_range) - 1;
+}
+
+static int cxlr_rm_extent(struct device *dev, void *data)
+{
+	struct region_extent *region_extent = to_region_extent(dev);
+	struct range *region_hpa_range = data;
+
+	if (!region_extent)
+		return 0;
+
+	/*
+	 * Any extent which 'touches' the released range is removed.
+	 */
+	if (range_overlaps(region_hpa_range, &region_extent->hpa_range)) {
+		dev_dbg(dev, "Remove region extent HPA %par\n",
+			&region_extent->hpa_range);
+		region_rm_extent(region_extent);
+	}
+	return 0;
+}
+
+int cxl_rm_extent(struct cxl_memdev_state *mds, struct cxl_extent *extent)
+{
+	u64 start_dpa = le64_to_cpu(extent->start_dpa);
+	struct cxl_memdev *cxlmd = mds->cxlds.cxlmd;
+	struct cxl_endpoint_decoder *cxled;
+	struct range hpa_range, dpa_range;
+	struct cxl_region *cxlr;
+
+	dpa_range = (struct range) {
+		.start = start_dpa,
+		.end = start_dpa + le64_to_cpu(extent->length) - 1,
+	};
+
+	guard(rwsem_read)(&cxl_region_rwsem);
+	cxlr = cxl_dpa_to_region(cxlmd, start_dpa, &cxled);
+	if (!cxlr) {
+		memdev_release_extent(mds, &dpa_range);
+		return -ENXIO;
+	}
+
+	calc_hpa_range(cxled, cxlr->cxlr_dax, &dpa_range, &hpa_range);
+
+	/* Remove region extents which overlap */
+	return device_for_each_child(&cxlr->cxlr_dax->dev, &hpa_range,
+				     cxlr_rm_extent);
+}
+
+static int cxlr_add_extent(struct cxl_dax_region *cxlr_dax,
+			   struct cxl_endpoint_decoder *cxled,
+			   struct cxled_extent *ed_extent)
+{
+	struct region_extent *region_extent;
+	struct range hpa_range;
+	int rc;
+
+	calc_hpa_range(cxled, cxlr_dax, &ed_extent->dpa_range, &hpa_range);
+
+	region_extent = alloc_region_extent(cxlr_dax, &hpa_range, ed_extent->tag);
+	if (IS_ERR(region_extent))
+		return PTR_ERR(region_extent);
+
+	rc = xa_insert(&region_extent->decoder_extents, (unsigned long)ed_extent, ed_extent,
+		       GFP_KERNEL);
+	if (rc) {
+		free_region_extent(region_extent);
+		return rc;
+	}
+
+	/* device model handles freeing region_extent */
+	return online_region_extent(region_extent);
+}
+
+/* Callers are expected to ensure cxled has been attached to a region */
+int cxl_add_extent(struct cxl_memdev_state *mds, struct cxl_extent *extent)
+{
+	u64 start_dpa = le64_to_cpu(extent->start_dpa);
+	struct cxl_memdev *cxlmd = mds->cxlds.cxlmd;
+	struct cxl_endpoint_decoder *cxled;
+	struct range ed_range, ext_range;
+	struct cxl_dax_region *cxlr_dax;
+	struct cxled_extent *ed_extent;
+	struct cxl_region *cxlr;
+	struct device *dev;
+
+	ext_range = (struct range) {
+		.start = start_dpa,
+		.end = start_dpa + le64_to_cpu(extent->length) - 1,
+	};
+
+	guard(rwsem_read)(&cxl_region_rwsem);
+	cxlr = cxl_dpa_to_region(cxlmd, start_dpa, &cxled);
+	if (!cxlr)
+		return -ENXIO;
+
+	cxlr_dax = cxled->cxld.region->cxlr_dax;
+	dev = &cxled->cxld.dev;
+	ed_range = (struct range) {
+		.start = cxled->dpa_res->start,
+		.end = cxled->dpa_res->end,
+	};
+
+	dev_dbg(&cxled->cxld.dev, "Checking ED (%pr) for extent %par\n",
+		cxled->dpa_res, &ext_range);
+
+	if (!range_contains(&ed_range, &ext_range)) {
+		dev_err_ratelimited(dev,
+				    "DC extent DPA %par (%*phC) is not fully in ED %par\n",
+				    &ext_range.start, CXL_EXTENT_TAG_LEN,
+				    extent->tag, &ed_range);
+		return -ENXIO;
+	}
+
+	if (extents_contain(cxlr_dax, cxled, &ext_range))
+		return 0;
+
+	if (extents_overlap(cxlr_dax, cxled, &ext_range))
+		return -ENXIO;
+
+	ed_extent = kzalloc(sizeof(*ed_extent), GFP_KERNEL);
+	if (!ed_extent)
+		return -ENOMEM;
+
+	ed_extent->cxled = cxled;
+	ed_extent->dpa_range = ext_range;
+	memcpy(ed_extent->tag, extent->tag, CXL_EXTENT_TAG_LEN);
+
+	dev_dbg(dev, "Add extent %par (%*phC)\n", &ed_extent->dpa_range,
+		CXL_EXTENT_TAG_LEN, ed_extent->tag);
+
+	return cxlr_add_extent(cxlr_dax, cxled, ed_extent);
+}
diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
index 01a447aaa1b1..f629ad7488ac 100644
--- a/drivers/cxl/core/mbox.c
+++ b/drivers/cxl/core/mbox.c
@@ -882,6 +882,48 @@ int cxl_enumerate_cmds(struct cxl_memdev_state *mds)
 }
 EXPORT_SYMBOL_NS_GPL(cxl_enumerate_cmds, CXL);
 
+static int cxl_validate_extent(struct cxl_memdev_state *mds,
+			       struct cxl_extent *extent)
+{
+	u64 start = le64_to_cpu(extent->start_dpa);
+	u64 length = le64_to_cpu(extent->length);
+	struct device *dev = mds->cxlds.dev;
+
+	struct range ext_range = (struct range){
+		.start = start,
+		.end = start + length - 1,
+	};
+
+	if (le16_to_cpu(extent->shared_extn_seq) != 0) {
+		dev_err_ratelimited(dev,
+				    "DC extent DPA %par (%*phC) can not be shared\n",
+				    &ext_range.start, CXL_EXTENT_TAG_LEN,
+				    extent->tag);
+		return -ENXIO;
+	}
+
+	/* Extents must not cross DC region boundary's */
+	for (int i = 0; i < mds->nr_dc_region; i++) {
+		struct cxl_dc_region_info *dcr = &mds->dc_region[i];
+		struct range region_range = (struct range) {
+			.start = dcr->base,
+			.end = dcr->base + dcr->decode_len - 1,
+		};
+
+		if (range_contains(&region_range, &ext_range)) {
+			dev_dbg(dev, "DC extent DPA %par (DCR:%d:%#llx)(%*phC)\n",
+				&ext_range, i, start - dcr->base,
+				CXL_EXTENT_TAG_LEN, extent->tag);
+			return 0;
+		}
+	}
+
+	dev_err_ratelimited(dev,
+			    "DC extent DPA %par (%*phC) is not in any DC region\n",
+			    &ext_range, CXL_EXTENT_TAG_LEN, extent->tag);
+	return -ENXIO;
+}
+
 void cxl_event_trace_record(const struct cxl_memdev *cxlmd,
 			    enum cxl_event_log_type type,
 			    enum cxl_event_type event_type,
@@ -1009,6 +1051,207 @@ static int cxl_clear_event_record(struct cxl_memdev_state *mds,
 	return rc;
 }
 
+static int cxl_send_dc_response(struct cxl_memdev_state *mds, int opcode,
+				struct xarray *extent_array, int cnt)
+{
+	struct cxl_mbox_dc_response *p;
+	struct cxl_mbox_cmd mbox_cmd;
+	struct cxl_extent *extent;
+	unsigned long index;
+	u32 pl_index;
+	int rc = 0;
+
+	size_t pl_size = struct_size(p, extent_list, cnt);
+	u32 max_extents = cnt;
+
+	/* May have to use more bit on response. */
+	if (pl_size > mds->payload_size) {
+		max_extents = (mds->payload_size - sizeof(*p)) /
+			      sizeof(struct updated_extent_list);
+		pl_size = struct_size(p, extent_list, max_extents);
+	}
+
+	struct cxl_mbox_dc_response *response __free(kfree) =
+						kzalloc(pl_size, GFP_KERNEL);
+	if (!response)
+		return -ENOMEM;
+
+	pl_index = 0;
+	xa_for_each(extent_array, index, extent) {
+
+		response->extent_list[pl_index].dpa_start = extent->start_dpa;
+		response->extent_list[pl_index].length = extent->length;
+		pl_index++;
+		response->extent_list_size = cpu_to_le32(pl_index);
+
+		if (pl_index == max_extents) {
+			mbox_cmd = (struct cxl_mbox_cmd) {
+				.opcode = opcode,
+				.size_in = struct_size(response, extent_list,
+						       pl_index),
+				.payload_in = response,
+			};
+
+			response->flags = 0;
+			if (pl_index < cnt)
+				response->flags &= CXL_DCD_EVENT_MORE;
+
+			rc = cxl_internal_send_cmd(mds, &mbox_cmd);
+			if (rc)
+				return rc;
+			pl_index = 0;
+		}
+	}
+
+	if (pl_index) {
+		mbox_cmd = (struct cxl_mbox_cmd) {
+			.opcode = opcode,
+			.size_in = struct_size(response, extent_list,
+					       pl_index),
+			.payload_in = response,
+		};
+
+		response->flags = 0;
+		rc = cxl_internal_send_cmd(mds, &mbox_cmd);
+	}
+
+	return rc;
+}
+
+void memdev_release_extent(struct cxl_memdev_state *mds, struct range *range)
+{
+	struct device *dev = mds->cxlds.dev;
+	struct xarray extent_list;
+
+	struct cxl_extent extent = {
+		.start_dpa = cpu_to_le64(range->start),
+		.length = cpu_to_le64(range_len(range)),
+	};
+
+	dev_dbg(dev, "Release response dpa %par\n", range);
+
+	xa_init(&extent_list);
+	if (xa_insert(&extent_list, 0, &extent, GFP_KERNEL)) {
+		dev_dbg(dev, "Failed to release %par\n", range);
+		goto destroy;
+	}
+
+	if (cxl_send_dc_response(mds, CXL_MBOX_OP_RELEASE_DC, &extent_list, 1))
+		dev_dbg(dev, "Failed to release %par\n", range);
+
+destroy:
+	xa_destroy(&extent_list);
+}
+
+static int validate_add_extent(struct cxl_memdev_state *mds,
+			       struct cxl_extent *extent)
+{
+	int rc;
+
+	rc = cxl_validate_extent(mds, extent);
+	if (rc)
+		return rc;
+
+	return cxl_add_extent(mds, extent);
+}
+
+static int cxl_add_pending(struct cxl_memdev_state *mds)
+{
+	struct device *dev = mds->cxlds.dev;
+	struct cxl_extent *extent;
+	unsigned long index;
+	unsigned long cnt = 0;
+	int rc;
+
+	xa_for_each(&mds->pending_extents, index, extent) {
+		if (validate_add_extent(mds, extent)) {
+			dev_dbg(dev, "unconsumed DC extent DPA:%#llx LEN:%#llx\n",
+				le64_to_cpu(extent->start_dpa),
+				le64_to_cpu(extent->length));
+			xa_erase(&mds->pending_extents, index);
+			kfree(extent);
+			continue;
+		}
+		cnt++;
+	}
+	rc = cxl_send_dc_response(mds, CXL_MBOX_OP_ADD_DC_RESPONSE,
+				  &mds->pending_extents, cnt);
+	xa_for_each(&mds->pending_extents, index, extent) {
+		xa_erase(&mds->pending_extents, index);
+		kfree(extent);
+	}
+	return rc;
+}
+
+static int handle_add_event(struct cxl_memdev_state *mds,
+			    struct cxl_event_dcd *event)
+{
+	struct cxl_extent *tmp = kzalloc(sizeof(*tmp), GFP_KERNEL);
+	struct device *dev = mds->cxlds.dev;
+
+	if (!tmp)
+		return -ENOMEM;
+
+	memcpy(tmp, &event->extent, sizeof(*tmp));
+	if (xa_insert(&mds->pending_extents, (unsigned long)tmp, tmp,
+		      GFP_KERNEL)) {
+		kfree(tmp);
+		return -ENOMEM;
+	}
+
+	if (event->flags & CXL_DCD_EVENT_MORE) {
+		dev_dbg(dev, "more bit set; delay the surfacing of extent\n");
+		return 0;
+	}
+
+	/* extents are removed and free'ed in cxl_add_pending() */
+	return cxl_add_pending(mds);
+}
+
+static char *cxl_dcd_evt_type_str(u8 type)
+{
+	switch (type) {
+	case DCD_ADD_CAPACITY:
+		return "add";
+	case DCD_RELEASE_CAPACITY:
+		return "release";
+	case DCD_FORCED_CAPACITY_RELEASE:
+		return "force release";
+	default:
+		break;
+	}
+
+	return "<unknown>";
+}
+
+static int cxl_handle_dcd_event_records(struct cxl_memdev_state *mds,
+					struct cxl_event_record_raw *raw_rec)
+{
+	struct cxl_event_dcd *event = &raw_rec->event.dcd;
+	struct cxl_extent *extent = &event->extent;
+	struct device *dev = mds->cxlds.dev;
+	uuid_t *id = &raw_rec->id;
+
+	if (!uuid_equal(id, &CXL_EVENT_DC_EVENT_UUID))
+		return -EINVAL;
+
+	dev_dbg(dev, "DCD event %s : DPA:%#llx LEN:%#llx\n",
+		cxl_dcd_evt_type_str(event->event_type),
+		le64_to_cpu(extent->start_dpa), le64_to_cpu(extent->length));
+
+	switch (event->event_type) {
+	case DCD_ADD_CAPACITY:
+		return handle_add_event(mds, event);
+	case DCD_RELEASE_CAPACITY:
+		return cxl_rm_extent(mds, &event->extent);
+	case DCD_FORCED_CAPACITY_RELEASE:
+		dev_err_ratelimited(dev, "Forced release event ignored.\n");
+		return 0;
+	default:
+		return -EINVAL;
+	}
+}
+
 static void cxl_mem_get_records_log(struct cxl_memdev_state *mds,
 				    enum cxl_event_log_type type)
 {
@@ -1044,9 +1287,17 @@ static void cxl_mem_get_records_log(struct cxl_memdev_state *mds,
 		if (!nr_rec)
 			break;
 
-		for (i = 0; i < nr_rec; i++)
+		for (i = 0; i < nr_rec; i++) {
 			__cxl_event_trace_record(cxlmd, type,
 						 &payload->records[i]);
+			if (type == CXL_EVENT_TYPE_DCD) {
+				rc = cxl_handle_dcd_event_records(mds,
+								  &payload->records[i]);
+				if (rc)
+					dev_err_ratelimited(dev, "dcd event failed: %d\n",
+							    rc);
+			}
+		}
 
 		if (payload->flags & CXL_GET_EVENT_FLAG_OVERFLOW)
 			trace_cxl_overflow(cxlmd, type, payload);
@@ -1078,6 +1329,8 @@ void cxl_mem_get_event_records(struct cxl_memdev_state *mds, u32 status)
 {
 	dev_dbg(mds->cxlds.dev, "Reading event logs: %x\n", status);
 
+	if (cxl_dcd_supported(mds) && (status & CXLDEV_EVENT_STATUS_DCD))
+		cxl_mem_get_records_log(mds, CXL_EVENT_TYPE_DCD);
 	if (status & CXLDEV_EVENT_STATUS_FATAL)
 		cxl_mem_get_records_log(mds, CXL_EVENT_TYPE_FATAL);
 	if (status & CXLDEV_EVENT_STATUS_FAIL)
@@ -1610,6 +1863,17 @@ int cxl_poison_state_init(struct cxl_memdev_state *mds)
 }
 EXPORT_SYMBOL_NS_GPL(cxl_poison_state_init, CXL);
 
+static void clear_pending_extents(void *_mds)
+{
+	struct cxl_memdev_state *mds = _mds;
+	struct cxl_extent *extent;
+	unsigned long index;
+
+	xa_for_each(&mds->pending_extents, index, extent)
+		kfree(extent);
+	xa_destroy(&mds->pending_extents);
+}
+
 struct cxl_memdev_state *cxl_memdev_state_create(struct device *dev)
 {
 	struct cxl_memdev_state *mds;
@@ -1628,6 +1892,8 @@ struct cxl_memdev_state *cxl_memdev_state_create(struct device *dev)
 	mds->cxlds.type = CXL_DEVTYPE_CLASSMEM;
 	mds->ram_perf.qos_class = CXL_QOS_CLASS_INVALID;
 	mds->pmem_perf.qos_class = CXL_QOS_CLASS_INVALID;
+	xa_init(&mds->pending_extents);
+	devm_add_action_or_reset(dev, clear_pending_extents, mds);
 
 	return mds;
 }
diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
index 8e0884b52f84..8c9171f914fb 100644
--- a/drivers/cxl/core/region.c
+++ b/drivers/cxl/core/region.c
@@ -3037,6 +3037,7 @@ static void cxl_dax_region_release(struct device *dev)
 {
 	struct cxl_dax_region *cxlr_dax = to_cxl_dax_region(dev);
 
+	ida_destroy(&cxlr_dax->extent_ida);
 	kfree(cxlr_dax);
 }
 
@@ -3090,6 +3091,8 @@ static struct cxl_dax_region *cxl_dax_region_alloc(struct cxl_region *cxlr)
 
 	dev = &cxlr_dax->dev;
 	cxlr_dax->cxlr = cxlr;
+	cxlr->cxlr_dax = cxlr_dax;
+	ida_init(&cxlr_dax->extent_ida);
 	device_initialize(dev);
 	lockdep_set_class(&dev->mutex, &cxl_dax_region_key);
 	device_set_pm_not_required(dev);
@@ -3190,7 +3193,10 @@ static int devm_cxl_add_pmem_region(struct cxl_region *cxlr)
 static void cxlr_dax_unregister(void *_cxlr_dax)
 {
 	struct cxl_dax_region *cxlr_dax = _cxlr_dax;
+	struct cxl_region *cxlr = cxlr_dax->cxlr;
 
+	cxlr->cxlr_dax = NULL;
+	cxlr_dax->cxlr = NULL;
 	device_unregister(&cxlr_dax->dev);
 }
 
diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
index 16861c867537..c858e3957fd5 100644
--- a/drivers/cxl/cxl.h
+++ b/drivers/cxl/cxl.h
@@ -11,6 +11,7 @@
 #include <linux/log2.h>
 #include <linux/node.h>
 #include <linux/io.h>
+#include <linux/cxl-event.h>
 
 extern const struct nvdimm_security_ops *cxl_security_ops;
 
@@ -169,11 +170,13 @@ static inline int ways_to_eiw(unsigned int ways, u8 *eiw)
 #define CXLDEV_EVENT_STATUS_WARN		BIT(1)
 #define CXLDEV_EVENT_STATUS_FAIL		BIT(2)
 #define CXLDEV_EVENT_STATUS_FATAL		BIT(3)
+#define CXLDEV_EVENT_STATUS_DCD			BIT(4)
 
 #define CXLDEV_EVENT_STATUS_ALL (CXLDEV_EVENT_STATUS_INFO |	\
 				 CXLDEV_EVENT_STATUS_WARN |	\
 				 CXLDEV_EVENT_STATUS_FAIL |	\
-				 CXLDEV_EVENT_STATUS_FATAL)
+				 CXLDEV_EVENT_STATUS_FATAL |	\
+				 CXLDEV_EVENT_STATUS_DCD)
 
 /* CXL rev 3.0 section 8.2.9.2.4; Table 8-52 */
 #define CXLDEV_EVENT_INT_MODE_MASK	GENMASK(1, 0)
@@ -444,6 +447,18 @@ enum cxl_decoder_state {
 	CXL_DECODER_STATE_AUTO,
 };
 
+/**
+ * struct cxled_extent - Extent within an endpoint decoder
+ * @cxled: Reference to the endpoint decoder
+ * @dpa_range: DPA range this extent covers within the decoder
+ * @tag: Tag from device for this extent
+ */
+struct cxled_extent {
+	struct cxl_endpoint_decoder *cxled;
+	struct range dpa_range;
+	u8 tag[CXL_EXTENT_TAG_LEN];
+};
+
 /**
  * struct cxl_endpoint_decoder - Endpoint  / SPA to DPA decoder
  * @cxld: base cxl_decoder_object
@@ -569,6 +584,7 @@ struct cxl_region_params {
  * @type: Endpoint decoder target type
  * @cxl_nvb: nvdimm bridge for coordinating @cxlr_pmem setup / shutdown
  * @cxlr_pmem: (for pmem regions) cached copy of the nvdimm bridge
+ * @cxlr_dax: (for DC regions) cached copy of CXL DAX bridge
  * @flags: Region state flags
  * @params: active + config params for the region
  * @coord: QoS access coordinates for the region
@@ -582,6 +598,7 @@ struct cxl_region {
 	enum cxl_decoder_type type;
 	struct cxl_nvdimm_bridge *cxl_nvb;
 	struct cxl_pmem_region *cxlr_pmem;
+	struct cxl_dax_region *cxlr_dax;
 	unsigned long flags;
 	struct cxl_region_params params;
 	struct access_coordinate coord[ACCESS_COORDINATE_MAX];
@@ -622,12 +639,45 @@ struct cxl_pmem_region {
 	struct cxl_pmem_region_mapping mapping[];
 };
 
+/* See CXL 3.0 8.2.9.2.1.5 */
+enum dc_event {
+	DCD_ADD_CAPACITY,
+	DCD_RELEASE_CAPACITY,
+	DCD_FORCED_CAPACITY_RELEASE,
+	DCD_REGION_CONFIGURATION_UPDATED,
+};
+
 struct cxl_dax_region {
 	struct device dev;
 	struct cxl_region *cxlr;
 	struct range hpa_range;
+	struct ida extent_ida;
 };
 
+/**
+ * struct region_extent - CXL DAX region extent
+ * @dev: device representing this extent
+ * @cxlr_dax: back reference to parent region device
+ * @hpa_range: HPA range of this extent
+ * @tag: tag of the extent
+ * @decoder_extents: Endpoint decoder extents which make up this region extent
+ */
+struct region_extent {
+	struct device dev;
+	struct cxl_dax_region *cxlr_dax;
+	struct range hpa_range;
+	uuid_t tag;
+	struct xarray decoder_extents;
+};
+
+bool is_region_extent(struct device *dev);
+static inline struct region_extent *to_region_extent(struct device *dev)
+{
+	if (!is_region_extent(dev))
+		return NULL;
+	return container_of(dev, struct region_extent, dev);
+}
+
 /**
  * struct cxl_port - logical collection of upstream port devices and
  *		     downstream port devices to construct a CXL memory
diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
index d41bec5433db..3a40fe1f0be7 100644
--- a/drivers/cxl/cxlmem.h
+++ b/drivers/cxl/cxlmem.h
@@ -497,6 +497,7 @@ struct cxl_dc_region_info {
  * @pmem_perf: performance data entry matched to PMEM partition
  * @nr_dc_region: number of DC regions implemented in the memory device
  * @dc_region: array containing info about the DC regions
+ * @pending_extents: array of extents pending during more bit processing
  * @event: event log driver state
  * @poison: poison driver state info
  * @security: security driver state info
@@ -532,6 +533,7 @@ struct cxl_memdev_state {
 
 	u8 nr_dc_region;
 	struct cxl_dc_region_info dc_region[CXL_MAX_DC_REGION];
+	struct xarray pending_extents;
 
 	struct cxl_event_state event;
 	struct cxl_poison_state poison;
@@ -607,6 +609,21 @@ enum cxl_opcode {
 	UUID_INIT(0x5e1819d9, 0x11a9, 0x400c, 0x81, 0x1f, 0xd6, 0x07, 0x19,     \
 		  0x40, 0x3d, 0x86)
 
+/*
+ * Add Dynamic Capacity Response
+ * CXL rev 3.1 section 8.2.9.9.9.3; Table 8-168 & Table 8-169
+ */
+struct cxl_mbox_dc_response {
+	__le32 extent_list_size;
+	u8 flags;
+	u8 reserved[3];
+	struct updated_extent_list {
+		__le64 dpa_start;
+		__le64 length;
+		u8 reserved[8];
+	} __packed extent_list[];
+} __packed;
+
 struct cxl_mbox_get_supported_logs {
 	__le16 entries;
 	u8 rsvd[6];
@@ -669,6 +686,14 @@ struct cxl_mbox_identify {
 	UUID_INIT(0xfe927475, 0xdd59, 0x4339, 0xa5, 0x86, 0x79, 0xba, 0xb1, \
 		  0x13, 0xb7, 0x74)
 
+/*
+ * Dynamic Capacity Event Record
+ * CXL rev 3.1 section 8.2.9.2.1; Table 8-43
+ */
+#define CXL_EVENT_DC_EVENT_UUID                                             \
+	UUID_INIT(0xca95afa7, 0xf183, 0x4018, 0x8c, 0x2f, 0x95, 0x26, 0x8e, \
+		  0x10, 0x1a, 0x2a)
+
 /*
  * Get Event Records output payload
  * CXL rev 3.0 section 8.2.9.2.2; Table 8-50
@@ -694,6 +719,7 @@ enum cxl_event_log_type {
 	CXL_EVENT_TYPE_WARN,
 	CXL_EVENT_TYPE_FAIL,
 	CXL_EVENT_TYPE_FATAL,
+	CXL_EVENT_TYPE_DCD,
 	CXL_EVENT_TYPE_MAX
 };
 
diff --git a/include/linux/cxl-event.h b/include/linux/cxl-event.h
index 0bea1afbd747..eeda8059d81a 100644
--- a/include/linux/cxl-event.h
+++ b/include/linux/cxl-event.h
@@ -96,11 +96,43 @@ struct cxl_event_mem_module {
 	u8 reserved[0x3d];
 } __packed;
 
+/*
+ * CXL rev 3.1 section 8.2.9.2.1.6; Table 8-51
+ */
+#define CXL_EXTENT_TAG_LEN 0x10
+struct cxl_extent {
+	__le64 start_dpa;
+	__le64 length;
+	u8 tag[CXL_EXTENT_TAG_LEN];
+	__le16 shared_extn_seq;
+	u8 reserved[0x6];
+} __packed;
+
+/*
+ * Dynamic Capacity Event Record
+ * CXL rev 3.1 section 8.2.9.2.1.6; Table 8-50
+ */
+#define CXL_DCD_EVENT_MORE			BIT(0)
+struct cxl_event_dcd {
+	struct cxl_event_record_hdr hdr;
+	u8 event_type;
+	u8 validity_flags;
+	__le16 host_id;
+	u8 region_index;
+	u8 flags;
+	u8 reserved1[0x2];
+	struct cxl_extent extent;
+	u8 reserved2[0x18];
+	__le32 num_avail_extents;
+	__le32 num_avail_tags;
+} __packed;
+
 union cxl_event {
 	struct cxl_event_generic generic;
 	struct cxl_event_gen_media gen_media;
 	struct cxl_event_dram dram;
 	struct cxl_event_mem_module mem_module;
+	struct cxl_event_dcd dcd;
 	/* dram & gen_media event header */
 	struct cxl_event_media_hdr media_hdr;
 } __packed;
diff --git a/tools/testing/cxl/Kbuild b/tools/testing/cxl/Kbuild
index 030b388800f0..8238588fffdf 100644
--- a/tools/testing/cxl/Kbuild
+++ b/tools/testing/cxl/Kbuild
@@ -61,7 +61,8 @@ cxl_core-y += $(CXL_CORE_SRC)/hdm.o
 cxl_core-y += $(CXL_CORE_SRC)/pmu.o
 cxl_core-y += $(CXL_CORE_SRC)/cdat.o
 cxl_core-$(CONFIG_TRACING) += $(CXL_CORE_SRC)/trace.o
-cxl_core-$(CONFIG_CXL_REGION) += $(CXL_CORE_SRC)/region.o
+cxl_core-$(CONFIG_CXL_REGION) += $(CXL_CORE_SRC)/region.o \
+				 $(CXL_CORE_SRC)/extent.o
 cxl_core-y += config_check.o
 cxl_core-y += cxl_core_test.o
 cxl_core-y += cxl_core_exports.o

-- 
2.45.2


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v3 19/25] cxl/region/extent: Expose region extent information in sysfs
  2024-08-16 14:44 [PATCH v3 00/25] DCD: Add support for Dynamic Capacity Devices (DCD) Ira Weiny
                   ` (17 preceding siblings ...)
  2024-08-16 14:44 ` [PATCH v3 18/25] cxl/extent: Process DCD events and realize region extents ira.weiny
@ 2024-08-16 14:44 ` ira.weiny
  2024-08-19 19:05   ` Dave Jiang
                     ` (2 more replies)
  2024-08-16 14:44 ` [PATCH v3 20/25] dax/bus: Factor out dev dax resize logic Ira Weiny
                   ` (5 subsequent siblings)
  24 siblings, 3 replies; 120+ messages in thread
From: ira.weiny @ 2024-08-16 14:44 UTC (permalink / raw)
  To: Dave Jiang, Fan Ni, Jonathan Cameron, Navneet Singh, Chris Mason,
	Josef Bacik, David Sterba, Petr Mladek, Steven Rostedt,
	Andy Shevchenko, Rasmus Villemoes, Sergey Senozhatsky,
	Jonathan Corbet, Andrew Morton
  Cc: Dan Williams, Davidlohr Bueso, Alison Schofield, Vishal Verma,
	Ira Weiny, linux-btrfs, linux-cxl, linux-kernel, linux-doc,
	nvdimm

From: Navneet Singh <navneet.singh@intel.com>

Extent information can be helpful to the user to coordinate memory usage
with the external orchestrator and FM.

Expose the details of region extents by creating the following
sysfs entries.

        /sys/bus/cxl/devices/dax_regionX/extentX.Y
        /sys/bus/cxl/devices/dax_regionX/extentX.Y/offset
        /sys/bus/cxl/devices/dax_regionX/extentX.Y/length
        /sys/bus/cxl/devices/dax_regionX/extentX.Y/tag

Signed-off-by: Navneet Singh <navneet.singh@intel.com>
Co-developed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Changes:
[iweiny: split this out]
[Jonathan: add documentation for extent sysfs]
[Jonathan/djbw: s/label/tag]
[Jonathan/djbw: treat tag as uuid]
[djbw: use __ATTRIBUTE_GROUPS]
[djbw: make tag invisible if it is empty]
[djbw/iweiny: use conventional id names for extents; extentX.Y]
---
 Documentation/ABI/testing/sysfs-bus-cxl | 13 ++++++++
 drivers/cxl/core/extent.c               | 58 +++++++++++++++++++++++++++++++++
 2 files changed, 71 insertions(+)

diff --git a/Documentation/ABI/testing/sysfs-bus-cxl b/Documentation/ABI/testing/sysfs-bus-cxl
index 3a5ee88e551b..e97e6a73c960 100644
--- a/Documentation/ABI/testing/sysfs-bus-cxl
+++ b/Documentation/ABI/testing/sysfs-bus-cxl
@@ -599,3 +599,16 @@ Description:
 		See Documentation/ABI/stable/sysfs-devices-node. access0 provides
 		the number to the closest initiator and access1 provides the
 		number to the closest CPU.
+
+What:		/sys/bus/cxl/devices/dax_regionX/extentX.Y/offset
+		/sys/bus/cxl/devices/dax_regionX/extentX.Y/length
+		/sys/bus/cxl/devices/dax_regionX/extentX.Y/tag
+Date:		October, 2024
+KernelVersion:	v6.12
+Contact:	linux-cxl@vger.kernel.org
+Description:
+		(RO) [For Dynamic Capacity regions only]  Extent offset and
+		length within the region.  Users can use the extent information
+		to create DAX devices on specific extents.  This is done by
+		creating and destroying DAX devices in specific sequences and
+		looking at the mappings created.
diff --git a/drivers/cxl/core/extent.c b/drivers/cxl/core/extent.c
index 34456594cdc3..d7d526a51e2b 100644
--- a/drivers/cxl/core/extent.c
+++ b/drivers/cxl/core/extent.c
@@ -6,6 +6,63 @@
 
 #include "core.h"
 
+static ssize_t offset_show(struct device *dev, struct device_attribute *attr,
+			   char *buf)
+{
+	struct region_extent *region_extent = to_region_extent(dev);
+
+	return sysfs_emit(buf, "%#llx\n", region_extent->hpa_range.start);
+}
+static DEVICE_ATTR_RO(offset);
+
+static ssize_t length_show(struct device *dev, struct device_attribute *attr,
+			   char *buf)
+{
+	struct region_extent *region_extent = to_region_extent(dev);
+	u64 length = range_len(&region_extent->hpa_range);
+
+	return sysfs_emit(buf, "%#llx\n", length);
+}
+static DEVICE_ATTR_RO(length);
+
+static ssize_t tag_show(struct device *dev, struct device_attribute *attr,
+			char *buf)
+{
+	struct region_extent *region_extent = to_region_extent(dev);
+
+	return sysfs_emit(buf, "%pUb\n", &region_extent->tag);
+}
+static DEVICE_ATTR_RO(tag);
+
+static struct attribute *region_extent_attrs[] = {
+	&dev_attr_offset.attr,
+	&dev_attr_length.attr,
+	&dev_attr_tag.attr,
+	NULL,
+};
+
+static uuid_t empty_tag = { 0 };
+
+static umode_t region_extent_visible(struct kobject *kobj,
+				     struct attribute *a, int n)
+{
+	struct device *dev = kobj_to_dev(kobj);
+	struct region_extent *region_extent = to_region_extent(dev);
+
+	if (a == &dev_attr_tag.attr &&
+	    uuid_equal(&region_extent->tag, &empty_tag))
+		return 0;
+
+	return a->mode;
+}
+
+static const struct attribute_group region_extent_attribute_group = {
+	.attrs = region_extent_attrs,
+	.is_visible = region_extent_visible,
+};
+
+__ATTRIBUTE_GROUPS(region_extent_attribute);
+
 static void cxled_release_extent(struct cxl_endpoint_decoder *cxled,
 				 struct cxled_extent *ed_extent)
 {
@@ -44,6 +101,7 @@ static void region_extent_release(struct device *dev)
 static const struct device_type region_extent_type = {
 	.name = "extent",
 	.release = region_extent_release,
+	.groups = region_extent_attribute_groups,
 };
 
 bool is_region_extent(struct device *dev)

-- 
2.45.2


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v3 20/25] dax/bus: Factor out dev dax resize logic
  2024-08-16 14:44 [PATCH v3 00/25] DCD: Add support for Dynamic Capacity Devices (DCD) Ira Weiny
                   ` (18 preceding siblings ...)
  2024-08-16 14:44 ` [PATCH v3 19/25] cxl/region/extent: Expose region extent information in sysfs ira.weiny
@ 2024-08-16 14:44 ` Ira Weiny
  2024-08-19 22:35   ` Dave Jiang
  2024-08-27 13:26   ` Jonathan Cameron
  2024-08-16 14:44 ` [PATCH v3 21/25] dax/region: Create resources on sparse DAX regions ira.weiny
                   ` (4 subsequent siblings)
  24 siblings, 2 replies; 120+ messages in thread
From: Ira Weiny @ 2024-08-16 14:44 UTC (permalink / raw)
  To: Dave Jiang, Fan Ni, Jonathan Cameron, Navneet Singh, Chris Mason,
	Josef Bacik, David Sterba, Petr Mladek, Steven Rostedt,
	Andy Shevchenko, Rasmus Villemoes, Sergey Senozhatsky,
	Jonathan Corbet, Andrew Morton
  Cc: Dan Williams, Davidlohr Bueso, Alison Schofield, Vishal Verma,
	Ira Weiny, linux-btrfs, linux-cxl, linux-kernel, linux-doc,
	nvdimm

Dynamic Capacity regions must limit dev dax resources to those areas
which have extents backing real memory.  Such DAX regions are dubbed
'sparse' regions.  In order to manage where memory is available four
alternatives were considered:

1) Create a single region resource child on region creation which
   reserves the entire region.  Then as extents are added punch holes in
   this reservation.  This requires new resource manipulation to punch
   the holes and still requires an additional iteration over the extent
   areas which may already have existing dev dax resources used.

2) Maintain an ordered xarray of extents which can be queried while
   processing the resize logic.  The issue is that existing region->res
   children may artificially limit the allocation size sent to
   alloc_dev_dax_range().  IE the resource children can't be directly
   used in the resize logic to find where space in the region is.  This
   also poses a problem of managing the available size in 2 places.

3) Maintain a separate resource tree with extents.  This option is the
   same as 2) but with the different data structure.  Most ideally there
   should be a unified representation of the resource tree not two places
   to look for space.

4) Create region resource children for each extent.  Manage the dax dev
   resize logic in the same way as before but use a region child
   (extent) resource as the parents to find space within each extent.

Option 4 can leverage the existing resize algorithm to find space within
the extents.  It manages the available space in a singular resource tree
which is less complicated for finding space.

In preparation for this change, factor out the dev_dax_resize logic.
For static regions use dax_region->res as the parent to find space for
the dax ranges.  Future patches will use the same algorithm with
individual extent resources as the parent.

Signed-off-by: Ira Weiny <ira.weiny@intel.com>
---
Changes:
[iweiny: Rebase on new DAX region locking]
[iweiny: Reword commit message]
[iweiny: Drop reviews]
---
 drivers/dax/bus.c | 129 +++++++++++++++++++++++++++++++++---------------------
 1 file changed, 79 insertions(+), 50 deletions(-)

diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
index d8cb5195a227..975860371d9f 100644
--- a/drivers/dax/bus.c
+++ b/drivers/dax/bus.c
@@ -844,11 +844,9 @@ static int devm_register_dax_mapping(struct dev_dax *dev_dax, int range_id)
 	return 0;
 }
 
-static int alloc_dev_dax_range(struct dev_dax *dev_dax, u64 start,
-		resource_size_t size)
+static int alloc_dev_dax_range(struct resource *parent, struct dev_dax *dev_dax,
+			       u64 start, resource_size_t size)
 {
-	struct dax_region *dax_region = dev_dax->region;
-	struct resource *res = &dax_region->res;
 	struct device *dev = &dev_dax->dev;
 	struct dev_dax_range *ranges;
 	unsigned long pgoff = 0;
@@ -866,14 +864,14 @@ static int alloc_dev_dax_range(struct dev_dax *dev_dax, u64 start,
 		return 0;
 	}
 
-	alloc = __request_region(res, start, size, dev_name(dev), 0);
+	alloc = __request_region(parent, start, size, dev_name(dev), 0);
 	if (!alloc)
 		return -ENOMEM;
 
 	ranges = krealloc(dev_dax->ranges, sizeof(*ranges)
 			* (dev_dax->nr_range + 1), GFP_KERNEL);
 	if (!ranges) {
-		__release_region(res, alloc->start, resource_size(alloc));
+		__release_region(parent, alloc->start, resource_size(alloc));
 		return -ENOMEM;
 	}
 
@@ -1026,50 +1024,45 @@ static bool adjust_ok(struct dev_dax *dev_dax, struct resource *res)
 	return true;
 }
 
-static ssize_t dev_dax_resize(struct dax_region *dax_region,
-		struct dev_dax *dev_dax, resource_size_t size)
+/**
+ * dev_dax_resize_static - Expand the device into the unused portion of the
+ * region. This may involve adjusting the end of an existing resource, or
+ * allocating a new resource.
+ *
+ * @parent: parent resource to allocate this range in
+ * @dev_dax: DAX device to be expanded
+ * @to_alloc: amount of space to alloc; must be <= space available in @parent
+ *
+ * Return the amount of space allocated or -ERRNO on failure
+ */
+static ssize_t dev_dax_resize_static(struct resource *parent,
+				     struct dev_dax *dev_dax,
+				     resource_size_t to_alloc)
 {
-	resource_size_t avail = dax_region_avail_size(dax_region), to_alloc;
-	resource_size_t dev_size = dev_dax_size(dev_dax);
-	struct resource *region_res = &dax_region->res;
-	struct device *dev = &dev_dax->dev;
 	struct resource *res, *first;
-	resource_size_t alloc = 0;
 	int rc;
 
-	if (dev->driver)
-		return -EBUSY;
-	if (size == dev_size)
-		return 0;
-	if (size > dev_size && size - dev_size > avail)
-		return -ENOSPC;
-	if (size < dev_size)
-		return dev_dax_shrink(dev_dax, size);
-
-	to_alloc = size - dev_size;
-	if (dev_WARN_ONCE(dev, !alloc_is_aligned(dev_dax, to_alloc),
-			"resize of %pa misaligned\n", &to_alloc))
-		return -ENXIO;
-
-	/*
-	 * Expand the device into the unused portion of the region. This
-	 * may involve adjusting the end of an existing resource, or
-	 * allocating a new resource.
-	 */
-retry:
-	first = region_res->child;
-	if (!first)
-		return alloc_dev_dax_range(dev_dax, dax_region->res.start, to_alloc);
+	first = parent->child;
+	if (!first) {
+		rc = alloc_dev_dax_range(parent, dev_dax,
+					   parent->start, to_alloc);
+		if (rc)
+			return rc;
+		return to_alloc;
+	}
 
-	rc = -ENOSPC;
 	for (res = first; res; res = res->sibling) {
 		struct resource *next = res->sibling;
+		resource_size_t alloc;
 
 		/* space at the beginning of the region */
-		if (res == first && res->start > dax_region->res.start) {
-			alloc = min(res->start - dax_region->res.start, to_alloc);
-			rc = alloc_dev_dax_range(dev_dax, dax_region->res.start, alloc);
-			break;
+		if (res == first && res->start > parent->start) {
+			alloc = min(res->start - parent->start, to_alloc);
+			rc = alloc_dev_dax_range(parent, dev_dax,
+						 parent->start, alloc);
+			if (rc)
+				return rc;
+			return alloc;
 		}
 
 		alloc = 0;
@@ -1078,21 +1071,55 @@ static ssize_t dev_dax_resize(struct dax_region *dax_region,
 			alloc = min(next->start - (res->end + 1), to_alloc);
 
 		/* space at the end of the region */
-		if (!alloc && !next && res->end < region_res->end)
-			alloc = min(region_res->end - res->end, to_alloc);
+		if (!alloc && !next && res->end < parent->end)
+			alloc = min(parent->end - res->end, to_alloc);
 
 		if (!alloc)
 			continue;
 
 		if (adjust_ok(dev_dax, res)) {
 			rc = adjust_dev_dax_range(dev_dax, res, resource_size(res) + alloc);
-			break;
+			if (rc)
+				return rc;
+			return alloc;
 		}
-		rc = alloc_dev_dax_range(dev_dax, res->end + 1, alloc);
-		break;
+		rc = alloc_dev_dax_range(parent, dev_dax, res->end + 1, alloc);
+		if (rc)
+			return rc;
+		return alloc;
 	}
-	if (rc)
-		return rc;
+
+	/* available was already calculated and should never be an issue */
+	dev_WARN_ONCE(&dev_dax->dev, 1, "space not found?");
+	return 0;
+}
+
+static ssize_t dev_dax_resize(struct dax_region *dax_region,
+		struct dev_dax *dev_dax, resource_size_t size)
+{
+	resource_size_t avail = dax_region_avail_size(dax_region), to_alloc;
+	resource_size_t dev_size = dev_dax_size(dev_dax);
+	struct device *dev = &dev_dax->dev;
+	resource_size_t alloc = 0;
+
+	if (dev->driver)
+		return -EBUSY;
+	if (size == dev_size)
+		return 0;
+	if (size > dev_size && size - dev_size > avail)
+		return -ENOSPC;
+	if (size < dev_size)
+		return dev_dax_shrink(dev_dax, size);
+
+	to_alloc = size - dev_size;
+	if (dev_WARN_ONCE(dev, !alloc_is_aligned(dev_dax, to_alloc),
+			"resize of %pa misaligned\n", &to_alloc))
+		return -ENXIO;
+
+retry:
+	alloc = dev_dax_resize_static(&dax_region->res, dev_dax, to_alloc);
+	if (alloc <= 0)
+		return alloc;
 	to_alloc -= alloc;
 	if (to_alloc)
 		goto retry;
@@ -1198,7 +1225,8 @@ static ssize_t mapping_store(struct device *dev, struct device_attribute *attr,
 
 	to_alloc = range_len(&r);
 	if (alloc_is_aligned(dev_dax, to_alloc))
-		rc = alloc_dev_dax_range(dev_dax, r.start, to_alloc);
+		rc = alloc_dev_dax_range(&dax_region->res, dev_dax, r.start,
+					 to_alloc);
 	up_write(&dax_dev_rwsem);
 	up_write(&dax_region_rwsem);
 
@@ -1466,7 +1494,8 @@ static struct dev_dax *__devm_create_dev_dax(struct dev_dax_data *data)
 	device_initialize(dev);
 	dev_set_name(dev, "dax%d.%d", dax_region->id, dev_dax->id);
 
-	rc = alloc_dev_dax_range(dev_dax, dax_region->res.start, data->size);
+	rc = alloc_dev_dax_range(&dax_region->res, dev_dax, dax_region->res.start,
+				 data->size);
 	if (rc)
 		goto err_range;
 

-- 
2.45.2


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v3 21/25] dax/region: Create resources on sparse DAX regions
  2024-08-16 14:44 [PATCH v3 00/25] DCD: Add support for Dynamic Capacity Devices (DCD) Ira Weiny
                   ` (19 preceding siblings ...)
  2024-08-16 14:44 ` [PATCH v3 20/25] dax/bus: Factor out dev dax resize logic Ira Weiny
@ 2024-08-16 14:44 ` ira.weiny
  2024-08-18 11:38   ` Markus Elfring
                     ` (2 more replies)
  2024-08-16 14:44 ` [PATCH v3 22/25] cxl/region: Read existing extents on region creation ira.weiny
                   ` (3 subsequent siblings)
  24 siblings, 3 replies; 120+ messages in thread
From: ira.weiny @ 2024-08-16 14:44 UTC (permalink / raw)
  To: Dave Jiang, Fan Ni, Jonathan Cameron, Navneet Singh, Chris Mason,
	Josef Bacik, David Sterba, Petr Mladek, Steven Rostedt,
	Andy Shevchenko, Rasmus Villemoes, Sergey Senozhatsky,
	Jonathan Corbet, Andrew Morton
  Cc: Dan Williams, Davidlohr Bueso, Alison Schofield, Vishal Verma,
	Ira Weiny, linux-btrfs, linux-cxl, linux-kernel, linux-doc,
	nvdimm

From: Navneet Singh <navneet.singh@intel.com>

DAX regions which map dynamic capacity partitions require that memory be
allowed to come and go.  Recall sparse regions were created for this
purpose.  Now that extents can be realized within DAX regions the DAX
region driver can start tracking sub-resource information.

The tight relationship between DAX region operations and extent
operations require memory changes to be controlled synchronously with
the user of the region.  Synchronize through the dax_region_rwsem and by
having the region driver drive both the region device as well as the
extent sub-devices.

Recall requests to remove extents can happen at any time and that a host
is not obligated to release the memory until it is not being used.  If
an extent is not used allow a release response.

The DAX layer has no need for the details of the CXL memory extent
devices.  Expose extents to the DAX layer as device children of the DAX
region device.  A single callback from the driver aids the DAX layer to
determine if the child device is an extent.  The DAX layer also
registers a devres function to automatically clean up when the device is
removed from the region.

There is a race between extents being surfaced and the dax_cxl driver
being loaded.  The driver must therefore scan for any existing extents
while still under the device lock.

Respond to extent notifications.  Manage the DAX region resource tree
based on the extents lifetime.  Return the status of remove
notifications to lower layers such that it can manage the hardware
appropriately.

Signed-off-by: Navneet Singh <navneet.singh@intel.com>
Co-developed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Changes:
[iweiny: patch reorder]
[iweiny: move hunks from other patches to clarify code changes and
         add/release flows WRT dax regions]
[iweiny: use %par]
[iweiny: clean up variable names]
[iweiny: Simplify sparse_ops]
[Fan: avoid open coding range_len()]
[djbw: s/reg_ext/region_extent]
---
 drivers/cxl/core/extent.c |  76 +++++++++++++--
 drivers/cxl/cxl.h         |   6 ++
 drivers/dax/bus.c         | 243 +++++++++++++++++++++++++++++++++++++++++-----
 drivers/dax/bus.h         |   3 +-
 drivers/dax/cxl.c         |  63 +++++++++++-
 drivers/dax/dax-private.h |  34 +++++++
 drivers/dax/hmem/hmem.c   |   2 +-
 drivers/dax/pmem.c        |   2 +-
 8 files changed, 391 insertions(+), 38 deletions(-)

diff --git a/drivers/cxl/core/extent.c b/drivers/cxl/core/extent.c
index d7d526a51e2b..103b0bec3a4a 100644
--- a/drivers/cxl/core/extent.c
+++ b/drivers/cxl/core/extent.c
@@ -271,20 +271,67 @@ static void calc_hpa_range(struct cxl_endpoint_decoder *cxled,
 	hpa_range->end = hpa_range->start + range_len(dpa_range) - 1;
 }
 
+static int cxlr_notify_extent(struct cxl_region *cxlr, enum dc_event event,
+			      struct region_extent *region_extent)
+{
+	struct cxl_dax_region *cxlr_dax;
+	struct device *dev;
+	int rc = 0;
+
+	cxlr_dax = cxlr->cxlr_dax;
+	dev = &cxlr_dax->dev;
+	dev_dbg(dev, "Trying notify: type %d HPA %par\n",
+		event, &region_extent->hpa_range);
+
+	/*
+	 * NOTE the lack of a driver indicates a notification has failed.  No
+	 * user space coordiantion was possible.
+	 */
+	device_lock(dev);
+	if (dev->driver) {
+		struct cxl_driver *driver = to_cxl_drv(dev->driver);
+		struct cxl_notify_data notify_data = (struct cxl_notify_data) {
+			.event = event,
+			.region_extent = region_extent,
+		};
+
+		if (driver->notify) {
+			dev_dbg(dev, "Notify: type %d HPA %par\n",
+				event, &region_extent->hpa_range);
+			rc = driver->notify(dev, &notify_data);
+		}
+	}
+	device_unlock(dev);
+	return rc;
+}
+
+struct rm_data {
+	struct cxl_region *cxlr;
+	struct range *range;
+};
+
 static int cxlr_rm_extent(struct device *dev, void *data)
 {
 	struct region_extent *region_extent = to_region_extent(dev);
-	struct range *region_hpa_range = data;
+	struct rm_data *rm_data = data;
+	int rc;
 
 	if (!region_extent)
 		return 0;
 
 	/*
-	 * Any extent which 'touches' the released range is removed.
+	 * Any extent which 'touches' the released range is attempted to be
+	 * removed.
 	 */
-	if (range_overlaps(region_hpa_range, &region_extent->hpa_range)) {
+	if (range_overlaps(rm_data->range, &region_extent->hpa_range)) {
+		struct cxl_region *cxlr = rm_data->cxlr;
+
 		dev_dbg(dev, "Remove region extent HPA %par\n",
 			&region_extent->hpa_range);
+		rc = cxlr_notify_extent(cxlr, DCD_RELEASE_CAPACITY, region_extent);
+		if (rc == -EBUSY)
+			return 0;
+		/* Extent not in use or error, remove it */
 		region_rm_extent(region_extent);
 	}
 	return 0;
@@ -312,8 +359,13 @@ int cxl_rm_extent(struct cxl_memdev_state *mds, struct cxl_extent *extent)
 
 	calc_hpa_range(cxled, cxlr->cxlr_dax, &dpa_range, &hpa_range);
 
+	struct rm_data rm_data = {
+		.cxlr = cxlr,
+		.range = &hpa_range,
+	};
+
 	/* Remove region extents which overlap */
-	return device_for_each_child(&cxlr->cxlr_dax->dev, &hpa_range,
+	return device_for_each_child(&cxlr->cxlr_dax->dev, &rm_data,
 				     cxlr_rm_extent);
 }
 
@@ -338,8 +390,20 @@ static int cxlr_add_extent(struct cxl_dax_region *cxlr_dax,
 		return rc;
 	}
 
-	/* device model handles freeing region_extent */
-	return online_region_extent(region_extent);
+	rc = online_region_extent(region_extent);
+	/* device model handled freeing region_extent */
+	if (rc)
+		return rc;
+
+	rc = cxlr_notify_extent(cxlr_dax->cxlr, DCD_ADD_CAPACITY, region_extent);
+	/*
+	 * The region device was breifly live but DAX layer ensures it was not
+	 * used
+	 */
+	if (rc)
+		region_rm_extent(region_extent);
+
+	return rc;
 }
 
 /* Callers are expected to ensure cxled has been attached to a region */
diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
index c858e3957fd5..9abbfc68c6ad 100644
--- a/drivers/cxl/cxl.h
+++ b/drivers/cxl/cxl.h
@@ -916,10 +916,16 @@ bool is_cxl_region(struct device *dev);
 
 extern struct bus_type cxl_bus_type;
 
+struct cxl_notify_data {
+	enum dc_event event;
+	struct region_extent *region_extent;
+};
+
 struct cxl_driver {
 	const char *name;
 	int (*probe)(struct device *dev);
 	void (*remove)(struct device *dev);
+	int (*notify)(struct device *dev, struct cxl_notify_data *notify_data);
 	struct device_driver drv;
 	int id;
 };
diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
index 975860371d9f..f14b0cfa7edd 100644
--- a/drivers/dax/bus.c
+++ b/drivers/dax/bus.c
@@ -183,6 +183,83 @@ static bool is_sparse(struct dax_region *dax_region)
 	return (dax_region->res.flags & IORESOURCE_DAX_SPARSE_CAP) != 0;
 }
 
+static void __dax_release_resource(struct dax_resource *dax_resource)
+{
+	struct dax_region *dax_region = dax_resource->region;
+
+	lockdep_assert_held_write(&dax_region_rwsem);
+	dev_dbg(dax_region->dev, "Extent release resource %pr\n",
+		dax_resource->res);
+	if (dax_resource->res)
+		__release_region(&dax_region->res, dax_resource->res->start,
+				 resource_size(dax_resource->res));
+	dax_resource->res = NULL;
+}
+
+static void dax_release_resource(void *res)
+{
+	struct dax_resource *dax_resource = res;
+
+	guard(rwsem_write)(&dax_region_rwsem);
+	__dax_release_resource(dax_resource);
+	kfree(dax_resource);
+}
+
+int dax_region_add_resource(struct dax_region *dax_region,
+			    struct device *device,
+			    resource_size_t start, resource_size_t length)
+{
+	struct resource *new_resource;
+	int rc;
+
+	struct dax_resource *dax_resource __free(kfree) =
+				kzalloc(sizeof(*dax_resource), GFP_KERNEL);
+	if (!dax_resource)
+		return -ENOMEM;
+
+	guard(rwsem_write)(&dax_region_rwsem);
+
+	dev_dbg(dax_region->dev, "DAX region resource %pr\n", &dax_region->res);
+	new_resource = __request_region(&dax_region->res, start, length, "extent", 0);
+	if (!new_resource) {
+		dev_err(dax_region->dev, "Failed to add region s:%pa l:%pa\n",
+			&start, &length);
+		return -ENOSPC;
+	}
+
+	dev_dbg(dax_region->dev, "add resource %pr\n", new_resource);
+	dax_resource->region = dax_region;
+	dax_resource->res = new_resource;
+	dev_set_drvdata(device, dax_resource);
+	rc = devm_add_action_or_reset(device, dax_release_resource,
+				      no_free_ptr(dax_resource));
+	/*  On error; ensure driver data is cleared under semaphore */
+	if (rc)
+		dev_set_drvdata(device, NULL);
+	return rc;
+}
+EXPORT_SYMBOL_GPL(dax_region_add_resource);
+
+int dax_region_rm_resource(struct dax_region *dax_region,
+			   struct device *dev)
+{
+	struct dax_resource *dax_resource;
+
+	guard(rwsem_write)(&dax_region_rwsem);
+
+	dax_resource = dev_get_drvdata(dev);
+	if (!dax_resource)
+		return 0;
+
+	if (dax_resource->use_cnt)
+		return -EBUSY;
+
+	/* avoid races with users trying to use the extent */
+	__dax_release_resource(dax_resource);
+	return 0;
+}
+EXPORT_SYMBOL_GPL(dax_region_rm_resource);
+
 bool static_dev_dax(struct dev_dax *dev_dax)
 {
 	return is_static(dev_dax->region);
@@ -296,19 +373,44 @@ static ssize_t region_align_show(struct device *dev,
 static struct device_attribute dev_attr_region_align =
 		__ATTR(align, 0400, region_align_show, NULL);
 
+#define for_each_child_resource(extent, res) \
+	for (res = (extent)->child; res; res = res->sibling)
+
+resource_size_t
+dax_avail_size(struct resource *dax_resource)
+{
+	resource_size_t rc;
+	struct resource *used_res;
+
+	rc = resource_size(dax_resource);
+	for_each_child_resource(dax_resource, used_res)
+		rc -= resource_size(used_res);
+	return rc;
+}
+EXPORT_SYMBOL_GPL(dax_avail_size);
+
 #define for_each_dax_region_resource(dax_region, res) \
 	for (res = (dax_region)->res.child; res; res = res->sibling)
 
 static unsigned long long dax_region_avail_size(struct dax_region *dax_region)
 {
-	resource_size_t size = resource_size(&dax_region->res);
+	resource_size_t size;
 	struct resource *res;
 
 	lockdep_assert_held(&dax_region_rwsem);
 
-	if (is_sparse(dax_region))
-		return 0;
+	if (is_sparse(dax_region)) {
+		/*
+		 * Children of a sparse region represent available space not
+		 * used space.
+		 */
+		size = 0;
+		for_each_dax_region_resource(dax_region, res)
+			size += dax_avail_size(res);
+		return size;
+	}
 
+	size = resource_size(&dax_region->res);
 	for_each_dax_region_resource(dax_region, res)
 		size -= resource_size(res);
 	return size;
@@ -449,15 +551,26 @@ EXPORT_SYMBOL_GPL(kill_dev_dax);
 static void trim_dev_dax_range(struct dev_dax *dev_dax)
 {
 	int i = dev_dax->nr_range - 1;
-	struct range *range = &dev_dax->ranges[i].range;
+	struct dev_dax_range *dev_range = &dev_dax->ranges[i];
+	struct range *range = &dev_range->range;
 	struct dax_region *dax_region = dev_dax->region;
+	struct resource *res = &dax_region->res;
 
 	lockdep_assert_held_write(&dax_region_rwsem);
 	dev_dbg(&dev_dax->dev, "delete range[%d]: %#llx:%#llx\n", i,
 		(unsigned long long)range->start,
 		(unsigned long long)range->end);
 
-	__release_region(&dax_region->res, range->start, range_len(range));
+	if (dev_range->dax_resource) {
+		res = dev_range->dax_resource->res;
+		dev_dbg(&dev_dax->dev, "Trim sparse extent %pr\n", res);
+	}
+
+	__release_region(res, range->start, range_len(range));
+
+	if (dev_range->dax_resource)
+		dev_range->dax_resource->use_cnt--;
+
 	if (--dev_dax->nr_range == 0) {
 		kfree(dev_dax->ranges);
 		dev_dax->ranges = NULL;
@@ -640,7 +753,7 @@ static void dax_region_unregister(void *region)
 
 struct dax_region *alloc_dax_region(struct device *parent, int region_id,
 		struct range *range, int target_node, unsigned int align,
-		unsigned long flags)
+		unsigned long flags, struct dax_sparse_ops *sparse_ops)
 {
 	struct dax_region *dax_region;
 
@@ -658,12 +771,16 @@ struct dax_region *alloc_dax_region(struct device *parent, int region_id,
 			|| !IS_ALIGNED(range_len(range), align))
 		return NULL;
 
+	if (!sparse_ops && (flags & IORESOURCE_DAX_SPARSE_CAP))
+		return NULL;
+
 	dax_region = kzalloc(sizeof(*dax_region), GFP_KERNEL);
 	if (!dax_region)
 		return NULL;
 
 	dev_set_drvdata(parent, dax_region);
 	kref_init(&dax_region->kref);
+	dax_region->sparse_ops = sparse_ops;
 	dax_region->id = region_id;
 	dax_region->align = align;
 	dax_region->dev = parent;
@@ -845,7 +962,8 @@ static int devm_register_dax_mapping(struct dev_dax *dev_dax, int range_id)
 }
 
 static int alloc_dev_dax_range(struct resource *parent, struct dev_dax *dev_dax,
-			       u64 start, resource_size_t size)
+			       u64 start, resource_size_t size,
+			       struct dax_resource *dax_resource)
 {
 	struct device *dev = &dev_dax->dev;
 	struct dev_dax_range *ranges;
@@ -884,6 +1002,7 @@ static int alloc_dev_dax_range(struct resource *parent, struct dev_dax *dev_dax,
 			.start = alloc->start,
 			.end = alloc->end,
 		},
+		.dax_resource = dax_resource,
 	};
 
 	dev_dbg(dev, "alloc range[%d]: %pa:%pa\n", dev_dax->nr_range - 1,
@@ -966,7 +1085,8 @@ static int dev_dax_shrink(struct dev_dax *dev_dax, resource_size_t size)
 	int i;
 
 	for (i = dev_dax->nr_range - 1; i >= 0; i--) {
-		struct range *range = &dev_dax->ranges[i].range;
+		struct dev_dax_range *dev_range = &dev_dax->ranges[i];
+		struct range *range = &dev_range->range;
 		struct dax_mapping *mapping = dev_dax->ranges[i].mapping;
 		struct resource *adjust = NULL, *res;
 		resource_size_t shrink;
@@ -982,12 +1102,21 @@ static int dev_dax_shrink(struct dev_dax *dev_dax, resource_size_t size)
 			continue;
 		}
 
-		for_each_dax_region_resource(dax_region, res)
-			if (strcmp(res->name, dev_name(dev)) == 0
-					&& res->start == range->start) {
-				adjust = res;
-				break;
-			}
+		if (dev_range->dax_resource) {
+			for_each_child_resource(dev_range->dax_resource->res, res)
+				if (strcmp(res->name, dev_name(dev)) == 0
+						&& res->start == range->start) {
+					adjust = res;
+					break;
+				}
+		} else {
+			for_each_dax_region_resource(dax_region, res)
+				if (strcmp(res->name, dev_name(dev)) == 0
+						&& res->start == range->start) {
+					adjust = res;
+					break;
+				}
+		}
 
 		if (dev_WARN_ONCE(dev, !adjust || i != dev_dax->nr_range - 1,
 					"failed to find matching resource\n"))
@@ -1025,19 +1154,21 @@ static bool adjust_ok(struct dev_dax *dev_dax, struct resource *res)
 }
 
 /**
- * dev_dax_resize_static - Expand the device into the unused portion of the
- * region. This may involve adjusting the end of an existing resource, or
- * allocating a new resource.
+ * __dev_dax_resize - Expand the device into the unused portion of the region.
+ * This may involve adjusting the end of an existing resource, or allocating a
+ * new resource.
  *
  * @parent: parent resource to allocate this range in
  * @dev_dax: DAX device to be expanded
  * @to_alloc: amount of space to alloc; must be <= space available in @parent
+ * @dax_resource: if sparse; the parent resource
  *
  * Return the amount of space allocated or -ERRNO on failure
  */
-static ssize_t dev_dax_resize_static(struct resource *parent,
-				     struct dev_dax *dev_dax,
-				     resource_size_t to_alloc)
+static ssize_t __dev_dax_resize(struct resource *parent,
+				struct dev_dax *dev_dax,
+				resource_size_t to_alloc,
+				struct dax_resource *dax_resource)
 {
 	struct resource *res, *first;
 	int rc;
@@ -1045,7 +1176,8 @@ static ssize_t dev_dax_resize_static(struct resource *parent,
 	first = parent->child;
 	if (!first) {
 		rc = alloc_dev_dax_range(parent, dev_dax,
-					   parent->start, to_alloc);
+					   parent->start, to_alloc,
+					   dax_resource);
 		if (rc)
 			return rc;
 		return to_alloc;
@@ -1059,7 +1191,8 @@ static ssize_t dev_dax_resize_static(struct resource *parent,
 		if (res == first && res->start > parent->start) {
 			alloc = min(res->start - parent->start, to_alloc);
 			rc = alloc_dev_dax_range(parent, dev_dax,
-						 parent->start, alloc);
+						 parent->start, alloc,
+						 dax_resource);
 			if (rc)
 				return rc;
 			return alloc;
@@ -1083,7 +1216,8 @@ static ssize_t dev_dax_resize_static(struct resource *parent,
 				return rc;
 			return alloc;
 		}
-		rc = alloc_dev_dax_range(parent, dev_dax, res->end + 1, alloc);
+		rc = alloc_dev_dax_range(parent, dev_dax, res->end + 1, alloc,
+					 dax_resource);
 		if (rc)
 			return rc;
 		return alloc;
@@ -1094,6 +1228,54 @@ static ssize_t dev_dax_resize_static(struct resource *parent,
 	return 0;
 }
 
+static ssize_t dev_dax_resize_static(struct dax_region *dax_region,
+				     struct dev_dax *dev_dax,
+				     resource_size_t to_alloc)
+{
+	return __dev_dax_resize(&dax_region->res, dev_dax, to_alloc, NULL);
+}
+
+static int find_free_extent(struct device *dev, void *data)
+{
+	struct dax_region *dax_region = data;
+	struct dax_resource *dax_resource;
+
+	if (!dax_region->sparse_ops->is_extent(dev))
+		return 0;
+
+	dax_resource = dev_get_drvdata(dev);
+	if (!dax_resource || !dax_avail_size(dax_resource->res))
+		return 0;
+	return 1;
+}
+
+static ssize_t dev_dax_resize_sparse(struct dax_region *dax_region,
+				     struct dev_dax *dev_dax,
+				     resource_size_t to_alloc)
+{
+	struct dax_resource *dax_resource;
+	resource_size_t available_size;
+	struct device *extent_dev;
+	ssize_t alloc;
+
+	extent_dev = device_find_child(dax_region->dev, dax_region,
+				       find_free_extent);
+	if (!extent_dev)
+		return 0;
+
+	dax_resource = dev_get_drvdata(extent_dev);
+	if (!dax_resource)
+		return 0;
+
+	available_size = dax_avail_size(dax_resource->res);
+	to_alloc = min(available_size, to_alloc);
+	alloc = __dev_dax_resize(dax_resource->res, dev_dax, to_alloc, dax_resource);
+	if (alloc > 0)
+		dax_resource->use_cnt++;
+	put_device(extent_dev);
+	return alloc;
+}
+
 static ssize_t dev_dax_resize(struct dax_region *dax_region,
 		struct dev_dax *dev_dax, resource_size_t size)
 {
@@ -1117,7 +1299,10 @@ static ssize_t dev_dax_resize(struct dax_region *dax_region,
 		return -ENXIO;
 
 retry:
-	alloc = dev_dax_resize_static(&dax_region->res, dev_dax, to_alloc);
+	if (is_sparse(dax_region))
+		alloc = dev_dax_resize_sparse(dax_region, dev_dax, to_alloc);
+	else
+		alloc = dev_dax_resize_static(dax_region, dev_dax, to_alloc);
 	if (alloc <= 0)
 		return alloc;
 	to_alloc -= alloc;
@@ -1226,7 +1411,7 @@ static ssize_t mapping_store(struct device *dev, struct device_attribute *attr,
 	to_alloc = range_len(&r);
 	if (alloc_is_aligned(dev_dax, to_alloc))
 		rc = alloc_dev_dax_range(&dax_region->res, dev_dax, r.start,
-					 to_alloc);
+					 to_alloc, NULL);
 	up_write(&dax_dev_rwsem);
 	up_write(&dax_region_rwsem);
 
@@ -1494,8 +1679,14 @@ static struct dev_dax *__devm_create_dev_dax(struct dev_dax_data *data)
 	device_initialize(dev);
 	dev_set_name(dev, "dax%d.%d", dax_region->id, dev_dax->id);
 
+	if (is_sparse(dax_region) && data->size) {
+		dev_err(parent, "Sparse DAX region devices are created initially with 0 size");
+		rc = -EINVAL;
+		goto err_id;
+	}
+
 	rc = alloc_dev_dax_range(&dax_region->res, dev_dax, dax_region->res.start,
-				 data->size);
+				 data->size, NULL);
 	if (rc)
 		goto err_range;
 
diff --git a/drivers/dax/bus.h b/drivers/dax/bus.h
index 783bfeef42cc..ae5029ea6047 100644
--- a/drivers/dax/bus.h
+++ b/drivers/dax/bus.h
@@ -9,6 +9,7 @@ struct dev_dax;
 struct resource;
 struct dax_device;
 struct dax_region;
+struct dax_sparse_ops;
 
 /* dax bus specific ioresource flags */
 #define IORESOURCE_DAX_STATIC BIT(0)
@@ -17,7 +18,7 @@ struct dax_region;
 
 struct dax_region *alloc_dax_region(struct device *parent, int region_id,
 		struct range *range, int target_node, unsigned int align,
-		unsigned long flags);
+		unsigned long flags, struct dax_sparse_ops *sparse_ops);
 
 struct dev_dax_data {
 	struct dax_region *dax_region;
diff --git a/drivers/dax/cxl.c b/drivers/dax/cxl.c
index 367e86b1c22a..bf3b82b0120d 100644
--- a/drivers/dax/cxl.c
+++ b/drivers/dax/cxl.c
@@ -5,6 +5,60 @@
 
 #include "../cxl/cxl.h"
 #include "bus.h"
+#include "dax-private.h"
+
+static int __cxl_dax_add_resource(struct dax_region *dax_region,
+				  struct region_extent *region_extent)
+{
+	resource_size_t start, length;
+	struct device *dev;
+
+	dev = &region_extent->dev;
+	start = dax_region->res.start + region_extent->hpa_range.start;
+	length = range_len(&region_extent->hpa_range);
+	return dax_region_add_resource(dax_region, dev, start, length);
+}
+
+static int cxl_dax_add_resource(struct device *dev, void *data)
+{
+	struct dax_region *dax_region = data;
+	struct region_extent *region_extent;
+
+	region_extent = to_region_extent(dev);
+	if (!region_extent)
+		return 0;
+
+	dev_dbg(dax_region->dev, "Adding resource HPA %par\n",
+		&region_extent->hpa_range);
+
+	return __cxl_dax_add_resource(dax_region, region_extent);
+}
+
+static int cxl_dax_region_notify(struct device *dev,
+				 struct cxl_notify_data *notify_data)
+{
+	struct cxl_dax_region *cxlr_dax = to_cxl_dax_region(dev);
+	struct dax_region *dax_region = dev_get_drvdata(dev);
+	struct region_extent *region_extent = notify_data->region_extent;
+
+	switch (notify_data->event) {
+	case DCD_ADD_CAPACITY:
+		return __cxl_dax_add_resource(dax_region, region_extent);
+	case DCD_RELEASE_CAPACITY:
+		return dax_region_rm_resource(dax_region, &region_extent->dev);
+	case DCD_FORCED_CAPACITY_RELEASE:
+	default:
+		dev_err(&cxlr_dax->dev, "Unknown DC event %d\n",
+			notify_data->event);
+		break;
+	}
+
+	return -ENXIO;
+}
+
+struct dax_sparse_ops sparse_ops = {
+	.is_extent = is_region_extent,
+};
 
 static int cxl_dax_region_probe(struct device *dev)
 {
@@ -24,14 +78,16 @@ static int cxl_dax_region_probe(struct device *dev)
 		flags |= IORESOURCE_DAX_SPARSE_CAP;
 
 	dax_region = alloc_dax_region(dev, cxlr->id, &cxlr_dax->hpa_range, nid,
-				      PMD_SIZE, flags);
+				      PMD_SIZE, flags, &sparse_ops);
 	if (!dax_region)
 		return -ENOMEM;
 
-	if (cxlr->mode == CXL_REGION_DC)
+	if (cxlr->mode == CXL_REGION_DC) {
+		device_for_each_child(&cxlr_dax->dev, dax_region,
+				      cxl_dax_add_resource);
 		/* Add empty seed dax device */
 		dev_size = 0;
-	else
+	} else
 		dev_size = range_len(&cxlr_dax->hpa_range);
 
 	data = (struct dev_dax_data) {
@@ -47,6 +103,7 @@ static int cxl_dax_region_probe(struct device *dev)
 static struct cxl_driver cxl_dax_region_driver = {
 	.name = "cxl_dax_region",
 	.probe = cxl_dax_region_probe,
+	.notify = cxl_dax_region_notify,
 	.id = CXL_DEVICE_DAX_REGION,
 	.drv = {
 		.suppress_bind_attrs = true,
diff --git a/drivers/dax/dax-private.h b/drivers/dax/dax-private.h
index ccde98c3d4e2..9e9f98c85620 100644
--- a/drivers/dax/dax-private.h
+++ b/drivers/dax/dax-private.h
@@ -16,6 +16,36 @@ struct inode *dax_inode(struct dax_device *dax_dev);
 int dax_bus_init(void);
 void dax_bus_exit(void);
 
+/**
+ * struct dax_resource - For sparse regions; an active resource
+ * @region: dax_region this resources is in
+ * @res: resource
+ * @use_cnt: count the number of uses of this resource
+ *
+ * Changes to the dax_reigon and the dax_resources within it are protected by
+ * dax_region_rwsem
+ */
+struct dax_resource {
+	struct dax_region *region;
+	struct resource *res;
+	unsigned int use_cnt;
+};
+int dax_region_add_resource(struct dax_region *dax_region, struct device *dev,
+			    resource_size_t start, resource_size_t length);
+int dax_region_rm_resource(struct dax_region *dax_region,
+			   struct device *dev);
+resource_size_t dax_avail_size(struct resource *dax_resource);
+
+typedef int (*match_cb)(struct device *dev, resource_size_t *size_avail);
+
+/**
+ * struct dax_sparse_ops - Operations for sparse regions
+ * @is_extent: return if the device is an extent
+ */
+struct dax_sparse_ops {
+	bool (*is_extent)(struct device *dev);
+};
+
 /**
  * struct dax_region - mapping infrastructure for dax devices
  * @id: kernel-wide unique region for a memory range
@@ -27,6 +57,7 @@ void dax_bus_exit(void);
  * @res: resource tree to track instance allocations
  * @seed: allow userspace to find the first unbound seed device
  * @youngest: allow userspace to find the most recently created device
+ * @sparse_ops: operations required for sparse regions
  */
 struct dax_region {
 	int id;
@@ -38,6 +69,7 @@ struct dax_region {
 	struct resource res;
 	struct device *seed;
 	struct device *youngest;
+	struct dax_sparse_ops *sparse_ops;
 };
 
 struct dax_mapping {
@@ -62,6 +94,7 @@ struct dax_mapping {
  * @pgoff: page offset
  * @range: resource-span
  * @mapping: device to assist in interrogating the range layout
+ * @dax_resource: if not NULL; dax sparse resource containing this range
  */
 struct dev_dax {
 	struct dax_region *region;
@@ -79,6 +112,7 @@ struct dev_dax {
 		unsigned long pgoff;
 		struct range range;
 		struct dax_mapping *mapping;
+		struct dax_resource *dax_resource;
 	} *ranges;
 };
 
diff --git a/drivers/dax/hmem/hmem.c b/drivers/dax/hmem/hmem.c
index 5e7c53f18491..0eea65052874 100644
--- a/drivers/dax/hmem/hmem.c
+++ b/drivers/dax/hmem/hmem.c
@@ -28,7 +28,7 @@ static int dax_hmem_probe(struct platform_device *pdev)
 
 	mri = dev->platform_data;
 	dax_region = alloc_dax_region(dev, pdev->id, &mri->range,
-				      mri->target_node, PMD_SIZE, flags);
+				      mri->target_node, PMD_SIZE, flags, NULL);
 	if (!dax_region)
 		return -ENOMEM;
 
diff --git a/drivers/dax/pmem.c b/drivers/dax/pmem.c
index c8ebf4e281f2..f927e855f240 100644
--- a/drivers/dax/pmem.c
+++ b/drivers/dax/pmem.c
@@ -54,7 +54,7 @@ static struct dev_dax *__dax_pmem_probe(struct device *dev)
 	range.start += offset;
 	dax_region = alloc_dax_region(dev, region_id, &range,
 			nd_region->target_node, le32_to_cpu(pfn_sb->align),
-			IORESOURCE_DAX_STATIC);
+			IORESOURCE_DAX_STATIC, NULL);
 	if (!dax_region)
 		return ERR_PTR(-ENOMEM);
 

-- 
2.45.2


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v3 22/25] cxl/region: Read existing extents on region creation
  2024-08-16 14:44 [PATCH v3 00/25] DCD: Add support for Dynamic Capacity Devices (DCD) Ira Weiny
                   ` (20 preceding siblings ...)
  2024-08-16 14:44 ` [PATCH v3 21/25] dax/region: Create resources on sparse DAX regions ira.weiny
@ 2024-08-16 14:44 ` ira.weiny
  2024-08-20  0:06   ` Dave Jiang
                     ` (2 more replies)
  2024-08-16 14:44 ` [PATCH v3 23/25] cxl/mem: Trace Dynamic capacity Event Record ira.weiny
                   ` (2 subsequent siblings)
  24 siblings, 3 replies; 120+ messages in thread
From: ira.weiny @ 2024-08-16 14:44 UTC (permalink / raw)
  To: Dave Jiang, Fan Ni, Jonathan Cameron, Navneet Singh, Chris Mason,
	Josef Bacik, David Sterba, Petr Mladek, Steven Rostedt,
	Andy Shevchenko, Rasmus Villemoes, Sergey Senozhatsky,
	Jonathan Corbet, Andrew Morton
  Cc: Dan Williams, Davidlohr Bueso, Alison Schofield, Vishal Verma,
	Ira Weiny, linux-btrfs, linux-cxl, linux-kernel, linux-doc,
	nvdimm

From: Navneet Singh <navneet.singh@intel.com>

Dynamic capacity device extents may be left in an accepted state on a
device due to an unexpected host crash.  In this case it is expected
that the creation of a new region on top of a DC partition can read
those extents and surface them for continued use.

Once all endpoint decoders are part of a region and the region is being
realized a read of the devices extent list can reveal these previously
accepted extents.

CXL r3.1 specifies the mailbox call Get Dynamic Capacity Extent List for
this purpose.  The call returns all the extents for all dynamic capacity
partitions.  If the fabric manager is adding extents to any DCD
partition, the extent list for the recovered region may change.  In this
case the query must retry.  Upon retry the query could encounter extents
which were accepted on a previous list query.  Adding such extents is
ignored without error because they are entirely within a previous
accepted extent.

The scan for existing extents races with the dax_cxl driver.  This is
synchronized through the region device lock.  Extents which are found
after the driver has loaded will surface through the normal notification
path while extents seen prior to the driver are read during driver load.

Signed-off-by: Navneet Singh <navneet.singh@intel.com>
Co-developed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Changes:
[iweiny: Leverage the new add path from the event processing code such
	 that the adding and surfacing of extents flows through the same
	 code path for both event processing and existing extents.
	 While this does validate existing extents again on start up
	 this is an error recovery case / new boot scenario and should
	 not cause any major issues while making the code more
	 straight forward and maintainable.]

[iweiny: use %par]
[iweiny: rebase]
[iweiny: Move this patch later in the series such that the realization
         of extents can go through the same path as an add event]
[Fan: Issue a retry if the gen number changes]
[djiang: s/uint64_t/u64/]
[djiang: update function names]
[Jørgen/djbw: read the generation and total count on first iteration of
              the Get Extent List call]
[djbw: s/cxl_mbox_get_dc_extent_in/cxl_mbox_get_extent_in/]
[djbw: s/cxl_mbox_get_dc_extent_out/cxl_mbox_get_extent_out/]
[djbw/iweiny: s/cxl_read_dc_extents/cxl_read_extent_list]
---
 drivers/cxl/core/core.h   |   2 +
 drivers/cxl/core/mbox.c   | 100 ++++++++++++++++++++++++++++++++++++++++++++++
 drivers/cxl/core/region.c |  12 ++++++
 drivers/cxl/cxlmem.h      |  21 ++++++++++
 4 files changed, 135 insertions(+)

diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
index 8dfc97b2e0a4..9e54064a6f48 100644
--- a/drivers/cxl/core/core.h
+++ b/drivers/cxl/core/core.h
@@ -21,6 +21,8 @@ cxled_to_mds(struct cxl_endpoint_decoder *cxled)
 	return container_of(cxlds, struct cxl_memdev_state, cxlds);
 }
 
+void cxl_read_extent_list(struct cxl_endpoint_decoder *cxled);
+
 #ifdef CONFIG_CXL_REGION
 extern struct device_attribute dev_attr_create_pmem_region;
 extern struct device_attribute dev_attr_create_ram_region;
diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
index f629ad7488ac..d43ac8eabf56 100644
--- a/drivers/cxl/core/mbox.c
+++ b/drivers/cxl/core/mbox.c
@@ -1670,6 +1670,106 @@ int cxl_dev_dynamic_capacity_identify(struct cxl_memdev_state *mds)
 }
 EXPORT_SYMBOL_NS_GPL(cxl_dev_dynamic_capacity_identify, CXL);
 
+/* Return -EAGAIN if the extent list changes while reading */
+static int __cxl_read_extent_list(struct cxl_endpoint_decoder *cxled)
+{
+	u32 current_index, total_read, total_expected, initial_gen_num;
+	struct cxl_memdev_state *mds = cxled_to_mds(cxled);
+	struct device *dev = mds->cxlds.dev;
+	struct cxl_mbox_cmd mbox_cmd;
+	u32 max_extent_count;
+	bool first = true;
+
+	struct cxl_mbox_get_extent_out *extents __free(kfree) =
+				kvmalloc(mds->payload_size, GFP_KERNEL);
+	if (!extents)
+		return -ENOMEM;
+
+	total_read = 0;
+	current_index = 0;
+	total_expected = 0;
+	max_extent_count = (mds->payload_size - sizeof(*extents)) /
+				sizeof(struct cxl_extent);
+	do {
+		struct cxl_mbox_get_extent_in get_extent;
+		u32 nr_returned, current_total, current_gen_num;
+		int rc;
+
+		get_extent = (struct cxl_mbox_get_extent_in) {
+			.extent_cnt = max(max_extent_count,
+					  total_expected - current_index),
+			.start_extent_index = cpu_to_le32(current_index),
+		};
+
+		mbox_cmd = (struct cxl_mbox_cmd) {
+			.opcode = CXL_MBOX_OP_GET_DC_EXTENT_LIST,
+			.payload_in = &get_extent,
+			.size_in = sizeof(get_extent),
+			.size_out = mds->payload_size,
+			.payload_out = extents,
+			.min_out = 1,
+		};
+
+		rc = cxl_internal_send_cmd(mds, &mbox_cmd);
+		if (rc < 0)
+			return rc;
+
+		/* Save initial data */
+		if (first) {
+			total_expected = le32_to_cpu(extents->total_extent_count);
+			initial_gen_num = le32_to_cpu(extents->generation_num);
+			first = false;
+		}
+
+		nr_returned = le32_to_cpu(extents->returned_extent_count);
+		total_read += nr_returned;
+		current_total = le32_to_cpu(extents->total_extent_count);
+		current_gen_num = le32_to_cpu(extents->generation_num);
+
+		dev_dbg(dev, "Got extent list %d-%d of %d generation Num:%d\n",
+			current_index, total_read - 1, current_total, current_gen_num);
+
+		if (current_gen_num != initial_gen_num || total_expected != current_total) {
+			dev_dbg(dev, "Extent list change detected; gen %u != %u : cnt %u != %u\n",
+				current_gen_num, initial_gen_num,
+				total_expected, current_total);
+			return -EAGAIN;
+		}
+
+		for (int i = 0; i < nr_returned ; i++) {
+			struct cxl_extent *extent = &extents->extent[i];
+
+			dev_dbg(dev, "Processing extent %d/%d\n",
+				current_index + i, total_expected);
+
+			rc = validate_add_extent(mds, extent);
+			if (rc)
+				continue;
+		}
+
+		current_index += nr_returned;
+	} while (total_expected > total_read);
+
+	return 0;
+}
+
+/**
+ * cxl_read_extent_list() - Read existing extents
+ * @cxled: Endpoint decoder which is part of a region
+ *
+ * Issue the Get Dynamic Capacity Extent List command to the device
+ * and add existing extents if found.
+ */
+void cxl_read_extent_list(struct cxl_endpoint_decoder *cxled)
+{
+	int retry = 10;
+	int rc;
+
+	do {
+		rc = __cxl_read_extent_list(cxled);
+	} while (rc == -EAGAIN && retry--);
+}
+
 static int add_dpa_res(struct device *dev, struct resource *parent,
 		       struct resource *res, resource_size_t start,
 		       resource_size_t size, const char *type)
diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
index 8c9171f914fb..885fb3004784 100644
--- a/drivers/cxl/core/region.c
+++ b/drivers/cxl/core/region.c
@@ -3190,6 +3190,15 @@ static int devm_cxl_add_pmem_region(struct cxl_region *cxlr)
 	return rc;
 }
 
+static void cxlr_add_existing_extents(struct cxl_region *cxlr)
+{
+	struct cxl_region_params *p = &cxlr->params;
+	int i;
+
+	for (i = 0; i < p->nr_targets; i++)
+		cxl_read_extent_list(p->targets[i]);
+}
+
 static void cxlr_dax_unregister(void *_cxlr_dax)
 {
 	struct cxl_dax_region *cxlr_dax = _cxlr_dax;
@@ -3227,6 +3236,9 @@ static int devm_cxl_add_dax_region(struct cxl_region *cxlr)
 	dev_dbg(&cxlr->dev, "%s: register %s\n", dev_name(dev->parent),
 		dev_name(dev));
 
+	if (cxlr->mode == CXL_REGION_DC)
+		cxlr_add_existing_extents(cxlr);
+
 	return devm_add_action_or_reset(&cxlr->dev, cxlr_dax_unregister,
 					cxlr_dax);
 err:
diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
index 3a40fe1f0be7..11c03637488d 100644
--- a/drivers/cxl/cxlmem.h
+++ b/drivers/cxl/cxlmem.h
@@ -624,6 +624,27 @@ struct cxl_mbox_dc_response {
 	} __packed extent_list[];
 } __packed;
 
+/*
+ * Get Dynamic Capacity Extent List; Input Payload
+ * CXL rev 3.1 section 8.2.9.9.9.2; Table 8-166
+ */
+struct cxl_mbox_get_extent_in {
+	__le32 extent_cnt;
+	__le32 start_extent_index;
+} __packed;
+
+/*
+ * Get Dynamic Capacity Extent List; Output Payload
+ * CXL rev 3.1 section 8.2.9.9.9.2; Table 8-167
+ */
+struct cxl_mbox_get_extent_out {
+	__le32 returned_extent_count;
+	__le32 total_extent_count;
+	__le32 generation_num;
+	u8 rsvd[4];
+	struct cxl_extent extent[];
+} __packed;
+
 struct cxl_mbox_get_supported_logs {
 	__le16 entries;
 	u8 rsvd[6];

-- 
2.45.2


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v3 23/25] cxl/mem: Trace Dynamic capacity Event Record
  2024-08-16 14:44 [PATCH v3 00/25] DCD: Add support for Dynamic Capacity Devices (DCD) Ira Weiny
                   ` (21 preceding siblings ...)
  2024-08-16 14:44 ` [PATCH v3 22/25] cxl/region: Read existing extents on region creation ira.weiny
@ 2024-08-16 14:44 ` ira.weiny
  2024-08-20 22:54   ` Dave Jiang
                     ` (2 more replies)
  2024-08-16 14:44 ` [PATCH v3 24/25] tools/testing/cxl: Make event logs dynamic Ira Weiny
  2024-08-16 14:44 ` [PATCH v3 25/25] tools/testing/cxl: Add DC Regions to mock mem data Ira Weiny
  24 siblings, 3 replies; 120+ messages in thread
From: ira.weiny @ 2024-08-16 14:44 UTC (permalink / raw)
  To: Dave Jiang, Fan Ni, Jonathan Cameron, Navneet Singh, Chris Mason,
	Josef Bacik, David Sterba, Petr Mladek, Steven Rostedt,
	Andy Shevchenko, Rasmus Villemoes, Sergey Senozhatsky,
	Jonathan Corbet, Andrew Morton
  Cc: Dan Williams, Davidlohr Bueso, Alison Schofield, Vishal Verma,
	Ira Weiny, linux-btrfs, linux-cxl, linux-kernel, linux-doc,
	nvdimm

From: Navneet Singh <navneet.singh@intel.com>

CXL rev 3.1 section 8.2.9.2.1 adds the Dynamic Capacity Event Records.
User space can use trace events for debugging of DC capacity changes.

Add DC trace points to the trace log.

Signed-off-by: Navneet Singh <navneet.singh@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Changes:
[Alison: Update commit message]
---
 drivers/cxl/core/mbox.c  |  4 +++
 drivers/cxl/core/trace.h | 65 ++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 69 insertions(+)

diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
index d43ac8eabf56..8202fc6c111d 100644
--- a/drivers/cxl/core/mbox.c
+++ b/drivers/cxl/core/mbox.c
@@ -977,6 +977,10 @@ static void __cxl_event_trace_record(const struct cxl_memdev *cxlmd,
 		ev_type = CXL_CPER_EVENT_DRAM;
 	else if (uuid_equal(uuid, &CXL_EVENT_MEM_MODULE_UUID))
 		ev_type = CXL_CPER_EVENT_MEM_MODULE;
+	else if (uuid_equal(uuid, &CXL_EVENT_DC_EVENT_UUID)) {
+		trace_cxl_dynamic_capacity(cxlmd, type, &record->event.dcd);
+		return;
+	}
 
 	cxl_event_trace_record(cxlmd, type, ev_type, uuid, &record->event);
 }
diff --git a/drivers/cxl/core/trace.h b/drivers/cxl/core/trace.h
index 9167cfba7f59..a3a5269311ee 100644
--- a/drivers/cxl/core/trace.h
+++ b/drivers/cxl/core/trace.h
@@ -731,6 +731,71 @@ TRACE_EVENT(cxl_poison,
 	)
 );
 
+/*
+ * DYNAMIC CAPACITY Event Record - DER
+ *
+ * CXL rev 3.0 section 8.2.9.2.1.5 Table 8-47
+ */
+
+#define CXL_DC_ADD_CAPACITY			0x00
+#define CXL_DC_REL_CAPACITY			0x01
+#define CXL_DC_FORCED_REL_CAPACITY		0x02
+#define CXL_DC_REG_CONF_UPDATED			0x03
+#define show_dc_evt_type(type)	__print_symbolic(type,		\
+	{ CXL_DC_ADD_CAPACITY,	"Add capacity"},		\
+	{ CXL_DC_REL_CAPACITY,	"Release capacity"},		\
+	{ CXL_DC_FORCED_REL_CAPACITY,	"Forced capacity release"},	\
+	{ CXL_DC_REG_CONF_UPDATED,	"Region Configuration Updated"	} \
+)
+
+TRACE_EVENT(cxl_dynamic_capacity,
+
+	TP_PROTO(const struct cxl_memdev *cxlmd, enum cxl_event_log_type log,
+		 struct cxl_event_dcd *rec),
+
+	TP_ARGS(cxlmd, log, rec),
+
+	TP_STRUCT__entry(
+		CXL_EVT_TP_entry
+
+		/* Dynamic capacity Event */
+		__field(u8, event_type)
+		__field(u16, hostid)
+		__field(u8, region_id)
+		__field(u64, dpa_start)
+		__field(u64, length)
+		__array(u8, tag, CXL_EXTENT_TAG_LEN)
+		__field(u16, sh_extent_seq)
+	),
+
+	TP_fast_assign(
+		CXL_EVT_TP_fast_assign(cxlmd, log, rec->hdr);
+
+		/* Dynamic_capacity Event */
+		__entry->event_type = rec->event_type;
+
+		/* DCD event record data */
+		__entry->hostid = le16_to_cpu(rec->host_id);
+		__entry->region_id = rec->region_index;
+		__entry->dpa_start = le64_to_cpu(rec->extent.start_dpa);
+		__entry->length = le64_to_cpu(rec->extent.length);
+		memcpy(__entry->tag, &rec->extent.tag, CXL_EXTENT_TAG_LEN);
+		__entry->sh_extent_seq = le16_to_cpu(rec->extent.shared_extn_seq);
+	),
+
+	CXL_EVT_TP_printk("event_type='%s' host_id='%d' region_id='%d' " \
+		"starting_dpa=%llx length=%llx tag=%s " \
+		"shared_extent_sequence=%d",
+		show_dc_evt_type(__entry->event_type),
+		__entry->hostid,
+		__entry->region_id,
+		__entry->dpa_start,
+		__entry->length,
+		__print_hex(__entry->tag, CXL_EXTENT_TAG_LEN),
+		__entry->sh_extent_seq
+	)
+);
+
 #endif /* _CXL_EVENTS_H */
 
 #define TRACE_INCLUDE_FILE trace

-- 
2.45.2


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v3 24/25] tools/testing/cxl: Make event logs dynamic
  2024-08-16 14:44 [PATCH v3 00/25] DCD: Add support for Dynamic Capacity Devices (DCD) Ira Weiny
                   ` (22 preceding siblings ...)
  2024-08-16 14:44 ` [PATCH v3 23/25] cxl/mem: Trace Dynamic capacity Event Record ira.weiny
@ 2024-08-16 14:44 ` Ira Weiny
  2024-08-20 23:30   ` Dave Jiang
  2024-08-27 14:32   ` Jonathan Cameron
  2024-08-16 14:44 ` [PATCH v3 25/25] tools/testing/cxl: Add DC Regions to mock mem data Ira Weiny
  24 siblings, 2 replies; 120+ messages in thread
From: Ira Weiny @ 2024-08-16 14:44 UTC (permalink / raw)
  To: Dave Jiang, Fan Ni, Jonathan Cameron, Navneet Singh, Chris Mason,
	Josef Bacik, David Sterba, Petr Mladek, Steven Rostedt,
	Andy Shevchenko, Rasmus Villemoes, Sergey Senozhatsky,
	Jonathan Corbet, Andrew Morton
  Cc: Dan Williams, Davidlohr Bueso, Alison Schofield, Vishal Verma,
	Ira Weiny, linux-btrfs, linux-cxl, linux-kernel, linux-doc,
	nvdimm

The test event logs were created as static arrays as an easy way to mock
events.  Dynamic Capacity Device (DCD) test support requires events be
generated dynamically when extents are created or destroyed.

Modify the event log storage to be dynamically allocated.  Reuse the
static event data to create the dynamic events in the new logs without
inventing complex event injection for the previous tests.  Simplify the
processing of the logs by using the event log array index as the handle.
Add a lock to manage concurrency required when user space is allowed to
control DCD extents

Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Changes:
[iweiny: rebase]
---
 tools/testing/cxl/test/mem.c | 278 ++++++++++++++++++++++++++-----------------
 1 file changed, 171 insertions(+), 107 deletions(-)

diff --git a/tools/testing/cxl/test/mem.c b/tools/testing/cxl/test/mem.c
index 129f179b0ac5..674fc7f086cd 100644
--- a/tools/testing/cxl/test/mem.c
+++ b/tools/testing/cxl/test/mem.c
@@ -125,18 +125,27 @@ static struct {
 
 #define PASS_TRY_LIMIT 3
 
-#define CXL_TEST_EVENT_CNT_MAX 15
+#define CXL_TEST_EVENT_CNT_MAX 17
 
 /* Set a number of events to return at a time for simulation.  */
 #define CXL_TEST_EVENT_RET_MAX 4
 
+/*
+ * @next_handle: next handle (index) to be stored to
+ * @cur_handle: current handle (index) to be returned to the user on get_event
+ * @nr_events: total events in this log
+ * @nr_overflow: number of events added past the log size
+ * @lock: protect these state variables
+ * @events: array of pending events to be returned.
+ */
 struct mock_event_log {
-	u16 clear_idx;
-	u16 cur_idx;
+	u16 next_handle;
+	u16 cur_handle;
 	u16 nr_events;
 	u16 nr_overflow;
-	u16 overflow_reset;
-	struct cxl_event_record_raw *events[CXL_TEST_EVENT_CNT_MAX];
+	rwlock_t lock;
+	/* 1 extra slot to accommodate that handles can't be 0 */
+	struct cxl_event_record_raw *events[CXL_TEST_EVENT_CNT_MAX + 1];
 };
 
 struct mock_event_store {
@@ -171,56 +180,68 @@ static struct mock_event_log *event_find_log(struct device *dev, int log_type)
 	return &mdata->mes.mock_logs[log_type];
 }
 
-static struct cxl_event_record_raw *event_get_current(struct mock_event_log *log)
-{
-	return log->events[log->cur_idx];
-}
-
-static void event_reset_log(struct mock_event_log *log)
-{
-	log->cur_idx = 0;
-	log->clear_idx = 0;
-	log->nr_overflow = log->overflow_reset;
-}
-
 /* Handle can never be 0 use 1 based indexing for handle */
-static u16 event_get_clear_handle(struct mock_event_log *log)
+static void event_inc_handle(u16 *handle)
 {
-	return log->clear_idx + 1;
+	*handle = (*handle + 1) % CXL_TEST_EVENT_CNT_MAX;
+	if (!*handle)
+		*handle = *handle + 1;
 }
 
-/* Handle can never be 0 use 1 based indexing for handle */
-static __le16 event_get_cur_event_handle(struct mock_event_log *log)
-{
-	u16 cur_handle = log->cur_idx + 1;
-
-	return cpu_to_le16(cur_handle);
-}
-
-static bool event_log_empty(struct mock_event_log *log)
-{
-	return log->cur_idx == log->nr_events;
-}
-
-static void mes_add_event(struct mock_event_store *mes,
+/* Add the event or free it on 'overflow' */
+static void mes_add_event(struct cxl_mockmem_data *mdata,
 			  enum cxl_event_log_type log_type,
 			  struct cxl_event_record_raw *event)
 {
+	struct device *dev = mdata->mds->cxlds.dev;
 	struct mock_event_log *log;
+	u16 handle;
 
 	if (WARN_ON(log_type >= CXL_EVENT_TYPE_MAX))
 		return;
 
-	log = &mes->mock_logs[log_type];
+	log = &mdata->mes.mock_logs[log_type];
 
-	if ((log->nr_events + 1) > CXL_TEST_EVENT_CNT_MAX) {
+	write_lock(&log->lock);
+
+	handle = log->next_handle;
+	if ((handle + 1) == log->cur_handle) {
 		log->nr_overflow++;
-		log->overflow_reset = log->nr_overflow;
-		return;
+		dev_dbg(dev, "Overflowing %d\n", log_type);
+		devm_kfree(dev, event);
+		goto unlock;
 	}
 
-	log->events[log->nr_events] = event;
+	dev_dbg(dev, "Log %d; handle %u\n", log_type, handle);
+	event->event.generic.hdr.handle = cpu_to_le16(handle);
+	log->events[handle] = event;
+	event_inc_handle(&log->next_handle);
 	log->nr_events++;
+
+unlock:
+	write_unlock(&log->lock);
+}
+
+static void mes_del_event(struct device *dev,
+			  struct mock_event_log *log,
+			  u16 handle)
+{
+	struct cxl_event_record_raw *cur;
+
+	lockdep_assert(lockdep_is_held(&log->lock));
+
+	dev_dbg(dev, "Clearing event %u; cur %u\n", handle, log->cur_handle);
+	cur = log->events[handle];
+	if (!cur) {
+		dev_err(dev, "Mock event index %u empty? nr_events %u",
+			handle, log->nr_events);
+		return;
+	}
+	log->events[handle] = NULL;
+
+	event_inc_handle(&log->cur_handle);
+	log->nr_events--;
+	devm_kfree(dev, cur);
 }
 
 /*
@@ -233,8 +254,8 @@ static int mock_get_event(struct device *dev, struct cxl_mbox_cmd *cmd)
 {
 	struct cxl_get_event_payload *pl;
 	struct mock_event_log *log;
-	u16 nr_overflow;
 	u8 log_type;
+	u16 handle;
 	int i;
 
 	if (cmd->size_in != sizeof(log_type))
@@ -254,29 +275,39 @@ static int mock_get_event(struct device *dev, struct cxl_mbox_cmd *cmd)
 	memset(cmd->payload_out, 0, struct_size(pl, records, 0));
 
 	log = event_find_log(dev, log_type);
-	if (!log || event_log_empty(log))
+	if (!log)
 		return 0;
 
 	pl = cmd->payload_out;
 
-	for (i = 0; i < ret_limit && !event_log_empty(log); i++) {
-		memcpy(&pl->records[i], event_get_current(log),
-		       sizeof(pl->records[i]));
-		pl->records[i].event.generic.hdr.handle =
-				event_get_cur_event_handle(log);
-		log->cur_idx++;
+	read_lock(&log->lock);
+
+	handle = log->cur_handle;
+	dev_dbg(dev, "Get log %d handle %u next %u\n",
+		log_type, handle, log->next_handle);
+	for (i = 0;
+	     i < ret_limit && handle != log->next_handle;
+	     i++, event_inc_handle(&handle)) {
+		struct cxl_event_record_raw *cur;
+
+		cur = log->events[handle];
+		dev_dbg(dev, "Sending event log %d handle %d idx %u\n",
+			log_type, le16_to_cpu(cur->event.generic.hdr.handle),
+			handle);
+		memcpy(&pl->records[i], cur, sizeof(pl->records[i]));
+		pl->records[i].event.generic.hdr.handle = cpu_to_le16(handle);
 	}
 
 	cmd->size_out = struct_size(pl, records, i);
 	pl->record_count = cpu_to_le16(i);
-	if (!event_log_empty(log))
+	if (log->nr_events > i)
 		pl->flags |= CXL_GET_EVENT_FLAG_MORE_RECORDS;
 
 	if (log->nr_overflow) {
 		u64 ns;
 
 		pl->flags |= CXL_GET_EVENT_FLAG_OVERFLOW;
-		pl->overflow_err_count = cpu_to_le16(nr_overflow);
+		pl->overflow_err_count = cpu_to_le16(log->nr_overflow);
 		ns = ktime_get_real_ns();
 		ns -= 5000000000; /* 5s ago */
 		pl->first_overflow_timestamp = cpu_to_le64(ns);
@@ -285,16 +316,17 @@ static int mock_get_event(struct device *dev, struct cxl_mbox_cmd *cmd)
 		pl->last_overflow_timestamp = cpu_to_le64(ns);
 	}
 
+	read_unlock(&log->lock);
 	return 0;
 }
 
 static int mock_clear_event(struct device *dev, struct cxl_mbox_cmd *cmd)
 {
 	struct cxl_mbox_clear_event_payload *pl = cmd->payload_in;
-	struct mock_event_log *log;
 	u8 log_type = pl->event_log;
+	struct mock_event_log *log;
+	int nr, rc = 0;
 	u16 handle;
-	int nr;
 
 	if (log_type >= CXL_EVENT_TYPE_MAX)
 		return -EINVAL;
@@ -303,24 +335,23 @@ static int mock_clear_event(struct device *dev, struct cxl_mbox_cmd *cmd)
 	if (!log)
 		return 0; /* No mock data in this log */
 
-	/*
-	 * This check is technically not invalid per the specification AFAICS.
-	 * (The host could 'guess' handles and clear them in order).
-	 * However, this is not good behavior for the host so test it.
-	 */
-	if (log->clear_idx + pl->nr_recs > log->cur_idx) {
-		dev_err(dev,
-			"Attempting to clear more events than returned!\n");
-		return -EINVAL;
-	}
+	write_lock(&log->lock);
 
 	/* Check handle order prior to clearing events */
-	for (nr = 0, handle = event_get_clear_handle(log);
-	     nr < pl->nr_recs;
-	     nr++, handle++) {
+	handle = log->cur_handle;
+	for (nr = 0;
+	     nr < pl->nr_recs && handle != log->next_handle;
+	     nr++, event_inc_handle(&handle)) {
+
+		dev_dbg(dev, "Checking clear of %d handle %u plhandle %u\n",
+			log_type, handle,
+			le16_to_cpu(pl->handles[nr]));
+
 		if (handle != le16_to_cpu(pl->handles[nr])) {
-			dev_err(dev, "Clearing events out of order\n");
-			return -EINVAL;
+			dev_err(dev, "Clearing events out of order %u %u\n",
+				handle, le16_to_cpu(pl->handles[nr]));
+			rc = -EINVAL;
+			goto unlock;
 		}
 	}
 
@@ -328,25 +359,12 @@ static int mock_clear_event(struct device *dev, struct cxl_mbox_cmd *cmd)
 		log->nr_overflow = 0;
 
 	/* Clear events */
-	log->clear_idx += pl->nr_recs;
-	return 0;
-}
-
-static void cxl_mock_event_trigger(struct device *dev)
-{
-	struct cxl_mockmem_data *mdata = dev_get_drvdata(dev);
-	struct mock_event_store *mes = &mdata->mes;
-	int i;
-
-	for (i = CXL_EVENT_TYPE_INFO; i < CXL_EVENT_TYPE_MAX; i++) {
-		struct mock_event_log *log;
-
-		log = event_find_log(dev, i);
-		if (log)
-			event_reset_log(log);
-	}
+	for (nr = 0; nr < pl->nr_recs; nr++)
+		mes_del_event(dev, log, le16_to_cpu(pl->handles[nr]));
 
-	cxl_mem_get_event_records(mdata->mds, mes->ev_status);
+unlock:
+	write_unlock(&log->lock);
+	return rc;
 }
 
 struct cxl_event_record_raw maint_needed = {
@@ -475,8 +493,27 @@ static int mock_set_timestamp(struct cxl_dev_state *cxlds,
 	return 0;
 }
 
-static void cxl_mock_add_event_logs(struct mock_event_store *mes)
+/* Create a dynamically allocated event out of a statically defined event. */
+static void add_event_from_static(struct cxl_mockmem_data *mdata,
+				  enum cxl_event_log_type log_type,
+				  struct cxl_event_record_raw *raw)
+{
+	struct device *dev = mdata->mds->cxlds.dev;
+	struct cxl_event_record_raw *rec;
+
+	rec = devm_kmemdup(dev, raw, sizeof(*rec), GFP_KERNEL);
+	if (!rec) {
+		dev_err(dev, "Failed to alloc event for log\n");
+		return;
+	}
+	mes_add_event(mdata, log_type, rec);
+}
+
+static void cxl_mock_add_event_logs(struct cxl_mockmem_data *mdata)
 {
+	struct mock_event_store *mes = &mdata->mes;
+	struct device *dev = mdata->mds->cxlds.dev;
+
 	put_unaligned_le16(CXL_GMER_VALID_CHANNEL | CXL_GMER_VALID_RANK,
 			   &gen_media.rec.media_hdr.validity_flags);
 
@@ -484,43 +521,60 @@ static void cxl_mock_add_event_logs(struct mock_event_store *mes)
 			   CXL_DER_VALID_BANK | CXL_DER_VALID_COLUMN,
 			   &dram.rec.media_hdr.validity_flags);
 
-	mes_add_event(mes, CXL_EVENT_TYPE_INFO, &maint_needed);
-	mes_add_event(mes, CXL_EVENT_TYPE_INFO,
+	dev_dbg(dev, "Generating fake event logs %d\n",
+		CXL_EVENT_TYPE_INFO);
+	add_event_from_static(mdata, CXL_EVENT_TYPE_INFO, &maint_needed);
+	add_event_from_static(mdata, CXL_EVENT_TYPE_INFO,
 		      (struct cxl_event_record_raw *)&gen_media);
-	mes_add_event(mes, CXL_EVENT_TYPE_INFO,
+	add_event_from_static(mdata, CXL_EVENT_TYPE_INFO,
 		      (struct cxl_event_record_raw *)&mem_module);
 	mes->ev_status |= CXLDEV_EVENT_STATUS_INFO;
 
-	mes_add_event(mes, CXL_EVENT_TYPE_FAIL, &maint_needed);
-	mes_add_event(mes, CXL_EVENT_TYPE_FAIL, &hardware_replace);
-	mes_add_event(mes, CXL_EVENT_TYPE_FAIL,
+	dev_dbg(dev, "Generating fake event logs %d\n",
+		CXL_EVENT_TYPE_FAIL);
+	add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL, &maint_needed);
+	add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL,
+		      (struct cxl_event_record_raw *)&mem_module);
+	add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL, &hardware_replace);
+	add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL,
 		      (struct cxl_event_record_raw *)&dram);
-	mes_add_event(mes, CXL_EVENT_TYPE_FAIL,
+	add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL,
 		      (struct cxl_event_record_raw *)&gen_media);
-	mes_add_event(mes, CXL_EVENT_TYPE_FAIL,
+	add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL,
 		      (struct cxl_event_record_raw *)&mem_module);
-	mes_add_event(mes, CXL_EVENT_TYPE_FAIL, &hardware_replace);
-	mes_add_event(mes, CXL_EVENT_TYPE_FAIL,
+	add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL, &hardware_replace);
+	add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL,
 		      (struct cxl_event_record_raw *)&dram);
 	/* Overflow this log */
-	mes_add_event(mes, CXL_EVENT_TYPE_FAIL, &hardware_replace);
-	mes_add_event(mes, CXL_EVENT_TYPE_FAIL, &hardware_replace);
-	mes_add_event(mes, CXL_EVENT_TYPE_FAIL, &hardware_replace);
-	mes_add_event(mes, CXL_EVENT_TYPE_FAIL, &hardware_replace);
-	mes_add_event(mes, CXL_EVENT_TYPE_FAIL, &hardware_replace);
-	mes_add_event(mes, CXL_EVENT_TYPE_FAIL, &hardware_replace);
-	mes_add_event(mes, CXL_EVENT_TYPE_FAIL, &hardware_replace);
-	mes_add_event(mes, CXL_EVENT_TYPE_FAIL, &hardware_replace);
-	mes_add_event(mes, CXL_EVENT_TYPE_FAIL, &hardware_replace);
-	mes_add_event(mes, CXL_EVENT_TYPE_FAIL, &hardware_replace);
+	add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL, &hardware_replace);
+	add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL, &hardware_replace);
+	add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL, &hardware_replace);
+	add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL, &hardware_replace);
+	add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL, &hardware_replace);
+	add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL, &hardware_replace);
+	add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL, &hardware_replace);
+	add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL, &hardware_replace);
+	add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL, &hardware_replace);
+	add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL, &hardware_replace);
 	mes->ev_status |= CXLDEV_EVENT_STATUS_FAIL;
 
-	mes_add_event(mes, CXL_EVENT_TYPE_FATAL, &hardware_replace);
-	mes_add_event(mes, CXL_EVENT_TYPE_FATAL,
+	dev_dbg(dev, "Generating fake event logs %d\n",
+		CXL_EVENT_TYPE_FATAL);
+	add_event_from_static(mdata, CXL_EVENT_TYPE_FATAL, &hardware_replace);
+	add_event_from_static(mdata, CXL_EVENT_TYPE_FATAL,
 		      (struct cxl_event_record_raw *)&dram);
 	mes->ev_status |= CXLDEV_EVENT_STATUS_FATAL;
 }
 
+static void cxl_mock_event_trigger(struct device *dev)
+{
+	struct cxl_mockmem_data *mdata = dev_get_drvdata(dev);
+	struct mock_event_store *mes = &mdata->mes;
+
+	cxl_mock_add_event_logs(mdata);
+	cxl_mem_get_event_records(mdata->mds, mes->ev_status);
+}
+
 static int mock_gsl(struct cxl_mbox_cmd *cmd)
 {
 	if (cmd->size_out < sizeof(mock_gsl_payload))
@@ -1453,6 +1507,14 @@ static ssize_t event_trigger_store(struct device *dev,
 }
 static DEVICE_ATTR_WO(event_trigger);
 
+static void init_event_log(struct mock_event_log *log)
+{
+	rwlock_init(&log->lock);
+	/* Handle can never be 0 use 1 based indexing for handle */
+	log->cur_handle = 1;
+	log->next_handle = 1;
+}
+
 static int cxl_mock_mem_probe(struct platform_device *pdev)
 {
 	struct device *dev = &pdev->dev;
@@ -1519,7 +1581,9 @@ static int cxl_mock_mem_probe(struct platform_device *pdev)
 	if (rc)
 		return rc;
 
-	cxl_mock_add_event_logs(&mdata->mes);
+	for (int i = 0; i < CXL_EVENT_TYPE_MAX; i++)
+		init_event_log(&mdata->mes.mock_logs[i]);
+	cxl_mock_add_event_logs(mdata);
 
 	cxlmd = devm_cxl_add_memdev(&pdev->dev, cxlds);
 	if (IS_ERR(cxlmd))

-- 
2.45.2


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v3 25/25] tools/testing/cxl: Add DC Regions to mock mem data
  2024-08-16 14:44 [PATCH v3 00/25] DCD: Add support for Dynamic Capacity Devices (DCD) Ira Weiny
                   ` (23 preceding siblings ...)
  2024-08-16 14:44 ` [PATCH v3 24/25] tools/testing/cxl: Make event logs dynamic Ira Weiny
@ 2024-08-16 14:44 ` Ira Weiny
  2024-08-27 14:39   ` Jonathan Cameron
  24 siblings, 1 reply; 120+ messages in thread
From: Ira Weiny @ 2024-08-16 14:44 UTC (permalink / raw)
  To: Dave Jiang, Fan Ni, Jonathan Cameron, Navneet Singh, Chris Mason,
	Josef Bacik, David Sterba, Petr Mladek, Steven Rostedt,
	Andy Shevchenko, Rasmus Villemoes, Sergey Senozhatsky,
	Jonathan Corbet, Andrew Morton
  Cc: Dan Williams, Davidlohr Bueso, Alison Schofield, Vishal Verma,
	Ira Weiny, linux-btrfs, linux-cxl, linux-kernel, linux-doc,
	nvdimm

cxl_test provides a good way to ensure quick smoke and regression
testing.  The complexity of Dynamic Capacity (DC) extent processing as
well as the complexity of the new sparse DAX regions can mostly be
tested through cxl_test.  This includes management of sparse regions and
DAX devices on those regions; the management of extent device lifetimes;
and the processing of DCD events.

The only missing functionality from this test is actual interrupt
processing.

Mock memory devices can easily mock DC information and manage fake
extent data.

Define mock_dc_region information within the mock memory data.  Add
sysfs entries on the mock device to inject and delete extents.

The inject format is <start>:<length>:<tag>:<more_flag>
The delete format is <start>:<length>

Directly call the event irq callback to simulate irqs to process the
test extents.

Add DC mailbox commands to the CEL and implement those commands.

Signed-off-by: Ira Weiny <ira.weiny@intel.com>
---
Changes:
[iweiny: add more bit]
[iweiny: merge the 2 test patches together]
---
 tools/testing/cxl/test/mem.c | 703 ++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 702 insertions(+), 1 deletion(-)

diff --git a/tools/testing/cxl/test/mem.c b/tools/testing/cxl/test/mem.c
index 674fc7f086cd..1a388d0ef052 100644
--- a/tools/testing/cxl/test/mem.c
+++ b/tools/testing/cxl/test/mem.c
@@ -19,6 +19,7 @@
 #define FW_SLOTS 3
 #define DEV_SIZE SZ_2G
 #define EFFECT(x) (1U << x)
+#define BASE_DYNAMIC_CAP_DPA DEV_SIZE
 
 #define MOCK_INJECT_DEV_MAX 8
 #define MOCK_INJECT_TEST_MAX 128
@@ -96,6 +97,22 @@ static struct cxl_cel_entry mock_cel[] = {
 				      EFFECT(SECURITY_CHANGE_IMMEDIATE) |
 				      EFFECT(BACKGROUND_OP)),
 	},
+	{
+		.opcode = cpu_to_le16(CXL_MBOX_OP_GET_DC_CONFIG),
+		.effect = CXL_CMD_EFFECT_NONE,
+	},
+	{
+		.opcode = cpu_to_le16(CXL_MBOX_OP_GET_DC_EXTENT_LIST),
+		.effect = CXL_CMD_EFFECT_NONE,
+	},
+	{
+		.opcode = cpu_to_le16(CXL_MBOX_OP_ADD_DC_RESPONSE),
+		.effect = cpu_to_le16(EFFECT(CONF_CHANGE_IMMEDIATE)),
+	},
+	{
+		.opcode = cpu_to_le16(CXL_MBOX_OP_RELEASE_DC),
+		.effect = cpu_to_le16(EFFECT(CONF_CHANGE_IMMEDIATE)),
+	},
 };
 
 /* See CXL 2.0 Table 181 Get Health Info Output Payload */
@@ -153,6 +170,7 @@ struct mock_event_store {
 	u32 ev_status;
 };
 
+#define NUM_MOCK_DC_REGIONS 2
 struct cxl_mockmem_data {
 	void *lsa;
 	void *fw;
@@ -169,6 +187,11 @@ struct cxl_mockmem_data {
 	u8 event_buf[SZ_4K];
 	u64 timestamp;
 	unsigned long sanitize_timeout;
+	struct cxl_dc_region_config dc_regions[NUM_MOCK_DC_REGIONS];
+	u32 dc_ext_generation;
+	struct mutex ext_lock;
+	struct xarray dc_extents;
+	struct xarray dc_accepted_exts;
 };
 
 static struct mock_event_log *event_find_log(struct device *dev, int log_type)
@@ -575,6 +598,237 @@ static void cxl_mock_event_trigger(struct device *dev)
 	cxl_mem_get_event_records(mdata->mds, mes->ev_status);
 }
 
+struct cxl_extent_data {
+	u64 dpa_start;
+	u64 length;
+	u8 tag[CXL_EXTENT_TAG_LEN];
+	bool shared;
+};
+
+static int __devm_add_extent(struct device *dev, struct xarray *array,
+			     u64 start, u64 length, const char *tag,
+			     bool shared)
+{
+	struct cxl_extent_data *extent;
+
+	extent = devm_kzalloc(dev, sizeof(*extent), GFP_KERNEL);
+	if (!extent)
+		return -ENOMEM;
+
+	extent->dpa_start = start;
+	extent->length = length;
+	memcpy(extent->tag, tag, min(sizeof(extent->tag), strlen(tag)));
+	extent->shared = shared;
+
+	if (xa_insert(array, start, extent, GFP_KERNEL)) {
+		devm_kfree(dev, extent);
+		dev_err(dev, "Failed xarry insert %#llx\n", start);
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+static int devm_add_extent(struct device *dev, u64 start, u64 length,
+			   const char *tag, bool shared)
+{
+	struct cxl_mockmem_data *mdata = dev_get_drvdata(dev);
+
+	guard(mutex)(&mdata->ext_lock);
+	return __devm_add_extent(dev, &mdata->dc_extents, start, length, tag,
+				 shared);
+}
+
+/* It is known that ext and the new range are not equal */
+static struct cxl_extent_data *
+split_ext(struct device *dev, struct xarray *array,
+	  struct cxl_extent_data *ext, u64 start, u64 length)
+{
+	u64 new_start, new_length;
+
+	if (ext->dpa_start == start) {
+		new_start = start + length;
+		new_length = (ext->dpa_start + ext->length) - new_start;
+
+		if (__devm_add_extent(dev, array, new_start, new_length,
+				      ext->tag, false))
+			return NULL;
+
+		ext = xa_erase(array, ext->dpa_start);
+		if (__devm_add_extent(dev, array, start, length, ext->tag,
+				      false))
+			return NULL;
+
+		return xa_load(array, start);
+	}
+
+	/* ext->dpa_start != start */
+
+	if (__devm_add_extent(dev, array, start, length, ext->tag, false))
+		return NULL;
+
+	new_start = ext->dpa_start;
+	new_length = start - ext->dpa_start;
+
+	ext = xa_erase(array, ext->dpa_start);
+	if (__devm_add_extent(dev, array, new_start, new_length, ext->tag,
+			      false))
+		return NULL;
+
+	return xa_load(array, start);
+}
+
+/*
+ * Do not handle extents which are not inside a single extent sent to
+ * the host.
+ */
+static struct cxl_extent_data *
+find_create_ext(struct device *dev, struct xarray *array, u64 start, u64 length)
+{
+	struct cxl_extent_data *ext;
+	unsigned long index;
+
+	xa_for_each(array, index, ext) {
+		u64 end = start + length;
+
+		/* start < [ext) <= start */
+		if (start < ext->dpa_start ||
+		    (ext->dpa_start + ext->length) <= start)
+			continue;
+
+		if (end <= ext->dpa_start ||
+		    (ext->dpa_start + ext->length) < end) {
+			dev_err(dev, "Invalid range %#llx-%#llx\n", start,
+				end);
+			return NULL;
+		}
+
+		break;
+	}
+
+	if (!ext)
+		return NULL;
+
+	if (start == ext->dpa_start && length == ext->length)
+		return ext;
+
+	return split_ext(dev, array, ext, start, length);
+}
+
+static int dc_accept_extent(struct device *dev, u64 start, u64 length)
+{
+	struct cxl_mockmem_data *mdata = dev_get_drvdata(dev);
+	struct cxl_extent_data *ext;
+
+	dev_dbg(dev, "Host accepting extent %#llx\n", start);
+	mdata->dc_ext_generation++;
+
+	guard(mutex)(&mdata->ext_lock);
+	ext = find_create_ext(dev, &mdata->dc_extents, start, length);
+	if (!ext) {
+		dev_err(dev, "Extent %#llx-%#llx not found\n",
+			start, start + length);
+		return -ENOMEM;
+	}
+	ext = xa_erase(&mdata->dc_extents, ext->dpa_start);
+	return xa_insert(&mdata->dc_accepted_exts, start, ext, GFP_KERNEL);
+}
+
+static void release_dc_ext(void *md)
+{
+	struct cxl_mockmem_data *mdata = md;
+
+	xa_destroy(&mdata->dc_extents);
+	xa_destroy(&mdata->dc_accepted_exts);
+}
+
+/* Pretend to have some previous accepted extents */
+struct pre_ext_info {
+	u64 offset;
+	u64 length;
+} pre_ext_info[] = {
+	{
+		.offset = SZ_128M,
+		.length = SZ_64M,
+	},
+	{
+		.offset = SZ_256M,
+		.length = SZ_64M,
+	},
+};
+
+static int inject_prev_extents(struct device *dev, u64 base_dpa)
+{
+	int rc;
+
+	dev_dbg(dev, "Adding %ld pre-extents for testing\n",
+		ARRAY_SIZE(pre_ext_info));
+
+	for (int i = 0; i < ARRAY_SIZE(pre_ext_info); i++) {
+		u64 ext_dpa = base_dpa + pre_ext_info[i].offset;
+		u64 ext_len = pre_ext_info[i].length;
+
+		dev_dbg(dev, "Adding pre-extent DPA:%#llx LEN:%#llx\n",
+			ext_dpa, ext_len);
+
+		rc = devm_add_extent(dev, ext_dpa, ext_len, "CXL-TEST", false);
+		if (rc) {
+			dev_err(dev, "Failed to add pre-extent DPA:%#llx LEN:%#llx; %d\n",
+				ext_dpa, ext_len, rc);
+			return rc;
+		}
+
+		rc = dc_accept_extent(dev, ext_dpa, ext_len);
+		if (rc)
+			return rc;
+	}
+	return 0;
+}
+
+static int cxl_mock_dc_region_setup(struct device *dev)
+{
+	struct cxl_mockmem_data *mdata = dev_get_drvdata(dev);
+	u64 base_dpa = BASE_DYNAMIC_CAP_DPA;
+	u32 dsmad_handle = 0xFADE;
+	u64 decode_length = SZ_512M;
+	u64 block_size = SZ_512;
+	u64 length = SZ_512M;
+	int rc;
+
+	mutex_init(&mdata->ext_lock);
+	xa_init(&mdata->dc_extents);
+	xa_init(&mdata->dc_accepted_exts);
+
+	rc = devm_add_action_or_reset(dev, release_dc_ext, mdata);
+	if (rc)
+		return rc;
+
+	for (int i = 0; i < NUM_MOCK_DC_REGIONS; i++) {
+		struct cxl_dc_region_config *conf = &mdata->dc_regions[i];
+
+		dev_dbg(dev, "Creating DC region DC%d DPA:%#llx LEN:%#llx\n",
+			i, base_dpa, length);
+
+		conf->region_base = cpu_to_le64(base_dpa);
+		conf->region_decode_length = cpu_to_le64(decode_length /
+						CXL_CAPACITY_MULTIPLIER);
+		conf->region_length = cpu_to_le64(length);
+		conf->region_block_size = cpu_to_le64(block_size);
+		conf->region_dsmad_handle = cpu_to_le32(dsmad_handle);
+		dsmad_handle++;
+
+		rc = inject_prev_extents(dev, base_dpa);
+		if (rc) {
+			dev_err(dev, "Failed to add pre-extents for DC%d\n", i);
+			return rc;
+		}
+
+		base_dpa += decode_length;
+	}
+
+	return 0;
+}
+
 static int mock_gsl(struct cxl_mbox_cmd *cmd)
 {
 	if (cmd->size_out < sizeof(mock_gsl_payload))
@@ -1387,6 +1641,177 @@ static int mock_activate_fw(struct cxl_mockmem_data *mdata,
 	return -EINVAL;
 }
 
+static int mock_get_dc_config(struct device *dev,
+			      struct cxl_mbox_cmd *cmd)
+{
+	struct cxl_mbox_get_dc_config_in *dc_config = cmd->payload_in;
+	struct cxl_mockmem_data *mdata = dev_get_drvdata(dev);
+	u8 region_requested, region_start_idx, region_ret_cnt;
+	struct cxl_mbox_get_dc_config_out *resp;
+	int i;
+
+	region_requested = dc_config->region_count;
+	if (region_requested > NUM_MOCK_DC_REGIONS)
+		region_requested = NUM_MOCK_DC_REGIONS;
+
+	if (cmd->size_out < struct_size(resp, region, region_requested))
+		return -EINVAL;
+
+	memset(cmd->payload_out, 0, cmd->size_out);
+	resp = cmd->payload_out;
+
+	region_start_idx = dc_config->start_region_index;
+	region_ret_cnt = 0;
+	for (i = 0; i < NUM_MOCK_DC_REGIONS; i++) {
+		if (i >= region_start_idx) {
+			memcpy(&resp->region[region_ret_cnt],
+				&mdata->dc_regions[i],
+				sizeof(resp->region[region_ret_cnt]));
+			region_ret_cnt++;
+		}
+	}
+	resp->avail_region_count = NUM_MOCK_DC_REGIONS;
+	resp->regions_returned = i;
+
+	dev_dbg(dev, "Returning %d dc regions\n", region_ret_cnt);
+	return 0;
+}
+
+static int mock_get_dc_extent_list(struct device *dev,
+				   struct cxl_mbox_cmd *cmd)
+{
+	struct cxl_mbox_get_extent_out *resp = cmd->payload_out;
+	struct cxl_mockmem_data *mdata = dev_get_drvdata(dev);
+	struct cxl_mbox_get_extent_in *get = cmd->payload_in;
+	u32 total_avail = 0, total_ret = 0;
+	struct cxl_extent_data *ext;
+	u32 ext_count, start_idx;
+	unsigned long i;
+
+	ext_count = le32_to_cpu(get->extent_cnt);
+	start_idx = le32_to_cpu(get->start_extent_index);
+
+	memset(resp, 0, sizeof(*resp));
+
+	guard(mutex)(&mdata->ext_lock);
+	/*
+	 * Total available needs to be calculated and returned regardless of
+	 * how many can actually be returned.
+	 */
+	xa_for_each(&mdata->dc_accepted_exts, i, ext)
+		total_avail++;
+
+	if (start_idx > total_avail)
+		return -EINVAL;
+
+	xa_for_each(&mdata->dc_accepted_exts, i, ext) {
+		if (total_ret >= ext_count)
+			break;
+
+		if (total_ret >= start_idx) {
+			resp->extent[total_ret].start_dpa =
+						cpu_to_le64(ext->dpa_start);
+			resp->extent[total_ret].length =
+						cpu_to_le64(ext->length);
+			memcpy(&resp->extent[total_ret].tag, ext->tag,
+					sizeof(resp->extent[total_ret]));
+			total_ret++;
+		}
+	}
+
+	resp->returned_extent_count = cpu_to_le32(total_ret);
+	resp->total_extent_count = cpu_to_le32(total_avail);
+	resp->generation_num = cpu_to_le32(mdata->dc_ext_generation);
+
+	dev_dbg(dev, "Returning %d extents of %d total\n",
+		total_ret, total_avail);
+
+	return 0;
+}
+
+static int mock_add_dc_response(struct device *dev,
+				struct cxl_mbox_cmd *cmd)
+{
+	struct cxl_mbox_dc_response *req = cmd->payload_in;
+	u32 list_size = le32_to_cpu(req->extent_list_size);
+
+	for (int i = 0; i < list_size; i++) {
+		u64 start = le64_to_cpu(req->extent_list[i].dpa_start);
+		u64 length = le64_to_cpu(req->extent_list[i].length);
+		int rc;
+
+		rc = dc_accept_extent(dev, start, length);
+		if (rc)
+			return rc;
+	}
+
+	return 0;
+}
+
+static void dc_delete_extent(struct device *dev, unsigned long long start,
+			     unsigned long long length)
+{
+	struct cxl_mockmem_data *mdata = dev_get_drvdata(dev);
+	unsigned long long end = start + length;
+	struct cxl_extent_data *ext;
+	unsigned long index;
+
+	dev_dbg(dev, "Deleting extent at %#llx len:%#llx\n", start, length);
+
+	guard(mutex)(&mdata->ext_lock);
+	xa_for_each(&mdata->dc_extents, index, ext) {
+		u64 extent_end = ext->dpa_start + ext->length;
+
+		/*
+		 * Any extent which 'touches' the released delete range will be
+		 * removed.
+		 */
+		if ((start <= ext->dpa_start && ext->dpa_start < end) ||
+		    (start <= extent_end && extent_end < end)) {
+			xa_erase(&mdata->dc_extents, ext->dpa_start);
+		}
+	}
+
+	/*
+	 * If the extent was accepted let it be for the host to drop
+	 * later.
+	 */
+}
+
+static int release_accepted_extent(struct device *dev, u64 start, u64 length)
+{
+	struct cxl_mockmem_data *mdata = dev_get_drvdata(dev);
+	struct cxl_extent_data *ext;
+
+	guard(mutex)(&mdata->ext_lock);
+	ext = find_create_ext(dev, &mdata->dc_accepted_exts, start, length);
+	if (!ext) {
+		dev_err(dev, "Extent %#llx not in accepted state\n", start);
+		return -EINVAL;
+	}
+	xa_erase(&mdata->dc_accepted_exts, ext->dpa_start);
+	mdata->dc_ext_generation++;
+
+	return 0;
+}
+
+static int mock_dc_release(struct device *dev,
+			   struct cxl_mbox_cmd *cmd)
+{
+	struct cxl_mbox_dc_response *req = cmd->payload_in;
+	u32 list_size = le32_to_cpu(req->extent_list_size);
+
+	for (int i = 0; i < list_size; i++) {
+		u64 start = le64_to_cpu(req->extent_list[i].dpa_start);
+		u64 length = le64_to_cpu(req->extent_list[i].length);
+
+		dev_dbg(dev, "Extent %#llx released by host\n", start);
+		release_accepted_extent(dev, start, length);
+	}
+
+	return 0;
+}
+
 static int cxl_mock_mbox_send(struct cxl_memdev_state *mds,
 			      struct cxl_mbox_cmd *cmd)
 {
@@ -1471,6 +1896,18 @@ static int cxl_mock_mbox_send(struct cxl_memdev_state *mds,
 	case CXL_MBOX_OP_ACTIVATE_FW:
 		rc = mock_activate_fw(mdata, cmd);
 		break;
+	case CXL_MBOX_OP_GET_DC_CONFIG:
+		rc = mock_get_dc_config(dev, cmd);
+		break;
+	case CXL_MBOX_OP_GET_DC_EXTENT_LIST:
+		rc = mock_get_dc_extent_list(dev, cmd);
+		break;
+	case CXL_MBOX_OP_ADD_DC_RESPONSE:
+		rc = mock_add_dc_response(dev, cmd);
+		break;
+	case CXL_MBOX_OP_RELEASE_DC:
+		rc = mock_dc_release(dev, cmd);
+		break;
 	default:
 		break;
 	}
@@ -1515,6 +1952,14 @@ static void init_event_log(struct mock_event_log *log)
 	log->next_handle = 1;
 }
 
+static void cxl_mock_mem_remove(struct platform_device *pdev)
+{
+	struct cxl_mockmem_data *mdata = dev_get_drvdata(&pdev->dev);
+	struct cxl_memdev_state *mds = mdata->mds;
+
+	dev_dbg(mds->cxlds.dev, "Removing extents\n");
+}
+
 static int cxl_mock_mem_probe(struct platform_device *pdev)
 {
 	struct device *dev = &pdev->dev;
@@ -1529,6 +1974,10 @@ static int cxl_mock_mem_probe(struct platform_device *pdev)
 		return -ENOMEM;
 	dev_set_drvdata(dev, mdata);
 
+	rc = cxl_mock_dc_region_setup(dev);
+	if (rc)
+		return rc;
+
 	mdata->lsa = vmalloc(LSA_SIZE);
 	if (!mdata->lsa)
 		return -ENOMEM;
@@ -1577,6 +2026,10 @@ static int cxl_mock_mem_probe(struct platform_device *pdev)
 	if (rc)
 		return rc;
 
+	rc = cxl_dev_dynamic_capacity_identify(mds);
+	if (rc)
+		return rc;
+
 	rc = cxl_mem_create_range_info(mds);
 	if (rc)
 		return rc;
@@ -1689,14 +2142,261 @@ static ssize_t sanitize_timeout_store(struct device *dev,
 
 	return count;
 }
-
 static DEVICE_ATTR_RW(sanitize_timeout);
 
+/* Return if the proposed extent would break the test code */
+static bool new_extent_valid(struct device *dev, size_t new_start,
+			     size_t new_len)
+{
+	struct cxl_mockmem_data *mdata = dev_get_drvdata(dev);
+	struct cxl_extent_data *extent;
+	size_t new_end, i;
+
+	if (!new_len)
+		return false;
+
+	new_end = new_start + new_len;
+
+	dev_dbg(dev, "New extent %zx-%zx\n", new_start, new_end);
+
+	guard(mutex)(&mdata->ext_lock);
+	dev_dbg(dev, "Checking extents starts...\n");
+	xa_for_each(&mdata->dc_extents, i, extent) {
+		if (extent->dpa_start == new_start)
+			return false;
+	}
+
+	dev_dbg(dev, "Checking accepted extents starts...\n");
+	xa_for_each(&mdata->dc_accepted_exts, i, extent) {
+		if (extent->dpa_start == new_start)
+			return false;
+	}
+
+	return true;
+}
+
+struct cxl_test_dcd {
+	uuid_t id;
+	struct cxl_event_dcd rec;
+} __packed;
+
+struct cxl_test_dcd dcd_event_rec_template = {
+	.id = CXL_EVENT_DC_EVENT_UUID,
+	.rec = {
+		.hdr = {
+			.length = sizeof(struct cxl_test_dcd),
+		},
+	},
+};
+
+static int log_dc_event(struct cxl_mockmem_data *mdata, enum dc_event type,
+			u64 start, u64 length, const char *tag_str, bool more)
+{
+	struct device *dev = mdata->mds->cxlds.dev;
+	struct cxl_test_dcd *dcd_event;
+
+	dev_dbg(dev, "mock device log event %d\n", type);
+
+	dcd_event = devm_kmemdup(dev, &dcd_event_rec_template,
+				     sizeof(*dcd_event), GFP_KERNEL);
+	if (!dcd_event)
+		return -ENOMEM;
+
+	dcd_event->rec.flags = 0;
+	if (more)
+		dcd_event->rec.flags |= CXL_DCD_EVENT_MORE;
+	dcd_event->rec.event_type = type;
+	dcd_event->rec.extent.start_dpa = cpu_to_le64(start);
+	dcd_event->rec.extent.length = cpu_to_le64(length);
+	memcpy(dcd_event->rec.extent.tag, tag_str,
+	       min(sizeof(dcd_event->rec.extent.tag),
+		   strlen(tag_str)));
+
+	mes_add_event(mdata, CXL_EVENT_TYPE_DCD,
+		      (struct cxl_event_record_raw *)dcd_event);
+
+	/* Fake the irq */
+	cxl_mem_get_event_records(mdata->mds, CXLDEV_EVENT_STATUS_DCD);
+
+	return 0;
+}
+
+/*
+ * Format <start>:<length>:<tag>
+ *
+ * start and length must be a multiple of the configured region block size.
+ * Tag can be any string up to 16 bytes.
+ *
+ * Extents must be exclusive of other extents
+ */
+static ssize_t __dc_inject_extent_store(struct device *dev,
+					struct device_attribute *attr,
+					const char *buf, size_t count,
+					bool shared)
+{
+	struct cxl_mockmem_data *mdata = dev_get_drvdata(dev);
+	unsigned long long start, length, more;
+	char *len_str, *tag_str, *more_str;
+	size_t buf_len = count;
+	int rc;
+
+	char *start_str __free(kfree) = kstrdup(buf, GFP_KERNEL);
+	if (!start_str)
+		return -ENOMEM;
+
+	len_str = strnchr(start_str, buf_len, ':');
+	if (!len_str) {
+		dev_err(dev, "Extent failed to find len_str: %s\n", start_str);
+		return -EINVAL;
+	}
+
+	*len_str = '\0';
+	len_str += 1;
+	buf_len -= strlen(start_str);
+
+	tag_str = strnchr(len_str, buf_len, ':');
+	if (!tag_str) {
+		dev_err(dev, "Extent failed to find tag_str: %s\n", len_str);
+		return -EINVAL;
+	}
+	*tag_str = '\0';
+	tag_str += 1;
+
+	more_str = strnchr(tag_str, buf_len, ':');
+	if (!more_str) {
+		dev_err(dev, "Extent failed to find more_str: %s\n", tag_str);
+		return -EINVAL;
+	}
+	*more_str = '\0';
+	more_str += 1;
+
+	if (kstrtoull(start_str, 0, &start)) {
+		dev_err(dev, "Extent failed to parse start: %s\n", start_str);
+		return -EINVAL;
+	}
+
+	if (kstrtoull(len_str, 0, &length)) {
+		dev_err(dev, "Extent failed to parse length: %s\n", len_str);
+		return -EINVAL;
+	}
+
+	if (kstrtoull(more_str, 0, &more)) {
+		dev_err(dev, "Extent failed to parse more: %s\n", more_str);
+		return -EINVAL;
+	}
+
+	if (!new_extent_valid(dev, start, length))
+		return -EINVAL;
+
+	rc = devm_add_extent(dev, start, length, tag_str, shared);
+	if (rc) {
+		dev_err(dev, "Failed to add extent DPA:%#llx LEN:%#llx; %d\n",
+			start, length, rc);
+		return rc;
+	}
+
+	rc = log_dc_event(mdata, DCD_ADD_CAPACITY, start, length, tag_str, more);
+	if (rc) {
+		dev_err(dev, "Failed to add event %d\n", rc);
+		return rc;
+	}
+
+	return count;
+}
+
+static ssize_t dc_inject_extent_store(struct device *dev,
+				      struct device_attribute *attr,
+				      const char *buf, size_t count)
+{
+	return __dc_inject_extent_store(dev, attr, buf, count, false);
+}
+static DEVICE_ATTR_WO(dc_inject_extent);
+
+static ssize_t dc_inject_shared_extent_store(struct device *dev,
+					     struct device_attribute *attr,
+					     const char *buf, size_t count)
+{
+	return __dc_inject_extent_store(dev, attr, buf, count, true);
+}
+static DEVICE_ATTR_WO(dc_inject_shared_extent);
+
+static ssize_t __dc_del_extent_store(struct device *dev,
+				     struct device_attribute *attr,
+				     const char *buf, size_t count,
+				     enum dc_event type)
+{
+	struct cxl_mockmem_data *mdata = dev_get_drvdata(dev);
+	unsigned long long start, length;
+	char *len_str;
+	int rc;
+
+	char *start_str __free(kfree) = kstrdup(buf, GFP_KERNEL);
+	if (!start_str)
+		return -ENOMEM;
+
+	len_str = strnchr(start_str, count, ':');
+	if (!len_str) {
+		dev_err(dev, "Failed to find len_str: %s\n", start_str);
+		return -EINVAL;
+	}
+	*len_str = '\0';
+	len_str += 1;
+
+	if (kstrtoull(start_str, 0, &start)) {
+		dev_err(dev, "Failed to parse start: %s\n", start_str);
+		return -EINVAL;
+	}
+
+	if (kstrtoull(len_str, 0, &length)) {
+		dev_err(dev, "Failed to parse length: %s\n", len_str);
+		return -EINVAL;
+	}
+
+	dc_delete_extent(dev, start, length);
+
+	if (type == DCD_FORCED_CAPACITY_RELEASE)
+		dev_dbg(dev, "Forcing delete of extent %#llx len:%#llx\n",
+			start, length);
+
+	rc = log_dc_event(mdata, type, start, length, "", false);
+	if (rc) {
+		dev_err(dev, "Failed to add event %d\n", rc);
+		return rc;
+	}
+
+	return count;
+}
+
+/*
+ * Format <start>:<length>
+ */
+static ssize_t dc_del_extent_store(struct device *dev,
+				   struct device_attribute *attr,
+				   const char *buf, size_t count)
+{
+	return __dc_del_extent_store(dev, attr, buf, count,
+				     DCD_RELEASE_CAPACITY);
+}
+static DEVICE_ATTR_WO(dc_del_extent);
+
+static ssize_t dc_force_del_extent_store(struct device *dev,
+					 struct device_attribute *attr,
+					 const char *buf, size_t count)
+{
+	return __dc_del_extent_store(dev, attr, buf, count,
+				     DCD_FORCED_CAPACITY_RELEASE);
+}
+static DEVICE_ATTR_WO(dc_force_del_extent);
+
 static struct attribute *cxl_mock_mem_attrs[] = {
 	&dev_attr_security_lock.attr,
 	&dev_attr_event_trigger.attr,
 	&dev_attr_fw_buf_checksum.attr,
 	&dev_attr_sanitize_timeout.attr,
+	&dev_attr_dc_inject_extent.attr,
+	&dev_attr_dc_inject_shared_extent.attr,
+	&dev_attr_dc_del_extent.attr,
+	&dev_attr_dc_force_del_extent.attr,
 	NULL
 };
 ATTRIBUTE_GROUPS(cxl_mock_mem);
@@ -1710,6 +2410,7 @@ MODULE_DEVICE_TABLE(platform, cxl_mock_mem_ids);
 
 static struct platform_driver cxl_mock_mem_driver = {
 	.probe = cxl_mock_mem_probe,
+	.remove_new = cxl_mock_mem_remove,
 	.id_table = cxl_mock_mem_ids,
 	.driver = {
 		.name = KBUILD_MODNAME,

-- 
2.45.2


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* Re: [PATCH v3 03/25] dax: Document dax dev range tuple
  2024-08-16 14:44 ` [PATCH v3 03/25] dax: Document dax dev range tuple Ira Weiny
@ 2024-08-16 20:58   ` Dave Jiang
  2024-08-23 15:29   ` Jonathan Cameron
  1 sibling, 0 replies; 120+ messages in thread
From: Dave Jiang @ 2024-08-16 20:58 UTC (permalink / raw)
  To: Ira Weiny, Fan Ni, Jonathan Cameron, Navneet Singh, Chris Mason,
	Josef Bacik, David Sterba, Petr Mladek, Steven Rostedt,
	Andy Shevchenko, Rasmus Villemoes, Sergey Senozhatsky,
	Jonathan Corbet, Andrew Morton
  Cc: Dan Williams, Davidlohr Bueso, Alison Schofield, Vishal Verma,
	linux-btrfs, linux-cxl, linux-kernel, linux-doc, nvdimm



On 8/16/24 7:44 AM, Ira Weiny wrote:
> The device DAX structure is being enhanced to track additional DCD
> information.
> 
> The current range tuple was not fully documented.  Document it prior to
> adding information for DC.
> 
> Suggested-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>

Reviewed-by: Dave Jiang <dave.jiang@intel.com>
> 
> ---
> Changes:
> [iweiny: move to start of series]
> ---
>  drivers/dax/dax-private.h | 5 ++++-
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/dax/dax-private.h b/drivers/dax/dax-private.h
> index 446617b73aea..ccde98c3d4e2 100644
> --- a/drivers/dax/dax-private.h
> +++ b/drivers/dax/dax-private.h
> @@ -58,7 +58,10 @@ struct dax_mapping {
>   * @dev - device core
>   * @pgmap - pgmap for memmap setup / lifetime (driver owned)
>   * @nr_range: size of @ranges
> - * @ranges: resource-span + pgoff tuples for the instance
> + * @ranges: range tuples of memory used
> + * @pgoff: page offset
> + * @range: resource-span
> + * @mapping: device to assist in interrogating the range layout
>   */
>  struct dev_dax {
>  	struct dax_region *region;
> 

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v3 06/25] cxl/mem: Read dynamic capacity configuration from the device
  2024-08-16 14:44 ` [PATCH v3 06/25] cxl/mem: Read dynamic capacity configuration from the device ira.weiny
@ 2024-08-16 21:45   ` Dave Jiang
  2024-08-20 17:01     ` Fan Ni
  2024-08-23 15:45   ` Jonathan Cameron
  1 sibling, 1 reply; 120+ messages in thread
From: Dave Jiang @ 2024-08-16 21:45 UTC (permalink / raw)
  To: ira.weiny, Fan Ni, Jonathan Cameron, Navneet Singh, Chris Mason,
	Josef Bacik, David Sterba, Petr Mladek, Steven Rostedt,
	Andy Shevchenko, Rasmus Villemoes, Sergey Senozhatsky,
	Jonathan Corbet, Andrew Morton
  Cc: Dan Williams, Davidlohr Bueso, Alison Schofield, Vishal Verma,
	linux-btrfs, linux-cxl, linux-kernel, linux-doc, nvdimm, Li, Ming



On 8/16/24 7:44 AM, ira.weiny@intel.com wrote:
> From: Navneet Singh <navneet.singh@intel.com>
> 
> Devices which optionally support Dynamic Capacity (DC) are configured
> via mailbox commands.  CXL 3.1 requires the host to issue the Get DC
> Configuration command in order to properly configure DCDs.  Without the
> Get DC Configuration command DCD can't be supported.
> 
> Implement the DC mailbox commands as specified in CXL 3.1 section
> 8.2.9.9.9 (opcodes 48XXh) to read and store the DCD configuration
> information.  Disable DCD if DCD is not supported.  Leverage the Get DC
> Configuration command supported bit to indicate if DCD support.
> 
> Linux has no use for the trailing fields of the Get Dynamic Capacity
> Configuration Output Payload (Total number of supported extents, number
> of available extents, total number of supported tags, and number of
> available tags).  Avoid defining those fields to use the more useful
> dynamic C array.
> 
> Cc: "Li, Ming" <ming4.li@intel.com>
> Signed-off-by: Navneet Singh <navneet.singh@intel.com>
> Co-developed-by: Ira Weiny <ira.weiny@intel.com>
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> 
> ---
> Changes:
> [Li, Ming: Fix bug in total_bytes calculation]
> [iweiny: update commit message]
> [Jonathan: fix formatting]
> [Jonathan: Define block line size]
> [Jonathan/Fan: use regions returned field instead of macro in get config]
> [Jørgen: Rename memdev state range variables]
> [Jonathan: adjust use of rc in cxl_dev_dynamic_capacity_identify()]
> [Jonathan: white space cleanup]
> [fan: make a comment about the trailing configuration output fields]
> ---
>  drivers/cxl/core/mbox.c | 171 +++++++++++++++++++++++++++++++++++++++++++++++-
>  drivers/cxl/cxlmem.h    |  64 +++++++++++++++++-
>  drivers/cxl/pci.c       |   4 ++
>  3 files changed, 237 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
> index 8eb196858abe..68c26c4be91a 100644
> --- a/drivers/cxl/core/mbox.c
> +++ b/drivers/cxl/core/mbox.c
> @@ -1157,7 +1157,7 @@ int cxl_dev_state_identify(struct cxl_memdev_state *mds)
>  	if (rc < 0)
>  		return rc;
>  
> -	mds->total_bytes =
> +	mds->static_bytes =
>  		le64_to_cpu(id.total_capacity) * CXL_CAPACITY_MULTIPLIER;
>  	mds->volatile_only_bytes =
>  		le64_to_cpu(id.volatile_capacity) * CXL_CAPACITY_MULTIPLIER;
> @@ -1264,6 +1264,159 @@ int cxl_mem_sanitize(struct cxl_memdev *cxlmd, u16 cmd)
>  	return rc;
>  }
>  
> +static int cxl_dc_save_region_info(struct cxl_memdev_state *mds, u8 index,
> +				   struct cxl_dc_region_config *region_config)
> +{
> +	struct cxl_dc_region_info *dcr = &mds->dc_region[index];
> +	struct device *dev = mds->cxlds.dev;
> +
> +	dcr->base = le64_to_cpu(region_config->region_base);
> +	dcr->decode_len = le64_to_cpu(region_config->region_decode_length);
> +	dcr->decode_len *= CXL_CAPACITY_MULTIPLIER;
> +	dcr->len = le64_to_cpu(region_config->region_length);
> +	dcr->blk_size = le64_to_cpu(region_config->region_block_size);
> +	dcr->dsmad_handle = le32_to_cpu(region_config->region_dsmad_handle);
> +	dcr->flags = region_config->flags;
> +	snprintf(dcr->name, CXL_DC_REGION_STRLEN, "dc%d", index);
> +
> +	/* Check regions are in increasing DPA order */
> +	if (index > 0) {
> +		struct cxl_dc_region_info *prev_dcr = &mds->dc_region[index - 1];
> +
> +		if ((prev_dcr->base + prev_dcr->decode_len) > dcr->base) {
> +			dev_err(dev,
> +				"DPA ordering violation for DC region %d and %d\n",
> +				index - 1, index);
> +			return -EINVAL;
> +		}
> +	}
> +
> +	if (!IS_ALIGNED(dcr->base, SZ_256M) ||
> +	    !IS_ALIGNED(dcr->base, dcr->blk_size)) {
> +		dev_err(dev, "DC region %d invalid base %#llx blk size %#llx\n",
> +			index, dcr->base, dcr->blk_size);
> +		return -EINVAL;
> +	}
> +
> +	if (dcr->decode_len == 0 || dcr->len == 0 || dcr->decode_len < dcr->len ||
> +	    !IS_ALIGNED(dcr->len, dcr->blk_size)) {
> +		dev_err(dev, "DC region %d invalid length; decode %#llx len %#llx blk size %#llx\n",
> +			index, dcr->decode_len, dcr->len, dcr->blk_size);
> +		return -EINVAL;
> +	}
> +
> +	if (dcr->blk_size == 0 || dcr->blk_size % CXL_DCD_BLOCK_LINE_SIZE ||
> +	    !is_power_of_2(dcr->blk_size)) {
> +		dev_err(dev, "DC region %d invalid block size; %#llx\n",
> +			index, dcr->blk_size);
> +		return -EINVAL;
> +	}
> +
> +	dev_dbg(dev,
> +		"DC region %s base %#llx length %#llx block size %#llx\n",
> +		dcr->name, dcr->base, dcr->decode_len, dcr->blk_size);
> +
> +	return 0;
> +}
> +
> +/* Returns the number of regions in dc_resp or -ERRNO */
> +static int cxl_get_dc_config(struct cxl_memdev_state *mds, u8 start_region,
> +			     struct cxl_mbox_get_dc_config_out *dc_resp,
> +			     size_t dc_resp_size)
> +{
> +	struct cxl_mbox_get_dc_config_in get_dc = (struct cxl_mbox_get_dc_config_in) {
> +		.region_count = CXL_MAX_DC_REGION,
> +		.start_region_index = start_region,
> +	};
> +	struct cxl_mbox_cmd mbox_cmd = (struct cxl_mbox_cmd) {
> +		.opcode = CXL_MBOX_OP_GET_DC_CONFIG,
> +		.payload_in = &get_dc,
> +		.size_in = sizeof(get_dc),
> +		.size_out = dc_resp_size,
> +		.payload_out = dc_resp,
> +		.min_out = 1,
> +	};
> +	struct device *dev = mds->cxlds.dev;
> +	int rc;
> +
> +	rc = cxl_internal_send_cmd(mds, &mbox_cmd);
> +	if (rc < 0)
> +		return rc;
> +
> +	dev_dbg(dev, "Read %d/%d DC regions\n",
> +		dc_resp->regions_returned, dc_resp->avail_region_count);
> +	return dc_resp->regions_returned;
> +}
> +
> +/**
> + * cxl_dev_dynamic_capacity_identify() - Reads the dynamic capacity
> + *					 information from the device.
> + * @mds: The memory device state
> + *
> + * Read Dynamic Capacity information from the device and populate the state
> + * structures for later use.
> + *
> + * Return: 0 if identify was executed successfully, -ERRNO on error.
> + */
> +int cxl_dev_dynamic_capacity_identify(struct cxl_memdev_state *mds)
> +{
> +	size_t dc_resp_size = mds->payload_size;
> +	struct device *dev = mds->cxlds.dev;
> +	u8 start_region, i;
> +
> +	for (i = 0; i < CXL_MAX_DC_REGION; i++)
> +		snprintf(mds->dc_region[i].name, CXL_DC_REGION_STRLEN, "<nil>");
> +
> +	if (!cxl_dcd_supported(mds)) {
> +		dev_dbg(dev, "DCD not supported\n");
> +		return 0;
> +	}

This should happen before you pre-format the name string? I would assume that if DCD is not supported then the dcd name sysfs attribs would be not be visible?

> +
> +	struct cxl_mbox_get_dc_config_out *dc_resp __free(kfree) =
> +					kvmalloc(dc_resp_size, GFP_KERNEL);
> +	if (!dc_resp)
> +		return -ENOMEM;
> +
> +	start_region = 0;
> +	do {
> +		int rc, j;
> +
> +		rc = cxl_get_dc_config(mds, start_region, dc_resp, dc_resp_size);
> +		if (rc < 0) {
> +			dev_dbg(dev, "Failed to get DC config: %d\n", rc);
> +			return rc;
> +		}
> +
> +		mds->nr_dc_region += rc;
> +
> +		if (mds->nr_dc_region < 1 || mds->nr_dc_region > CXL_MAX_DC_REGION) {
> +			dev_err(dev, "Invalid num of dynamic capacity regions %d\n",
> +				mds->nr_dc_region);
> +			return -EINVAL;
> +		}
> +
> +		for (i = start_region, j = 0; i < mds->nr_dc_region; i++, j++) {

This should be 'j < mds->nr_dc_region'? Otherwise if your start region say is '3' and you have '2' DC regions, you never enter the loop. Or does that not happen? I also wonder if you need to check if 'start_region + mds->nr_dc_region > CXL_MAX_DC_REGION'.

> +			rc = cxl_dc_save_region_info(mds, i, &dc_resp->region[j]);
> +			if (rc) {
> +				dev_dbg(dev, "Failed to save region info: %d\n", rc);
> +				return rc;
> +			}
> +		}
> +
> +		start_region = mds->nr_dc_region;
> +
> +	} while (mds->nr_dc_region < dc_resp->avail_region_count);
> +
> +	mds->dynamic_bytes =
> +		mds->dc_region[mds->nr_dc_region - 1].base +
> +		mds->dc_region[mds->nr_dc_region - 1].decode_len -
> +		mds->dc_region[0].base;
> +	dev_dbg(dev, "Total dynamic range: %#llx\n", mds->dynamic_bytes);
> +
> +	return 0;
> +}
> +EXPORT_SYMBOL_NS_GPL(cxl_dev_dynamic_capacity_identify, CXL);
> +
>  static int add_dpa_res(struct device *dev, struct resource *parent,
>  		       struct resource *res, resource_size_t start,
>  		       resource_size_t size, const char *type)
> @@ -1294,8 +1447,15 @@ int cxl_mem_create_range_info(struct cxl_memdev_state *mds)
>  {
>  	struct cxl_dev_state *cxlds = &mds->cxlds;
>  	struct device *dev = cxlds->dev;
> +	size_t untenanted_mem;
>  	int rc;
>  
> +	mds->total_bytes = mds->static_bytes;
> +	if (mds->nr_dc_region) {
> +		untenanted_mem = mds->dc_region[0].base - mds->static_bytes;
> +		mds->total_bytes += untenanted_mem + mds->dynamic_bytes;
> +	}
> +
>  	if (!cxlds->media_ready) {
>  		cxlds->dpa_res = DEFINE_RES_MEM(0, 0);
>  		cxlds->ram_res = DEFINE_RES_MEM(0, 0);
> @@ -1305,6 +1465,15 @@ int cxl_mem_create_range_info(struct cxl_memdev_state *mds)
>  
>  	cxlds->dpa_res = DEFINE_RES_MEM(0, mds->total_bytes);
>  
> +	for (int i = 0; i < mds->nr_dc_region; i++) {
> +		struct cxl_dc_region_info *dcr = &mds->dc_region[i];
> +
> +		rc = add_dpa_res(dev, &cxlds->dpa_res, &cxlds->dc_res[i],
> +				 dcr->base, dcr->decode_len, dcr->name);
> +		if (rc)
> +			return rc;
> +	}
> +
>  	if (mds->partition_align_bytes == 0) {
>  		rc = add_dpa_res(dev, &cxlds->dpa_res, &cxlds->ram_res, 0,
>  				 mds->volatile_only_bytes, "ram");
> diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
> index f2f8b567e0e7..b4eb8164d05d 100644
> --- a/drivers/cxl/cxlmem.h
> +++ b/drivers/cxl/cxlmem.h
> @@ -402,6 +402,7 @@ enum cxl_devtype {
>  	CXL_DEVTYPE_CLASSMEM,
>  };
>  
> +#define CXL_MAX_DC_REGION 8
>  /**
>   * struct cxl_dpa_perf - DPA performance property entry
>   * @dpa_range: range for DPA address
> @@ -431,6 +432,8 @@ struct cxl_dpa_perf {
>   * @dpa_res: Overall DPA resource tree for the device
>   * @pmem_res: Active Persistent memory capacity configuration
>   * @ram_res: Active Volatile memory capacity configuration
> + * @dc_res: Active Dynamic Capacity memory configuration for each possible
> + *          region
>   * @serial: PCIe Device Serial Number
>   * @type: Generic Memory Class device or Vendor Specific Memory device
>   */
> @@ -445,10 +448,22 @@ struct cxl_dev_state {
>  	struct resource dpa_res;
>  	struct resource pmem_res;
>  	struct resource ram_res;
> +	struct resource dc_res[CXL_MAX_DC_REGION];
>  	u64 serial;
>  	enum cxl_devtype type;
>  };
>  
> +#define CXL_DC_REGION_STRLEN > +struct cxl_dc_region_info {
> +	u64 base;
> +	u64 decode_len;
> +	u64 len;
> +	u64 blk_size;
> +	u32 dsmad_handle;
> +	u8 flags;
> +	u8 name[CXL_DC_REGION_STRLEN];
> +};

Does this need kdoc comments?


> +
>  /**
>   * struct cxl_memdev_state - Generic Type-3 Memory Device Class driver data
>   *
> @@ -466,7 +481,9 @@ struct cxl_dev_state {
>   * @dcd_cmds: List of DCD commands implemented by memory device
>   * @enabled_cmds: Hardware commands found enabled in CEL.
>   * @exclusive_cmds: Commands that are kernel-internal only
> - * @total_bytes: sum of all possible capacities
> + * @total_bytes: length of all possible capacities
> + * @static_bytes: length of possible static RAM and PMEM partitions
> + * @dynamic_bytes: length of possible DC partitions (DC Regions)

Did this get added to the wrong struct comment header? 'cxl_dev_state' instead of 'cxl_memdev_state'?
>   * @volatile_only_bytes: hard volatile capacity
>   * @persistent_only_bytes: hard persistent capacity
>   * @partition_align_bytes: alignment size for partition-able capacity
> @@ -476,6 +493,8 @@ struct cxl_dev_state {
>   * @next_persistent_bytes: persistent capacity change pending device reset
>   * @ram_perf: performance data entry matched to RAM partition
>   * @pmem_perf: performance data entry matched to PMEM partition
> + * @nr_dc_region: number of DC regions implemented in the memory device
> + * @dc_region: array containing info about the DC regions
Did this get added to the wrong struct comment header? 'cxl_dev_state' instead of 'cxl_memdev_state'?

DJ

>   * @event: event log driver state
>   * @poison: poison driver state info
>   * @security: security driver state info
> @@ -496,6 +515,8 @@ struct cxl_memdev_state {
>  	DECLARE_BITMAP(enabled_cmds, CXL_MEM_COMMAND_ID_MAX);
>  	DECLARE_BITMAP(exclusive_cmds, CXL_MEM_COMMAND_ID_MAX);
>  	u64 total_bytes;
> +	u64 static_bytes;
> +	u64 dynamic_bytes;
>  	u64 volatile_only_bytes;
>  	u64 persistent_only_bytes;
>  	u64 partition_align_bytes;
> @@ -507,6 +528,9 @@ struct cxl_memdev_state {
>  	struct cxl_dpa_perf ram_perf;
>  	struct cxl_dpa_perf pmem_perf;
>  
> +	u8 nr_dc_region;
> +	struct cxl_dc_region_info dc_region[CXL_MAX_DC_REGION];
> +
>  	struct cxl_event_state event;
>  	struct cxl_poison_state poison;
>  	struct cxl_security_state security;
> @@ -709,6 +733,32 @@ struct cxl_mbox_set_partition_info {
>  
>  #define  CXL_SET_PARTITION_IMMEDIATE_FLAG	BIT(0)
>  
> +/* See CXL 3.1 Table 8-163 get dynamic capacity config Input Payload */
> +struct cxl_mbox_get_dc_config_in {
> +	u8 region_count;
> +	u8 start_region_index;
> +} __packed;
> +
> +/* See CXL 3.1 Table 8-164 get dynamic capacity config Output Payload */
> +struct cxl_mbox_get_dc_config_out {
> +	u8 avail_region_count;
> +	u8 regions_returned;
> +	u8 rsvd[6];
> +	/* See CXL 3.1 Table 8-165 */
> +	struct cxl_dc_region_config {
> +		__le64 region_base;
> +		__le64 region_decode_length;
> +		__le64 region_length;
> +		__le64 region_block_size;
> +		__le32 region_dsmad_handle;
> +		u8 flags;
> +		u8 rsvd[3];
> +	} __packed region[];
> +	/* Trailing fields unused */
> +} __packed;
> +#define CXL_DYNAMIC_CAPACITY_SANITIZE_ON_RELEASE_FLAG BIT(0)
> +#define CXL_DCD_BLOCK_LINE_SIZE 0x40
> +
>  /* Set Timestamp CXL 3.0 Spec 8.2.9.4.2 */
>  struct cxl_mbox_set_timestamp_in {
>  	__le64 timestamp;
> @@ -832,6 +882,7 @@ enum {
>  int cxl_internal_send_cmd(struct cxl_memdev_state *mds,
>  			  struct cxl_mbox_cmd *cmd);
>  int cxl_dev_state_identify(struct cxl_memdev_state *mds);
> +int cxl_dev_dynamic_capacity_identify(struct cxl_memdev_state *mds);
>  int cxl_await_media_ready(struct cxl_dev_state *cxlds);
>  int cxl_enumerate_cmds(struct cxl_memdev_state *mds);
>  int cxl_mem_create_range_info(struct cxl_memdev_state *mds);
> @@ -845,6 +896,17 @@ void cxl_event_trace_record(const struct cxl_memdev *cxlmd,
>  			    enum cxl_event_log_type type,
>  			    enum cxl_event_type event_type,
>  			    const uuid_t *uuid, union cxl_event *evt);
> +
> +static inline bool cxl_dcd_supported(struct cxl_memdev_state *mds)
> +{
> +	return test_bit(CXL_DCD_ENABLED_GET_CONFIG, mds->dcd_cmds);
> +}
> +
> +static inline void cxl_disable_dcd(struct cxl_memdev_state *mds)
> +{
> +	clear_bit(CXL_DCD_ENABLED_GET_CONFIG, mds->dcd_cmds);
> +}
> +
>  int cxl_set_timestamp(struct cxl_memdev_state *mds);
>  int cxl_poison_state_init(struct cxl_memdev_state *mds);
>  int cxl_mem_get_poison(struct cxl_memdev *cxlmd, u64 offset, u64 len,
> diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
> index 3a60cd66263e..f7f03599bc83 100644
> --- a/drivers/cxl/pci.c
> +++ b/drivers/cxl/pci.c
> @@ -874,6 +874,10 @@ static int cxl_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
>  	if (rc)
>  		return rc;
>  
> +	rc = cxl_dev_dynamic_capacity_identify(mds);
> +	if (rc)
> +		cxl_disable_dcd(mds);
> +
>  	rc = cxl_mem_create_range_info(mds);
>  	if (rc)
>  		return rc;
> 

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v3 07/25] cxl/core: Separate region mode from decoder mode
  2024-08-16 14:44 ` [PATCH v3 07/25] cxl/core: Separate region mode from decoder mode ira.weiny
@ 2024-08-16 22:11   ` Dave Jiang
  2024-08-23 15:47   ` Jonathan Cameron
  2024-09-03  6:56   ` Li, Ming4
  2 siblings, 0 replies; 120+ messages in thread
From: Dave Jiang @ 2024-08-16 22:11 UTC (permalink / raw)
  To: ira.weiny, Fan Ni, Jonathan Cameron, Navneet Singh, Chris Mason,
	Josef Bacik, David Sterba, Petr Mladek, Steven Rostedt,
	Andy Shevchenko, Rasmus Villemoes, Sergey Senozhatsky,
	Jonathan Corbet, Andrew Morton
  Cc: Dan Williams, Davidlohr Bueso, Alison Schofield, Vishal Verma,
	linux-btrfs, linux-cxl, linux-kernel, linux-doc, nvdimm



On 8/16/24 7:44 AM, ira.weiny@intel.com wrote:
> From: Navneet Singh <navneet.singh@intel.com>
> 
> Until now region modes and decoder modes were equivalent in that both
> modes were either PMEM or RAM.  The addition of Dynamic
> Capacity partitions defines up to 8 DC partitions per device.
> 
> The region mode is thus no longer equivalent to the endpoint decoder
> mode.  IOW the endpoint decoders may have modes of DC0-DC7 while the
> region mode is simply DC.
> 
> Define a new region mode enumeration which applies to regions separate
> from the decoder mode.  Adjust the code to process these modes
> independently.
> 
> There is no equal to decoder mode dead in region modes.  Avoid
> constructing regions with decoders which have been flagged as dead.
> 
> Suggested-by: Jonathan Cameron <Jonathan.Cameron@Huawei.com>
> Signed-off-by: Navneet Singh <navneet.singh@intel.com>
> Co-developed-by: Ira Weiny <ira.weiny@intel.com>
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>

> 
> ---
> Changes:
> [iweiny: rebase]
> [Jonathan: remove dead code]
> [Jonathan: clarify commit message]
> ---
>  drivers/cxl/core/region.c | 75 ++++++++++++++++++++++++++++++++++-------------
>  drivers/cxl/cxl.h         | 26 ++++++++++++++--
>  2 files changed, 79 insertions(+), 22 deletions(-)
> 
> diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
> index 971a314b6b0e..796e5a791e44 100644
> --- a/drivers/cxl/core/region.c
> +++ b/drivers/cxl/core/region.c
> @@ -144,7 +144,7 @@ static ssize_t uuid_show(struct device *dev, struct device_attribute *attr,
>  	rc = down_read_interruptible(&cxl_region_rwsem);
>  	if (rc)
>  		return rc;
> -	if (cxlr->mode != CXL_DECODER_PMEM)
> +	if (cxlr->mode != CXL_REGION_PMEM)
>  		rc = sysfs_emit(buf, "\n");
>  	else
>  		rc = sysfs_emit(buf, "%pUb\n", &p->uuid);
> @@ -457,7 +457,7 @@ static umode_t cxl_region_visible(struct kobject *kobj, struct attribute *a,
>  	 * Support tooling that expects to find a 'uuid' attribute for all
>  	 * regions regardless of mode.
>  	 */
> -	if (a == &dev_attr_uuid.attr && cxlr->mode != CXL_DECODER_PMEM)
> +	if (a == &dev_attr_uuid.attr && cxlr->mode != CXL_REGION_PMEM)
>  		return 0444;
>  	return a->mode;
>  }
> @@ -620,7 +620,7 @@ static ssize_t mode_show(struct device *dev, struct device_attribute *attr,
>  {
>  	struct cxl_region *cxlr = to_cxl_region(dev);
>  
> -	return sysfs_emit(buf, "%s\n", cxl_decoder_mode_name(cxlr->mode));
> +	return sysfs_emit(buf, "%s\n", cxl_region_mode_name(cxlr->mode));
>  }
>  static DEVICE_ATTR_RO(mode);
>  
> @@ -646,7 +646,7 @@ static int alloc_hpa(struct cxl_region *cxlr, resource_size_t size)
>  
>  	/* ways, granularity and uuid (if PMEM) need to be set before HPA */
>  	if (!p->interleave_ways || !p->interleave_granularity ||
> -	    (cxlr->mode == CXL_DECODER_PMEM && uuid_is_null(&p->uuid)))
> +	    (cxlr->mode == CXL_REGION_PMEM && uuid_is_null(&p->uuid)))
>  		return -ENXIO;
>  
>  	div64_u64_rem(size, (u64)SZ_256M * p->interleave_ways, &remainder);
> @@ -1863,6 +1863,17 @@ static int cxl_region_sort_targets(struct cxl_region *cxlr)
>  	return rc;
>  }
>  
> +static bool cxl_modes_compatible(enum cxl_region_mode rmode,
> +				 enum cxl_decoder_mode dmode)
> +{
> +	if (rmode == CXL_REGION_RAM && dmode == CXL_DECODER_RAM)
> +		return true;
> +	if (rmode == CXL_REGION_PMEM && dmode == CXL_DECODER_PMEM)
> +		return true;
> +
> +	return false;
> +}
> +
>  static int cxl_region_attach(struct cxl_region *cxlr,
>  			     struct cxl_endpoint_decoder *cxled, int pos)
>  {
> @@ -1882,9 +1893,11 @@ static int cxl_region_attach(struct cxl_region *cxlr,
>  		return rc;
>  	}
>  
> -	if (cxled->mode != cxlr->mode) {
> -		dev_dbg(&cxlr->dev, "%s region mode: %d mismatch: %d\n",
> -			dev_name(&cxled->cxld.dev), cxlr->mode, cxled->mode);
> +	if (!cxl_modes_compatible(cxlr->mode, cxled->mode)) {
> +		dev_dbg(&cxlr->dev, "%s region mode: %s mismatch decoder: %s\n",
> +			dev_name(&cxled->cxld.dev),
> +			cxl_region_mode_name(cxlr->mode),
> +			cxl_decoder_mode_name(cxled->mode));
>  		return -EINVAL;
>  	}
>  
> @@ -2447,7 +2460,7 @@ static int cxl_region_calculate_adistance(struct notifier_block *nb,
>   * devm_cxl_add_region - Adds a region to a decoder
>   * @cxlrd: root decoder
>   * @id: memregion id to create, or memregion_free() on failure
> - * @mode: mode for the endpoint decoders of this region
> + * @mode: mode of this region
>   * @type: select whether this is an expander or accelerator (type-2 or type-3)
>   *
>   * This is the second step of region initialization. Regions exist within an
> @@ -2458,7 +2471,7 @@ static int cxl_region_calculate_adistance(struct notifier_block *nb,
>   */
>  static struct cxl_region *devm_cxl_add_region(struct cxl_root_decoder *cxlrd,
>  					      int id,
> -					      enum cxl_decoder_mode mode,
> +					      enum cxl_region_mode mode,
>  					      enum cxl_decoder_type type)
>  {
>  	struct cxl_port *port = to_cxl_port(cxlrd->cxlsd.cxld.dev.parent);
> @@ -2512,16 +2525,17 @@ static ssize_t create_ram_region_show(struct device *dev,
>  }
>  
>  static struct cxl_region *__create_region(struct cxl_root_decoder *cxlrd,
> -					  enum cxl_decoder_mode mode, int id)
> +					  enum cxl_region_mode mode, int id)
>  {
>  	int rc;
>  
>  	switch (mode) {
> -	case CXL_DECODER_RAM:
> -	case CXL_DECODER_PMEM:
> +	case CXL_REGION_RAM:
> +	case CXL_REGION_PMEM:
>  		break;
>  	default:
> -		dev_err(&cxlrd->cxlsd.cxld.dev, "unsupported mode %d\n", mode);
> +		dev_err(&cxlrd->cxlsd.cxld.dev, "unsupported mode %s\n",
> +			cxl_region_mode_name(mode));
>  		return ERR_PTR(-EINVAL);
>  	}
>  
> @@ -2549,7 +2563,7 @@ static ssize_t create_pmem_region_store(struct device *dev,
>  	if (rc != 1)
>  		return -EINVAL;
>  
> -	cxlr = __create_region(cxlrd, CXL_DECODER_PMEM, id);
> +	cxlr = __create_region(cxlrd, CXL_REGION_PMEM, id);
>  	if (IS_ERR(cxlr))
>  		return PTR_ERR(cxlr);
>  
> @@ -2569,7 +2583,7 @@ static ssize_t create_ram_region_store(struct device *dev,
>  	if (rc != 1)
>  		return -EINVAL;
>  
> -	cxlr = __create_region(cxlrd, CXL_DECODER_RAM, id);
> +	cxlr = __create_region(cxlrd, CXL_REGION_RAM, id);
>  	if (IS_ERR(cxlr))
>  		return PTR_ERR(cxlr);
>  
> @@ -3215,6 +3229,22 @@ static int match_region_by_range(struct device *dev, void *data)
>  	return rc;
>  }
>  
> +static enum cxl_region_mode
> +cxl_decoder_to_region_mode(enum cxl_decoder_mode mode)
> +{
> +	switch (mode) {
> +	case CXL_DECODER_NONE:
> +		return CXL_REGION_NONE;
> +	case CXL_DECODER_RAM:
> +		return CXL_REGION_RAM;
> +	case CXL_DECODER_PMEM:
> +		return CXL_REGION_PMEM;
> +	case CXL_DECODER_MIXED:
> +	default:
> +		return CXL_REGION_MIXED;
> +	}
> +}
> +
>  /* Establish an empty region covering the given HPA range */
>  static struct cxl_region *construct_region(struct cxl_root_decoder *cxlrd,
>  					   struct cxl_endpoint_decoder *cxled)
> @@ -3223,12 +3253,17 @@ static struct cxl_region *construct_region(struct cxl_root_decoder *cxlrd,
>  	struct cxl_port *port = cxlrd_to_port(cxlrd);
>  	struct range *hpa = &cxled->cxld.hpa_range;
>  	struct cxl_region_params *p;
> +	enum cxl_region_mode mode;
>  	struct cxl_region *cxlr;
>  	struct resource *res;
>  	int rc;
>  
> +	if (cxled->mode == CXL_DECODER_DEAD)
> +		return ERR_PTR(-EINVAL);
> +
> +	mode = cxl_decoder_to_region_mode(cxled->mode);
>  	do {
> -		cxlr = __create_region(cxlrd, cxled->mode,
> +		cxlr = __create_region(cxlrd, mode,
>  				       atomic_read(&cxlrd->region_id));
>  	} while (IS_ERR(cxlr) && PTR_ERR(cxlr) == -EBUSY);
>  
> @@ -3431,9 +3466,9 @@ static int cxl_region_probe(struct device *dev)
>  		return rc;
>  
>  	switch (cxlr->mode) {
> -	case CXL_DECODER_PMEM:
> +	case CXL_REGION_PMEM:
>  		return devm_cxl_add_pmem_region(cxlr);
> -	case CXL_DECODER_RAM:
> +	case CXL_REGION_RAM:
>  		/*
>  		 * The region can not be manged by CXL if any portion of
>  		 * it is already online as 'System RAM'
> @@ -3445,8 +3480,8 @@ static int cxl_region_probe(struct device *dev)
>  			return 0;
>  		return devm_cxl_add_dax_region(cxlr);
>  	default:
> -		dev_dbg(&cxlr->dev, "unsupported region mode: %d\n",
> -			cxlr->mode);
> +		dev_dbg(&cxlr->dev, "unsupported region mode: %s\n",
> +			cxl_region_mode_name(cxlr->mode));
>  		return -ENXIO;
>  	}
>  }
> diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
> index 9afb407d438f..f766b2a8bf53 100644
> --- a/drivers/cxl/cxl.h
> +++ b/drivers/cxl/cxl.h
> @@ -388,6 +388,27 @@ static inline const char *cxl_decoder_mode_name(enum cxl_decoder_mode mode)
>  	return "mixed";
>  }
>  
> +enum cxl_region_mode {
> +	CXL_REGION_NONE,
> +	CXL_REGION_RAM,
> +	CXL_REGION_PMEM,
> +	CXL_REGION_MIXED,
> +};
> +
> +static inline const char *cxl_region_mode_name(enum cxl_region_mode mode)
> +{
> +	static const char * const names[] = {
> +		[CXL_REGION_NONE] = "none",
> +		[CXL_REGION_RAM] = "ram",
> +		[CXL_REGION_PMEM] = "pmem",
> +		[CXL_REGION_MIXED] = "mixed",
> +	};
> +
> +	if (mode >= CXL_REGION_NONE && mode <= CXL_REGION_MIXED)
> +		return names[mode];
> +	return "mixed";
> +}
> +
>  /*
>   * Track whether this decoder is reserved for region autodiscovery, or
>   * free for userspace provisioning.
> @@ -515,7 +536,8 @@ struct cxl_region_params {
>   * struct cxl_region - CXL region
>   * @dev: This region's device
>   * @id: This region's id. Id is globally unique across all regions
> - * @mode: Endpoint decoder allocation / access mode
> + * @mode: Region mode which defines which endpoint decoder modes the region is
> + *        compatible with
>   * @type: Endpoint decoder target type
>   * @cxl_nvb: nvdimm bridge for coordinating @cxlr_pmem setup / shutdown
>   * @cxlr_pmem: (for pmem regions) cached copy of the nvdimm bridge
> @@ -528,7 +550,7 @@ struct cxl_region_params {
>  struct cxl_region {
>  	struct device dev;
>  	int id;
> -	enum cxl_decoder_mode mode;
> +	enum cxl_region_mode mode;
>  	enum cxl_decoder_type type;
>  	struct cxl_nvdimm_bridge *cxl_nvb;
>  	struct cxl_pmem_region *cxlr_pmem;
> 

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v3 08/25] cxl/region: Add dynamic capacity decoder and region modes
  2024-08-16 14:44 ` [PATCH v3 08/25] cxl/region: Add dynamic capacity decoder and region modes ira.weiny
@ 2024-08-16 22:14   ` Dave Jiang
  2024-09-03  6:57   ` Li, Ming4
  1 sibling, 0 replies; 120+ messages in thread
From: Dave Jiang @ 2024-08-16 22:14 UTC (permalink / raw)
  To: ira.weiny, Fan Ni, Jonathan Cameron, Navneet Singh, Chris Mason,
	Josef Bacik, David Sterba, Petr Mladek, Steven Rostedt,
	Andy Shevchenko, Rasmus Villemoes, Sergey Senozhatsky,
	Jonathan Corbet, Andrew Morton
  Cc: Dan Williams, Davidlohr Bueso, Alison Schofield, Vishal Verma,
	linux-btrfs, linux-cxl, linux-kernel, linux-doc, nvdimm



On 8/16/24 7:44 AM, ira.weiny@intel.com wrote:
> From: Navneet Singh <navneet.singh@intel.com>
> 
> One or more decoders each pointing to a Dynamic Capacity (DC) partition
> form a CXL software region.  The region mode reflects composition of
> that entire software region.  Decoder mode reflects a specific DC
> partition.  DC partitions are also known as DC regions per CXL
> specification r3.1.
> 
> Define the new modes and helper functions required to make the
> association between these new modes.
> 
> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> Reviewed-by: Fan Ni <fan.ni@samsung.com>
> Signed-off-by: Navneet Singh <navneet.singh@intel.com>
> Co-developed-by: Ira Weiny <ira.weiny@intel.com>
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>

Reviewed-by: Dave Jiang <dave.jiang@intel.com>
> 
> ---
> Changes:
> [iweiny: keep tags on simple patch]
> [Fan: s/partitions/partition/]
> [djiang: New wording for the commit message]
> [iweiny: reword commit message more]
> ---
>  drivers/cxl/core/region.c |  4 ++++
>  drivers/cxl/cxl.h         | 23 +++++++++++++++++++++++
>  2 files changed, 27 insertions(+)
> 
> diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
> index 796e5a791e44..650fe33f2ed4 100644
> --- a/drivers/cxl/core/region.c
> +++ b/drivers/cxl/core/region.c
> @@ -1870,6 +1870,8 @@ static bool cxl_modes_compatible(enum cxl_region_mode rmode,
>  		return true;
>  	if (rmode == CXL_REGION_PMEM && dmode == CXL_DECODER_PMEM)
>  		return true;
> +	if (rmode == CXL_REGION_DC && cxl_decoder_mode_is_dc(dmode))
> +		return true;
>  
>  	return false;
>  }
> @@ -3239,6 +3241,8 @@ cxl_decoder_to_region_mode(enum cxl_decoder_mode mode)
>  		return CXL_REGION_RAM;
>  	case CXL_DECODER_PMEM:
>  		return CXL_REGION_PMEM;
> +	case CXL_DECODER_DC0 ... CXL_DECODER_DC7:
> +		return CXL_REGION_DC;
>  	case CXL_DECODER_MIXED:
>  	default:
>  		return CXL_REGION_MIXED;
> diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
> index f766b2a8bf53..d2674ab46f35 100644
> --- a/drivers/cxl/cxl.h
> +++ b/drivers/cxl/cxl.h
> @@ -370,6 +370,14 @@ enum cxl_decoder_mode {
>  	CXL_DECODER_NONE,
>  	CXL_DECODER_RAM,
>  	CXL_DECODER_PMEM,
> +	CXL_DECODER_DC0,
> +	CXL_DECODER_DC1,
> +	CXL_DECODER_DC2,
> +	CXL_DECODER_DC3,
> +	CXL_DECODER_DC4,
> +	CXL_DECODER_DC5,
> +	CXL_DECODER_DC6,
> +	CXL_DECODER_DC7,
>  	CXL_DECODER_MIXED,
>  	CXL_DECODER_DEAD,
>  };
> @@ -380,6 +388,14 @@ static inline const char *cxl_decoder_mode_name(enum cxl_decoder_mode mode)
>  		[CXL_DECODER_NONE] = "none",
>  		[CXL_DECODER_RAM] = "ram",
>  		[CXL_DECODER_PMEM] = "pmem",
> +		[CXL_DECODER_DC0] = "dc0",
> +		[CXL_DECODER_DC1] = "dc1",
> +		[CXL_DECODER_DC2] = "dc2",
> +		[CXL_DECODER_DC3] = "dc3",
> +		[CXL_DECODER_DC4] = "dc4",
> +		[CXL_DECODER_DC5] = "dc5",
> +		[CXL_DECODER_DC6] = "dc6",
> +		[CXL_DECODER_DC7] = "dc7",
>  		[CXL_DECODER_MIXED] = "mixed",
>  	};
>  
> @@ -388,10 +404,16 @@ static inline const char *cxl_decoder_mode_name(enum cxl_decoder_mode mode)
>  	return "mixed";
>  }
>  
> +static inline bool cxl_decoder_mode_is_dc(enum cxl_decoder_mode mode)
> +{
> +	return (mode >= CXL_DECODER_DC0 && mode <= CXL_DECODER_DC7);
> +}
> +
>  enum cxl_region_mode {
>  	CXL_REGION_NONE,
>  	CXL_REGION_RAM,
>  	CXL_REGION_PMEM,
> +	CXL_REGION_DC,
>  	CXL_REGION_MIXED,
>  };
>  
> @@ -401,6 +423,7 @@ static inline const char *cxl_region_mode_name(enum cxl_region_mode mode)
>  		[CXL_REGION_NONE] = "none",
>  		[CXL_REGION_RAM] = "ram",
>  		[CXL_REGION_PMEM] = "pmem",
> +		[CXL_REGION_DC] = "dc",
>  		[CXL_REGION_MIXED] = "mixed",
>  	};
>  
> 

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v3 09/25] cxl/hdm: Add dynamic capacity size support to endpoint decoders
  2024-08-16 14:44 ` [PATCH v3 09/25] cxl/hdm: Add dynamic capacity size support to endpoint decoders ira.weiny
@ 2024-08-16 23:08   ` Dave Jiang
  2024-08-23  2:26     ` Ira Weiny
  2024-08-23 16:09   ` Jonathan Cameron
  1 sibling, 1 reply; 120+ messages in thread
From: Dave Jiang @ 2024-08-16 23:08 UTC (permalink / raw)
  To: ira.weiny, Fan Ni, Jonathan Cameron, Navneet Singh, Chris Mason,
	Josef Bacik, David Sterba, Petr Mladek, Steven Rostedt,
	Andy Shevchenko, Rasmus Villemoes, Sergey Senozhatsky,
	Jonathan Corbet, Andrew Morton
  Cc: Dan Williams, Davidlohr Bueso, Alison Schofield, Vishal Verma,
	linux-btrfs, linux-cxl, linux-kernel, linux-doc, nvdimm



On 8/16/24 7:44 AM, ira.weiny@intel.com wrote:
> From: Navneet Singh <navneet.singh@intel.com>
> 
> To support Dynamic Capacity Devices (DCD) endpoint decoders will need to
> map DC partitions (regions).  In addition to assigning the size of the
> DC partition, the decoder must assign any skip value from the previous
> decoder.  This must be done within a contiguous DPA space.
> 
> Two complications arise with Dynamic Capacity regions which did not
> exist with Ram and PMEM partitions.  First, gaps in the DPA space can
> exist between and around the DC partitions.  Second, the Linux resource
> tree does not allow a resource to be marked across existing nodes within
> a tree.
> 
> For clarity, below is an example of an 60GB device with 10GB of RAM,
> 10GB of PMEM and 10GB for each of 2 DC partitions.  The desired CXL
> mapping is 5GB of RAM, 5GB of PMEM, and 5GB of DC1.
> 
>      DPA RANGE
>      (dpa_res)
> 0GB        10GB       20GB       30GB       40GB       50GB       60GB
> |----------|----------|----------|----------|----------|----------|
> 
> RAM         PMEM                  DC0                   DC1
>  (ram_res)  (pmem_res)            (dc_res[0])           (dc_res[1])
> |----------|----------|   <gap>  |----------|   <gap>  |----------|
> 
>  RAM        PMEM                                        DC1
> |XXXXX|----|XXXXX|----|----------|----------|----------|XXXXX-----|
> 0GB   5GB  10GB  15GB 20GB       30GB       40GB       50GB       60GB
> 
> The previous skip resource between RAM and PMEM was always a child of
> the RAM resource and fit nicely [see (S) below].  Because of this
> simplicity this skip resource reference was not stored in any CXL state.
> On release the skip range could be calculated based on the endpoint
> decoders stored values.
> 
> Now when DC1 is being mapped 4 skip resources must be created as
> children.  One for the PMEM resource (A), two of the parent DPA resource
> (B,D), and one more child of the DC0 resource (C).
> 
> 0GB        10GB       20GB       30GB       40GB       50GB       60GB
> |----------|----------|----------|----------|----------|----------|
>                            |                     |
> |----------|----------|    |     |----------|    |     |----------|
>         |          |       |          |          |
>        (S)        (A)     (B)        (C)        (D)
> 	v          v       v          v          v
> |XXXXX|----|XXXXX|----|----------|----------|----------|XXXXX-----|
>        skip       skip  skip        skip      skip
> 
> Expand the calculation of DPA free space and enhance the logic to
> support this more complex skipping.  To track the potential of multiple
> skip resources an xarray is attached to the endpoint decoder.  The
> existing algorithm between RAM and PMEM is consolidated within the new
> one to streamline the code even though the result is the storage of a
> single skip resource in the xarray.
> 
> Signed-off-by: Navneet Singh <navneet.singh@intel.com>
> Co-developed-by: Ira Weiny <ira.weiny@intel.com>
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> 
> ---
> Changes:
> [Jonathan: Use an example only mapping 1/2 of DC1]
> [iweiny: Update cover letter]
> [iweiny: Fix 0day bugs
> 	https://lore.kernel.org/all/202408090138.RB41yBE8-lkp@intel.com/
> [djbw/Jonathan: allow more than 1 region per DC partition]
> ---
>  drivers/cxl/core/hdm.c  | 196 ++++++++++++++++++++++++++++++++++++++++++++----
>  drivers/cxl/core/port.c |   2 +
>  drivers/cxl/cxl.h       |   2 +
>  3 files changed, 184 insertions(+), 16 deletions(-)
> 
> diff --git a/drivers/cxl/core/hdm.c b/drivers/cxl/core/hdm.c
> index 3df10517a327..b4a517c6d283 100644
> --- a/drivers/cxl/core/hdm.c
> +++ b/drivers/cxl/core/hdm.c
> @@ -223,6 +223,25 @@ void cxl_dpa_debug(struct seq_file *file, struct cxl_dev_state *cxlds)
>  }
>  EXPORT_SYMBOL_NS_GPL(cxl_dpa_debug, CXL);
>  
> +static void cxl_skip_release(struct cxl_endpoint_decoder *cxled)
> +{
> +	struct cxl_dev_state *cxlds = cxled_to_memdev(cxled)->cxlds;
> +	struct cxl_port *port = cxled_to_port(cxled);
> +	struct device *dev = &port->dev;
> +	unsigned long index;
> +	void *entry;
> +
> +	xa_for_each(&cxled->skip_res, index, entry) {
> +		struct resource *res = entry;
> +
> +		dev_dbg(dev, "decoder%d.%d: releasing skipped space; %pr\n",
> +			port->id, cxled->cxld.id, res);
> +		__release_region(&cxlds->dpa_res, res->start,
> +				 resource_size(res));
> +		xa_erase(&cxled->skip_res, index);
> +	}
> +}
> +
>  /*
>   * Must be called in a context that synchronizes against this decoder's
>   * port ->remove() callback (like an endpoint decoder sysfs attribute)
> @@ -233,15 +252,11 @@ static void __cxl_dpa_release(struct cxl_endpoint_decoder *cxled)
>  	struct cxl_port *port = cxled_to_port(cxled);
>  	struct cxl_dev_state *cxlds = cxlmd->cxlds;
>  	struct resource *res = cxled->dpa_res;
> -	resource_size_t skip_start;
>  
>  	lockdep_assert_held_write(&cxl_dpa_rwsem);
>  
> -	/* save @skip_start, before @res is released */
> -	skip_start = res->start - cxled->skip;
>  	__release_region(&cxlds->dpa_res, res->start, resource_size(res));
> -	if (cxled->skip)
> -		__release_region(&cxlds->dpa_res, skip_start, cxled->skip);
> +	cxl_skip_release(cxled);
>  	cxled->skip = 0;
>  	cxled->dpa_res = NULL;
>  	put_device(&cxled->cxld.dev);
> @@ -268,6 +283,105 @@ static void devm_cxl_dpa_release(struct cxl_endpoint_decoder *cxled)
>  	__cxl_dpa_release(cxled);
>  }
>  
> +static int dc_mode_to_region_index(enum cxl_decoder_mode mode)
> +{
> +	return mode - CXL_DECODER_DC0;
> +}
> +
> +static int cxl_request_skip(struct cxl_endpoint_decoder *cxled,
> +			    resource_size_t skip_base, resource_size_t skip_len)
> +{
> +	struct cxl_dev_state *cxlds = cxled_to_memdev(cxled)->cxlds;
> +	const char *name = dev_name(&cxled->cxld.dev);
> +	struct cxl_port *port = cxled_to_port(cxled);
> +	struct resource *dpa_res = &cxlds->dpa_res;
> +	struct device *dev = &port->dev;
> +	struct resource *res;
> +	int rc;
> +
> +	res = __request_region(dpa_res, skip_base, skip_len, name, 0);
> +	if (!res)
> +		return -EBUSY;
> +
> +	rc = xa_insert(&cxled->skip_res, skip_base, res, GFP_KERNEL);

Maybe rename skip_res to skip_xa, given most of the vars in CXL with _res are 'struct resource' to avoid confusion. See 'dpa_res' above.

> +	if (rc) {
> +		__release_region(dpa_res, skip_base, skip_len);
> +		return rc;
> +	}
> +
> +	dev_dbg(dev, "decoder%d.%d: skipped space; %pr\n",
> +		port->id, cxled->cxld.id, res);
> +	return 0;
> +}
> +
> +static int cxl_reserve_dpa_skip(struct cxl_endpoint_decoder *cxled,
> +				resource_size_t base, resource_size_t skipped)
> +{
> +	struct cxl_memdev *cxlmd = cxled_to_memdev(cxled);
> +	struct cxl_port *port = cxled_to_port(cxled);
> +	struct cxl_dev_state *cxlds = cxlmd->cxlds;
> +	resource_size_t skip_base = base - skipped;
> +	struct device *dev = &port->dev;
> +	resource_size_t skip_len = 0;
> +	int rc, index;
> +
> +	if (resource_size(&cxlds->ram_res) && skip_base <= cxlds->ram_res.end) {
> +		skip_len = cxlds->ram_res.end - skip_base + 1;
> +		rc = cxl_request_skip(cxled, skip_base, skip_len);
> +		if (rc)
> +			return rc;
> +		skip_base += skip_len;
> +	}
> +
> +	if (skip_base == base) {
> +		dev_dbg(dev, "skip done ram!\n");
> +		return 0;
> +	}
> +
> +	if (resource_size(&cxlds->pmem_res) &&
> +	    skip_base <= cxlds->pmem_res.end) {
> +		skip_len = cxlds->pmem_res.end - skip_base + 1;
> +		rc = cxl_request_skip(cxled, skip_base, skip_len);
> +		if (rc)
> +			return rc;
> +		skip_base += skip_len;
> +	}

Does 'skip_base == base' need to be checked here again before going to DCD?

DJ

> +
> +	index = dc_mode_to_region_index(cxled->mode);
> +	for (int i = 0; i <= index; i++) {
> +		struct resource *dcr = &cxlds->dc_res[i];
> +
> +		if (skip_base < dcr->start) {
> +			skip_len = dcr->start - skip_base;
> +			rc = cxl_request_skip(cxled, skip_base, skip_len);
> +			if (rc)
> +				return rc;
> +			skip_base += skip_len;
> +		}
> +
> +		if (skip_base == base) {
> +			dev_dbg(dev, "skip done DC region %d!\n", i);
> +			break;
> +		}
> +
> +		if (resource_size(dcr) && skip_base <= dcr->end) {
> +			if (skip_base > base) {
> +				dev_err(dev, "Skip error DC region %d; skip_base %pa; base %pa\n",
> +					i, &skip_base, &base);
> +				return -ENXIO;
> +			}
> +
> +			skip_len = dcr->end - skip_base + 1;
> +			rc = cxl_request_skip(cxled, skip_base, skip_len);
> +			if (rc)
> +				return rc;
> +			skip_base += skip_len;
> +		}
> +	}
> +
> +	return 0;
> +}
> +
>  static int __cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
>  			     resource_size_t base, resource_size_t len,
>  			     resource_size_t skipped)
> @@ -305,13 +419,12 @@ static int __cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
>  	}
>  
>  	if (skipped) {
> -		res = __request_region(&cxlds->dpa_res, base - skipped, skipped,
> -				       dev_name(&cxled->cxld.dev), 0);
> -		if (!res) {
> -			dev_dbg(dev,
> -				"decoder%d.%d: failed to reserve skipped space\n",
> -				port->id, cxled->cxld.id);
> -			return -EBUSY;
> +		int rc = cxl_reserve_dpa_skip(cxled, base, skipped);
> +
> +		if (rc) {
> +			dev_dbg(dev, "decoder%d.%d: failed to reserve skipped space; %pa - %pa\n",
> +				port->id, cxled->cxld.id, &base, &skipped);
> +			return rc;
>  		}
>  	}
>  	res = __request_region(&cxlds->dpa_res, base, len,
> @@ -319,14 +432,20 @@ static int __cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
>  	if (!res) {
>  		dev_dbg(dev, "decoder%d.%d: failed to reserve allocation\n",
>  			port->id, cxled->cxld.id);
> -		if (skipped)
> -			__release_region(&cxlds->dpa_res, base - skipped,
> -					 skipped);
> +		cxl_skip_release(cxled);
>  		return -EBUSY;
>  	}
>  	cxled->dpa_res = res;
>  	cxled->skip = skipped;
>  
> +	for (int mode = CXL_DECODER_DC0; mode <= CXL_DECODER_DC7; mode++) {
> +		int index = dc_mode_to_region_index(mode);
> +
> +		if (resource_contains(&cxlds->dc_res[index], res)) {
> +			cxled->mode = mode;
> +			goto success;
> +		}
> +	}
>  	if (resource_contains(&cxlds->pmem_res, res))
>  		cxled->mode = CXL_DECODER_PMEM;
>  	else if (resource_contains(&cxlds->ram_res, res))
> @@ -337,6 +456,9 @@ static int __cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
>  		cxled->mode = CXL_DECODER_MIXED;
>  	}
>  
> +success:
> +	dev_dbg(dev, "decoder%d.%d: %pr mode: %d\n", port->id, cxled->cxld.id,
> +		cxled->dpa_res, cxled->mode);
>  	port->hdm_end++;
>  	get_device(&cxled->cxld.dev);
>  	return 0;
> @@ -466,8 +588,8 @@ int cxl_dpa_set_mode(struct cxl_endpoint_decoder *cxled,
>  
>  int cxl_dpa_alloc(struct cxl_endpoint_decoder *cxled, unsigned long long size)
>  {
> -	struct cxl_memdev *cxlmd = cxled_to_memdev(cxled);
>  	resource_size_t free_ram_start, free_pmem_start;
> +	struct cxl_memdev *cxlmd = cxled_to_memdev(cxled);
>  	struct cxl_port *port = cxled_to_port(cxled);
>  	struct cxl_dev_state *cxlds = cxlmd->cxlds;
>  	struct device *dev = &cxled->cxld.dev;
> @@ -524,12 +646,54 @@ int cxl_dpa_alloc(struct cxl_endpoint_decoder *cxled, unsigned long long size)
>  		else
>  			skip_end = start - 1;
>  		skip = skip_end - skip_start + 1;
> +	} else if (cxl_decoder_mode_is_dc(cxled->mode)) {
> +		int dc_index = dc_mode_to_region_index(cxled->mode);
> +
> +		for (p = cxlds->dc_res[dc_index].child, last = NULL; p; p = p->sibling)
> +			last = p;
> +
> +		if (last) {
> +			/*
> +			 * Some capacity in this DC partition is already allocated,
> +			 * that allocation already handled the skip.
> +			 */
> +			start = last->end + 1;
> +			skip = 0;
> +		} else {
> +			/* Calculate skip */
> +			resource_size_t skip_start, skip_end;
> +
> +			start = cxlds->dc_res[dc_index].start;
> +
> +			if ((resource_size(&cxlds->pmem_res) == 0) || !cxlds->pmem_res.child)
> +				skip_start = free_ram_start;
> +			else
> +				skip_start = free_pmem_start;
> +			/*
> +			 * If any dc region is already mapped, then that allocation
> +			 * already handled the RAM and PMEM skip.  Check for DC region
> +			 * skip.
> +			 */
> +			for (int i = dc_index - 1; i >= 0 ; i--) {
> +				if (cxlds->dc_res[i].child) {
> +					skip_start = cxlds->dc_res[i].child->end + 1;
> +					break;
> +				}
> +			}
> +
> +			skip_end = start - 1;
> +			skip = skip_end - skip_start + 1;
> +		}
> +		avail = cxlds->dc_res[dc_index].end - start + 1;
>  	} else {
>  		dev_dbg(dev, "mode not set\n");
>  		rc = -EINVAL;
>  		goto out;
>  	}
>  
> +	dev_dbg(dev, "DPA Allocation start: %pa len: %#llx Skip: %pa\n",
> +		&start, size, &skip);
> +
>  	if (size > avail) {
>  		dev_dbg(dev, "%pa exceeds available %s capacity: %pa\n", &size,
>  			cxl_decoder_mode_name(cxled->mode), &avail);
> diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
> index 1d5007e3795a..8054cbaac9f6 100644
> --- a/drivers/cxl/core/port.c
> +++ b/drivers/cxl/core/port.c
> @@ -419,6 +419,7 @@ static void cxl_endpoint_decoder_release(struct device *dev)
>  	struct cxl_endpoint_decoder *cxled = to_cxl_endpoint_decoder(dev);
>  
>  	__cxl_decoder_release(&cxled->cxld);
> +	xa_destroy(&cxled->skip_res);
>  	kfree(cxled);
>  }
>  
> @@ -1899,6 +1900,7 @@ struct cxl_endpoint_decoder *cxl_endpoint_decoder_alloc(struct cxl_port *port)
>  		return ERR_PTR(-ENOMEM);
>  
>  	cxled->pos = -1;
> +	xa_init(&cxled->skip_res);
>  	cxld = &cxled->cxld;
>  	rc = cxl_decoder_init(port, cxld);
>  	if (rc)	 {
> diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
> index d2674ab46f35..53b666ef4097 100644
> --- a/drivers/cxl/cxl.h
> +++ b/drivers/cxl/cxl.h
> @@ -446,6 +446,7 @@ enum cxl_decoder_state {
>   * @cxld: base cxl_decoder_object
>   * @dpa_res: actively claimed DPA span of this decoder
>   * @skip: offset into @dpa_res where @cxld.hpa_range maps
> + * @skip_res: array of skipped resources from the previous decoder end
>   * @mode: which memory type / access-mode-partition this decoder targets
>   * @state: autodiscovery state
>   * @pos: interleave position in @cxld.region
> @@ -454,6 +455,7 @@ struct cxl_endpoint_decoder {
>  	struct cxl_decoder cxld;
>  	struct resource *dpa_res;
>  	resource_size_t skip;
> +	struct xarray skip_res;
>  	enum cxl_decoder_mode mode;
>  	enum cxl_decoder_state state;
>  	int pos;
> 

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v3 10/25] cxl/port: Add endpoint decoder DC mode support to sysfs
  2024-08-16 14:44 ` [PATCH v3 10/25] cxl/port: Add endpoint decoder DC mode support to sysfs ira.weiny
@ 2024-08-16 23:17   ` Dave Jiang
  2024-08-23 16:12   ` Jonathan Cameron
  1 sibling, 0 replies; 120+ messages in thread
From: Dave Jiang @ 2024-08-16 23:17 UTC (permalink / raw)
  To: ira.weiny, Fan Ni, Jonathan Cameron, Navneet Singh, Chris Mason,
	Josef Bacik, David Sterba, Petr Mladek, Steven Rostedt,
	Andy Shevchenko, Rasmus Villemoes, Sergey Senozhatsky,
	Jonathan Corbet, Andrew Morton
  Cc: Dan Williams, Davidlohr Bueso, Alison Schofield, Vishal Verma,
	linux-btrfs, linux-cxl, linux-kernel, linux-doc, nvdimm



On 8/16/24 7:44 AM, ira.weiny@intel.com wrote:
> From: Navneet Singh <navneet.singh@intel.com>
> 
> Endpoint decoder mode is used to represent the partition the decoder
> points to such as ram or pmem.
> 
> Expand the mode to allow a decoder to point to a specific DC partition
> (Region).
> 
> Signed-off-by: Navneet Singh <navneet.singh@intel.com>
> Co-developed-by: Ira Weiny <ira.weiny@intel.com>
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>

Reviewed-by: Dave Jiang <dave.jiang@intel.com>
> 
> ---
> Changes:
> [Fan: change mode range logic]
> [Fan: use !resource_size()]
> [djiang: use the static mode name string array in mode_store()]
> [Jonathan: remove rc check from mode to region index]
> [Jonathan: clarify decoder mode 'mixed']
> [djbw: drop cleanup patch and just follow the convention in cxl_dpa_set_mode()]
> [fan: make dcd resource size check similar to other partitions]
> [djbw, jonathan, fan: remove mode range check from dc_mode_to_region_index]
> [iweiny: push sysfs versions to 6.12]
> ---
>  Documentation/ABI/testing/sysfs-bus-cxl | 21 ++++++++++----------
>  drivers/cxl/core/hdm.c                  | 10 ++++++++++
>  drivers/cxl/core/port.c                 | 10 +++++-----
>  drivers/cxl/cxl.h                       | 35 ++++++++++++++++++---------------
>  4 files changed, 45 insertions(+), 31 deletions(-)
> 
> diff --git a/Documentation/ABI/testing/sysfs-bus-cxl b/Documentation/ABI/testing/sysfs-bus-cxl
> index 3f5627a1210a..957717264709 100644
> --- a/Documentation/ABI/testing/sysfs-bus-cxl
> +++ b/Documentation/ABI/testing/sysfs-bus-cxl
> @@ -316,23 +316,24 @@ Description:
>  
>  
>  What:		/sys/bus/cxl/devices/decoderX.Y/mode
> -Date:		May, 2022
> -KernelVersion:	v6.0
> +Date:		May, 2022, October 2024
> +KernelVersion:	v6.0, v6.12 (dcY)
>  Contact:	linux-cxl@vger.kernel.org
>  Description:
>  		(RW) When a CXL decoder is of devtype "cxl_decoder_endpoint" it
>  		translates from a host physical address range, to a device local
>  		address range. Device-local address ranges are further split
> -		into a 'ram' (volatile memory) range and 'pmem' (persistent
> -		memory) range. The 'mode' attribute emits one of 'ram', 'pmem',
> -		'mixed', or 'none'. The 'mixed' indication is for error cases
> -		when a decoder straddles the volatile/persistent partition
> -		boundary, and 'none' indicates the decoder is not actively
> -		decoding, or no DPA allocation policy has been set.
> +		into a 'ram' (volatile memory) range, 'pmem' (persistent
> +		memory) range, or Dynamic Capacity (DC) range. The 'mode'
> +		attribute emits one of 'ram', 'pmem', 'dcY', 'mixed', or
> +		'none'. The 'mixed' indication is for error cases when a
> +		decoder straddles partition boundaries, and 'none' indicates
> +		the decoder is not actively decoding, or no DPA allocation
> +		policy has been set.
>  
>  		'mode' can be written, when the decoder is in the 'disabled'
> -		state, with either 'ram' or 'pmem' to set the boundaries for the
> -		next allocation.
> +		state, with 'ram', 'pmem', or 'dcY' to set the boundaries for
> +		the next allocation.
>  
>  
>  What:		/sys/bus/cxl/devices/decoderX.Y/dpa_resource
> diff --git a/drivers/cxl/core/hdm.c b/drivers/cxl/core/hdm.c
> index b4a517c6d283..ceca0b3d3e5c 100644
> --- a/drivers/cxl/core/hdm.c
> +++ b/drivers/cxl/core/hdm.c
> @@ -551,6 +551,7 @@ int cxl_dpa_set_mode(struct cxl_endpoint_decoder *cxled,
>  	switch (mode) {
>  	case CXL_DECODER_RAM:
>  	case CXL_DECODER_PMEM:
> +	case CXL_DECODER_DC0 ... CXL_DECODER_DC7:
>  		break;
>  	default:
>  		dev_dbg(dev, "unsupported mode: %d\n", mode);
> @@ -578,6 +579,15 @@ int cxl_dpa_set_mode(struct cxl_endpoint_decoder *cxled,
>  		goto out;
>  	}
>  
> +	if (mode >= CXL_DECODER_DC0 && mode <= CXL_DECODER_DC7) {
> +		rc = dc_mode_to_region_index(mode);
> +		if (!resource_size(&cxlds->dc_res[rc])) {
> +			dev_dbg(dev, "no available dynamic capacity\n");
> +			rc = -ENXIO;
> +			goto out;
> +		}
> +	}
> +
>  	cxled->mode = mode;
>  	rc = 0;
>  out:
> diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
> index 8054cbaac9f6..222aa0aeeef7 100644
> --- a/drivers/cxl/core/port.c
> +++ b/drivers/cxl/core/port.c
> @@ -205,11 +205,11 @@ static ssize_t mode_store(struct device *dev, struct device_attribute *attr,
>  	enum cxl_decoder_mode mode;
>  	ssize_t rc;
>  
> -	if (sysfs_streq(buf, "pmem"))
> -		mode = CXL_DECODER_PMEM;
> -	else if (sysfs_streq(buf, "ram"))
> -		mode = CXL_DECODER_RAM;
> -	else
> +	for (mode = CXL_DECODER_RAM; mode < CXL_DECODER_MIXED; mode++)
> +		if (sysfs_streq(buf, cxl_decoder_mode_names[mode]))
> +			break;
> +
> +	if (mode >= CXL_DECODER_MIXED)
>  		return -EINVAL;
>  
>  	rc = cxl_dpa_set_mode(cxled, mode);
> diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
> index 53b666ef4097..16861c867537 100644
> --- a/drivers/cxl/cxl.h
> +++ b/drivers/cxl/cxl.h
> @@ -365,6 +365,9 @@ struct cxl_decoder {
>  /*
>   * CXL_DECODER_DEAD prevents endpoints from being reattached to regions
>   * while cxld_unregister() is running
> + *
> + * NOTE: CXL_DECODER_RAM must be second and CXL_DECODER_MIXED must be last.
> + *	 See mode_store()
>   */
>  enum cxl_decoder_mode {
>  	CXL_DECODER_NONE,
> @@ -382,25 +385,25 @@ enum cxl_decoder_mode {
>  	CXL_DECODER_DEAD,
>  };
>  
> +static const char * const cxl_decoder_mode_names[] = {
> +	[CXL_DECODER_NONE] = "none",
> +	[CXL_DECODER_RAM] = "ram",
> +	[CXL_DECODER_PMEM] = "pmem",
> +	[CXL_DECODER_DC0] = "dc0",
> +	[CXL_DECODER_DC1] = "dc1",
> +	[CXL_DECODER_DC2] = "dc2",
> +	[CXL_DECODER_DC3] = "dc3",
> +	[CXL_DECODER_DC4] = "dc4",
> +	[CXL_DECODER_DC5] = "dc5",
> +	[CXL_DECODER_DC6] = "dc6",
> +	[CXL_DECODER_DC7] = "dc7",
> +	[CXL_DECODER_MIXED] = "mixed",
> +};
> +
>  static inline const char *cxl_decoder_mode_name(enum cxl_decoder_mode mode)
>  {
> -	static const char * const names[] = {
> -		[CXL_DECODER_NONE] = "none",
> -		[CXL_DECODER_RAM] = "ram",
> -		[CXL_DECODER_PMEM] = "pmem",
> -		[CXL_DECODER_DC0] = "dc0",
> -		[CXL_DECODER_DC1] = "dc1",
> -		[CXL_DECODER_DC2] = "dc2",
> -		[CXL_DECODER_DC3] = "dc3",
> -		[CXL_DECODER_DC4] = "dc4",
> -		[CXL_DECODER_DC5] = "dc5",
> -		[CXL_DECODER_DC6] = "dc6",
> -		[CXL_DECODER_DC7] = "dc7",
> -		[CXL_DECODER_MIXED] = "mixed",
> -	};
> -
>  	if (mode >= CXL_DECODER_NONE && mode <= CXL_DECODER_MIXED)
> -		return names[mode];
> +		return cxl_decoder_mode_names[mode];
>  	return "mixed";
>  }
>  
> 

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v3 11/25] cxl/mem: Expose DCD partition capabilities in sysfs
  2024-08-16 14:44 ` [PATCH v3 11/25] cxl/mem: Expose DCD partition capabilities in sysfs ira.weiny
@ 2024-08-16 23:42   ` Dave Jiang
  2024-08-23  2:28     ` Ira Weiny
  0 siblings, 1 reply; 120+ messages in thread
From: Dave Jiang @ 2024-08-16 23:42 UTC (permalink / raw)
  To: ira.weiny, Fan Ni, Jonathan Cameron, Navneet Singh, Chris Mason,
	Josef Bacik, David Sterba, Petr Mladek, Steven Rostedt,
	Andy Shevchenko, Rasmus Villemoes, Sergey Senozhatsky,
	Jonathan Corbet, Andrew Morton
  Cc: Dan Williams, Davidlohr Bueso, Alison Schofield, Vishal Verma,
	linux-btrfs, linux-cxl, linux-kernel, linux-doc, nvdimm



On 8/16/24 7:44 AM, ira.weiny@intel.com wrote:
> From: Navneet Singh <navneet.singh@intel.com>
> 
> To properly configure CXL regions on Dynamic Capacity Devices (DCD),
> user space will need to know the details of the DC partitions available.
> 
> Expose dynamic capacity capabilities through sysfs.
> 
> Signed-off-by: Navneet Singh <navneet.singh@intel.com>
> Co-developed-by: Ira Weiny <ira.weiny@intel.com>
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> 
> ---
> Changes:
> [iweiny: remove review tags]
> [Davidlohr/Fan/Jonathan: omit 'dc' attribute directory if device is not DC]
> [Jonathan: update documentation for dc visibility]
> [Jonathan: Add a comment to DC region X attributes to ensure visibility checks work]
> [iweiny: push sysfs version to 6.12]
> ---
>  Documentation/ABI/testing/sysfs-bus-cxl | 12 ++++
>  drivers/cxl/core/memdev.c               | 97 +++++++++++++++++++++++++++++++++
>  2 files changed, 109 insertions(+)
> 
> diff --git a/Documentation/ABI/testing/sysfs-bus-cxl b/Documentation/ABI/testing/sysfs-bus-cxl
> index 957717264709..6227ae0ab3fc 100644
> --- a/Documentation/ABI/testing/sysfs-bus-cxl
> +++ b/Documentation/ABI/testing/sysfs-bus-cxl
> @@ -54,6 +54,18 @@ Description:
>  		identically named field in the Identify Memory Device Output
>  		Payload in the CXL-2.0 specification.
>  
> +What:		/sys/bus/cxl/devices/memX/dc/region_count
> +		/sys/bus/cxl/devices/memX/dc/regionY_size

Just make it into 2 separate entries?

DJ
> +Date:		August, 2024
> +KernelVersion:	v6.12
> +Contact:	linux-cxl@vger.kernel.org
> +Description:
> +		(RO) Dynamic Capacity (DC) region information.  The dc
> +		directory is only visible on devices which support Dynamic
> +		Capacity.
> +		The region_count is the number of Dynamic Capacity (DC)
> +		partitions (regions) supported on the device.
> +		regionY_size is the size of each of those partitions.
>  
>  What:		/sys/bus/cxl/devices/memX/pmem/qos_class
>  Date:		May, 2023
> diff --git a/drivers/cxl/core/memdev.c b/drivers/cxl/core/memdev.c
> index 0277726afd04..7da1f0f5711a 100644
> --- a/drivers/cxl/core/memdev.c
> +++ b/drivers/cxl/core/memdev.c
> @@ -101,6 +101,18 @@ static ssize_t pmem_size_show(struct device *dev, struct device_attribute *attr,
>  static struct device_attribute dev_attr_pmem_size =
>  	__ATTR(size, 0444, pmem_size_show, NULL);
>  
> +static ssize_t region_count_show(struct device *dev, struct device_attribute *attr,
> +				 char *buf)
> +{
> +	struct cxl_memdev *cxlmd = to_cxl_memdev(dev);
> +	struct cxl_memdev_state *mds = to_cxl_memdev_state(cxlmd->cxlds);
> +
> +	return sysfs_emit(buf, "%d\n", mds->nr_dc_region);
> +}
> +
> +static struct device_attribute dev_attr_region_count =
> +	__ATTR(region_count, 0444, region_count_show, NULL);
> +
>  static ssize_t serial_show(struct device *dev, struct device_attribute *attr,
>  			   char *buf)
>  {
> @@ -448,6 +460,90 @@ static struct attribute *cxl_memdev_security_attributes[] = {
>  	NULL,
>  };
>  
> +static ssize_t show_size_regionN(struct cxl_memdev *cxlmd, char *buf, int pos)
> +{
> +	struct cxl_memdev_state *mds = to_cxl_memdev_state(cxlmd->cxlds);
> +
> +	return sysfs_emit(buf, "%#llx\n", mds->dc_region[pos].decode_len);
> +}
> +
> +#define REGION_SIZE_ATTR_RO(n)						\
> +static ssize_t region##n##_size_show(struct device *dev,		\
> +				     struct device_attribute *attr,	\
> +				     char *buf)				\
> +{									\
> +	return show_size_regionN(to_cxl_memdev(dev), buf, (n));		\
> +}									\
> +static DEVICE_ATTR_RO(region##n##_size)
> +REGION_SIZE_ATTR_RO(0);
> +REGION_SIZE_ATTR_RO(1);
> +REGION_SIZE_ATTR_RO(2);
> +REGION_SIZE_ATTR_RO(3);
> +REGION_SIZE_ATTR_RO(4);
> +REGION_SIZE_ATTR_RO(5);
> +REGION_SIZE_ATTR_RO(6);
> +REGION_SIZE_ATTR_RO(7);
> +
> +/*
> + * RegionX attributes must be listed in order and first in this array to
> + * support the visbility checks.
> + */
> +static struct attribute *cxl_memdev_dc_attributes[] = {
> +	&dev_attr_region0_size.attr,
> +	&dev_attr_region1_size.attr,
> +	&dev_attr_region2_size.attr,
> +	&dev_attr_region3_size.attr,
> +	&dev_attr_region4_size.attr,
> +	&dev_attr_region5_size.attr,
> +	&dev_attr_region6_size.attr,
> +	&dev_attr_region7_size.attr,
> +	&dev_attr_region_count.attr,
> +	NULL,
> +};
> +
> +static umode_t cxl_memdev_dc_attr_visible(struct kobject *kobj, struct attribute *a, int n)
> +{
> +	struct device *dev = kobj_to_dev(kobj);
> +	struct cxl_memdev *cxlmd = to_cxl_memdev(dev);
> +	struct cxl_memdev_state *mds = to_cxl_memdev_state(cxlmd->cxlds);
> +
> +	/* Not a memory device */
> +	if (!mds)
> +		return 0;
> +
> +	if (a == &dev_attr_region_count.attr)
> +		return a->mode;
> +
> +	/*
> +	 * Show only the regions supported, regionX attributes are first in the
> +	 * list
> +	 */
> +	if (n < mds->nr_dc_region)
> +		return a->mode;
> +
> +	return 0;
> +}
> +
> +static bool cxl_memdev_dc_group_visible(struct kobject *kobj)
> +{
> +	struct device *dev = kobj_to_dev(kobj);
> +	struct cxl_memdev *cxlmd = to_cxl_memdev(dev);
> +	struct cxl_memdev_state *mds = to_cxl_memdev_state(cxlmd->cxlds);
> +
> +	/* No DC regions */
> +	if (!mds || mds->nr_dc_region == 0)
> +		return false;
> +	return true;
> +}
> +
> +DEFINE_SYSFS_GROUP_VISIBLE(cxl_memdev_dc);
> +
> +static struct attribute_group cxl_memdev_dc_group = {
> +	.name = "dc",
> +	.attrs = cxl_memdev_dc_attributes,
> +	.is_visible = SYSFS_GROUP_VISIBLE(cxl_memdev_dc),
> +};
> +
>  static umode_t cxl_memdev_visible(struct kobject *kobj, struct attribute *a,
>  				  int n)
>  {
> @@ -528,6 +624,7 @@ static const struct attribute_group *cxl_memdev_attribute_groups[] = {
>  	&cxl_memdev_ram_attribute_group,
>  	&cxl_memdev_pmem_attribute_group,
>  	&cxl_memdev_security_attribute_group,
> +	&cxl_memdev_dc_group,
>  	NULL,
>  };
>  
> 

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v3 12/25] cxl/region: Refactor common create region code
  2024-08-16 14:44 ` [PATCH v3 12/25] cxl/region: Refactor common create region code Ira Weiny
@ 2024-08-16 23:43   ` Dave Jiang
  2024-08-22 18:51   ` Fan Ni
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 120+ messages in thread
From: Dave Jiang @ 2024-08-16 23:43 UTC (permalink / raw)
  To: Ira Weiny, Fan Ni, Jonathan Cameron, Navneet Singh, Chris Mason,
	Josef Bacik, David Sterba, Petr Mladek, Steven Rostedt,
	Andy Shevchenko, Rasmus Villemoes, Sergey Senozhatsky,
	Jonathan Corbet, Andrew Morton
  Cc: Dan Williams, Davidlohr Bueso, Alison Schofield, Vishal Verma,
	linux-btrfs, linux-cxl, linux-kernel, linux-doc, nvdimm



On 8/16/24 7:44 AM, Ira Weiny wrote:
> create_pmem_region_store() and create_ram_region_store() are identical
> with the exception of the region mode.  With the addition of DC region
> mode this would end up being 3 copies of the same code.
> 
> Refactor create_pmem_region_store() and create_ram_region_store() to use
> a single common function to be used in subsequent DC code.
> 
> Suggested-by: Fan Ni <fan.ni@samsung.com>
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>

> ---
>  drivers/cxl/core/region.c | 28 +++++++++++-----------------
>  1 file changed, 11 insertions(+), 17 deletions(-)
> 
> diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
> index 650fe33f2ed4..f85b26b39b2f 100644
> --- a/drivers/cxl/core/region.c
> +++ b/drivers/cxl/core/region.c
> @@ -2553,9 +2553,8 @@ static struct cxl_region *__create_region(struct cxl_root_decoder *cxlrd,
>  	return devm_cxl_add_region(cxlrd, id, mode, CXL_DECODER_HOSTONLYMEM);
>  }
>  
> -static ssize_t create_pmem_region_store(struct device *dev,
> -					struct device_attribute *attr,
> -					const char *buf, size_t len)
> +static ssize_t create_region_store(struct device *dev, const char *buf,
> +				   size_t len, enum cxl_region_mode mode)
>  {
>  	struct cxl_root_decoder *cxlrd = to_cxl_root_decoder(dev);
>  	struct cxl_region *cxlr;
> @@ -2565,31 +2564,26 @@ static ssize_t create_pmem_region_store(struct device *dev,
>  	if (rc != 1)
>  		return -EINVAL;
>  
> -	cxlr = __create_region(cxlrd, CXL_REGION_PMEM, id);
> +	cxlr = __create_region(cxlrd, mode, id);
>  	if (IS_ERR(cxlr))
>  		return PTR_ERR(cxlr);
>  
>  	return len;
>  }
> +
> +static ssize_t create_pmem_region_store(struct device *dev,
> +					struct device_attribute *attr,
> +					const char *buf, size_t len)
> +{
> +	return create_region_store(dev, buf, len, CXL_REGION_PMEM);
> +}
>  DEVICE_ATTR_RW(create_pmem_region);
>  
>  static ssize_t create_ram_region_store(struct device *dev,
>  				       struct device_attribute *attr,
>  				       const char *buf, size_t len)
>  {
> -	struct cxl_root_decoder *cxlrd = to_cxl_root_decoder(dev);
> -	struct cxl_region *cxlr;
> -	int rc, id;
> -
> -	rc = sscanf(buf, "region%d\n", &id);
> -	if (rc != 1)
> -		return -EINVAL;
> -
> -	cxlr = __create_region(cxlrd, CXL_REGION_RAM, id);
> -	if (IS_ERR(cxlr))
> -		return PTR_ERR(cxlr);
> -
> -	return len;
> +	return create_region_store(dev, buf, len, CXL_REGION_RAM);
>  }
>  DEVICE_ATTR_RW(create_ram_region);
>  
> 

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v3 13/25] cxl/region: Add sparse DAX region support
  2024-08-16 14:44 ` [PATCH v3 13/25] cxl/region: Add sparse DAX region support ira.weiny
@ 2024-08-16 23:51   ` Dave Jiang
  2024-08-22 18:50   ` Fan Ni
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 120+ messages in thread
From: Dave Jiang @ 2024-08-16 23:51 UTC (permalink / raw)
  To: ira.weiny, Fan Ni, Jonathan Cameron, Navneet Singh, Chris Mason,
	Josef Bacik, David Sterba, Petr Mladek, Steven Rostedt,
	Andy Shevchenko, Rasmus Villemoes, Sergey Senozhatsky,
	Jonathan Corbet, Andrew Morton
  Cc: Dan Williams, Davidlohr Bueso, Alison Schofield, Vishal Verma,
	linux-btrfs, linux-cxl, linux-kernel, linux-doc, nvdimm



On 8/16/24 7:44 AM, ira.weiny@intel.com wrote:
> From: Navneet Singh <navneet.singh@intel.com>
> 
> Dynamic Capacity CXL regions must allow memory to be added or removed
> dynamically.  In addition to the quantity of memory available the
> location of the memory within a DC partition is dynamic based on the
> extents offered by a device.  CXL DAX regions must accommodate the
> sparseness of this memory in the management of DAX regions and devices.
> 
> Introduce the concept of a sparse DAX region.  Add a create_dc_region()
> sysfs entry to create such regions.  Special case DC capable regions to
> create a 0 sized seed DAX device to maintain compatibility which
> requires a default DAX device to hold a region reference.
> 
> Indicate 0 byte available capacity until such time that capacity is
> added.
> 
> Sparse regions complicate the range mapping of dax devices.  There is no
> known use case for range mapping on sparse regions.  Avoid the
> complication by preventing range mapping of dax devices on sparse
> regions.
> 
> Interleaving is deferred for now.  Add checks.
> 
> Signed-off-by: Navneet Singh <navneet.singh@intel.com>
> Co-developed-by: Ira Weiny <ira.weiny@intel.com>
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>

> 
> ---
> Changes:
> [Fan: use single function for dc region store]
> [djiang: avoid setting dev_size twice]
> [djbw: Check DCD support and interleave restriction on region creation]
> [iweiny: squash patch : dax/region: Prevent range mapping allocation on sparse regions]
> [iwieny: remove reviews]
> [iweiny: rebase to master]
> [iweiny: push sysfs version to 6.12]
> [iweiny: make cxled_to_mds inline]
> ---
>  Documentation/ABI/testing/sysfs-bus-cxl | 22 ++++++++--------
>  drivers/cxl/core/core.h                 | 12 +++++++++
>  drivers/cxl/core/port.c                 |  1 +
>  drivers/cxl/core/region.c               | 46 +++++++++++++++++++++++++++++++--
>  drivers/dax/bus.c                       | 10 +++++++
>  drivers/dax/bus.h                       |  1 +
>  drivers/dax/cxl.c                       | 16 ++++++++++--
>  7 files changed, 93 insertions(+), 15 deletions(-)
> 
> diff --git a/Documentation/ABI/testing/sysfs-bus-cxl b/Documentation/ABI/testing/sysfs-bus-cxl
> index 6227ae0ab3fc..3a5ee88e551b 100644
> --- a/Documentation/ABI/testing/sysfs-bus-cxl
> +++ b/Documentation/ABI/testing/sysfs-bus-cxl
> @@ -406,20 +406,20 @@ Description:
>  		interleave_granularity).
>  
>  
> -What:		/sys/bus/cxl/devices/decoderX.Y/create_{pmem,ram}_region
> -Date:		May, 2022, January, 2023
> -KernelVersion:	v6.0 (pmem), v6.3 (ram)
> +What:		/sys/bus/cxl/devices/decoderX.Y/create_{pmem,ram,dc}_region
> +Date:		May, 2022, January, 2023, August 2024
> +KernelVersion:	v6.0 (pmem), v6.3 (ram), v6.12 (dc)
>  Contact:	linux-cxl@vger.kernel.org
>  Description:
>  		(RW) Write a string in the form 'regionZ' to start the process
> -		of defining a new persistent, or volatile memory region
> -		(interleave-set) within the decode range bounded by root decoder
> -		'decoderX.Y'. The value written must match the current value
> -		returned from reading this attribute. An atomic compare exchange
> -		operation is done on write to assign the requested id to a
> -		region and allocate the region-id for the next creation attempt.
> -		EBUSY is returned if the region name written does not match the
> -		current cached value.
> +		of defining a new persistent, volatile, or Dynamic Capacity
> +		(DC) memory region (interleave-set) within the decode range
> +		bounded by root decoder 'decoderX.Y'. The value written must
> +		match the current value returned from reading this attribute.
> +		An atomic compare exchange operation is done on write to assign
> +		the requested id to a region and allocate the region-id for the
> +		next creation attempt.  EBUSY is returned if the region name
> +		written does not match the current cached value.
>  
>  
>  What:		/sys/bus/cxl/devices/decoderX.Y/delete_region
> diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
> index 72a506c9dbd0..15b6cf1c19ef 100644
> --- a/drivers/cxl/core/core.h
> +++ b/drivers/cxl/core/core.h
> @@ -4,15 +4,27 @@
>  #ifndef __CXL_CORE_H__
>  #define __CXL_CORE_H__
>  
> +#include <cxlmem.h>
> +
>  extern const struct device_type cxl_nvdimm_bridge_type;
>  extern const struct device_type cxl_nvdimm_type;
>  extern const struct device_type cxl_pmu_type;
>  
>  extern struct attribute_group cxl_base_attribute_group;
>  
> +static inline struct cxl_memdev_state *
> +cxled_to_mds(struct cxl_endpoint_decoder *cxled)
> +{
> +	struct cxl_memdev *cxlmd = cxled_to_memdev(cxled);
> +	struct cxl_dev_state *cxlds = cxlmd->cxlds;
> +
> +	return container_of(cxlds, struct cxl_memdev_state, cxlds);
> +}
> +
>  #ifdef CONFIG_CXL_REGION
>  extern struct device_attribute dev_attr_create_pmem_region;
>  extern struct device_attribute dev_attr_create_ram_region;
> +extern struct device_attribute dev_attr_create_dc_region;
>  extern struct device_attribute dev_attr_delete_region;
>  extern struct device_attribute dev_attr_region;
>  extern const struct device_type cxl_pmem_region_type;
> diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
> index 222aa0aeeef7..44e1e203173d 100644
> --- a/drivers/cxl/core/port.c
> +++ b/drivers/cxl/core/port.c
> @@ -320,6 +320,7 @@ static struct attribute *cxl_decoder_root_attrs[] = {
>  	&dev_attr_qos_class.attr,
>  	SET_CXL_REGION_ATTR(create_pmem_region)
>  	SET_CXL_REGION_ATTR(create_ram_region)
> +	SET_CXL_REGION_ATTR(create_dc_region)
>  	SET_CXL_REGION_ATTR(delete_region)
>  	NULL,
>  };
> diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
> index f85b26b39b2f..35c4a1f4f9bd 100644
> --- a/drivers/cxl/core/region.c
> +++ b/drivers/cxl/core/region.c
> @@ -496,6 +496,11 @@ static ssize_t interleave_ways_store(struct device *dev,
>  	if (rc)
>  		return rc;
>  
> +	if (cxlr->mode == CXL_REGION_DC && val != 1) {
> +		dev_err(dev, "Interleaving and DCD not supported\n");
> +		return -EINVAL;
> +	}
> +
>  	rc = ways_to_eiw(val, &iw);
>  	if (rc)
>  		return rc;
> @@ -2174,6 +2179,7 @@ static size_t store_targetN(struct cxl_region *cxlr, const char *buf, int pos,
>  	if (sysfs_streq(buf, "\n"))
>  		rc = detach_target(cxlr, pos);
>  	else {
> +		struct cxl_endpoint_decoder *cxled;
>  		struct device *dev;
>  
>  		dev = bus_find_device_by_name(&cxl_bus_type, NULL, buf);
> @@ -2185,8 +2191,13 @@ static size_t store_targetN(struct cxl_region *cxlr, const char *buf, int pos,
>  			goto out;
>  		}
>  
> -		rc = attach_target(cxlr, to_cxl_endpoint_decoder(dev), pos,
> -				   TASK_INTERRUPTIBLE);
> +		cxled = to_cxl_endpoint_decoder(dev);
> +		if (cxlr->mode == CXL_REGION_DC &&
> +		    !cxl_dcd_supported(cxled_to_mds(cxled))) {
> +			dev_dbg(dev, "DCD unsupported\n");
> +			return -EINVAL;
> +		}
> +		rc = attach_target(cxlr, cxled, pos, TASK_INTERRUPTIBLE);
>  out:
>  		put_device(dev);
>  	}
> @@ -2534,6 +2545,7 @@ static struct cxl_region *__create_region(struct cxl_root_decoder *cxlrd,
>  	switch (mode) {
>  	case CXL_REGION_RAM:
>  	case CXL_REGION_PMEM:
> +	case CXL_REGION_DC:
>  		break;
>  	default:
>  		dev_err(&cxlrd->cxlsd.cxld.dev, "unsupported mode %s\n",
> @@ -2587,6 +2599,20 @@ static ssize_t create_ram_region_store(struct device *dev,
>  }
>  DEVICE_ATTR_RW(create_ram_region);
>  
> +static ssize_t create_dc_region_show(struct device *dev,
> +				     struct device_attribute *attr, char *buf)
> +{
> +	return __create_region_show(to_cxl_root_decoder(dev), buf);
> +}
> +
> +static ssize_t create_dc_region_store(struct device *dev,
> +				      struct device_attribute *attr,
> +				      const char *buf, size_t len)
> +{
> +	return create_region_store(dev, buf, len, CXL_REGION_DC);
> +}
> +DEVICE_ATTR_RW(create_dc_region);
> +
>  static ssize_t region_show(struct device *dev, struct device_attribute *attr,
>  			   char *buf)
>  {
> @@ -3168,6 +3194,11 @@ static int devm_cxl_add_dax_region(struct cxl_region *cxlr)
>  	struct device *dev;
>  	int rc;
>  
> +	if (cxlr->mode == CXL_REGION_DC && cxlr->params.interleave_ways != 1) {
> +		dev_err(&cxlr->dev, "Interleaving DC not supported\n");
> +		return -EINVAL;
> +	}
> +
>  	cxlr_dax = cxl_dax_region_alloc(cxlr);
>  	if (IS_ERR(cxlr_dax))
>  		return PTR_ERR(cxlr_dax);
> @@ -3260,6 +3291,16 @@ static struct cxl_region *construct_region(struct cxl_root_decoder *cxlrd,
>  		return ERR_PTR(-EINVAL);
>  
>  	mode = cxl_decoder_to_region_mode(cxled->mode);
> +	if (mode == CXL_REGION_DC) {
> +		if (!cxl_dcd_supported(cxled_to_mds(cxled))) {
> +			dev_err(&cxled->cxld.dev, "DCD unsupported\n");
> +			return ERR_PTR(-EINVAL);
> +		}
> +		if (cxled->cxld.interleave_ways != 1) {
> +			dev_err(&cxled->cxld.dev, "Interleaving and DCD not supported\n");
> +			return ERR_PTR(-EINVAL);
> +		}
> +	}
>  	do {
>  		cxlr = __create_region(cxlrd, mode,
>  				       atomic_read(&cxlrd->region_id));
> @@ -3467,6 +3508,7 @@ static int cxl_region_probe(struct device *dev)
>  	case CXL_REGION_PMEM:
>  		return devm_cxl_add_pmem_region(cxlr);
>  	case CXL_REGION_RAM:
> +	case CXL_REGION_DC:
>  		/*
>  		 * The region can not be manged by CXL if any portion of
>  		 * it is already online as 'System RAM'
> diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
> index fde29e0ad68b..d8cb5195a227 100644
> --- a/drivers/dax/bus.c
> +++ b/drivers/dax/bus.c
> @@ -178,6 +178,11 @@ static bool is_static(struct dax_region *dax_region)
>  	return (dax_region->res.flags & IORESOURCE_DAX_STATIC) != 0;
>  }
>  
> +static bool is_sparse(struct dax_region *dax_region)
> +{
> +	return (dax_region->res.flags & IORESOURCE_DAX_SPARSE_CAP) != 0;
> +}
> +
>  bool static_dev_dax(struct dev_dax *dev_dax)
>  {
>  	return is_static(dev_dax->region);
> @@ -301,6 +306,9 @@ static unsigned long long dax_region_avail_size(struct dax_region *dax_region)
>  
>  	lockdep_assert_held(&dax_region_rwsem);
>  
> +	if (is_sparse(dax_region))
> +		return 0;
> +
>  	for_each_dax_region_resource(dax_region, res)
>  		size -= resource_size(res);
>  	return size;
> @@ -1373,6 +1381,8 @@ static umode_t dev_dax_visible(struct kobject *kobj, struct attribute *a, int n)
>  		return 0;
>  	if (a == &dev_attr_mapping.attr && is_static(dax_region))
>  		return 0;
> +	if (a == &dev_attr_mapping.attr && is_sparse(dax_region))
> +		return 0;
>  	if ((a == &dev_attr_align.attr ||
>  	     a == &dev_attr_size.attr) && is_static(dax_region))
>  		return 0444;
> diff --git a/drivers/dax/bus.h b/drivers/dax/bus.h
> index cbbf64443098..783bfeef42cc 100644
> --- a/drivers/dax/bus.h
> +++ b/drivers/dax/bus.h
> @@ -13,6 +13,7 @@ struct dax_region;
>  /* dax bus specific ioresource flags */
>  #define IORESOURCE_DAX_STATIC BIT(0)
>  #define IORESOURCE_DAX_KMEM BIT(1)
> +#define IORESOURCE_DAX_SPARSE_CAP BIT(2)
>  
>  struct dax_region *alloc_dax_region(struct device *parent, int region_id,
>  		struct range *range, int target_node, unsigned int align,
> diff --git a/drivers/dax/cxl.c b/drivers/dax/cxl.c
> index 9b29e732b39a..367e86b1c22a 100644
> --- a/drivers/dax/cxl.c
> +++ b/drivers/dax/cxl.c
> @@ -13,19 +13,31 @@ static int cxl_dax_region_probe(struct device *dev)
>  	struct cxl_region *cxlr = cxlr_dax->cxlr;
>  	struct dax_region *dax_region;
>  	struct dev_dax_data data;
> +	resource_size_t dev_size;
> +	unsigned long flags;
>  
>  	if (nid == NUMA_NO_NODE)
>  		nid = memory_add_physaddr_to_nid(cxlr_dax->hpa_range.start);
>  
> +	flags = IORESOURCE_DAX_KMEM;
> +	if (cxlr->mode == CXL_REGION_DC)
> +		flags |= IORESOURCE_DAX_SPARSE_CAP;
> +
>  	dax_region = alloc_dax_region(dev, cxlr->id, &cxlr_dax->hpa_range, nid,
> -				      PMD_SIZE, IORESOURCE_DAX_KMEM);
> +				      PMD_SIZE, flags);
>  	if (!dax_region)
>  		return -ENOMEM;
>  
> +	if (cxlr->mode == CXL_REGION_DC)
> +		/* Add empty seed dax device */
> +		dev_size = 0;
> +	else
> +		dev_size = range_len(&cxlr_dax->hpa_range);
> +
>  	data = (struct dev_dax_data) {
>  		.dax_region = dax_region,
>  		.id = -1,
> -		.size = range_len(&cxlr_dax->hpa_range),
> +		.size = dev_size,
>  		.memmap_on_memory = true,
>  	};
>  
> 

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v3 14/25] cxl/events: Split event msgnum configuration from irq setup
  2024-08-16 14:44 ` [PATCH v3 14/25] cxl/events: Split event msgnum configuration from irq setup Ira Weiny
@ 2024-08-16 23:57   ` Dave Jiang
  2024-08-22 21:39   ` Fan Ni
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 120+ messages in thread
From: Dave Jiang @ 2024-08-16 23:57 UTC (permalink / raw)
  To: Ira Weiny, Fan Ni, Jonathan Cameron, Navneet Singh, Chris Mason,
	Josef Bacik, David Sterba, Petr Mladek, Steven Rostedt,
	Andy Shevchenko, Rasmus Villemoes, Sergey Senozhatsky,
	Jonathan Corbet, Andrew Morton
  Cc: Dan Williams, Davidlohr Bueso, Alison Schofield, Vishal Verma,
	linux-btrfs, linux-cxl, linux-kernel, linux-doc, nvdimm



On 8/16/24 7:44 AM, Ira Weiny wrote:
> Dynamic Capacity Devices (DCD) require event interrupts to process
> memory addition or removal.  BIOS may have control over non-DCD event
> processing.  DCD interrupt configuration needs to be separate from
> memory event interrupt configuration.
> 
> Split cxl_event_config_msgnums() from irq setup in preparation for
> separate DCD interrupts configuration.
> 
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>

Reviewed-by: Dave Jiang <dave.jiang@intel.com>
> ---
>  drivers/cxl/pci.c | 24 ++++++++++++------------
>  1 file changed, 12 insertions(+), 12 deletions(-)
> 
> diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
> index f7f03599bc83..17bea49bbf4d 100644
> --- a/drivers/cxl/pci.c
> +++ b/drivers/cxl/pci.c
> @@ -698,35 +698,31 @@ static int cxl_event_config_msgnums(struct cxl_memdev_state *mds,
>  	return cxl_event_get_int_policy(mds, policy);
>  }
>  
> -static int cxl_event_irqsetup(struct cxl_memdev_state *mds)
> +static int cxl_event_irqsetup(struct cxl_memdev_state *mds,
> +			      struct cxl_event_interrupt_policy *policy)
>  {
>  	struct cxl_dev_state *cxlds = &mds->cxlds;
> -	struct cxl_event_interrupt_policy policy;
>  	int rc;
>  
> -	rc = cxl_event_config_msgnums(mds, &policy);
> -	if (rc)
> -		return rc;
> -
> -	rc = cxl_event_req_irq(cxlds, policy.info_settings);
> +	rc = cxl_event_req_irq(cxlds, policy->info_settings);
>  	if (rc) {
>  		dev_err(cxlds->dev, "Failed to get interrupt for event Info log\n");
>  		return rc;
>  	}
>  
> -	rc = cxl_event_req_irq(cxlds, policy.warn_settings);
> +	rc = cxl_event_req_irq(cxlds, policy->warn_settings);
>  	if (rc) {
>  		dev_err(cxlds->dev, "Failed to get interrupt for event Warn log\n");
>  		return rc;
>  	}
>  
> -	rc = cxl_event_req_irq(cxlds, policy.failure_settings);
> +	rc = cxl_event_req_irq(cxlds, policy->failure_settings);
>  	if (rc) {
>  		dev_err(cxlds->dev, "Failed to get interrupt for event Failure log\n");
>  		return rc;
>  	}
>  
> -	rc = cxl_event_req_irq(cxlds, policy.fatal_settings);
> +	rc = cxl_event_req_irq(cxlds, policy->fatal_settings);
>  	if (rc) {
>  		dev_err(cxlds->dev, "Failed to get interrupt for event Fatal log\n");
>  		return rc;
> @@ -745,7 +741,7 @@ static bool cxl_event_int_is_fw(u8 setting)
>  static int cxl_event_config(struct pci_host_bridge *host_bridge,
>  			    struct cxl_memdev_state *mds, bool irq_avail)
>  {
> -	struct cxl_event_interrupt_policy policy;
> +	struct cxl_event_interrupt_policy policy = { 0 };
>  	int rc;
>  
>  	/*
> @@ -773,11 +769,15 @@ static int cxl_event_config(struct pci_host_bridge *host_bridge,
>  		return -EBUSY;
>  	}
>  
> +	rc = cxl_event_config_msgnums(mds, &policy);
> +	if (rc)
> +		return rc;
> +
>  	rc = cxl_mem_alloc_event_buf(mds);
>  	if (rc)
>  		return rc;
>  
> -	rc = cxl_event_irqsetup(mds);
> +	rc = cxl_event_irqsetup(mds, &policy);
>  	if (rc)
>  		return rc;
>  
> 

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v3 16/25] cxl/mem: Configure dynamic capacity interrupts
  2024-08-16 14:44 ` [PATCH v3 16/25] cxl/mem: Configure dynamic capacity interrupts ira.weiny
@ 2024-08-17  0:02   ` Dave Jiang
  2024-08-23 17:08   ` Jonathan Cameron
  2024-09-03  7:09   ` Li, Ming4
  2 siblings, 0 replies; 120+ messages in thread
From: Dave Jiang @ 2024-08-17  0:02 UTC (permalink / raw)
  To: ira.weiny, Fan Ni, Jonathan Cameron, Navneet Singh, Chris Mason,
	Josef Bacik, David Sterba, Petr Mladek, Steven Rostedt,
	Andy Shevchenko, Rasmus Villemoes, Sergey Senozhatsky,
	Jonathan Corbet, Andrew Morton
  Cc: Dan Williams, Davidlohr Bueso, Alison Schofield, Vishal Verma,
	linux-btrfs, linux-cxl, linux-kernel, linux-doc, nvdimm



On 8/16/24 7:44 AM, ira.weiny@intel.com wrote:
> From: Navneet Singh <navneet.singh@intel.com>
> 
> Dynamic Capacity Devices (DCD) support extent change notifications
> through the event log mechanism.  The interrupt mailbox commands were
> extended in CXL 3.1 to support these notifications.  Firmware can't
> configure DCD events to be FW controlled but can retain control of
> memory events.
> 
> Configure DCD event log interrupts on devices supporting dynamic
> capacity.  Disable DCD if interrupts are not supported.
> 
> Care is taken to preserve the interrupt policy set by the FW if FW first
> has been selected by the BIOS.
> 
> Signed-off-by: Navneet Singh <navneet.singh@intel.com>
> Co-developed-by: Ira Weiny <ira.weiny@intel.com>
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>

> 
> ---
> Changes:
> [iweiny: update commit message]
> [iweiny: rebase to upstream irq code]
> [iweiny: disable DCD if irqs not supported]
> [Jonathan: formatting fix]
> [Fan: add text to debug print]
> [djiang: make dcd helpers inline]
> ---
>  drivers/cxl/cxlmem.h |  2 ++
>  drivers/cxl/pci.c    | 72 +++++++++++++++++++++++++++++++++++++++++++---------
>  2 files changed, 62 insertions(+), 12 deletions(-)
> 
> diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
> index b4eb8164d05d..d41bec5433db 100644
> --- a/drivers/cxl/cxlmem.h
> +++ b/drivers/cxl/cxlmem.h
> @@ -225,7 +225,9 @@ struct cxl_event_interrupt_policy {
>  	u8 warn_settings;
>  	u8 failure_settings;
>  	u8 fatal_settings;
> +	u8 dcd_settings;
>  } __packed;
> +#define CXL_EVENT_INT_POLICY_BASE_SIZE 4 /* info, warn, failure, fatal */
>  
>  /**
>   * struct cxl_event_state - Event log driver state
> diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
> index 370c74eae323..e5430c4e3a3b 100644
> --- a/drivers/cxl/pci.c
> +++ b/drivers/cxl/pci.c
> @@ -669,22 +669,33 @@ static int cxl_event_get_int_policy(struct cxl_memdev_state *mds,
>  }
>  
>  static int cxl_event_config_msgnums(struct cxl_memdev_state *mds,
> -				    struct cxl_event_interrupt_policy *policy)
> +				    struct cxl_event_interrupt_policy *policy,
> +				    bool native_cxl)
>  {
> +	size_t size_in = CXL_EVENT_INT_POLICY_BASE_SIZE;
>  	struct cxl_mbox_cmd mbox_cmd;
>  	int rc;
>  
> -	*policy = (struct cxl_event_interrupt_policy) {
> -		.info_settings = CXL_INT_MSI_MSIX,
> -		.warn_settings = CXL_INT_MSI_MSIX,
> -		.failure_settings = CXL_INT_MSI_MSIX,
> -		.fatal_settings = CXL_INT_MSI_MSIX,
> -	};
> +	/* memory event policy is left if FW has control */
> +	if (native_cxl) {
> +		*policy = (struct cxl_event_interrupt_policy) {
> +			.info_settings = CXL_INT_MSI_MSIX,
> +			.warn_settings = CXL_INT_MSI_MSIX,
> +			.failure_settings = CXL_INT_MSI_MSIX,
> +			.fatal_settings = CXL_INT_MSI_MSIX,
> +			.dcd_settings = 0,
> +		};
> +	}
> +
> +	if (cxl_dcd_supported(mds)) {
> +		policy->dcd_settings = CXL_INT_MSI_MSIX;
> +		size_in += sizeof(policy->dcd_settings);
> +	}
>  
>  	mbox_cmd = (struct cxl_mbox_cmd) {
>  		.opcode = CXL_MBOX_OP_SET_EVT_INT_POLICY,
>  		.payload_in = policy,
> -		.size_in = sizeof(*policy),
> +		.size_in = size_in,
>  	};
>  
>  	rc = cxl_internal_send_cmd(mds, &mbox_cmd);
> @@ -731,6 +742,31 @@ static int cxl_event_irqsetup(struct cxl_memdev_state *mds,
>  	return 0;
>  }
>  
> +static int cxl_irqsetup(struct cxl_memdev_state *mds,
> +			struct cxl_event_interrupt_policy *policy,
> +			bool native_cxl)
> +{
> +	struct cxl_dev_state *cxlds = &mds->cxlds;
> +	int rc;
> +
> +	if (native_cxl) {
> +		rc = cxl_event_irqsetup(mds, policy);
> +		if (rc)
> +			return rc;
> +	}
> +
> +	if (cxl_dcd_supported(mds)) {
> +		rc = cxl_event_req_irq(cxlds, policy->dcd_settings);
> +		if (rc) {
> +			dev_err(cxlds->dev, "Failed to get interrupt for DCD event log\n");
> +			cxl_disable_dcd(mds);
> +			return rc;
> +		}
> +	}
> +
> +	return 0;
> +}
> +
>  static bool cxl_event_int_is_fw(u8 setting)
>  {
>  	u8 mode = FIELD_GET(CXLDEV_EVENT_INT_MODE_MASK, setting);
> @@ -757,17 +793,25 @@ static int cxl_event_config(struct pci_host_bridge *host_bridge,
>  			    struct cxl_memdev_state *mds, bool irq_avail)
>  {
>  	struct cxl_event_interrupt_policy policy = { 0 };
> +	bool native_cxl = host_bridge->native_cxl_error;
>  	int rc;
>  
>  	/*
>  	 * When BIOS maintains CXL error reporting control, it will process
>  	 * event records.  Only one agent can do so.
> +	 *
> +	 * If BIOS has control of events and DCD is not supported skip event
> +	 * configuration.
>  	 */
> -	if (!host_bridge->native_cxl_error)
> +	if (!native_cxl && !cxl_dcd_supported(mds))
>  		return 0;
>  
>  	if (!irq_avail) {
>  		dev_info(mds->cxlds.dev, "No interrupt support, disable event processing.\n");
> +		if (cxl_dcd_supported(mds)) {
> +			dev_info(mds->cxlds.dev, "DCD requires interrupts, disable DCD\n");
> +			cxl_disable_dcd(mds);
> +		}
>  		return 0;
>  	}
>  
> @@ -775,10 +819,10 @@ static int cxl_event_config(struct pci_host_bridge *host_bridge,
>  	if (rc)
>  		return rc;
>  
> -	if (!cxl_event_validate_mem_policy(mds, &policy))
> +	if (native_cxl && !cxl_event_validate_mem_policy(mds, &policy))
>  		return -EBUSY;
>  
> -	rc = cxl_event_config_msgnums(mds, &policy);
> +	rc = cxl_event_config_msgnums(mds, &policy, native_cxl);
>  	if (rc)
>  		return rc;
>  
> @@ -786,12 +830,16 @@ static int cxl_event_config(struct pci_host_bridge *host_bridge,
>  	if (rc)
>  		return rc;
>  
> -	rc = cxl_event_irqsetup(mds, &policy);
> +	rc = cxl_irqsetup(mds, &policy, native_cxl);
>  	if (rc)
>  		return rc;
>  
>  	cxl_mem_get_event_records(mds, CXLDEV_EVENT_STATUS_ALL);
>  
> +	dev_dbg(mds->cxlds.dev, "Event config : %s DCD %s\n",
> +		native_cxl ? "OS" : "BIOS",
> +		cxl_dcd_supported(mds) ? "supported" : "not supported");
> +
>  	return 0;
>  }
>  
> 

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v3 21/25] dax/region: Create resources on sparse DAX regions
  2024-08-16 14:44 ` [PATCH v3 21/25] dax/region: Create resources on sparse DAX regions ira.weiny
@ 2024-08-18 11:38   ` Markus Elfring
  2024-08-19 23:30   ` Dave Jiang
  2024-08-27 14:12   ` Jonathan Cameron
  2 siblings, 0 replies; 120+ messages in thread
From: Markus Elfring @ 2024-08-18 11:38 UTC (permalink / raw)
  To: Ira Weiny, Navneet Singh, linux-cxl, linux-btrfs, nvdimm,
	Andrew Morton, Andy Shevchenko, Chris Mason, Dave Jiang,
	David Sterba, Fan Ni, Jonathan Cameron, Jonathan Corbet,
	Josef Bacik, Petr Mladek, Sergey Senozhatsky, Steven Rostedt,
	Rasmus Villemoes
  Cc: linux-doc, LKML, Alison Schofield, Dan Williams, Davidlohr Bueso,
	Vishal Verma

…
> +++ b/drivers/cxl/core/extent.c
> @@ -271,20 +271,67 @@ static void calc_hpa_range(struct cxl_endpoint_decoder *cxled,
>  	hpa_range->end = hpa_range->start + range_len(dpa_range) - 1;
>  }
>
> +static int cxlr_notify_extent(struct cxl_region *cxlr, enum dc_event event,
> +			      struct region_extent *region_extent)
> +{
> +	device_lock(dev);
> +	if (dev->driver) {
> +	}
> +	device_unlock(dev);
> +	return rc;
> +}
…

Under which circumstances would you become interested to apply a statement
like “guard(device)(dev);”?
https://elixir.bootlin.com/linux/v6.11-rc3/source/include/linux/device.h#L1027

Regards,
Markus

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v3 17/25] cxl/core: Return endpoint decoder information from region search
  2024-08-16 14:44 ` [PATCH v3 17/25] cxl/core: Return endpoint decoder information from region search Ira Weiny
@ 2024-08-19 16:35   ` Dave Jiang
  2024-08-23 17:12   ` Jonathan Cameron
  2024-09-03  7:10   ` Li, Ming4
  2 siblings, 0 replies; 120+ messages in thread
From: Dave Jiang @ 2024-08-19 16:35 UTC (permalink / raw)
  To: Ira Weiny, Fan Ni, Jonathan Cameron, Navneet Singh, Chris Mason,
	Josef Bacik, David Sterba, Petr Mladek, Steven Rostedt,
	Andy Shevchenko, Rasmus Villemoes, Sergey Senozhatsky,
	Jonathan Corbet, Andrew Morton
  Cc: Dan Williams, Davidlohr Bueso, Alison Schofield, Vishal Verma,
	linux-btrfs, linux-cxl, linux-kernel, linux-doc, nvdimm



On 8/16/24 7:44 AM, Ira Weiny wrote:
> cxl_dpa_to_region() finds the region from a <DPA, device> tuple.
> The search involves finding the device endpoint decoder as well.
> 
> Dynamic capacity extent processing uses the endpoint decoder HPA
> information to calculate the HPA offset.  In addition, well behaved
> extents should be contained within an endpoint decoder.
> 
> Return the endpoint decoder found to be used in subsequent DCD code.
> 
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>

Reviewed-by: Dave Jiang <dave.jiang@intel.com>
> ---
>  drivers/cxl/core/core.h   | 6 ++++--
>  drivers/cxl/core/mbox.c   | 2 +-
>  drivers/cxl/core/memdev.c | 4 ++--
>  drivers/cxl/core/region.c | 8 +++++++-
>  4 files changed, 14 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
> index 15b6cf1c19ef..76c4153a9b2c 100644
> --- a/drivers/cxl/core/core.h
> +++ b/drivers/cxl/core/core.h
> @@ -39,7 +39,8 @@ void cxl_decoder_kill_region(struct cxl_endpoint_decoder *cxled);
>  int cxl_region_init(void);
>  void cxl_region_exit(void);
>  int cxl_get_poison_by_endpoint(struct cxl_port *port);
> -struct cxl_region *cxl_dpa_to_region(const struct cxl_memdev *cxlmd, u64 dpa);
> +struct cxl_region *cxl_dpa_to_region(const struct cxl_memdev *cxlmd, u64 dpa,
> +				     struct cxl_endpoint_decoder **cxled);
>  u64 cxl_dpa_to_hpa(struct cxl_region *cxlr, const struct cxl_memdev *cxlmd,
>  		   u64 dpa);
>  
> @@ -50,7 +51,8 @@ static inline u64 cxl_dpa_to_hpa(struct cxl_region *cxlr,
>  	return ULLONG_MAX;
>  }
>  static inline
> -struct cxl_region *cxl_dpa_to_region(const struct cxl_memdev *cxlmd, u64 dpa)
> +struct cxl_region *cxl_dpa_to_region(const struct cxl_memdev *cxlmd, u64 dpa,
> +				     struct cxl_endpoint_decoder **cxled)
>  {
>  	return NULL;
>  }
> diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
> index 68c26c4be91a..01a447aaa1b1 100644
> --- a/drivers/cxl/core/mbox.c
> +++ b/drivers/cxl/core/mbox.c
> @@ -909,7 +909,7 @@ void cxl_event_trace_record(const struct cxl_memdev *cxlmd,
>  		guard(rwsem_read)(&cxl_dpa_rwsem);
>  
>  		dpa = le64_to_cpu(evt->media_hdr.phys_addr) & CXL_DPA_MASK;
> -		cxlr = cxl_dpa_to_region(cxlmd, dpa);
> +		cxlr = cxl_dpa_to_region(cxlmd, dpa, NULL);
>  		if (cxlr)
>  			hpa = cxl_dpa_to_hpa(cxlr, cxlmd, dpa);
>  
> diff --git a/drivers/cxl/core/memdev.c b/drivers/cxl/core/memdev.c
> index 7da1f0f5711a..12fb07fb89a6 100644
> --- a/drivers/cxl/core/memdev.c
> +++ b/drivers/cxl/core/memdev.c
> @@ -323,7 +323,7 @@ int cxl_inject_poison(struct cxl_memdev *cxlmd, u64 dpa)
>  	if (rc)
>  		goto out;
>  
> -	cxlr = cxl_dpa_to_region(cxlmd, dpa);
> +	cxlr = cxl_dpa_to_region(cxlmd, dpa, NULL);
>  	if (cxlr)
>  		dev_warn_once(mds->cxlds.dev,
>  			      "poison inject dpa:%#llx region: %s\n", dpa,
> @@ -387,7 +387,7 @@ int cxl_clear_poison(struct cxl_memdev *cxlmd, u64 dpa)
>  	if (rc)
>  		goto out;
>  
> -	cxlr = cxl_dpa_to_region(cxlmd, dpa);
> +	cxlr = cxl_dpa_to_region(cxlmd, dpa, NULL);
>  	if (cxlr)
>  		dev_warn_once(mds->cxlds.dev,
>  			      "poison clear dpa:%#llx region: %s\n", dpa,
> diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
> index 35c4a1f4f9bd..8e0884b52f84 100644
> --- a/drivers/cxl/core/region.c
> +++ b/drivers/cxl/core/region.c
> @@ -2828,6 +2828,7 @@ int cxl_get_poison_by_endpoint(struct cxl_port *port)
>  struct cxl_dpa_to_region_context {
>  	struct cxl_region *cxlr;
>  	u64 dpa;
> +	struct cxl_endpoint_decoder *cxled;
>  };
>  
>  static int __cxl_dpa_to_region(struct device *dev, void *arg)
> @@ -2861,11 +2862,13 @@ static int __cxl_dpa_to_region(struct device *dev, void *arg)
>  			dev_name(dev));
>  
>  	ctx->cxlr = cxlr;
> +	ctx->cxled = cxled;
>  
>  	return 1;
>  }
>  
> -struct cxl_region *cxl_dpa_to_region(const struct cxl_memdev *cxlmd, u64 dpa)
> +struct cxl_region *cxl_dpa_to_region(const struct cxl_memdev *cxlmd, u64 dpa,
> +				     struct cxl_endpoint_decoder **cxled)
>  {
>  	struct cxl_dpa_to_region_context ctx;
>  	struct cxl_port *port;
> @@ -2877,6 +2880,9 @@ struct cxl_region *cxl_dpa_to_region(const struct cxl_memdev *cxlmd, u64 dpa)
>  	if (port && is_cxl_endpoint(port) && cxl_num_decoders_committed(port))
>  		device_for_each_child(&port->dev, &ctx, __cxl_dpa_to_region);
>  
> +	if (cxled)
> +		*cxled = ctx.cxled;
> +
>  	return ctx.cxlr;
>  }
>  
> 

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v3 18/25] cxl/extent: Process DCD events and realize region extents
  2024-08-16 14:44 ` [PATCH v3 18/25] cxl/extent: Process DCD events and realize region extents ira.weiny
@ 2024-08-19 18:51   ` Dave Jiang
  2024-08-23  2:53     ` Ira Weiny
  2024-08-23 21:32   ` Fan Ni
                     ` (3 subsequent siblings)
  4 siblings, 1 reply; 120+ messages in thread
From: Dave Jiang @ 2024-08-19 18:51 UTC (permalink / raw)
  To: ira.weiny, Fan Ni, Jonathan Cameron, Navneet Singh, Chris Mason,
	Josef Bacik, David Sterba, Petr Mladek, Steven Rostedt,
	Andy Shevchenko, Rasmus Villemoes, Sergey Senozhatsky,
	Jonathan Corbet, Andrew Morton
  Cc: Dan Williams, Davidlohr Bueso, Alison Schofield, Vishal Verma,
	linux-btrfs, linux-cxl, linux-kernel, linux-doc, nvdimm



On 8/16/24 7:44 AM, ira.weiny@intel.com wrote:
> From: Navneet Singh <navneet.singh@intel.com>
> 
> A dynamic capacity device (DCD) sends events to signal the host for
> changes in the availability of Dynamic Capacity (DC) memory.  These
> events contain extents describing a DPA range and meta data for memory
> to be added or removed.  Events may be sent from the device at any time.
> 
> Three types of events can be signaled, Add, Release, and Force Release.
> 
> On add, the host may accept or reject the memory being offered.  If no
> region exists, or the extent is invalid, the extent should be rejected.
> Add extent events may be grouped by a 'more' bit which indicates those
> extents should be processed as a group.
> 
> On remove, the host can delay the response until the host is safely not
> using the memory.  If no region exists the release can be sent
> immediately.  The host may also release extents (or partial extents) at
> any time.  Thus the 'more' bit grouping of release events is of less
> value and can be ignored in favor of sending multiple release capacity
> responses for groups of release events.
> 
> Force removal is intended as a mechanism between the FM and the device
> and intended only when the host is unresponsive, out of sync, or
> otherwise broken.  Purposely ignore force removal events.
> 
> Regions are made up of one or more devices which may be surfacing memory
> to the host.  Once all devices in a region have surfaced an extent the
> region can expose a corresponding extent for the user to consume.
> Without interleaving a device extent forms a 1:1 relationship with the
> region extent.  Immediately surface a region extent upon getting a
> device extent.
> 
> Per the specification the device is allowed to offer or remove extents
> at any time.  However, anticipated use cases can expect extents to be
> offered, accepted, and removed in well defined chunks.
> 
> Simplify extent tracking with the following restrictions.
> 
> 	1) Flag for removal any extent which overlaps a requested
> 	   release range.
> 	2) Refuse the offer of extents which overlap already accepted
> 	   memory ranges.
> 	3) Accept again a range which has already been accepted by the
> 	   host.  (It is likely the device has an error because it
> 	   should already know that this range was accepted.  But from
> 	   the host point of view it is safe to acknowledge that
> 	   acceptance again.)
> 
> Management of the region extent devices must be synchronized with
> potential uses of the memory within the DAX layer.  Create region extent
> devices as children of the cxl_dax_region device such that the DAX
> region driver can co-drive them and synchronize with the DAX layer.
> Synchronization and management is handled in a subsequent patch.
> 
> Process DCD events and create region devices.
> 
> Signed-off-by: Navneet Singh <navneet.singh@intel.com>
> Co-developed-by: Ira Weiny <ira.weiny@intel.com>
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>

A few nits below, but in general
Reviewed-by: Dave Jiang <dave.jiang@intel.com>

> 
> ---
> Changes:
> [iweiny: combine this with the extent surface patches to better show the
>          lifetime extent objects in review]
> [iweiny: clean up commit message.]
> [iweiny: move extent verification of the 'read extents on region
>          creation' to this patch]
> [iweiny: Provide for a common path for extent realization between an add
> 	 event and adding existing extents.]
> [iweiny: Persist a check that an extent is within an endpoint decoder]
> [iweiny: reduce exported and non-static calls]
> [iweiny: use %par]
> 
> 	<Combined comments from the old patches which were addressed>
> 
> [Jonathan: implement the more bit with a simple algorithm which accepts
> 	   all extents it can.
> 	   Also include the response more bit to prevent payload
> 	   overflow]
> [Fan: Do not error if a contained extent is added.]
> [Jonathan: allocate ida after kzalloc]
> [iweiny: fix ida resource leak]
> [fan/djiang: remove unneeded memset]
> [djiang: fix indentation]
> [Jonathan: Fix indentation]
> [Jonathan/djbw: make tag a uuid]
> [djbw: create helper calc_hpa_range() straight away]
> [djbw: Allow for multiple cxled_extents per region_extent]
> [djbw: s/cxl_ed/cxled]
> [djbw: s/cxl_release_ed_extent/cxled_release_extent/]
> [djbw: s/reg_ext/region_extent/]
> [djbw: s/dc_extent/extent/]
> [Gregory/djbw: reject shared extents]
> [iweiny: predicate extent.c compile on CONFIG_CXL_REGION]
> ---
>  drivers/cxl/core/Makefile |   2 +-
>  drivers/cxl/core/core.h   |  13 ++
>  drivers/cxl/core/extent.c | 345 ++++++++++++++++++++++++++++++++++++++++++++++
>  drivers/cxl/core/mbox.c   | 268 ++++++++++++++++++++++++++++++++++-
>  drivers/cxl/core/region.c |   6 +
>  drivers/cxl/cxl.h         |  52 ++++++-
>  drivers/cxl/cxlmem.h      |  26 ++++
>  include/linux/cxl-event.h |  32 +++++
>  tools/testing/cxl/Kbuild  |   3 +-
>  9 files changed, 743 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/cxl/core/Makefile b/drivers/cxl/core/Makefile
> index 9259bcc6773c..3b812515e725 100644
> --- a/drivers/cxl/core/Makefile
> +++ b/drivers/cxl/core/Makefile
> @@ -15,4 +15,4 @@ cxl_core-y += hdm.o
>  cxl_core-y += pmu.o
>  cxl_core-y += cdat.o
>  cxl_core-$(CONFIG_TRACING) += trace.o
> -cxl_core-$(CONFIG_CXL_REGION) += region.o
> +cxl_core-$(CONFIG_CXL_REGION) += region.o extent.o
> diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
> index 76c4153a9b2c..8dfc97b2e0a4 100644
> --- a/drivers/cxl/core/core.h
> +++ b/drivers/cxl/core/core.h
> @@ -44,12 +44,24 @@ struct cxl_region *cxl_dpa_to_region(const struct cxl_memdev *cxlmd, u64 dpa,
>  u64 cxl_dpa_to_hpa(struct cxl_region *cxlr, const struct cxl_memdev *cxlmd,
>  		   u64 dpa);
>  
> +int cxl_add_extent(struct cxl_memdev_state *mds, struct cxl_extent *extent);
> +int cxl_rm_extent(struct cxl_memdev_state *mds, struct cxl_extent *extent);
>  #else
>  static inline u64 cxl_dpa_to_hpa(struct cxl_region *cxlr,
>  				 const struct cxl_memdev *cxlmd, u64 dpa)
>  {
>  	return ULLONG_MAX;
>  }
> +static inline int cxl_add_extent(struct cxl_memdev_state *mds,
> +				   struct cxl_extent *extent)
> +{
> +	return 0;
> +}
> +static inline int cxl_rm_extent(struct cxl_memdev_state *mds,
> +				struct cxl_extent *extent)
> +{
> +	return 0;
> +}
>  static inline
>  struct cxl_region *cxl_dpa_to_region(const struct cxl_memdev *cxlmd, u64 dpa,
>  				     struct cxl_endpoint_decoder **cxled)
> @@ -121,5 +133,6 @@ long cxl_pci_get_latency(struct pci_dev *pdev);
>  int cxl_update_hmat_access_coordinates(int nid, struct cxl_region *cxlr,
>  				       enum access_coordinate_class access);
>  bool cxl_need_node_perf_attrs_update(int nid);
> +void memdev_release_extent(struct cxl_memdev_state *mds, struct range *range);
>  
>  #endif /* __CXL_CORE_H__ */
> diff --git a/drivers/cxl/core/extent.c b/drivers/cxl/core/extent.c
> new file mode 100644
> index 000000000000..34456594cdc3
> --- /dev/null
> +++ b/drivers/cxl/core/extent.c
> @@ -0,0 +1,345 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*  Copyright(c) 2024 Intel Corporation. All rights reserved. */
> +
> +#include <linux/device.h>
> +#include <cxl.h>
> +
> +#include "core.h"
> +
> +static void cxled_release_extent(struct cxl_endpoint_decoder *cxled,
> +				 struct cxled_extent *ed_extent)
> +{
> +	struct cxl_memdev_state *mds = cxled_to_mds(cxled);
> +	struct device *dev = &cxled->cxld.dev;
> +
> +	dev_dbg(dev, "Remove extent %par (%*phC)\n", &ed_extent->dpa_range,
> +		CXL_EXTENT_TAG_LEN, ed_extent->tag);
> +	memdev_release_extent(mds, &ed_extent->dpa_range);
> +	kfree(ed_extent);
> +}
> +
> +static void free_region_extent(struct region_extent *region_extent)
> +{
> +	struct cxled_extent *ed_extent;
> +	unsigned long index;
> +
> +	/*
> +	 * Remove from each endpoint decoder the extent which backs this region
> +	 * extent
> +	 */
> +	xa_for_each(&region_extent->decoder_extents, index, ed_extent)
> +		cxled_release_extent(ed_extent->cxled, ed_extent);
> +	xa_destroy(&region_extent->decoder_extents);
> +	ida_free(&region_extent->cxlr_dax->extent_ida, region_extent->dev.id);
> +	kfree(region_extent);
> +}
> +
> +static void region_extent_release(struct device *dev)
> +{
> +	struct region_extent *region_extent = to_region_extent(dev);
> +
> +	free_region_extent(region_extent);
> +}
> +
> +static const struct device_type region_extent_type = {
> +	.name = "extent",
> +	.release = region_extent_release,
> +};
> +
> +bool is_region_extent(struct device *dev)
> +{
> +	return dev->type == &region_extent_type;
> +}
> +EXPORT_SYMBOL_NS_GPL(is_region_extent, CXL);
> +
> +static void region_extent_unregister(void *ext)
> +{
> +	struct region_extent *region_extent = ext;
> +
> +	dev_dbg(&region_extent->dev, "DAX region rm extent HPA %par\n",
> +		&region_extent->hpa_range);
> +	device_unregister(&region_extent->dev);
> +}
> +
> +static void region_rm_extent(struct region_extent *region_extent)
> +{
> +	struct device *region_dev = region_extent->dev.parent;
> +
> +	devm_release_action(region_dev, region_extent_unregister, region_extent);
> +}
> +
> +static struct region_extent *
> +alloc_region_extent(struct cxl_dax_region *cxlr_dax, struct range *hpa_range, u8 *tag)
> +{
> +	int id;
> +
> +	struct region_extent *region_extent __free(kfree) =
> +				kzalloc(sizeof(*region_extent), GFP_KERNEL);
> +	if (!region_extent)
> +		return ERR_PTR(-ENOMEM);
> +
> +	id = ida_alloc(&cxlr_dax->extent_ida, GFP_KERNEL);
> +	if (id < 0)
> +		return ERR_PTR(-ENOMEM);
> +
> +	region_extent->hpa_range = *hpa_range;
> +	region_extent->cxlr_dax = cxlr_dax;
> +	import_uuid(&region_extent->tag, tag);
> +	region_extent->dev.id = id;
> +	xa_init(&region_extent->decoder_extents);
> +	return no_free_ptr(region_extent);
> +}
> +
> +static int online_region_extent(struct region_extent *region_extent)
> +{
> +	struct cxl_dax_region *cxlr_dax = region_extent->cxlr_dax;
> +	struct device *dev;
> +	int rc;
> +
> +	dev = &region_extent->dev;

Nit. You can move this up to when you declare 'dev'.

> +	device_initialize(dev);
> +	device_set_pm_not_required(dev);
> +	dev->parent = &cxlr_dax->dev;
> +	dev->type = &region_extent_type;
> +	rc = dev_set_name(dev, "extent%d.%d", cxlr_dax->cxlr->id, dev->id);
> +	if (rc)
> +		goto err;
> +
> +	rc = device_add(dev);
> +	if (rc)
> +		goto err;
> +
> +	dev_dbg(dev, "region extent HPA %par\n", &region_extent->hpa_range);
> +	return devm_add_action_or_reset(&cxlr_dax->dev, region_extent_unregister,
> +					region_extent);
> +
> +err:
> +	dev_err(&cxlr_dax->dev, "Failed to initialize region extent HPA %par\n",
> +		&region_extent->hpa_range);
> +
> +	put_device(dev);
> +	return rc;
> +}
> +
> +struct match_data {
> +	struct cxl_endpoint_decoder *cxled;
> +	struct range *new_range;
> +};
> +
> +static int match_contains(struct device *dev, void *data)
> +{
> +	struct region_extent *region_extent = to_region_extent(dev);
> +	struct match_data *md = data;
> +	struct cxled_extent *entry;
> +	unsigned long index;
> +
> +	if (!region_extent)
> +		return 0;
> +
> +	xa_for_each(&region_extent->decoder_extents, index, entry) {
> +		if (md->cxled == entry->cxled &&
> +		    range_contains(&entry->dpa_range, md->new_range))
> +			return true;
> +	}
> +	return false;
> +}
> +
> +static bool extents_contain(struct cxl_dax_region *cxlr_dax,
> +			    struct cxl_endpoint_decoder *cxled,
> +			    struct range *new_range)
> +{
> +	struct device *extent_device;
> +	struct match_data md = {
> +		.cxled = cxled,
> +		.new_range = new_range,
> +	};
> +
> +	extent_device = device_find_child(&cxlr_dax->dev, &md, match_contains);
> +	if (!extent_device)
> +		return false;
> +
> +	put_device(extent_device);
> +	return true;
> +}
> +
> +static int match_overlaps(struct device *dev, void *data)
> +{
> +	struct region_extent *region_extent = to_region_extent(dev);
> +	struct match_data *md = data;
> +	struct cxled_extent *entry;
> +	unsigned long index;
> +
> +	if (!region_extent)
> +		return 0;
> +
> +	xa_for_each(&region_extent->decoder_extents, index, entry) {
> +		if (md->cxled == entry->cxled &&
> +		    range_overlaps(&entry->dpa_range, md->new_range))
> +			return true;
> +	}
> +
> +	return false;
> +}
> +
> +static bool extents_overlap(struct cxl_dax_region *cxlr_dax,
> +			    struct cxl_endpoint_decoder *cxled,
> +			    struct range *new_range)
> +{
> +	struct device *extent_device;
> +	struct match_data md = {
> +		.cxled = cxled,
> +		.new_range = new_range,
> +	};
> +
> +	extent_device = device_find_child(&cxlr_dax->dev, &md, match_overlaps);
> +	if (!extent_device)
> +		return false;
> +
> +	put_device(extent_device);
> +	return true;
> +}
> +
> +static void calc_hpa_range(struct cxl_endpoint_decoder *cxled,
> +			   struct cxl_dax_region *cxlr_dax,
> +			   struct range *dpa_range,
> +			   struct range *hpa_range)
> +{
> +	resource_size_t dpa_offset, hpa;
> +
> +	dpa_offset = dpa_range->start - cxled->dpa_res->start;
> +	hpa = cxled->cxld.hpa_range.start + dpa_offset;
> +
> +	hpa_range->start = hpa - cxlr_dax->hpa_range.start;
> +	hpa_range->end = hpa_range->start + range_len(dpa_range) - 1;
> +}
> +
> +static int cxlr_rm_extent(struct device *dev, void *data)
> +{
> +	struct region_extent *region_extent = to_region_extent(dev);
> +	struct range *region_hpa_range = data;
> +
> +	if (!region_extent)
> +		return 0;
> +
> +	/*
> +	 * Any extent which 'touches' the released range is removed.
> +	 */
> +	if (range_overlaps(region_hpa_range, &region_extent->hpa_range)) {
> +		dev_dbg(dev, "Remove region extent HPA %par\n",
> +			&region_extent->hpa_range);
> +		region_rm_extent(region_extent);
> +	}
> +	return 0;
> +}
> +
> +int cxl_rm_extent(struct cxl_memdev_state *mds, struct cxl_extent *extent)
> +{
> +	u64 start_dpa = le64_to_cpu(extent->start_dpa);
> +	struct cxl_memdev *cxlmd = mds->cxlds.cxlmd;
> +	struct cxl_endpoint_decoder *cxled;
> +	struct range hpa_range, dpa_range;
> +	struct cxl_region *cxlr;
> +
> +	dpa_range = (struct range) {
> +		.start = start_dpa,
> +		.end = start_dpa + le64_to_cpu(extent->length) - 1,
> +	};
> +
> +	guard(rwsem_read)(&cxl_region_rwsem);
> +	cxlr = cxl_dpa_to_region(cxlmd, start_dpa, &cxled);
> +	if (!cxlr) {
> +		memdev_release_extent(mds, &dpa_range);
> +		return -ENXIO;
> +	}
> +
> +	calc_hpa_range(cxled, cxlr->cxlr_dax, &dpa_range, &hpa_range);
> +
> +	/* Remove region extents which overlap */
> +	return device_for_each_child(&cxlr->cxlr_dax->dev, &hpa_range,
> +				     cxlr_rm_extent);
> +}
> +
> +static int cxlr_add_extent(struct cxl_dax_region *cxlr_dax,
> +			   struct cxl_endpoint_decoder *cxled,
> +			   struct cxled_extent *ed_extent)
> +{
> +	struct region_extent *region_extent;
> +	struct range hpa_range;
> +	int rc;
> +
> +	calc_hpa_range(cxled, cxlr_dax, &ed_extent->dpa_range, &hpa_range);
> +
> +	region_extent = alloc_region_extent(cxlr_dax, &hpa_range, ed_extent->tag);
> +	if (IS_ERR(region_extent))
> +		return PTR_ERR(region_extent);
> +
> +	rc = xa_insert(&region_extent->decoder_extents, (unsigned long)ed_extent, ed_extent,
> +		       GFP_KERNEL);
> +	if (rc) {
> +		free_region_extent(region_extent);
> +		return rc;
> +	}
> +
> +	/* device model handles freeing region_extent */
> +	return online_region_extent(region_extent);
> +}
> +
> +/* Callers are expected to ensure cxled has been attached to a region */
> +int cxl_add_extent(struct cxl_memdev_state *mds, struct cxl_extent *extent)
> +{
> +	u64 start_dpa = le64_to_cpu(extent->start_dpa);
> +	struct cxl_memdev *cxlmd = mds->cxlds.cxlmd;
> +	struct cxl_endpoint_decoder *cxled;
> +	struct range ed_range, ext_range;
> +	struct cxl_dax_region *cxlr_dax;
> +	struct cxled_extent *ed_extent;
> +	struct cxl_region *cxlr;
> +	struct device *dev;
> +
> +	ext_range = (struct range) {
> +		.start = start_dpa,
> +		.end = start_dpa + le64_to_cpu(extent->length) - 1,
> +	};
> +
> +	guard(rwsem_read)(&cxl_region_rwsem);
> +	cxlr = cxl_dpa_to_region(cxlmd, start_dpa, &cxled);
> +	if (!cxlr)
> +		return -ENXIO;
> +
> +	cxlr_dax = cxled->cxld.region->cxlr_dax;
> +	dev = &cxled->cxld.dev;
> +	ed_range = (struct range) {
> +		.start = cxled->dpa_res->start,
> +		.end = cxled->dpa_res->end,
> +	};
> +
> +	dev_dbg(&cxled->cxld.dev, "Checking ED (%pr) for extent %par\n",
> +		cxled->dpa_res, &ext_range);
> +
> +	if (!range_contains(&ed_range, &ext_range)) {
> +		dev_err_ratelimited(dev,
> +				    "DC extent DPA %par (%*phC) is not fully in ED %par\n",
> +				    &ext_range.start, CXL_EXTENT_TAG_LEN,
> +				    extent->tag, &ed_range);
> +		return -ENXIO;
> +	}
> +
> +	if (extents_contain(cxlr_dax, cxled, &ext_range))
> +		return 0;
> +
> +	if (extents_overlap(cxlr_dax, cxled, &ext_range))
> +		return -ENXIO;
> +
> +	ed_extent = kzalloc(sizeof(*ed_extent), GFP_KERNEL);
> +	if (!ed_extent)
> +		return -ENOMEM;
> +
> +	ed_extent->cxled = cxled;
> +	ed_extent->dpa_range = ext_range;
> +	memcpy(ed_extent->tag, extent->tag, CXL_EXTENT_TAG_LEN);
> +
> +	dev_dbg(dev, "Add extent %par (%*phC)\n", &ed_extent->dpa_range,
> +		CXL_EXTENT_TAG_LEN, ed_extent->tag);
> +
> +	return cxlr_add_extent(cxlr_dax, cxled, ed_extent);
> +}
> diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
> index 01a447aaa1b1..f629ad7488ac 100644
> --- a/drivers/cxl/core/mbox.c
> +++ b/drivers/cxl/core/mbox.c
> @@ -882,6 +882,48 @@ int cxl_enumerate_cmds(struct cxl_memdev_state *mds)
>  }
>  EXPORT_SYMBOL_NS_GPL(cxl_enumerate_cmds, CXL);
>  
> +static int cxl_validate_extent(struct cxl_memdev_state *mds,
> +			       struct cxl_extent *extent)
> +{
> +	u64 start = le64_to_cpu(extent->start_dpa);
> +	u64 length = le64_to_cpu(extent->length);
> +	struct device *dev = mds->cxlds.dev;
> +
> +	struct range ext_range = (struct range){
> +		.start = start,
> +		.end = start + length - 1,
> +	};
> +
> +	if (le16_to_cpu(extent->shared_extn_seq) != 0) {
> +		dev_err_ratelimited(dev,
> +				    "DC extent DPA %par (%*phC) can not be shared\n",
> +				    &ext_range.start, CXL_EXTENT_TAG_LEN,
> +				    extent->tag);
> +		return -ENXIO;
> +	}
> +
> +	/* Extents must not cross DC region boundary's */
> +	for (int i = 0; i < mds->nr_dc_region; i++) {
> +		struct cxl_dc_region_info *dcr = &mds->dc_region[i];
> +		struct range region_range = (struct range) {
> +			.start = dcr->base,
> +			.end = dcr->base + dcr->decode_len - 1,
> +		};
> +
> +		if (range_contains(&region_range, &ext_range)) {
> +			dev_dbg(dev, "DC extent DPA %par (DCR:%d:%#llx)(%*phC)\n",
> +				&ext_range, i, start - dcr->base,
> +				CXL_EXTENT_TAG_LEN, extent->tag);
> +			return 0;
> +		}
> +	}
> +
> +	dev_err_ratelimited(dev,
> +			    "DC extent DPA %par (%*phC) is not in any DC region\n",
> +			    &ext_range, CXL_EXTENT_TAG_LEN, extent->tag);
> +	return -ENXIO;
> +}
> +
>  void cxl_event_trace_record(const struct cxl_memdev *cxlmd,
>  			    enum cxl_event_log_type type,
>  			    enum cxl_event_type event_type,
> @@ -1009,6 +1051,207 @@ static int cxl_clear_event_record(struct cxl_memdev_state *mds,
>  	return rc;
>  }
>  
> +static int cxl_send_dc_response(struct cxl_memdev_state *mds, int opcode,
> +				struct xarray *extent_array, int cnt)
> +{
> +	struct cxl_mbox_dc_response *p;
> +	struct cxl_mbox_cmd mbox_cmd;
> +	struct cxl_extent *extent;
> +	unsigned long index;
> +	u32 pl_index;
> +	int rc = 0;
> +
> +	size_t pl_size = struct_size(p, extent_list, cnt);
> +	u32 max_extents = cnt;
> +
> +	/* May have to use more bit on response. */
> +	if (pl_size > mds->payload_size) {
> +		max_extents = (mds->payload_size - sizeof(*p)) /
> +			      sizeof(struct updated_extent_list);
> +		pl_size = struct_size(p, extent_list, max_extents);
> +	}
> +
> +	struct cxl_mbox_dc_response *response __free(kfree) =
> +						kzalloc(pl_size, GFP_KERNEL);
> +	if (!response)
> +		return -ENOMEM;
> +
> +	pl_index = 0;
> +	xa_for_each(extent_array, index, extent) {
> +
> +		response->extent_list[pl_index].dpa_start = extent->start_dpa;
> +		response->extent_list[pl_index].length = extent->length;
> +		pl_index++;
> +		response->extent_list_size = cpu_to_le32(pl_index);
> +
> +		if (pl_index == max_extents) {
> +			mbox_cmd = (struct cxl_mbox_cmd) {
> +				.opcode = opcode,
> +				.size_in = struct_size(response, extent_list,
> +						       pl_index),
> +				.payload_in = response,
> +			};
> +
> +			response->flags = 0;
> +			if (pl_index < cnt)
> +				response->flags &= CXL_DCD_EVENT_MORE;
> +
> +			rc = cxl_internal_send_cmd(mds, &mbox_cmd);
> +			if (rc)
> +				return rc;
> +			pl_index = 0;
> +		}
> +	}
> +
> +	if (pl_index) {
> +		mbox_cmd = (struct cxl_mbox_cmd) {
> +			.opcode = opcode,
> +			.size_in = struct_size(response, extent_list,
> +					       pl_index),
> +			.payload_in = response,
> +		};
> +
> +		response->flags = 0;
> +		rc = cxl_internal_send_cmd(mds, &mbox_cmd);
> +	}
> +
> +	return rc;
> +}
> +
> +void memdev_release_extent(struct cxl_memdev_state *mds, struct range *range)
> +{
> +	struct device *dev = mds->cxlds.dev;
> +	struct xarray extent_list;
> +
> +	struct cxl_extent extent = {
> +		.start_dpa = cpu_to_le64(range->start),
> +		.length = cpu_to_le64(range_len(range)),
> +	};
> +
> +	dev_dbg(dev, "Release response dpa %par\n", range);
> +
> +	xa_init(&extent_list);
> +	if (xa_insert(&extent_list, 0, &extent, GFP_KERNEL)) {
> +		dev_dbg(dev, "Failed to release %par\n", range);
> +		goto destroy;
> +	}
> +
> +	if (cxl_send_dc_response(mds, CXL_MBOX_OP_RELEASE_DC, &extent_list, 1))
> +		dev_dbg(dev, "Failed to release %par\n", range);
> +
> +destroy:
> +	xa_destroy(&extent_list);
> +}
> +
> +static int validate_add_extent(struct cxl_memdev_state *mds,
> +			       struct cxl_extent *extent)
> +{
> +	int rc;
> +
> +	rc = cxl_validate_extent(mds, extent);
> +	if (rc)
> +		return rc;
> +
> +	return cxl_add_extent(mds, extent);
> +}
> +
> +static int cxl_add_pending(struct cxl_memdev_state *mds)
> +{
> +	struct device *dev = mds->cxlds.dev;
> +	struct cxl_extent *extent;
> +	unsigned long index;
> +	unsigned long cnt = 0;
reverse xmas tree

> +	int rc;
> +
> +	xa_for_each(&mds->pending_extents, index, extent) {
> +		if (validate_add_extent(mds, extent)) {
> +			dev_dbg(dev, "unconsumed DC extent DPA:%#llx LEN:%#llx\n",
> +				le64_to_cpu(extent->start_dpa),
> +				le64_to_cpu(extent->length));
> +			xa_erase(&mds->pending_extents, index);
> +			kfree(extent);
> +			continue;
> +		}
> +		cnt++;
> +	}
> +	rc = cxl_send_dc_response(mds, CXL_MBOX_OP_ADD_DC_RESPONSE,
> +				  &mds->pending_extents, cnt);
> +	xa_for_each(&mds->pending_extents, index, extent) {
> +		xa_erase(&mds->pending_extents, index);
> +		kfree(extent);
> +	}
> +	return rc;
> +}
> +
> +static int handle_add_event(struct cxl_memdev_state *mds,
> +			    struct cxl_event_dcd *event)
> +{
> +	struct cxl_extent *tmp = kzalloc(sizeof(*tmp), GFP_KERNEL);
for readability I would use *extent instead of *tmp

> +	struct device *dev = mds->cxlds.dev;
> +
> +	if (!tmp)
> +		return -ENOMEM;
> +
> +	memcpy(tmp, &event->extent, sizeof(*tmp));
> +	if (xa_insert(&mds->pending_extents, (unsigned long)tmp, tmp,
> +		      GFP_KERNEL)) {
> +		kfree(tmp);
> +		return -ENOMEM;
> +	}
> +
> +	if (event->flags & CXL_DCD_EVENT_MORE) {
> +		dev_dbg(dev, "more bit set; delay the surfacing of extent\n");
> +		return 0;
> +	}
> +
> +	/* extents are removed and free'ed in cxl_add_pending() */
> +	return cxl_add_pending(mds);
> +}
> +
> +static char *cxl_dcd_evt_type_str(u8 type)
> +{
> +	switch (type) {
> +	case DCD_ADD_CAPACITY:
> +		return "add";
> +	case DCD_RELEASE_CAPACITY:
> +		return "release";
> +	case DCD_FORCED_CAPACITY_RELEASE:
> +		return "force release";
> +	default:
> +		break;
> +	}
> +
> +	return "<unknown>";
> +}
> +
> +static int cxl_handle_dcd_event_records(struct cxl_memdev_state *mds,
> +					struct cxl_event_record_raw *raw_rec)
> +{
> +	struct cxl_event_dcd *event = &raw_rec->event.dcd;
> +	struct cxl_extent *extent = &event->extent;
> +	struct device *dev = mds->cxlds.dev;
> +	uuid_t *id = &raw_rec->id;
> +
> +	if (!uuid_equal(id, &CXL_EVENT_DC_EVENT_UUID))
> +		return -EINVAL;
> +
> +	dev_dbg(dev, "DCD event %s : DPA:%#llx LEN:%#llx\n",
> +		cxl_dcd_evt_type_str(event->event_type),
> +		le64_to_cpu(extent->start_dpa), le64_to_cpu(extent->length));
> +
> +	switch (event->event_type) {
> +	case DCD_ADD_CAPACITY:
> +		return handle_add_event(mds, event);
> +	case DCD_RELEASE_CAPACITY:
> +		return cxl_rm_extent(mds, &event->extent);
> +	case DCD_FORCED_CAPACITY_RELEASE:
> +		dev_err_ratelimited(dev, "Forced release event ignored.\n");
> +		return 0;
> +	default:
> +		return -EINVAL;
> +	}
> +}
> +
>  static void cxl_mem_get_records_log(struct cxl_memdev_state *mds,
>  				    enum cxl_event_log_type type)
>  {
> @@ -1044,9 +1287,17 @@ static void cxl_mem_get_records_log(struct cxl_memdev_state *mds,
>  		if (!nr_rec)
>  			break;
>  
> -		for (i = 0; i < nr_rec; i++)
> +		for (i = 0; i < nr_rec; i++) {
>  			__cxl_event_trace_record(cxlmd, type,
>  						 &payload->records[i]);
> +			if (type == CXL_EVENT_TYPE_DCD) {
> +				rc = cxl_handle_dcd_event_records(mds,
> +								  &payload->records[i]);
> +				if (rc)
> +					dev_err_ratelimited(dev, "dcd event failed: %d\n",
> +							    rc);
> +			}
> +		}
>  
>  		if (payload->flags & CXL_GET_EVENT_FLAG_OVERFLOW)
>  			trace_cxl_overflow(cxlmd, type, payload);
> @@ -1078,6 +1329,8 @@ void cxl_mem_get_event_records(struct cxl_memdev_state *mds, u32 status)
>  {
>  	dev_dbg(mds->cxlds.dev, "Reading event logs: %x\n", status);
>  
> +	if (cxl_dcd_supported(mds) && (status & CXLDEV_EVENT_STATUS_DCD))
> +		cxl_mem_get_records_log(mds, CXL_EVENT_TYPE_DCD);
>  	if (status & CXLDEV_EVENT_STATUS_FATAL)
>  		cxl_mem_get_records_log(mds, CXL_EVENT_TYPE_FATAL);
>  	if (status & CXLDEV_EVENT_STATUS_FAIL)
> @@ -1610,6 +1863,17 @@ int cxl_poison_state_init(struct cxl_memdev_state *mds)
>  }
>  EXPORT_SYMBOL_NS_GPL(cxl_poison_state_init, CXL);
>  
> +static void clear_pending_extents(void *_mds)
> +{
> +	struct cxl_memdev_state *mds = _mds;
> +	struct cxl_extent *extent;
> +	unsigned long index;
> +
> +	xa_for_each(&mds->pending_extents, index, extent)
> +		kfree(extent);
> +	xa_destroy(&mds->pending_extents);
> +}
> +
>  struct cxl_memdev_state *cxl_memdev_state_create(struct device *dev)
>  {
>  	struct cxl_memdev_state *mds;
> @@ -1628,6 +1892,8 @@ struct cxl_memdev_state *cxl_memdev_state_create(struct device *dev)
>  	mds->cxlds.type = CXL_DEVTYPE_CLASSMEM;
>  	mds->ram_perf.qos_class = CXL_QOS_CLASS_INVALID;
>  	mds->pmem_perf.qos_class = CXL_QOS_CLASS_INVALID;
> +	xa_init(&mds->pending_extents);
> +	devm_add_action_or_reset(dev, clear_pending_extents, mds);
>  
>  	return mds;
>  }
> diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
> index 8e0884b52f84..8c9171f914fb 100644
> --- a/drivers/cxl/core/region.c
> +++ b/drivers/cxl/core/region.c
> @@ -3037,6 +3037,7 @@ static void cxl_dax_region_release(struct device *dev)
>  {
>  	struct cxl_dax_region *cxlr_dax = to_cxl_dax_region(dev);
>  
> +	ida_destroy(&cxlr_dax->extent_ida);
>  	kfree(cxlr_dax);
>  }
>  
> @@ -3090,6 +3091,8 @@ static struct cxl_dax_region *cxl_dax_region_alloc(struct cxl_region *cxlr)
>  
>  	dev = &cxlr_dax->dev;
>  	cxlr_dax->cxlr = cxlr;
> +	cxlr->cxlr_dax = cxlr_dax;
> +	ida_init(&cxlr_dax->extent_ida);
>  	device_initialize(dev);
>  	lockdep_set_class(&dev->mutex, &cxl_dax_region_key);
>  	device_set_pm_not_required(dev);
> @@ -3190,7 +3193,10 @@ static int devm_cxl_add_pmem_region(struct cxl_region *cxlr)
>  static void cxlr_dax_unregister(void *_cxlr_dax)
>  {
>  	struct cxl_dax_region *cxlr_dax = _cxlr_dax;
> +	struct cxl_region *cxlr = cxlr_dax->cxlr;
>  
> +	cxlr->cxlr_dax = NULL;
> +	cxlr_dax->cxlr = NULL;
>  	device_unregister(&cxlr_dax->dev);
>  }
>  
> diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
> index 16861c867537..c858e3957fd5 100644
> --- a/drivers/cxl/cxl.h
> +++ b/drivers/cxl/cxl.h
> @@ -11,6 +11,7 @@
>  #include <linux/log2.h>
>  #include <linux/node.h>
>  #include <linux/io.h>
> +#include <linux/cxl-event.h>
>  
>  extern const struct nvdimm_security_ops *cxl_security_ops;
>  
> @@ -169,11 +170,13 @@ static inline int ways_to_eiw(unsigned int ways, u8 *eiw)
>  #define CXLDEV_EVENT_STATUS_WARN		BIT(1)
>  #define CXLDEV_EVENT_STATUS_FAIL		BIT(2)
>  #define CXLDEV_EVENT_STATUS_FATAL		BIT(3)
> +#define CXLDEV_EVENT_STATUS_DCD			BIT(4)
>  
>  #define CXLDEV_EVENT_STATUS_ALL (CXLDEV_EVENT_STATUS_INFO |	\
>  				 CXLDEV_EVENT_STATUS_WARN |	\
>  				 CXLDEV_EVENT_STATUS_FAIL |	\
> -				 CXLDEV_EVENT_STATUS_FATAL)
> +				 CXLDEV_EVENT_STATUS_FATAL |	\
> +				 CXLDEV_EVENT_STATUS_DCD)
>  
>  /* CXL rev 3.0 section 8.2.9.2.4; Table 8-52 */
>  #define CXLDEV_EVENT_INT_MODE_MASK	GENMASK(1, 0)
> @@ -444,6 +447,18 @@ enum cxl_decoder_state {
>  	CXL_DECODER_STATE_AUTO,
>  };
>  
> +/**
> + * struct cxled_extent - Extent within an endpoint decoder
> + * @cxled: Reference to the endpoint decoder
> + * @dpa_range: DPA range this extent covers within the decoder
> + * @tag: Tag from device for this extent
> + */
> +struct cxled_extent {
> +	struct cxl_endpoint_decoder *cxled;
> +	struct range dpa_range;
> +	u8 tag[CXL_EXTENT_TAG_LEN];
> +};
> +
>  /**
>   * struct cxl_endpoint_decoder - Endpoint  / SPA to DPA decoder
>   * @cxld: base cxl_decoder_object
> @@ -569,6 +584,7 @@ struct cxl_region_params {
>   * @type: Endpoint decoder target type
>   * @cxl_nvb: nvdimm bridge for coordinating @cxlr_pmem setup / shutdown
>   * @cxlr_pmem: (for pmem regions) cached copy of the nvdimm bridge
> + * @cxlr_dax: (for DC regions) cached copy of CXL DAX bridge
>   * @flags: Region state flags
>   * @params: active + config params for the region
>   * @coord: QoS access coordinates for the region
> @@ -582,6 +598,7 @@ struct cxl_region {
>  	enum cxl_decoder_type type;
>  	struct cxl_nvdimm_bridge *cxl_nvb;
>  	struct cxl_pmem_region *cxlr_pmem;
> +	struct cxl_dax_region *cxlr_dax;
>  	unsigned long flags;
>  	struct cxl_region_params params;
>  	struct access_coordinate coord[ACCESS_COORDINATE_MAX];
> @@ -622,12 +639,45 @@ struct cxl_pmem_region {
>  	struct cxl_pmem_region_mapping mapping[];
>  };
>  
> +/* See CXL 3.0 8.2.9.2.1.5 */
> +enum dc_event {
> +	DCD_ADD_CAPACITY,
> +	DCD_RELEASE_CAPACITY,
> +	DCD_FORCED_CAPACITY_RELEASE,
> +	DCD_REGION_CONFIGURATION_UPDATED,
> +};
> +
>  struct cxl_dax_region {
>  	struct device dev;
>  	struct cxl_region *cxlr;
>  	struct range hpa_range;
> +	struct ida extent_ida;
>  };
>  
> +/**
> + * struct region_extent - CXL DAX region extent
> + * @dev: device representing this extent
> + * @cxlr_dax: back reference to parent region device
> + * @hpa_range: HPA range of this extent
> + * @tag: tag of the extent
> + * @decoder_extents: Endpoint decoder extents which make up this region extent
> + */
> +struct region_extent {
> +	struct device dev;
> +	struct cxl_dax_region *cxlr_dax;
> +	struct range hpa_range;
> +	uuid_t tag;
> +	struct xarray decoder_extents;
> +};
> +
> +bool is_region_extent(struct device *dev);
> +static inline struct region_extent *to_region_extent(struct device *dev)
> +{
> +	if (!is_region_extent(dev))
> +		return NULL;
> +	return container_of(dev, struct region_extent, dev);
> +}
> +
>  /**
>   * struct cxl_port - logical collection of upstream port devices and
>   *		     downstream port devices to construct a CXL memory
> diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
> index d41bec5433db..3a40fe1f0be7 100644
> --- a/drivers/cxl/cxlmem.h
> +++ b/drivers/cxl/cxlmem.h
> @@ -497,6 +497,7 @@ struct cxl_dc_region_info {
>   * @pmem_perf: performance data entry matched to PMEM partition
>   * @nr_dc_region: number of DC regions implemented in the memory device
>   * @dc_region: array containing info about the DC regions
> + * @pending_extents: array of extents pending during more bit processing
>   * @event: event log driver state
>   * @poison: poison driver state info
>   * @security: security driver state info
> @@ -532,6 +533,7 @@ struct cxl_memdev_state {
>  
>  	u8 nr_dc_region;
>  	struct cxl_dc_region_info dc_region[CXL_MAX_DC_REGION];
> +	struct xarray pending_extents;
>  
>  	struct cxl_event_state event;
>  	struct cxl_poison_state poison;
> @@ -607,6 +609,21 @@ enum cxl_opcode {
>  	UUID_INIT(0x5e1819d9, 0x11a9, 0x400c, 0x81, 0x1f, 0xd6, 0x07, 0x19,     \
>  		  0x40, 0x3d, 0x86)
>  
> +/*
> + * Add Dynamic Capacity Response
> + * CXL rev 3.1 section 8.2.9.9.9.3; Table 8-168 & Table 8-169
> + */
> +struct cxl_mbox_dc_response {
> +	__le32 extent_list_size;
> +	u8 flags;
> +	u8 reserved[3];
> +	struct updated_extent_list {
> +		__le64 dpa_start;
> +		__le64 length;
> +		u8 reserved[8];
> +	} __packed extent_list[];
> +} __packed;
> +
>  struct cxl_mbox_get_supported_logs {
>  	__le16 entries;
>  	u8 rsvd[6];
> @@ -669,6 +686,14 @@ struct cxl_mbox_identify {
>  	UUID_INIT(0xfe927475, 0xdd59, 0x4339, 0xa5, 0x86, 0x79, 0xba, 0xb1, \
>  		  0x13, 0xb7, 0x74)
>  
> +/*
> + * Dynamic Capacity Event Record
> + * CXL rev 3.1 section 8.2.9.2.1; Table 8-43
> + */
> +#define CXL_EVENT_DC_EVENT_UUID                                             \
> +	UUID_INIT(0xca95afa7, 0xf183, 0x4018, 0x8c, 0x2f, 0x95, 0x26, 0x8e, \
> +		  0x10, 0x1a, 0x2a)
> +
>  /*
>   * Get Event Records output payload
>   * CXL rev 3.0 section 8.2.9.2.2; Table 8-50
> @@ -694,6 +719,7 @@ enum cxl_event_log_type {
>  	CXL_EVENT_TYPE_WARN,
>  	CXL_EVENT_TYPE_FAIL,
>  	CXL_EVENT_TYPE_FATAL,
> +	CXL_EVENT_TYPE_DCD,
>  	CXL_EVENT_TYPE_MAX
>  };
>  
> diff --git a/include/linux/cxl-event.h b/include/linux/cxl-event.h
> index 0bea1afbd747..eeda8059d81a 100644
> --- a/include/linux/cxl-event.h
> +++ b/include/linux/cxl-event.h
> @@ -96,11 +96,43 @@ struct cxl_event_mem_module {
>  	u8 reserved[0x3d];
Previous code, but 61 would be better than 0x3d to be consistent with rest of cxl code

>  } __packed;
>  
> +/*
> + * CXL rev 3.1 section 8.2.9.2.1.6; Table 8-51
> + */
> +#define CXL_EXTENT_TAG_LEN 0x10
> +struct cxl_extent {
> +	__le64 start_dpa;
> +	__le64 length;
> +	u8 tag[CXL_EXTENT_TAG_LEN];
> +	__le16 shared_extn_seq;
> +	u8 reserved[0x6];

Why not just 6? In general I find it odd that this header uses hex for array indexing when the rest of the cxl code uses decimal. 

> +} __packed;
> +
> +/*
> + * Dynamic Capacity Event Record
> + * CXL rev 3.1 section 8.2.9.2.1.6; Table 8-50
> + */
> +#define CXL_DCD_EVENT_MORE			BIT(0)
> +struct cxl_event_dcd {
> +	struct cxl_event_record_hdr hdr;
> +	u8 event_type;
> +	u8 validity_flags;
> +	__le16 host_id;
> +	u8 region_index;
> +	u8 flags;
> +	u8 reserved1[0x2];

also here, 2?

> +	struct cxl_extent extent;
> +	u8 reserved2[0x18];

24?

> +	__le32 num_avail_extents;
> +	__le32 num_avail_tags;
> +} __packed;
> +
>  union cxl_event {
>  	struct cxl_event_generic generic;
>  	struct cxl_event_gen_media gen_media;
>  	struct cxl_event_dram dram;
>  	struct cxl_event_mem_module mem_module;
> +	struct cxl_event_dcd dcd;
>  	/* dram & gen_media event header */
>  	struct cxl_event_media_hdr media_hdr;
>  } __packed;
> diff --git a/tools/testing/cxl/Kbuild b/tools/testing/cxl/Kbuild
> index 030b388800f0..8238588fffdf 100644
> --- a/tools/testing/cxl/Kbuild
> +++ b/tools/testing/cxl/Kbuild
> @@ -61,7 +61,8 @@ cxl_core-y += $(CXL_CORE_SRC)/hdm.o
>  cxl_core-y += $(CXL_CORE_SRC)/pmu.o
>  cxl_core-y += $(CXL_CORE_SRC)/cdat.o
>  cxl_core-$(CONFIG_TRACING) += $(CXL_CORE_SRC)/trace.o
> -cxl_core-$(CONFIG_CXL_REGION) += $(CXL_CORE_SRC)/region.o
> +cxl_core-$(CONFIG_CXL_REGION) += $(CXL_CORE_SRC)/region.o \
> +				 $(CXL_CORE_SRC)/extent.o
>  cxl_core-y += config_check.o
>  cxl_core-y += cxl_core_test.o
>  cxl_core-y += cxl_core_exports.o
> 

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v3 19/25] cxl/region/extent: Expose region extent information in sysfs
  2024-08-16 14:44 ` [PATCH v3 19/25] cxl/region/extent: Expose region extent information in sysfs ira.weiny
@ 2024-08-19 19:05   ` Dave Jiang
  2024-08-23  2:58     ` Ira Weiny
  2024-08-23 17:19   ` Jonathan Cameron
  2024-08-28 17:44   ` Fan Ni
  2 siblings, 1 reply; 120+ messages in thread
From: Dave Jiang @ 2024-08-19 19:05 UTC (permalink / raw)
  To: ira.weiny, Fan Ni, Jonathan Cameron, Navneet Singh, Chris Mason,
	Josef Bacik, David Sterba, Petr Mladek, Steven Rostedt,
	Andy Shevchenko, Rasmus Villemoes, Sergey Senozhatsky,
	Jonathan Corbet, Andrew Morton
  Cc: Dan Williams, Davidlohr Bueso, Alison Schofield, Vishal Verma,
	linux-btrfs, linux-cxl, linux-kernel, linux-doc, nvdimm



On 8/16/24 7:44 AM, ira.weiny@intel.com wrote:
> From: Navneet Singh <navneet.singh@intel.com>
> 
> Extent information can be helpful to the user to coordinate memory usage
> with the external orchestrator and FM.
> 
> Expose the details of region extents by creating the following
> sysfs entries.
> 
>         /sys/bus/cxl/devices/dax_regionX/extentX.Y
>         /sys/bus/cxl/devices/dax_regionX/extentX.Y/offset
>         /sys/bus/cxl/devices/dax_regionX/extentX.Y/length
>         /sys/bus/cxl/devices/dax_regionX/extentX.Y/tag
> 
> Signed-off-by: Navneet Singh <navneet.singh@intel.com>
> Co-developed-by: Ira Weiny <ira.weiny@intel.com>
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> 
> ---
> Changes:
> [iweiny: split this out]
> [Jonathan: add documentation for extent sysfs]
> [Jonathan/djbw: s/label/tag]
> [Jonathan/djbw: treat tag as uuid]
> [djbw: use __ATTRIBUTE_GROUPS]
> [djbw: make tag invisible if it is empty]
> [djbw/iweiny: use conventional id names for extents; extentX.Y]
> ---
>  Documentation/ABI/testing/sysfs-bus-cxl | 13 ++++++++
>  drivers/cxl/core/extent.c               | 58 +++++++++++++++++++++++++++++++++
>  2 files changed, 71 insertions(+)
> 
> diff --git a/Documentation/ABI/testing/sysfs-bus-cxl b/Documentation/ABI/testing/sysfs-bus-cxl
> index 3a5ee88e551b..e97e6a73c960 100644
> --- a/Documentation/ABI/testing/sysfs-bus-cxl
> +++ b/Documentation/ABI/testing/sysfs-bus-cxl
> @@ -599,3 +599,16 @@ Description:
>  		See Documentation/ABI/stable/sysfs-devices-node. access0 provides
>  		the number to the closest initiator and access1 provides the
>  		number to the closest CPU.
> +
> +What:		/sys/bus/cxl/devices/dax_regionX/extentX.Y/offset
> +		/sys/bus/cxl/devices/dax_regionX/extentX.Y/length
> +		/sys/bus/cxl/devices/dax_regionX/extentX.Y/tag

I wonder consider an entry for each with their own descriptions, which seems to be the standard practice.

DJ

> +Date:		October, 2024
> +KernelVersion:	v6.12
> +Contact:	linux-cxl@vger.kernel.org
> +Description:
> +		(RO) [For Dynamic Capacity regions only]  Extent offset and
> +		length within the region.  Users can use the extent information
> +		to create DAX devices on specific extents.  This is done by
> +		creating and destroying DAX devices in specific sequences and
> +		looking at the mappings created.
> diff --git a/drivers/cxl/core/extent.c b/drivers/cxl/core/extent.c
> index 34456594cdc3..d7d526a51e2b 100644
> --- a/drivers/cxl/core/extent.c
> +++ b/drivers/cxl/core/extent.c
> @@ -6,6 +6,63 @@
>  
>  #include "core.h"
>  
> +static ssize_t offset_show(struct device *dev, struct device_attribute *attr,
> +			   char *buf)
> +{
> +	struct region_extent *region_extent = to_region_extent(dev);
> +
> +	return sysfs_emit(buf, "%#llx\n", region_extent->hpa_range.start);
> +}
> +static DEVICE_ATTR_RO(offset);
> +
> +static ssize_t length_show(struct device *dev, struct device_attribute *attr,
> +			   char *buf)
> +{
> +	struct region_extent *region_extent = to_region_extent(dev);
> +	u64 length = range_len(&region_extent->hpa_range);
> +
> +	return sysfs_emit(buf, "%#llx\n", length);
> +}
> +static DEVICE_ATTR_RO(length);
> +
> +static ssize_t tag_show(struct device *dev, struct device_attribute *attr,
> +			char *buf)
> +{
> +	struct region_extent *region_extent = to_region_extent(dev);
> +
> +	return sysfs_emit(buf, "%pUb\n", &region_extent->tag);
> +}
> +static DEVICE_ATTR_RO(tag);
> +
> +static struct attribute *region_extent_attrs[] = {
> +	&dev_attr_offset.attr,
> +	&dev_attr_length.attr,
> +	&dev_attr_tag.attr,
> +	NULL,
> +};
> +
> +static uuid_t empty_tag = { 0 };
> +
> +static umode_t region_extent_visible(struct kobject *kobj,
> +				     struct attribute *a, int n)
> +{
> +	struct device *dev = kobj_to_dev(kobj);
> +	struct region_extent *region_extent = to_region_extent(dev);
> +
> +	if (a == &dev_attr_tag.attr &&
> +	    uuid_equal(&region_extent->tag, &empty_tag))
> +		return 0;
> +
> +	return a->mode;
> +}
> +
> +static const struct attribute_group region_extent_attribute_group = {
> +	.attrs = region_extent_attrs,
> +	.is_visible = region_extent_visible,
> +};
> +
> +__ATTRIBUTE_GROUPS(region_extent_attribute);
> +
>  static void cxled_release_extent(struct cxl_endpoint_decoder *cxled,
>  				 struct cxled_extent *ed_extent)
>  {
> @@ -44,6 +101,7 @@ static void region_extent_release(struct device *dev)
>  static const struct device_type region_extent_type = {
>  	.name = "extent",
>  	.release = region_extent_release,
> +	.groups = region_extent_attribute_groups,
>  };
>  
>  bool is_region_extent(struct device *dev)
> 

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v3 20/25] dax/bus: Factor out dev dax resize logic
  2024-08-16 14:44 ` [PATCH v3 20/25] dax/bus: Factor out dev dax resize logic Ira Weiny
@ 2024-08-19 22:35   ` Dave Jiang
  2024-08-27 13:26   ` Jonathan Cameron
  1 sibling, 0 replies; 120+ messages in thread
From: Dave Jiang @ 2024-08-19 22:35 UTC (permalink / raw)
  To: Ira Weiny, Fan Ni, Jonathan Cameron, Navneet Singh, Chris Mason,
	Josef Bacik, David Sterba, Petr Mladek, Steven Rostedt,
	Andy Shevchenko, Rasmus Villemoes, Sergey Senozhatsky,
	Jonathan Corbet, Andrew Morton
  Cc: Dan Williams, Davidlohr Bueso, Alison Schofield, Vishal Verma,
	linux-btrfs, linux-cxl, linux-kernel, linux-doc, nvdimm



On 8/16/24 7:44 AM, Ira Weiny wrote:
> Dynamic Capacity regions must limit dev dax resources to those areas
> which have extents backing real memory.  Such DAX regions are dubbed
> 'sparse' regions.  In order to manage where memory is available four
> alternatives were considered:
> 
> 1) Create a single region resource child on region creation which
>    reserves the entire region.  Then as extents are added punch holes in
>    this reservation.  This requires new resource manipulation to punch
>    the holes and still requires an additional iteration over the extent
>    areas which may already have existing dev dax resources used.
> 
> 2) Maintain an ordered xarray of extents which can be queried while
>    processing the resize logic.  The issue is that existing region->res
>    children may artificially limit the allocation size sent to
>    alloc_dev_dax_range().  IE the resource children can't be directly
>    used in the resize logic to find where space in the region is.  This
>    also poses a problem of managing the available size in 2 places.
> 
> 3) Maintain a separate resource tree with extents.  This option is the
>    same as 2) but with the different data structure.  Most ideally there
>    should be a unified representation of the resource tree not two places
>    to look for space.
> 
> 4) Create region resource children for each extent.  Manage the dax dev
>    resize logic in the same way as before but use a region child
>    (extent) resource as the parents to find space within each extent.
> 
> Option 4 can leverage the existing resize algorithm to find space within
> the extents.  It manages the available space in a singular resource tree
> which is less complicated for finding space.
> 
> In preparation for this change, factor out the dev_dax_resize logic.
> For static regions use dax_region->res as the parent to find space for
> the dax ranges.  Future patches will use the same algorithm with
> individual extent resources as the parent.
> 
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>

> ---
> Changes:
> [iweiny: Rebase on new DAX region locking]
> [iweiny: Reword commit message]
> [iweiny: Drop reviews]
> ---
>  drivers/dax/bus.c | 129 +++++++++++++++++++++++++++++++++---------------------
>  1 file changed, 79 insertions(+), 50 deletions(-)
> 
> diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
> index d8cb5195a227..975860371d9f 100644
> --- a/drivers/dax/bus.c
> +++ b/drivers/dax/bus.c
> @@ -844,11 +844,9 @@ static int devm_register_dax_mapping(struct dev_dax *dev_dax, int range_id)
>  	return 0;
>  }
>  
> -static int alloc_dev_dax_range(struct dev_dax *dev_dax, u64 start,
> -		resource_size_t size)
> +static int alloc_dev_dax_range(struct resource *parent, struct dev_dax *dev_dax,
> +			       u64 start, resource_size_t size)
>  {
> -	struct dax_region *dax_region = dev_dax->region;
> -	struct resource *res = &dax_region->res;
>  	struct device *dev = &dev_dax->dev;
>  	struct dev_dax_range *ranges;
>  	unsigned long pgoff = 0;
> @@ -866,14 +864,14 @@ static int alloc_dev_dax_range(struct dev_dax *dev_dax, u64 start,
>  		return 0;
>  	}
>  
> -	alloc = __request_region(res, start, size, dev_name(dev), 0);
> +	alloc = __request_region(parent, start, size, dev_name(dev), 0);
>  	if (!alloc)
>  		return -ENOMEM;
>  
>  	ranges = krealloc(dev_dax->ranges, sizeof(*ranges)
>  			* (dev_dax->nr_range + 1), GFP_KERNEL);
>  	if (!ranges) {
> -		__release_region(res, alloc->start, resource_size(alloc));
> +		__release_region(parent, alloc->start, resource_size(alloc));
>  		return -ENOMEM;
>  	}
>  
> @@ -1026,50 +1024,45 @@ static bool adjust_ok(struct dev_dax *dev_dax, struct resource *res)
>  	return true;
>  }
>  
> -static ssize_t dev_dax_resize(struct dax_region *dax_region,
> -		struct dev_dax *dev_dax, resource_size_t size)
> +/**
> + * dev_dax_resize_static - Expand the device into the unused portion of the
> + * region. This may involve adjusting the end of an existing resource, or
> + * allocating a new resource.
> + *
> + * @parent: parent resource to allocate this range in
> + * @dev_dax: DAX device to be expanded
> + * @to_alloc: amount of space to alloc; must be <= space available in @parent
> + *
> + * Return the amount of space allocated or -ERRNO on failure
> + */
> +static ssize_t dev_dax_resize_static(struct resource *parent,
> +				     struct dev_dax *dev_dax,
> +				     resource_size_t to_alloc)
>  {
> -	resource_size_t avail = dax_region_avail_size(dax_region), to_alloc;
> -	resource_size_t dev_size = dev_dax_size(dev_dax);
> -	struct resource *region_res = &dax_region->res;
> -	struct device *dev = &dev_dax->dev;
>  	struct resource *res, *first;
> -	resource_size_t alloc = 0;
>  	int rc;
>  
> -	if (dev->driver)
> -		return -EBUSY;
> -	if (size == dev_size)
> -		return 0;
> -	if (size > dev_size && size - dev_size > avail)
> -		return -ENOSPC;
> -	if (size < dev_size)
> -		return dev_dax_shrink(dev_dax, size);
> -
> -	to_alloc = size - dev_size;
> -	if (dev_WARN_ONCE(dev, !alloc_is_aligned(dev_dax, to_alloc),
> -			"resize of %pa misaligned\n", &to_alloc))
> -		return -ENXIO;
> -
> -	/*
> -	 * Expand the device into the unused portion of the region. This
> -	 * may involve adjusting the end of an existing resource, or
> -	 * allocating a new resource.
> -	 */
> -retry:
> -	first = region_res->child;
> -	if (!first)
> -		return alloc_dev_dax_range(dev_dax, dax_region->res.start, to_alloc);
> +	first = parent->child;
> +	if (!first) {
> +		rc = alloc_dev_dax_range(parent, dev_dax,
> +					   parent->start, to_alloc);
> +		if (rc)
> +			return rc;
> +		return to_alloc;
> +	}
>  
> -	rc = -ENOSPC;
>  	for (res = first; res; res = res->sibling) {
>  		struct resource *next = res->sibling;
> +		resource_size_t alloc;
>  
>  		/* space at the beginning of the region */
> -		if (res == first && res->start > dax_region->res.start) {
> -			alloc = min(res->start - dax_region->res.start, to_alloc);
> -			rc = alloc_dev_dax_range(dev_dax, dax_region->res.start, alloc);
> -			break;
> +		if (res == first && res->start > parent->start) {
> +			alloc = min(res->start - parent->start, to_alloc);
> +			rc = alloc_dev_dax_range(parent, dev_dax,
> +						 parent->start, alloc);
> +			if (rc)
> +				return rc;
> +			return alloc;
>  		}
>  
>  		alloc = 0;
> @@ -1078,21 +1071,55 @@ static ssize_t dev_dax_resize(struct dax_region *dax_region,
>  			alloc = min(next->start - (res->end + 1), to_alloc);
>  
>  		/* space at the end of the region */
> -		if (!alloc && !next && res->end < region_res->end)
> -			alloc = min(region_res->end - res->end, to_alloc);
> +		if (!alloc && !next && res->end < parent->end)
> +			alloc = min(parent->end - res->end, to_alloc);
>  
>  		if (!alloc)
>  			continue;
>  
>  		if (adjust_ok(dev_dax, res)) {
>  			rc = adjust_dev_dax_range(dev_dax, res, resource_size(res) + alloc);
> -			break;
> +			if (rc)
> +				return rc;
> +			return alloc;
>  		}
> -		rc = alloc_dev_dax_range(dev_dax, res->end + 1, alloc);
> -		break;
> +		rc = alloc_dev_dax_range(parent, dev_dax, res->end + 1, alloc);
> +		if (rc)
> +			return rc;
> +		return alloc;
>  	}
> -	if (rc)
> -		return rc;
> +
> +	/* available was already calculated and should never be an issue */
> +	dev_WARN_ONCE(&dev_dax->dev, 1, "space not found?");
> +	return 0;
> +}
> +
> +static ssize_t dev_dax_resize(struct dax_region *dax_region,
> +		struct dev_dax *dev_dax, resource_size_t size)
> +{
> +	resource_size_t avail = dax_region_avail_size(dax_region), to_alloc;
> +	resource_size_t dev_size = dev_dax_size(dev_dax);
> +	struct device *dev = &dev_dax->dev;
> +	resource_size_t alloc = 0;
> +
> +	if (dev->driver)
> +		return -EBUSY;
> +	if (size == dev_size)
> +		return 0;
> +	if (size > dev_size && size - dev_size > avail)
> +		return -ENOSPC;
> +	if (size < dev_size)
> +		return dev_dax_shrink(dev_dax, size);
> +
> +	to_alloc = size - dev_size;
> +	if (dev_WARN_ONCE(dev, !alloc_is_aligned(dev_dax, to_alloc),
> +			"resize of %pa misaligned\n", &to_alloc))
> +		return -ENXIO;
> +
> +retry:
> +	alloc = dev_dax_resize_static(&dax_region->res, dev_dax, to_alloc);
> +	if (alloc <= 0)
> +		return alloc;
>  	to_alloc -= alloc;
>  	if (to_alloc)
>  		goto retry;
> @@ -1198,7 +1225,8 @@ static ssize_t mapping_store(struct device *dev, struct device_attribute *attr,
>  
>  	to_alloc = range_len(&r);
>  	if (alloc_is_aligned(dev_dax, to_alloc))
> -		rc = alloc_dev_dax_range(dev_dax, r.start, to_alloc);
> +		rc = alloc_dev_dax_range(&dax_region->res, dev_dax, r.start,
> +					 to_alloc);
>  	up_write(&dax_dev_rwsem);
>  	up_write(&dax_region_rwsem);
>  
> @@ -1466,7 +1494,8 @@ static struct dev_dax *__devm_create_dev_dax(struct dev_dax_data *data)
>  	device_initialize(dev);
>  	dev_set_name(dev, "dax%d.%d", dax_region->id, dev_dax->id);
>  
> -	rc = alloc_dev_dax_range(dev_dax, dax_region->res.start, data->size);
> +	rc = alloc_dev_dax_range(&dax_region->res, dev_dax, dax_region->res.start,
> +				 data->size);
>  	if (rc)
>  		goto err_range;
>  
> 

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v3 21/25] dax/region: Create resources on sparse DAX regions
  2024-08-16 14:44 ` [PATCH v3 21/25] dax/region: Create resources on sparse DAX regions ira.weiny
  2024-08-18 11:38   ` Markus Elfring
@ 2024-08-19 23:30   ` Dave Jiang
  2024-08-23 14:28     ` Ira Weiny
  2024-08-27 14:12   ` Jonathan Cameron
  2 siblings, 1 reply; 120+ messages in thread
From: Dave Jiang @ 2024-08-19 23:30 UTC (permalink / raw)
  To: ira.weiny, Fan Ni, Jonathan Cameron, Navneet Singh, Chris Mason,
	Josef Bacik, David Sterba, Petr Mladek, Steven Rostedt,
	Andy Shevchenko, Rasmus Villemoes, Sergey Senozhatsky,
	Jonathan Corbet, Andrew Morton
  Cc: Dan Williams, Davidlohr Bueso, Alison Schofield, Vishal Verma,
	linux-btrfs, linux-cxl, linux-kernel, linux-doc, nvdimm



On 8/16/24 7:44 AM, ira.weiny@intel.com wrote:
> From: Navneet Singh <navneet.singh@intel.com>
> 
> DAX regions which map dynamic capacity partitions require that memory be
> allowed to come and go.  Recall sparse regions were created for this
> purpose.  Now that extents can be realized within DAX regions the DAX
> region driver can start tracking sub-resource information.
> 
> The tight relationship between DAX region operations and extent
> operations require memory changes to be controlled synchronously with
> the user of the region.  Synchronize through the dax_region_rwsem and by
> having the region driver drive both the region device as well as the
> extent sub-devices.
> 
> Recall requests to remove extents can happen at any time and that a host
> is not obligated to release the memory until it is not being used.  If
> an extent is not used allow a release response.
> 
> The DAX layer has no need for the details of the CXL memory extent
> devices.  Expose extents to the DAX layer as device children of the DAX
> region device.  A single callback from the driver aids the DAX layer to
> determine if the child device is an extent.  The DAX layer also
> registers a devres function to automatically clean up when the device is
> removed from the region.
> 
> There is a race between extents being surfaced and the dax_cxl driver
> being loaded.  The driver must therefore scan for any existing extents
> while still under the device lock.
> 
> Respond to extent notifications.  Manage the DAX region resource tree
> based on the extents lifetime.  Return the status of remove
> notifications to lower layers such that it can manage the hardware
> appropriately.
> 
> Signed-off-by: Navneet Singh <navneet.singh@intel.com>
> Co-developed-by: Ira Weiny <ira.weiny@intel.com>
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> 
> ---
> Changes:
> [iweiny: patch reorder]
> [iweiny: move hunks from other patches to clarify code changes and
>          add/release flows WRT dax regions]
> [iweiny: use %par]
> [iweiny: clean up variable names]
> [iweiny: Simplify sparse_ops]
> [Fan: avoid open coding range_len()]
> [djbw: s/reg_ext/region_extent]
> ---
>  drivers/cxl/core/extent.c |  76 +++++++++++++--
>  drivers/cxl/cxl.h         |   6 ++
>  drivers/dax/bus.c         | 243 +++++++++++++++++++++++++++++++++++++++++-----
>  drivers/dax/bus.h         |   3 +-
>  drivers/dax/cxl.c         |  63 +++++++++++-
>  drivers/dax/dax-private.h |  34 +++++++
>  drivers/dax/hmem/hmem.c   |   2 +-
>  drivers/dax/pmem.c        |   2 +-
>  8 files changed, 391 insertions(+), 38 deletions(-)
> 
> diff --git a/drivers/cxl/core/extent.c b/drivers/cxl/core/extent.c
> index d7d526a51e2b..103b0bec3a4a 100644
> --- a/drivers/cxl/core/extent.c
> +++ b/drivers/cxl/core/extent.c
> @@ -271,20 +271,67 @@ static void calc_hpa_range(struct cxl_endpoint_decoder *cxled,
>  	hpa_range->end = hpa_range->start + range_len(dpa_range) - 1;
>  }
>  
> +static int cxlr_notify_extent(struct cxl_region *cxlr, enum dc_event event,
> +			      struct region_extent *region_extent)
> +{
> +	struct cxl_dax_region *cxlr_dax;
> +	struct device *dev;
> +	int rc = 0;
> +
> +	cxlr_dax = cxlr->cxlr_dax;
> +	dev = &cxlr_dax->dev;
> +	dev_dbg(dev, "Trying notify: type %d HPA %par\n",
> +		event, &region_extent->hpa_range);
> +
> +	/*
> +	 * NOTE the lack of a driver indicates a notification has failed.  No
> +	 * user space coordiantion was possible.
> +	 */
> +	device_lock(dev);
> +	if (dev->driver) {
> +		struct cxl_driver *driver = to_cxl_drv(dev->driver);
> +		struct cxl_notify_data notify_data = (struct cxl_notify_data) {
> +			.event = event,
> +			.region_extent = region_extent,
> +		};
> +
> +		if (driver->notify) {
> +			dev_dbg(dev, "Notify: type %d HPA %par\n",
> +				event, &region_extent->hpa_range);
> +			rc = driver->notify(dev, &notify_data);
> +		}
> +	}
> +	device_unlock(dev);

Maybe a cleaner version:
	guard(device)(dev);
	if (!dev->driver || !dev->driver->notify)
		return 0;

	dev_dbg(...);
	return driver->notify(dev, &notify_data);


> +	return rc;
> +}
> +
> +struct rm_data {
> +	struct cxl_region *cxlr;
> +	struct range *range;
> +};
> +
>  static int cxlr_rm_extent(struct device *dev, void *data)
>  {
>  	struct region_extent *region_extent = to_region_extent(dev);
> -	struct range *region_hpa_range = data;
> +	struct rm_data *rm_data = data;
> +	int rc;
>  
>  	if (!region_extent)
>  		return 0;
>  
>  	/*
> -	 * Any extent which 'touches' the released range is removed.
> +	 * Any extent which 'touches' the released range is attempted to be
> +	 * removed.
>  	 */
> -	if (range_overlaps(region_hpa_range, &region_extent->hpa_range)) {
> +	if (range_overlaps(rm_data->range, &region_extent->hpa_range)) {
> +		struct cxl_region *cxlr = rm_data->cxlr;
> +
>  		dev_dbg(dev, "Remove region extent HPA %par\n",
>  			&region_extent->hpa_range);
> +		rc = cxlr_notify_extent(cxlr, DCD_RELEASE_CAPACITY, region_extent);
> +		if (rc == -EBUSY)
> +			return 0;
> +		/* Extent not in use or error, remove it */
>  		region_rm_extent(region_extent);
>  	}
>  	return 0;
> @@ -312,8 +359,13 @@ int cxl_rm_extent(struct cxl_memdev_state *mds, struct cxl_extent *extent)
>  
>  	calc_hpa_range(cxled, cxlr->cxlr_dax, &dpa_range, &hpa_range);
>  
> +	struct rm_data rm_data = {
> +		.cxlr = cxlr,
> +		.range = &hpa_range,
> +	};
> +
>  	/* Remove region extents which overlap */
> -	return device_for_each_child(&cxlr->cxlr_dax->dev, &hpa_range,
> +	return device_for_each_child(&cxlr->cxlr_dax->dev, &rm_data,
>  				     cxlr_rm_extent);
>  }
>  
> @@ -338,8 +390,20 @@ static int cxlr_add_extent(struct cxl_dax_region *cxlr_dax,
>  		return rc;
>  	}
>  
> -	/* device model handles freeing region_extent */
> -	return online_region_extent(region_extent);
> +	rc = online_region_extent(region_extent);
> +	/* device model handled freeing region_extent */
> +	if (rc)
> +		return rc;
> +
> +	rc = cxlr_notify_extent(cxlr_dax->cxlr, DCD_ADD_CAPACITY, region_extent);
> +	/*
> +	 * The region device was breifly live but DAX layer ensures it was not
> +	 * used
> +	 */
> +	if (rc)
> +		region_rm_extent(region_extent);
> +
> +	return rc;
>  }
>  
>  /* Callers are expected to ensure cxled has been attached to a region */
> diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
> index c858e3957fd5..9abbfc68c6ad 100644
> --- a/drivers/cxl/cxl.h
> +++ b/drivers/cxl/cxl.h
> @@ -916,10 +916,16 @@ bool is_cxl_region(struct device *dev);
>  
>  extern struct bus_type cxl_bus_type;
>  
> +struct cxl_notify_data {
> +	enum dc_event event;
> +	struct region_extent *region_extent;
> +};
> +
>  struct cxl_driver {
>  	const char *name;
>  	int (*probe)(struct device *dev);
>  	void (*remove)(struct device *dev);
> +	int (*notify)(struct device *dev, struct cxl_notify_data *notify_data);
>  	struct device_driver drv;
>  	int id;
>  };
> diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
> index 975860371d9f..f14b0cfa7edd 100644
> --- a/drivers/dax/bus.c
> +++ b/drivers/dax/bus.c
> @@ -183,6 +183,83 @@ static bool is_sparse(struct dax_region *dax_region)
>  	return (dax_region->res.flags & IORESOURCE_DAX_SPARSE_CAP) != 0;
>  }
>  
> +static void __dax_release_resource(struct dax_resource *dax_resource)
> +{
> +	struct dax_region *dax_region = dax_resource->region;
> +
> +	lockdep_assert_held_write(&dax_region_rwsem);
> +	dev_dbg(dax_region->dev, "Extent release resource %pr\n",
> +		dax_resource->res);
> +	if (dax_resource->res)
> +		__release_region(&dax_region->res, dax_resource->res->start,
> +				 resource_size(dax_resource->res));
> +	dax_resource->res = NULL;
> +}
> +
> +static void dax_release_resource(void *res)
> +{
> +	struct dax_resource *dax_resource = res;
> +
> +	guard(rwsem_write)(&dax_region_rwsem);
> +	__dax_release_resource(dax_resource);
> +	kfree(dax_resource);
> +}
> +
> +int dax_region_add_resource(struct dax_region *dax_region,
> +			    struct device *device,
> +			    resource_size_t start, resource_size_t length)
>
kdoc header?

 +{
> +	struct resource *new_resource;
> +	int rc;
> +
> +	struct dax_resource *dax_resource __free(kfree) =
> +				kzalloc(sizeof(*dax_resource), GFP_KERNEL);
> +	if (!dax_resource)
> +		return -ENOMEM;
> +
> +	guard(rwsem_write)(&dax_region_rwsem);
> +
> +	dev_dbg(dax_region->dev, "DAX region resource %pr\n", &dax_region->res);
> +	new_resource = __request_region(&dax_region->res, start, length, "extent", 0);
> +	if (!new_resource) {
> +		dev_err(dax_region->dev, "Failed to add region s:%pa l:%pa\n",
> +			&start, &length);
> +		return -ENOSPC;
> +	}
> +
> +	dev_dbg(dax_region->dev, "add resource %pr\n", new_resource);
> +	dax_resource->region = dax_region;
> +	dax_resource->res = new_resource;
> +	dev_set_drvdata(device, dax_resource);
> +	rc = devm_add_action_or_reset(device, dax_release_resource,
> +				      no_free_ptr(dax_resource));
> +	/*  On error; ensure driver data is cleared under semaphore */
> +	if (rc)
> +		dev_set_drvdata(device, NULL);
> +	return rc;
> +}
> +EXPORT_SYMBOL_GPL(dax_region_add_resource);
> +
> +int dax_region_rm_resource(struct dax_region *dax_region,
> +			   struct device *dev)

kdoc header
> +{
> +	struct dax_resource *dax_resource;
> +
> +	guard(rwsem_write)(&dax_region_rwsem);
> +
> +	dax_resource = dev_get_drvdata(dev);
> +	if (!dax_resource)
> +		return 0;
> +
> +	if (dax_resource->use_cnt)
> +		return -EBUSY;
> +
> +	/* avoid races with users trying to use the extent */
> +	__dax_release_resource(dax_resource);
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(dax_region_rm_resource);
> +
>  bool static_dev_dax(struct dev_dax *dev_dax)
>  {
>  	return is_static(dev_dax->region);
> @@ -296,19 +373,44 @@ static ssize_t region_align_show(struct device *dev,
>  static struct device_attribute dev_attr_region_align =
>  		__ATTR(align, 0400, region_align_show, NULL);
>  
> +#define for_each_child_resource(extent, res) \
> +	for (res = (extent)->child; res; res = res->sibling)
> +
> +resource_size_t
> +dax_avail_size(struct resource *dax_resource)
kdoc header

DJ

> +{
> +	resource_size_t rc;
> +	struct resource *used_res;
> +
> +	rc = resource_size(dax_resource);
> +	for_each_child_resource(dax_resource, used_res)
> +		rc -= resource_size(used_res);
> +	return rc;
> +}
> +EXPORT_SYMBOL_GPL(dax_avail_size);
> +
>  #define for_each_dax_region_resource(dax_region, res) \
>  	for (res = (dax_region)->res.child; res; res = res->sibling)
>  
>  static unsigned long long dax_region_avail_size(struct dax_region *dax_region)
>  {
> -	resource_size_t size = resource_size(&dax_region->res);
> +	resource_size_t size;
>  	struct resource *res;
>  
>  	lockdep_assert_held(&dax_region_rwsem);
>  
> -	if (is_sparse(dax_region))
> -		return 0;
> +	if (is_sparse(dax_region)) {
> +		/*
> +		 * Children of a sparse region represent available space not
> +		 * used space.
> +		 */
> +		size = 0;
> +		for_each_dax_region_resource(dax_region, res)
> +			size += dax_avail_size(res);
> +		return size;
> +	}
>  
> +	size = resource_size(&dax_region->res);
>  	for_each_dax_region_resource(dax_region, res)
>  		size -= resource_size(res);
>  	return size;
> @@ -449,15 +551,26 @@ EXPORT_SYMBOL_GPL(kill_dev_dax);
>  static void trim_dev_dax_range(struct dev_dax *dev_dax)
>  {
>  	int i = dev_dax->nr_range - 1;
> -	struct range *range = &dev_dax->ranges[i].range;
> +	struct dev_dax_range *dev_range = &dev_dax->ranges[i];
> +	struct range *range = &dev_range->range;
>  	struct dax_region *dax_region = dev_dax->region;
> +	struct resource *res = &dax_region->res;
>  
>  	lockdep_assert_held_write(&dax_region_rwsem);
>  	dev_dbg(&dev_dax->dev, "delete range[%d]: %#llx:%#llx\n", i,
>  		(unsigned long long)range->start,
>  		(unsigned long long)range->end);
>  
> -	__release_region(&dax_region->res, range->start, range_len(range));
> +	if (dev_range->dax_resource) {
> +		res = dev_range->dax_resource->res;
> +		dev_dbg(&dev_dax->dev, "Trim sparse extent %pr\n", res);
> +	}
> +
> +	__release_region(res, range->start, range_len(range));
> +
> +	if (dev_range->dax_resource)
> +		dev_range->dax_resource->use_cnt--;
> +
>  	if (--dev_dax->nr_range == 0) {
>  		kfree(dev_dax->ranges);
>  		dev_dax->ranges = NULL;
> @@ -640,7 +753,7 @@ static void dax_region_unregister(void *region)
>  
>  struct dax_region *alloc_dax_region(struct device *parent, int region_id,
>  		struct range *range, int target_node, unsigned int align,
> -		unsigned long flags)
> +		unsigned long flags, struct dax_sparse_ops *sparse_ops)
>  {
>  	struct dax_region *dax_region;
>  
> @@ -658,12 +771,16 @@ struct dax_region *alloc_dax_region(struct device *parent, int region_id,
>  			|| !IS_ALIGNED(range_len(range), align))
>  		return NULL;
>  
> +	if (!sparse_ops && (flags & IORESOURCE_DAX_SPARSE_CAP))
> +		return NULL;
> +
>  	dax_region = kzalloc(sizeof(*dax_region), GFP_KERNEL);
>  	if (!dax_region)
>  		return NULL;
>  
>  	dev_set_drvdata(parent, dax_region);
>  	kref_init(&dax_region->kref);
> +	dax_region->sparse_ops = sparse_ops;
>  	dax_region->id = region_id;
>  	dax_region->align = align;
>  	dax_region->dev = parent;
> @@ -845,7 +962,8 @@ static int devm_register_dax_mapping(struct dev_dax *dev_dax, int range_id)
>  }
>  
>  static int alloc_dev_dax_range(struct resource *parent, struct dev_dax *dev_dax,
> -			       u64 start, resource_size_t size)
> +			       u64 start, resource_size_t size,
> +			       struct dax_resource *dax_resource)
>  {
>  	struct device *dev = &dev_dax->dev;
>  	struct dev_dax_range *ranges;
> @@ -884,6 +1002,7 @@ static int alloc_dev_dax_range(struct resource *parent, struct dev_dax *dev_dax,
>  			.start = alloc->start,
>  			.end = alloc->end,
>  		},
> +		.dax_resource = dax_resource,
>  	};
>  
>  	dev_dbg(dev, "alloc range[%d]: %pa:%pa\n", dev_dax->nr_range - 1,
> @@ -966,7 +1085,8 @@ static int dev_dax_shrink(struct dev_dax *dev_dax, resource_size_t size)
>  	int i;
>  
>  	for (i = dev_dax->nr_range - 1; i >= 0; i--) {
> -		struct range *range = &dev_dax->ranges[i].range;
> +		struct dev_dax_range *dev_range = &dev_dax->ranges[i];
> +		struct range *range = &dev_range->range;
>  		struct dax_mapping *mapping = dev_dax->ranges[i].mapping;
>  		struct resource *adjust = NULL, *res;
>  		resource_size_t shrink;
> @@ -982,12 +1102,21 @@ static int dev_dax_shrink(struct dev_dax *dev_dax, resource_size_t size)
>  			continue;
>  		}
>  
> -		for_each_dax_region_resource(dax_region, res)
> -			if (strcmp(res->name, dev_name(dev)) == 0
> -					&& res->start == range->start) {
> -				adjust = res;
> -				break;
> -			}
> +		if (dev_range->dax_resource) {
> +			for_each_child_resource(dev_range->dax_resource->res, res)
> +				if (strcmp(res->name, dev_name(dev)) == 0
> +						&& res->start == range->start) {
> +					adjust = res;
> +					break;
> +				}
> +		} else {
> +			for_each_dax_region_resource(dax_region, res)
> +				if (strcmp(res->name, dev_name(dev)) == 0
> +						&& res->start == range->start) {
> +					adjust = res;
> +					break;
> +				}
> +		}
>  
>  		if (dev_WARN_ONCE(dev, !adjust || i != dev_dax->nr_range - 1,
>  					"failed to find matching resource\n"))
> @@ -1025,19 +1154,21 @@ static bool adjust_ok(struct dev_dax *dev_dax, struct resource *res)
>  }
>  
>  /**
> - * dev_dax_resize_static - Expand the device into the unused portion of the
> - * region. This may involve adjusting the end of an existing resource, or
> - * allocating a new resource.
> + * __dev_dax_resize - Expand the device into the unused portion of the region.
> + * This may involve adjusting the end of an existing resource, or allocating a
> + * new resource.
>   *
>   * @parent: parent resource to allocate this range in
>   * @dev_dax: DAX device to be expanded
>   * @to_alloc: amount of space to alloc; must be <= space available in @parent
> + * @dax_resource: if sparse; the parent resource
>   *
>   * Return the amount of space allocated or -ERRNO on failure
>   */
> -static ssize_t dev_dax_resize_static(struct resource *parent,
> -				     struct dev_dax *dev_dax,
> -				     resource_size_t to_alloc)
> +static ssize_t __dev_dax_resize(struct resource *parent,
> +				struct dev_dax *dev_dax,
> +				resource_size_t to_alloc,
> +				struct dax_resource *dax_resource)
>  {
>  	struct resource *res, *first;
>  	int rc;
> @@ -1045,7 +1176,8 @@ static ssize_t dev_dax_resize_static(struct resource *parent,
>  	first = parent->child;
>  	if (!first) {
>  		rc = alloc_dev_dax_range(parent, dev_dax,
> -					   parent->start, to_alloc);
> +					   parent->start, to_alloc,
> +					   dax_resource);
>  		if (rc)
>  			return rc;
>  		return to_alloc;
> @@ -1059,7 +1191,8 @@ static ssize_t dev_dax_resize_static(struct resource *parent,
>  		if (res == first && res->start > parent->start) {
>  			alloc = min(res->start - parent->start, to_alloc);
>  			rc = alloc_dev_dax_range(parent, dev_dax,
> -						 parent->start, alloc);
> +						 parent->start, alloc,
> +						 dax_resource);
>  			if (rc)
>  				return rc;
>  			return alloc;
> @@ -1083,7 +1216,8 @@ static ssize_t dev_dax_resize_static(struct resource *parent,
>  				return rc;
>  			return alloc;
>  		}
> -		rc = alloc_dev_dax_range(parent, dev_dax, res->end + 1, alloc);
> +		rc = alloc_dev_dax_range(parent, dev_dax, res->end + 1, alloc,
> +					 dax_resource);
>  		if (rc)
>  			return rc;
>  		return alloc;
> @@ -1094,6 +1228,54 @@ static ssize_t dev_dax_resize_static(struct resource *parent,
>  	return 0;
>  }
>  
> +static ssize_t dev_dax_resize_static(struct dax_region *dax_region,
> +				     struct dev_dax *dev_dax,
> +				     resource_size_t to_alloc)
> +{
> +	return __dev_dax_resize(&dax_region->res, dev_dax, to_alloc, NULL);
> +}
> +
> +static int find_free_extent(struct device *dev, void *data)
> +{
> +	struct dax_region *dax_region = data;
> +	struct dax_resource *dax_resource;
> +
> +	if (!dax_region->sparse_ops->is_extent(dev))
> +		return 0;
> +
> +	dax_resource = dev_get_drvdata(dev);
> +	if (!dax_resource || !dax_avail_size(dax_resource->res))
> +		return 0;
> +	return 1;
> +}
> +
> +static ssize_t dev_dax_resize_sparse(struct dax_region *dax_region,
> +				     struct dev_dax *dev_dax,
> +				     resource_size_t to_alloc)
> +{
> +	struct dax_resource *dax_resource;
> +	resource_size_t available_size;
> +	struct device *extent_dev;
> +	ssize_t alloc;
> +
> +	extent_dev = device_find_child(dax_region->dev, dax_region,
> +				       find_free_extent);
> +	if (!extent_dev)
> +		return 0;
> +
> +	dax_resource = dev_get_drvdata(extent_dev);
> +	if (!dax_resource)
> +		return 0;
> +
> +	available_size = dax_avail_size(dax_resource->res);
> +	to_alloc = min(available_size, to_alloc);
> +	alloc = __dev_dax_resize(dax_resource->res, dev_dax, to_alloc, dax_resource);
> +	if (alloc > 0)
> +		dax_resource->use_cnt++;
> +	put_device(extent_dev);
> +	return alloc;
> +}
> +
>  static ssize_t dev_dax_resize(struct dax_region *dax_region,
>  		struct dev_dax *dev_dax, resource_size_t size)
>  {
> @@ -1117,7 +1299,10 @@ static ssize_t dev_dax_resize(struct dax_region *dax_region,
>  		return -ENXIO;
>  
>  retry:
> -	alloc = dev_dax_resize_static(&dax_region->res, dev_dax, to_alloc);
> +	if (is_sparse(dax_region))
> +		alloc = dev_dax_resize_sparse(dax_region, dev_dax, to_alloc);
> +	else
> +		alloc = dev_dax_resize_static(dax_region, dev_dax, to_alloc);
>  	if (alloc <= 0)
>  		return alloc;
>  	to_alloc -= alloc;
> @@ -1226,7 +1411,7 @@ static ssize_t mapping_store(struct device *dev, struct device_attribute *attr,
>  	to_alloc = range_len(&r);
>  	if (alloc_is_aligned(dev_dax, to_alloc))
>  		rc = alloc_dev_dax_range(&dax_region->res, dev_dax, r.start,
> -					 to_alloc);
> +					 to_alloc, NULL);
>  	up_write(&dax_dev_rwsem);
>  	up_write(&dax_region_rwsem);
>  
> @@ -1494,8 +1679,14 @@ static struct dev_dax *__devm_create_dev_dax(struct dev_dax_data *data)
>  	device_initialize(dev);
>  	dev_set_name(dev, "dax%d.%d", dax_region->id, dev_dax->id);
>  
> +	if (is_sparse(dax_region) && data->size) {
> +		dev_err(parent, "Sparse DAX region devices are created initially with 0 size");
> +		rc = -EINVAL;
> +		goto err_id;
> +	}
> +
>  	rc = alloc_dev_dax_range(&dax_region->res, dev_dax, dax_region->res.start,
> -				 data->size);
> +				 data->size, NULL);
>  	if (rc)
>  		goto err_range;
>  
> diff --git a/drivers/dax/bus.h b/drivers/dax/bus.h
> index 783bfeef42cc..ae5029ea6047 100644
> --- a/drivers/dax/bus.h
> +++ b/drivers/dax/bus.h
> @@ -9,6 +9,7 @@ struct dev_dax;
>  struct resource;
>  struct dax_device;
>  struct dax_region;
> +struct dax_sparse_ops;
>  
>  /* dax bus specific ioresource flags */
>  #define IORESOURCE_DAX_STATIC BIT(0)
> @@ -17,7 +18,7 @@ struct dax_region;
>  
>  struct dax_region *alloc_dax_region(struct device *parent, int region_id,
>  		struct range *range, int target_node, unsigned int align,
> -		unsigned long flags);
> +		unsigned long flags, struct dax_sparse_ops *sparse_ops);
>  
>  struct dev_dax_data {
>  	struct dax_region *dax_region;
> diff --git a/drivers/dax/cxl.c b/drivers/dax/cxl.c
> index 367e86b1c22a..bf3b82b0120d 100644
> --- a/drivers/dax/cxl.c
> +++ b/drivers/dax/cxl.c
> @@ -5,6 +5,60 @@
>  
>  #include "../cxl/cxl.h"
>  #include "bus.h"
> +#include "dax-private.h"
> +
> +static int __cxl_dax_add_resource(struct dax_region *dax_region,
> +				  struct region_extent *region_extent)
> +{
> +	resource_size_t start, length;
> +	struct device *dev;
> +
> +	dev = &region_extent->dev;
> +	start = dax_region->res.start + region_extent->hpa_range.start;
> +	length = range_len(&region_extent->hpa_range);
> +	return dax_region_add_resource(dax_region, dev, start, length);
> +}
> +
> +static int cxl_dax_add_resource(struct device *dev, void *data)
> +{
> +	struct dax_region *dax_region = data;
> +	struct region_extent *region_extent;
> +
> +	region_extent = to_region_extent(dev);
> +	if (!region_extent)
> +		return 0;
> +
> +	dev_dbg(dax_region->dev, "Adding resource HPA %par\n",
> +		&region_extent->hpa_range);
> +
> +	return __cxl_dax_add_resource(dax_region, region_extent);
> +}
> +
> +static int cxl_dax_region_notify(struct device *dev,
> +				 struct cxl_notify_data *notify_data)
> +{
> +	struct cxl_dax_region *cxlr_dax = to_cxl_dax_region(dev);
> +	struct dax_region *dax_region = dev_get_drvdata(dev);
> +	struct region_extent *region_extent = notify_data->region_extent;
> +
> +	switch (notify_data->event) {
> +	case DCD_ADD_CAPACITY:
> +		return __cxl_dax_add_resource(dax_region, region_extent);
> +	case DCD_RELEASE_CAPACITY:
> +		return dax_region_rm_resource(dax_region, &region_extent->dev);
> +	case DCD_FORCED_CAPACITY_RELEASE:
> +	default:
> +		dev_err(&cxlr_dax->dev, "Unknown DC event %d\n",
> +			notify_data->event);
> +		break;
> +	}
> +
> +	return -ENXIO;
> +}
> +
> +struct dax_sparse_ops sparse_ops = {
> +	.is_extent = is_region_extent,
> +};
>  
>  static int cxl_dax_region_probe(struct device *dev)
>  {
> @@ -24,14 +78,16 @@ static int cxl_dax_region_probe(struct device *dev)
>  		flags |= IORESOURCE_DAX_SPARSE_CAP;
>  
>  	dax_region = alloc_dax_region(dev, cxlr->id, &cxlr_dax->hpa_range, nid,
> -				      PMD_SIZE, flags);
> +				      PMD_SIZE, flags, &sparse_ops);
>  	if (!dax_region)
>  		return -ENOMEM;
>  
> -	if (cxlr->mode == CXL_REGION_DC)
> +	if (cxlr->mode == CXL_REGION_DC) {
> +		device_for_each_child(&cxlr_dax->dev, dax_region,
> +				      cxl_dax_add_resource);
>  		/* Add empty seed dax device */
>  		dev_size = 0;
> -	else
> +	} else
>  		dev_size = range_len(&cxlr_dax->hpa_range);
>  
>  	data = (struct dev_dax_data) {
> @@ -47,6 +103,7 @@ static int cxl_dax_region_probe(struct device *dev)
>  static struct cxl_driver cxl_dax_region_driver = {
>  	.name = "cxl_dax_region",
>  	.probe = cxl_dax_region_probe,
> +	.notify = cxl_dax_region_notify,
>  	.id = CXL_DEVICE_DAX_REGION,
>  	.drv = {
>  		.suppress_bind_attrs = true,
> diff --git a/drivers/dax/dax-private.h b/drivers/dax/dax-private.h
> index ccde98c3d4e2..9e9f98c85620 100644
> --- a/drivers/dax/dax-private.h
> +++ b/drivers/dax/dax-private.h
> @@ -16,6 +16,36 @@ struct inode *dax_inode(struct dax_device *dax_dev);
>  int dax_bus_init(void);
>  void dax_bus_exit(void);
>  
> +/**
> + * struct dax_resource - For sparse regions; an active resource
> + * @region: dax_region this resources is in
> + * @res: resource
> + * @use_cnt: count the number of uses of this resource
> + *
> + * Changes to the dax_reigon and the dax_resources within it are protected by
> + * dax_region_rwsem
> + */
> +struct dax_resource {
> +	struct dax_region *region;
> +	struct resource *res;
> +	unsigned int use_cnt;
> +};
> +int dax_region_add_resource(struct dax_region *dax_region, struct device *dev,
> +			    resource_size_t start, resource_size_t length);
> +int dax_region_rm_resource(struct dax_region *dax_region,
> +			   struct device *dev);
> +resource_size_t dax_avail_size(struct resource *dax_resource);
> +
> +typedef int (*match_cb)(struct device *dev, resource_size_t *size_avail);
> +
> +/**
> + * struct dax_sparse_ops - Operations for sparse regions
> + * @is_extent: return if the device is an extent
> + */
> +struct dax_sparse_ops {
> +	bool (*is_extent)(struct device *dev);
> +};
> +
>  /**
>   * struct dax_region - mapping infrastructure for dax devices
>   * @id: kernel-wide unique region for a memory range
> @@ -27,6 +57,7 @@ void dax_bus_exit(void);
>   * @res: resource tree to track instance allocations
>   * @seed: allow userspace to find the first unbound seed device
>   * @youngest: allow userspace to find the most recently created device
> + * @sparse_ops: operations required for sparse regions
>   */
>  struct dax_region {
>  	int id;
> @@ -38,6 +69,7 @@ struct dax_region {
>  	struct resource res;
>  	struct device *seed;
>  	struct device *youngest;
> +	struct dax_sparse_ops *sparse_ops;
>  };
>  
>  struct dax_mapping {
> @@ -62,6 +94,7 @@ struct dax_mapping {
>   * @pgoff: page offset
>   * @range: resource-span
>   * @mapping: device to assist in interrogating the range layout
> + * @dax_resource: if not NULL; dax sparse resource containing this range
>   */
>  struct dev_dax {
>  	struct dax_region *region;
> @@ -79,6 +112,7 @@ struct dev_dax {
>  		unsigned long pgoff;
>  		struct range range;
>  		struct dax_mapping *mapping;
> +		struct dax_resource *dax_resource;
>  	} *ranges;
>  };
>  
> diff --git a/drivers/dax/hmem/hmem.c b/drivers/dax/hmem/hmem.c
> index 5e7c53f18491..0eea65052874 100644
> --- a/drivers/dax/hmem/hmem.c
> +++ b/drivers/dax/hmem/hmem.c
> @@ -28,7 +28,7 @@ static int dax_hmem_probe(struct platform_device *pdev)
>  
>  	mri = dev->platform_data;
>  	dax_region = alloc_dax_region(dev, pdev->id, &mri->range,
> -				      mri->target_node, PMD_SIZE, flags);
> +				      mri->target_node, PMD_SIZE, flags, NULL);
>  	if (!dax_region)
>  		return -ENOMEM;
>  
> diff --git a/drivers/dax/pmem.c b/drivers/dax/pmem.c
> index c8ebf4e281f2..f927e855f240 100644
> --- a/drivers/dax/pmem.c
> +++ b/drivers/dax/pmem.c
> @@ -54,7 +54,7 @@ static struct dev_dax *__dax_pmem_probe(struct device *dev)
>  	range.start += offset;
>  	dax_region = alloc_dax_region(dev, region_id, &range,
>  			nd_region->target_node, le32_to_cpu(pfn_sb->align),
> -			IORESOURCE_DAX_STATIC);
> +			IORESOURCE_DAX_STATIC, NULL);
>  	if (!dax_region)
>  		return ERR_PTR(-ENOMEM);
>  
> 

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v3 22/25] cxl/region: Read existing extents on region creation
  2024-08-16 14:44 ` [PATCH v3 22/25] cxl/region: Read existing extents on region creation ira.weiny
@ 2024-08-20  0:06   ` Dave Jiang
  2024-08-23 21:31     ` Ira Weiny
  2024-08-27 14:19   ` Jonathan Cameron
  2024-09-05 19:35   ` Fan Ni
  2 siblings, 1 reply; 120+ messages in thread
From: Dave Jiang @ 2024-08-20  0:06 UTC (permalink / raw)
  To: ira.weiny, Fan Ni, Jonathan Cameron, Navneet Singh, Chris Mason,
	Josef Bacik, David Sterba, Petr Mladek, Steven Rostedt,
	Andy Shevchenko, Rasmus Villemoes, Sergey Senozhatsky,
	Jonathan Corbet, Andrew Morton
  Cc: Dan Williams, Davidlohr Bueso, Alison Schofield, Vishal Verma,
	linux-btrfs, linux-cxl, linux-kernel, linux-doc, nvdimm



On 8/16/24 7:44 AM, ira.weiny@intel.com wrote:
> From: Navneet Singh <navneet.singh@intel.com>
> 
> Dynamic capacity device extents may be left in an accepted state on a
> device due to an unexpected host crash.  In this case it is expected
> that the creation of a new region on top of a DC partition can read
> those extents and surface them for continued use.
> 
> Once all endpoint decoders are part of a region and the region is being
> realized a read of the devices extent list can reveal these previously
> accepted extents.

Once all endpoint decoders are part of a region and the region is being
realized, a read of the 'devices extend list' can reveal these previously
accepted extents.

> 
> CXL r3.1 specifies the mailbox call Get Dynamic Capacity Extent List for
> this purpose.  The call returns all the extents for all dynamic capacity
> partitions.  If the fabric manager is adding extents to any DCD
> partition, the extent list for the recovered region may change.  In this
> case the query must retry.  Upon retry the query could encounter extents
> which were accepted on a previous list query.  Adding such extents is
> ignored without error because they are entirely within a previous
> accepted extent.
> 
> The scan for existing extents races with the dax_cxl driver.  This is
> synchronized through the region device lock.  Extents which are found
> after the driver has loaded will surface through the normal notification
> path while extents seen prior to the driver are read during driver load.
> 
> Signed-off-by: Navneet Singh <navneet.singh@intel.com>
> Co-developed-by: Ira Weiny <ira.weiny@intel.com>
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> 
> ---
> Changes:
> [iweiny: Leverage the new add path from the event processing code such
> 	 that the adding and surfacing of extents flows through the same
> 	 code path for both event processing and existing extents.
> 	 While this does validate existing extents again on start up
> 	 this is an error recovery case / new boot scenario and should
> 	 not cause any major issues while making the code more
> 	 straight forward and maintainable.]
> 
> [iweiny: use %par]
> [iweiny: rebase]
> [iweiny: Move this patch later in the series such that the realization
>          of extents can go through the same path as an add event]
> [Fan: Issue a retry if the gen number changes]
> [djiang: s/uint64_t/u64/]
> [djiang: update function names]
> [Jørgen/djbw: read the generation and total count on first iteration of
>               the Get Extent List call]
> [djbw: s/cxl_mbox_get_dc_extent_in/cxl_mbox_get_extent_in/]
> [djbw: s/cxl_mbox_get_dc_extent_out/cxl_mbox_get_extent_out/]
> [djbw/iweiny: s/cxl_read_dc_extents/cxl_read_extent_list]
> ---
>  drivers/cxl/core/core.h   |   2 +
>  drivers/cxl/core/mbox.c   | 100 ++++++++++++++++++++++++++++++++++++++++++++++
>  drivers/cxl/core/region.c |  12 ++++++
>  drivers/cxl/cxlmem.h      |  21 ++++++++++
>  4 files changed, 135 insertions(+)
> 
> diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
> index 8dfc97b2e0a4..9e54064a6f48 100644
> --- a/drivers/cxl/core/core.h
> +++ b/drivers/cxl/core/core.h
> @@ -21,6 +21,8 @@ cxled_to_mds(struct cxl_endpoint_decoder *cxled)
>  	return container_of(cxlds, struct cxl_memdev_state, cxlds);
>  }
>  
> +void cxl_read_extent_list(struct cxl_endpoint_decoder *cxled);
> +
>  #ifdef CONFIG_CXL_REGION
>  extern struct device_attribute dev_attr_create_pmem_region;
>  extern struct device_attribute dev_attr_create_ram_region;
> diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
> index f629ad7488ac..d43ac8eabf56 100644
> --- a/drivers/cxl/core/mbox.c
> +++ b/drivers/cxl/core/mbox.c
> @@ -1670,6 +1670,106 @@ int cxl_dev_dynamic_capacity_identify(struct cxl_memdev_state *mds)
>  }
>  EXPORT_SYMBOL_NS_GPL(cxl_dev_dynamic_capacity_identify, CXL);
>  
> +/* Return -EAGAIN if the extent list changes while reading */
> +static int __cxl_read_extent_list(struct cxl_endpoint_decoder *cxled)
> +{
> +	u32 current_index, total_read, total_expected, initial_gen_num;
> +	struct cxl_memdev_state *mds = cxled_to_mds(cxled);
> +	struct device *dev = mds->cxlds.dev;
> +	struct cxl_mbox_cmd mbox_cmd;
> +	u32 max_extent_count;
> +	bool first = true;
> +
> +	struct cxl_mbox_get_extent_out *extents __free(kfree) =
> +				kvmalloc(mds->payload_size, GFP_KERNEL);
> +	if (!extents)
> +		return -ENOMEM;
> +
> +	total_read = 0;
> +	current_index = 0;
> +	total_expected = 0;
> +	max_extent_count = (mds->payload_size - sizeof(*extents)) /
> +				sizeof(struct cxl_extent);
> +	do {
> +		struct cxl_mbox_get_extent_in get_extent;
> +		u32 nr_returned, current_total, current_gen_num;
> +		int rc;
> +
> +		get_extent = (struct cxl_mbox_get_extent_in) {
> +			.extent_cnt = max(max_extent_count,
> +					  total_expected - current_index),
> +			.start_extent_index = cpu_to_le32(current_index),
> +		};
> +
> +		mbox_cmd = (struct cxl_mbox_cmd) {
> +			.opcode = CXL_MBOX_OP_GET_DC_EXTENT_LIST,
> +			.payload_in = &get_extent,
> +			.size_in = sizeof(get_extent),
> +			.size_out = mds->payload_size,
> +			.payload_out = extents,
> +			.min_out = 1,
> +		};
> +
> +		rc = cxl_internal_send_cmd(mds, &mbox_cmd);
> +		if (rc < 0)
> +			return rc;
> +
> +		/* Save initial data */
> +		if (first) {
> +			total_expected = le32_to_cpu(extents->total_extent_count);
> +			initial_gen_num = le32_to_cpu(extents->generation_num);
> +			first = false;
> +		}
> +
> +		nr_returned = le32_to_cpu(extents->returned_extent_count);
> +		total_read += nr_returned;
> +		current_total = le32_to_cpu(extents->total_extent_count);
> +		current_gen_num = le32_to_cpu(extents->generation_num);
> +
> +		dev_dbg(dev, "Got extent list %d-%d of %d generation Num:%d\n",
> +			current_index, total_read - 1, current_total, current_gen_num);
> +
> +		if (current_gen_num != initial_gen_num || total_expected != current_total) {
> +			dev_dbg(dev, "Extent list change detected; gen %u != %u : cnt %u != %u\n",
> +				current_gen_num, initial_gen_num,
> +				total_expected, current_total);
> +			return -EAGAIN;
> +		}
> +
> +		for (int i = 0; i < nr_returned ; i++) {
> +			struct cxl_extent *extent = &extents->extent[i];
> +
> +			dev_dbg(dev, "Processing extent %d/%d\n",
> +				current_index + i, total_expected);
> +
> +			rc = validate_add_extent(mds, extent);
> +			if (rc)
> +				continue;
> +		}
> +
> +		current_index += nr_returned;
> +	} while (total_expected > total_read);
> +
> +	return 0;
> +}
> +
> +/**
> + * cxl_read_extent_list() - Read existing extents
> + * @cxled: Endpoint decoder which is part of a region
> + *
> + * Issue the Get Dynamic Capacity Extent List command to the device
> + * and add existing extents if found.
> + */
> +void cxl_read_extent_list(struct cxl_endpoint_decoder *cxled)

cxl_process_extend_list()? It seems to do read+validate+add. 

> +{
> +	int retry = 10;

arbitrary retry number? maybe define it?

DJ

> +	int rc;
> +
> +	do {
> +		rc = __cxl_read_extent_list(cxled);
> +	} while (rc == -EAGAIN && retry--);
> +}
> +
>  static int add_dpa_res(struct device *dev, struct resource *parent,
>  		       struct resource *res, resource_size_t start,
>  		       resource_size_t size, const char *type)
> diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
> index 8c9171f914fb..885fb3004784 100644
> --- a/drivers/cxl/core/region.c
> +++ b/drivers/cxl/core/region.c
> @@ -3190,6 +3190,15 @@ static int devm_cxl_add_pmem_region(struct cxl_region *cxlr)
>  	return rc;
>  }
>  
> +static void cxlr_add_existing_extents(struct cxl_region *cxlr)
> +{
> +	struct cxl_region_params *p = &cxlr->params;
> +	int i;
> +
> +	for (i = 0; i < p->nr_targets; i++)
> +		cxl_read_extent_list(p->targets[i]);
> +}
> +
>  static void cxlr_dax_unregister(void *_cxlr_dax)
>  {
>  	struct cxl_dax_region *cxlr_dax = _cxlr_dax;
> @@ -3227,6 +3236,9 @@ static int devm_cxl_add_dax_region(struct cxl_region *cxlr)
>  	dev_dbg(&cxlr->dev, "%s: register %s\n", dev_name(dev->parent),
>  		dev_name(dev));
>  
> +	if (cxlr->mode == CXL_REGION_DC)
> +		cxlr_add_existing_extents(cxlr);
> +
>  	return devm_add_action_or_reset(&cxlr->dev, cxlr_dax_unregister,
>  					cxlr_dax);
>  err:
> diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
> index 3a40fe1f0be7..11c03637488d 100644
> --- a/drivers/cxl/cxlmem.h
> +++ b/drivers/cxl/cxlmem.h
> @@ -624,6 +624,27 @@ struct cxl_mbox_dc_response {
>  	} __packed extent_list[];
>  } __packed;
>  
> +/*
> + * Get Dynamic Capacity Extent List; Input Payload
> + * CXL rev 3.1 section 8.2.9.9.9.2; Table 8-166
> + */
> +struct cxl_mbox_get_extent_in {
> +	__le32 extent_cnt;
> +	__le32 start_extent_index;
> +} __packed;
> +
> +/*
> + * Get Dynamic Capacity Extent List; Output Payload
> + * CXL rev 3.1 section 8.2.9.9.9.2; Table 8-167
> + */
> +struct cxl_mbox_get_extent_out {
> +	__le32 returned_extent_count;
> +	__le32 total_extent_count;
> +	__le32 generation_num;
> +	u8 rsvd[4];
> +	struct cxl_extent extent[];
> +} __packed;
> +
>  struct cxl_mbox_get_supported_logs {
>  	__le16 entries;
>  	u8 rsvd[6];
> 

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v3 02/25] printk: Add print format (%par) for struct range
  2024-08-16 14:44 ` [PATCH v3 02/25] printk: Add print format (%par) for struct range Ira Weiny
@ 2024-08-20 14:08   ` Petr Mladek
  2024-08-22 17:53     ` Ira Weiny
  0 siblings, 1 reply; 120+ messages in thread
From: Petr Mladek @ 2024-08-20 14:08 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Dave Jiang, Fan Ni, Jonathan Cameron, Navneet Singh, Chris Mason,
	Josef Bacik, David Sterba, Steven Rostedt, Andy Shevchenko,
	Rasmus Villemoes, Sergey Senozhatsky, Jonathan Corbet,
	Andrew Morton, Dan Williams, Davidlohr Bueso, Alison Schofield,
	Vishal Verma, linux-btrfs, linux-cxl, linux-kernel, linux-doc,
	nvdimm

On Fri 2024-08-16 09:44:10, Ira Weiny wrote:
> The use of struct range in the CXL subsystem is growing.  In particular,
> the addition of Dynamic Capacity devices uses struct range in a number
> of places which are reported in debug and error messages.
> 
> To wit requiring the printing of the start/end fields in each print
> became cumbersome.  Dan Williams mentions in [1] that it might be time
> to have a print specifier for struct range similar to struct resource
> 
> A few alternatives were considered including '%pn' for 'print raNge' but
> %par follows that struct range is most often used to store a range of
> physical addresses.  So use '%par' for 'print address range'.
> 
> --- a/Documentation/core-api/printk-formats.rst
> +++ b/Documentation/core-api/printk-formats.rst
> @@ -231,6 +231,20 @@ width of the CPU data path.
>  
>  Passed by reference.
>  
> +Struct Range
> +------------
> +
> +::
> +
> +	%par	[range 0x60000000-0x6fffffff] or

It seems that it is always 64-bit. It prints:

struct range {
	u64   start;
	u64   end;
};

> +		[range 0x0000000060000000-0x000000006fffffff]
> +
> +For printing struct range.  A variation of printing a physical address is to
> +print the value of struct range which are often used to hold a physical address
> +range.
> +
> +Passed by reference.
> +
>  DMA address types dma_addr_t
>  ----------------------------
>  
> diff --git a/lib/vsprintf.c b/lib/vsprintf.c
> index 2d71b1115916..c132178fac07 100644
> --- a/lib/vsprintf.c
> +++ b/lib/vsprintf.c
> @@ -1140,6 +1140,39 @@ char *resource_string(char *buf, char *end, struct resource *res,
>  	return string_nocheck(buf, end, sym, spec);
>  }
>  
> +static noinline_for_stack
> +char *range_string(char *buf, char *end, const struct range *range,
> +		      struct printf_spec spec, const char *fmt)
> +{
> +#define RANGE_PRINTK_SIZE		16
> +#define RANGE_DECODED_BUF_SIZE		((2 * sizeof(struct range)) + 4)
> +#define RANGE_PRINT_BUF_SIZE		sizeof("[range - ]")

I think that it should be "[range -]"

> +	char sym[RANGE_DECODED_BUF_SIZE + RANGE_PRINT_BUF_SIZE];
> +	char *p = sym, *pend = sym + sizeof(sym);
> +
> +	static const struct printf_spec str_spec = {
> +		.field_width = -1,
> +		.precision = 10,
> +		.flags = LEFT,
> +	};

Is this really needed? What about using "default_str_spec" instead?

> +	static const struct printf_spec range_spec = {
> +		.base = 16,
> +		.field_width = RANGE_PRINTK_SIZE,
> +		.precision = -1,
> +		.flags = SPECIAL | SMALL | ZEROPAD,
> +	};
> +
> +	*p++ = '[';
> +	p = string_nocheck(p, pend, "range ", str_spec);
> +	p = number(p, pend, range->start, range_spec);
> +	*p++ = '-';
> +	p = number(p, pend, range->end, range_spec);
> +	*p++ = ']';
> +	*p = '\0';
> +
> +	return string_nocheck(buf, end, sym, spec);
> +}
> +
>  static noinline_for_stack
>  char *hex_string(char *buf, char *end, u8 *addr, struct printf_spec spec,
>  		 const char *fmt)

Also add a selftest into lib/test_printf.c, please.

Best Regards,
Petr

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v3 06/25] cxl/mem: Read dynamic capacity configuration from the device
  2024-08-16 21:45   ` Dave Jiang
@ 2024-08-20 17:01     ` Fan Ni
  2024-08-23  2:01       ` Ira Weiny
  2024-08-23  2:02       ` Ira Weiny
  0 siblings, 2 replies; 120+ messages in thread
From: Fan Ni @ 2024-08-20 17:01 UTC (permalink / raw)
  To: Dave Jiang
  Cc: ira.weiny, Jonathan Cameron, Navneet Singh, Chris Mason,
	Josef Bacik, David Sterba, Petr Mladek, Steven Rostedt,
	Andy Shevchenko, Rasmus Villemoes, Sergey Senozhatsky,
	Jonathan Corbet, Andrew Morton, Dan Williams, Davidlohr Bueso,
	Alison Schofield, Vishal Verma, linux-btrfs, linux-cxl,
	linux-kernel, linux-doc, nvdimm, Li, Ming

On Fri, Aug 16, 2024 at 02:45:47PM -0700, Dave Jiang wrote:
> 
> 
> On 8/16/24 7:44 AM, ira.weiny@intel.com wrote:
> > From: Navneet Singh <navneet.singh@intel.com>
> > 
> > Devices which optionally support Dynamic Capacity (DC) are configured
> > via mailbox commands.  CXL 3.1 requires the host to issue the Get DC
> > Configuration command in order to properly configure DCDs.  Without the
> > Get DC Configuration command DCD can't be supported.
> > 
> > Implement the DC mailbox commands as specified in CXL 3.1 section
> > 8.2.9.9.9 (opcodes 48XXh) to read and store the DCD configuration
> > information.  Disable DCD if DCD is not supported.  Leverage the Get DC
> > Configuration command supported bit to indicate if DCD support.
> > 
> > Linux has no use for the trailing fields of the Get Dynamic Capacity
> > Configuration Output Payload (Total number of supported extents, number
> > of available extents, total number of supported tags, and number of
> > available tags).  Avoid defining those fields to use the more useful
> > dynamic C array.
> > 
> > Cc: "Li, Ming" <ming4.li@intel.com>
> > Signed-off-by: Navneet Singh <navneet.singh@intel.com>
> > Co-developed-by: Ira Weiny <ira.weiny@intel.com>
> > Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> > 
> > ---
> > Changes:
> > [Li, Ming: Fix bug in total_bytes calculation]
> > [iweiny: update commit message]
> > [Jonathan: fix formatting]
> > [Jonathan: Define block line size]
> > [Jonathan/Fan: use regions returned field instead of macro in get config]
> > [Jørgen: Rename memdev state range variables]
> > [Jonathan: adjust use of rc in cxl_dev_dynamic_capacity_identify()]
> > [Jonathan: white space cleanup]
> > [fan: make a comment about the trailing configuration output fields]
> > ---
> >  drivers/cxl/core/mbox.c | 171 +++++++++++++++++++++++++++++++++++++++++++++++-
> >  drivers/cxl/cxlmem.h    |  64 +++++++++++++++++-
> >  drivers/cxl/pci.c       |   4 ++
> >  3 files changed, 237 insertions(+), 2 deletions(-)
> > 
> > diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
> > index 8eb196858abe..68c26c4be91a 100644
> > --- a/drivers/cxl/core/mbox.c
> > +++ b/drivers/cxl/core/mbox.c
> > @@ -1157,7 +1157,7 @@ int cxl_dev_state_identify(struct cxl_memdev_state *mds)
> >  	if (rc < 0)
> >  		return rc;
> >  
> > -	mds->total_bytes =
> > +	mds->static_bytes =
> >  		le64_to_cpu(id.total_capacity) * CXL_CAPACITY_MULTIPLIER;
> >  	mds->volatile_only_bytes =
> >  		le64_to_cpu(id.volatile_capacity) * CXL_CAPACITY_MULTIPLIER;
> > @@ -1264,6 +1264,159 @@ int cxl_mem_sanitize(struct cxl_memdev *cxlmd, u16 cmd)
> >  	return rc;
> >  }
> >  
> > +static int cxl_dc_save_region_info(struct cxl_memdev_state *mds, u8 index,
> > +				   struct cxl_dc_region_config *region_config)
> > +{
> > +	struct cxl_dc_region_info *dcr = &mds->dc_region[index];
> > +	struct device *dev = mds->cxlds.dev;
> > +
> > +	dcr->base = le64_to_cpu(region_config->region_base);
> > +	dcr->decode_len = le64_to_cpu(region_config->region_decode_length);
> > +	dcr->decode_len *= CXL_CAPACITY_MULTIPLIER;
> > +	dcr->len = le64_to_cpu(region_config->region_length);
> > +	dcr->blk_size = le64_to_cpu(region_config->region_block_size);
> > +	dcr->dsmad_handle = le32_to_cpu(region_config->region_dsmad_handle);
> > +	dcr->flags = region_config->flags;
> > +	snprintf(dcr->name, CXL_DC_REGION_STRLEN, "dc%d", index);
> > +
> > +	/* Check regions are in increasing DPA order */
> > +	if (index > 0) {
> > +		struct cxl_dc_region_info *prev_dcr = &mds->dc_region[index - 1];
> > +
> > +		if ((prev_dcr->base + prev_dcr->decode_len) > dcr->base) {
> > +			dev_err(dev,
> > +				"DPA ordering violation for DC region %d and %d\n",
> > +				index - 1, index);
> > +			return -EINVAL;
> > +		}
> > +	}
> > +
> > +	if (!IS_ALIGNED(dcr->base, SZ_256M) ||
> > +	    !IS_ALIGNED(dcr->base, dcr->blk_size)) {
> > +		dev_err(dev, "DC region %d invalid base %#llx blk size %#llx\n",
> > +			index, dcr->base, dcr->blk_size);
> > +		return -EINVAL;
> > +	}
> > +
> > +	if (dcr->decode_len == 0 || dcr->len == 0 || dcr->decode_len < dcr->len ||
> > +	    !IS_ALIGNED(dcr->len, dcr->blk_size)) {
> > +		dev_err(dev, "DC region %d invalid length; decode %#llx len %#llx blk size %#llx\n",
> > +			index, dcr->decode_len, dcr->len, dcr->blk_size);
> > +		return -EINVAL;
> > +	}
> > +
> > +	if (dcr->blk_size == 0 || dcr->blk_size % CXL_DCD_BLOCK_LINE_SIZE ||
> > +	    !is_power_of_2(dcr->blk_size)) {
> > +		dev_err(dev, "DC region %d invalid block size; %#llx\n",
> > +			index, dcr->blk_size);
> > +		return -EINVAL;
> > +	}
> > +
> > +	dev_dbg(dev,
> > +		"DC region %s base %#llx length %#llx block size %#llx\n",
> > +		dcr->name, dcr->base, dcr->decode_len, dcr->blk_size);
> > +
> > +	return 0;
> > +}
> > +
> > +/* Returns the number of regions in dc_resp or -ERRNO */
> > +static int cxl_get_dc_config(struct cxl_memdev_state *mds, u8 start_region,
> > +			     struct cxl_mbox_get_dc_config_out *dc_resp,
> > +			     size_t dc_resp_size)
> > +{
> > +	struct cxl_mbox_get_dc_config_in get_dc = (struct cxl_mbox_get_dc_config_in) {
> > +		.region_count = CXL_MAX_DC_REGION,
> > +		.start_region_index = start_region,
> > +	};
> > +	struct cxl_mbox_cmd mbox_cmd = (struct cxl_mbox_cmd) {
> > +		.opcode = CXL_MBOX_OP_GET_DC_CONFIG,
> > +		.payload_in = &get_dc,
> > +		.size_in = sizeof(get_dc),
> > +		.size_out = dc_resp_size,
> > +		.payload_out = dc_resp,
> > +		.min_out = 1,
> > +	};
> > +	struct device *dev = mds->cxlds.dev;
> > +	int rc;
> > +
> > +	rc = cxl_internal_send_cmd(mds, &mbox_cmd);
> > +	if (rc < 0)
> > +		return rc;
> > +
> > +	dev_dbg(dev, "Read %d/%d DC regions\n",
> > +		dc_resp->regions_returned, dc_resp->avail_region_count);
> > +	return dc_resp->regions_returned;
> > +}
> > +
> > +/**
> > + * cxl_dev_dynamic_capacity_identify() - Reads the dynamic capacity
> > + *					 information from the device.
> > + * @mds: The memory device state
> > + *
> > + * Read Dynamic Capacity information from the device and populate the state
> > + * structures for later use.
> > + *
> > + * Return: 0 if identify was executed successfully, -ERRNO on error.
> > + */
> > +int cxl_dev_dynamic_capacity_identify(struct cxl_memdev_state *mds)
> > +{
> > +	size_t dc_resp_size = mds->payload_size;
> > +	struct device *dev = mds->cxlds.dev;
> > +	u8 start_region, i;
> > +
> > +	for (i = 0; i < CXL_MAX_DC_REGION; i++)
> > +		snprintf(mds->dc_region[i].name, CXL_DC_REGION_STRLEN, "<nil>");
> > +
> > +	if (!cxl_dcd_supported(mds)) {
> > +		dev_dbg(dev, "DCD not supported\n");
> > +		return 0;
> > +	}
> 
> This should happen before you pre-format the name string? I would assume that if DCD is not supported then the dcd name sysfs attribs would be not be visible?
> 
> > +
> > +	struct cxl_mbox_get_dc_config_out *dc_resp __free(kfree) =
> > +					kvmalloc(dc_resp_size, GFP_KERNEL);
> > +	if (!dc_resp)
> > +		return -ENOMEM;
> > +
> > +	start_region = 0;
> > +	do {
> > +		int rc, j;
> > +
> > +		rc = cxl_get_dc_config(mds, start_region, dc_resp, dc_resp_size);
> > +		if (rc < 0) {
> > +			dev_dbg(dev, "Failed to get DC config: %d\n", rc);
> > +			return rc;
> > +		}
> > +
> > +		mds->nr_dc_region += rc;
> > +
> > +		if (mds->nr_dc_region < 1 || mds->nr_dc_region > CXL_MAX_DC_REGION) {
> > +			dev_err(dev, "Invalid num of dynamic capacity regions %d\n",
> > +				mds->nr_dc_region);
> > +			return -EINVAL;
> > +		}
> > +
> > +		for (i = start_region, j = 0; i < mds->nr_dc_region; i++, j++) {
> 
> This should be 'j < mds->nr_dc_region'? Otherwise if your start region say is '3' and you have '2' DC regions, you never enter the loop. Or does that not happen? I also wonder if you need to check if 'start_region + mds->nr_dc_region > CXL_MAX_DC_REGION'.
> 
That can not happen, start_region was updated to the number of regions
has returned till now (not counting the current call), while
nr_dc_region is the total number of regions returned till now (including
the current call) as we update it above, so start_region should never be larger
than nr_dc_region.

> > +			rc = cxl_dc_save_region_info(mds, i, &dc_resp->region[j]);
> > +			if (rc) {
> > +				dev_dbg(dev, "Failed to save region info: %d\n", rc);

I am not sure why we sometimes use dev_err and sometimes we use dev_dbg
here, if dcd is supported, error from getting dc configuration is an
error to me.

Fan

> > +				return rc;
> > +			}
> > +		}
> > +
> > +		start_region = mds->nr_dc_region;
> > +
> > +	} while (mds->nr_dc_region < dc_resp->avail_region_count);
> > +
> > +	mds->dynamic_bytes =
> > +		mds->dc_region[mds->nr_dc_region - 1].base +
> > +		mds->dc_region[mds->nr_dc_region - 1].decode_len -
> > +		mds->dc_region[0].base;
> > +	dev_dbg(dev, "Total dynamic range: %#llx\n", mds->dynamic_bytes);
> > +
> > +	return 0;
> > +}
> > +EXPORT_SYMBOL_NS_GPL(cxl_dev_dynamic_capacity_identify, CXL);
> > +
> >  static int add_dpa_res(struct device *dev, struct resource *parent,
> >  		       struct resource *res, resource_size_t start,
> >  		       resource_size_t size, const char *type)
> > @@ -1294,8 +1447,15 @@ int cxl_mem_create_range_info(struct cxl_memdev_state *mds)
> >  {
> >  	struct cxl_dev_state *cxlds = &mds->cxlds;
> >  	struct device *dev = cxlds->dev;
> > +	size_t untenanted_mem;
> >  	int rc;
> >  
> > +	mds->total_bytes = mds->static_bytes;
> > +	if (mds->nr_dc_region) {
> > +		untenanted_mem = mds->dc_region[0].base - mds->static_bytes;
> > +		mds->total_bytes += untenanted_mem + mds->dynamic_bytes;
> > +	}
> > +
> >  	if (!cxlds->media_ready) {
> >  		cxlds->dpa_res = DEFINE_RES_MEM(0, 0);
> >  		cxlds->ram_res = DEFINE_RES_MEM(0, 0);
> > @@ -1305,6 +1465,15 @@ int cxl_mem_create_range_info(struct cxl_memdev_state *mds)
> >  
> >  	cxlds->dpa_res = DEFINE_RES_MEM(0, mds->total_bytes);
> >  
> > +	for (int i = 0; i < mds->nr_dc_region; i++) {
> > +		struct cxl_dc_region_info *dcr = &mds->dc_region[i];
> > +
> > +		rc = add_dpa_res(dev, &cxlds->dpa_res, &cxlds->dc_res[i],
> > +				 dcr->base, dcr->decode_len, dcr->name);
> > +		if (rc)
> > +			return rc;
> > +	}
> > +
> >  	if (mds->partition_align_bytes == 0) {
> >  		rc = add_dpa_res(dev, &cxlds->dpa_res, &cxlds->ram_res, 0,
> >  				 mds->volatile_only_bytes, "ram");
> > diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
> > index f2f8b567e0e7..b4eb8164d05d 100644
> > --- a/drivers/cxl/cxlmem.h
> > +++ b/drivers/cxl/cxlmem.h
> > @@ -402,6 +402,7 @@ enum cxl_devtype {
> >  	CXL_DEVTYPE_CLASSMEM,
> >  };
> >  
> > +#define CXL_MAX_DC_REGION 8
> >  /**
> >   * struct cxl_dpa_perf - DPA performance property entry
> >   * @dpa_range: range for DPA address
> > @@ -431,6 +432,8 @@ struct cxl_dpa_perf {
> >   * @dpa_res: Overall DPA resource tree for the device
> >   * @pmem_res: Active Persistent memory capacity configuration
> >   * @ram_res: Active Volatile memory capacity configuration
> > + * @dc_res: Active Dynamic Capacity memory configuration for each possible
> > + *          region
> >   * @serial: PCIe Device Serial Number
> >   * @type: Generic Memory Class device or Vendor Specific Memory device
> >   */
> > @@ -445,10 +448,22 @@ struct cxl_dev_state {
> >  	struct resource dpa_res;
> >  	struct resource pmem_res;
> >  	struct resource ram_res;
> > +	struct resource dc_res[CXL_MAX_DC_REGION];
> >  	u64 serial;
> >  	enum cxl_devtype type;
> >  };
> >  
> > +#define CXL_DC_REGION_STRLEN > +struct cxl_dc_region_info {
> > +	u64 base;
> > +	u64 decode_len;
> > +	u64 len;
> > +	u64 blk_size;
> > +	u32 dsmad_handle;
> > +	u8 flags;
> > +	u8 name[CXL_DC_REGION_STRLEN];
> > +};
> 
> Does this need kdoc comments?
> 
> 
> > +
> >  /**
> >   * struct cxl_memdev_state - Generic Type-3 Memory Device Class driver data
> >   *
> > @@ -466,7 +481,9 @@ struct cxl_dev_state {
> >   * @dcd_cmds: List of DCD commands implemented by memory device
> >   * @enabled_cmds: Hardware commands found enabled in CEL.
> >   * @exclusive_cmds: Commands that are kernel-internal only
> > - * @total_bytes: sum of all possible capacities
> > + * @total_bytes: length of all possible capacities
> > + * @static_bytes: length of possible static RAM and PMEM partitions
> > + * @dynamic_bytes: length of possible DC partitions (DC Regions)
> 
> Did this get added to the wrong struct comment header? 'cxl_dev_state' instead of 'cxl_memdev_state'?
> >   * @volatile_only_bytes: hard volatile capacity
> >   * @persistent_only_bytes: hard persistent capacity
> >   * @partition_align_bytes: alignment size for partition-able capacity
> > @@ -476,6 +493,8 @@ struct cxl_dev_state {
> >   * @next_persistent_bytes: persistent capacity change pending device reset
> >   * @ram_perf: performance data entry matched to RAM partition
> >   * @pmem_perf: performance data entry matched to PMEM partition
> > + * @nr_dc_region: number of DC regions implemented in the memory device
> > + * @dc_region: array containing info about the DC regions
> Did this get added to the wrong struct comment header? 'cxl_dev_state' instead of 'cxl_memdev_state'?
> 
> DJ
> 
> >   * @event: event log driver state
> >   * @poison: poison driver state info
> >   * @security: security driver state info
> > @@ -496,6 +515,8 @@ struct cxl_memdev_state {
> >  	DECLARE_BITMAP(enabled_cmds, CXL_MEM_COMMAND_ID_MAX);
> >  	DECLARE_BITMAP(exclusive_cmds, CXL_MEM_COMMAND_ID_MAX);
> >  	u64 total_bytes;
> > +	u64 static_bytes;
> > +	u64 dynamic_bytes;
> >  	u64 volatile_only_bytes;
> >  	u64 persistent_only_bytes;
> >  	u64 partition_align_bytes;
> > @@ -507,6 +528,9 @@ struct cxl_memdev_state {
> >  	struct cxl_dpa_perf ram_perf;
> >  	struct cxl_dpa_perf pmem_perf;
> >  
> > +	u8 nr_dc_region;
> > +	struct cxl_dc_region_info dc_region[CXL_MAX_DC_REGION];
> > +
> >  	struct cxl_event_state event;
> >  	struct cxl_poison_state poison;
> >  	struct cxl_security_state security;
> > @@ -709,6 +733,32 @@ struct cxl_mbox_set_partition_info {
> >  
> >  #define  CXL_SET_PARTITION_IMMEDIATE_FLAG	BIT(0)
> >  
> > +/* See CXL 3.1 Table 8-163 get dynamic capacity config Input Payload */
> > +struct cxl_mbox_get_dc_config_in {
> > +	u8 region_count;
> > +	u8 start_region_index;
> > +} __packed;
> > +
> > +/* See CXL 3.1 Table 8-164 get dynamic capacity config Output Payload */
> > +struct cxl_mbox_get_dc_config_out {
> > +	u8 avail_region_count;
> > +	u8 regions_returned;
> > +	u8 rsvd[6];
> > +	/* See CXL 3.1 Table 8-165 */
> > +	struct cxl_dc_region_config {
> > +		__le64 region_base;
> > +		__le64 region_decode_length;
> > +		__le64 region_length;
> > +		__le64 region_block_size;
> > +		__le32 region_dsmad_handle;
> > +		u8 flags;
> > +		u8 rsvd[3];
> > +	} __packed region[];
> > +	/* Trailing fields unused */
> > +} __packed;
> > +#define CXL_DYNAMIC_CAPACITY_SANITIZE_ON_RELEASE_FLAG BIT(0)
> > +#define CXL_DCD_BLOCK_LINE_SIZE 0x40
> > +
> >  /* Set Timestamp CXL 3.0 Spec 8.2.9.4.2 */
> >  struct cxl_mbox_set_timestamp_in {
> >  	__le64 timestamp;
> > @@ -832,6 +882,7 @@ enum {
> >  int cxl_internal_send_cmd(struct cxl_memdev_state *mds,
> >  			  struct cxl_mbox_cmd *cmd);
> >  int cxl_dev_state_identify(struct cxl_memdev_state *mds);
> > +int cxl_dev_dynamic_capacity_identify(struct cxl_memdev_state *mds);
> >  int cxl_await_media_ready(struct cxl_dev_state *cxlds);
> >  int cxl_enumerate_cmds(struct cxl_memdev_state *mds);
> >  int cxl_mem_create_range_info(struct cxl_memdev_state *mds);
> > @@ -845,6 +896,17 @@ void cxl_event_trace_record(const struct cxl_memdev *cxlmd,
> >  			    enum cxl_event_log_type type,
> >  			    enum cxl_event_type event_type,
> >  			    const uuid_t *uuid, union cxl_event *evt);
> > +
> > +static inline bool cxl_dcd_supported(struct cxl_memdev_state *mds)
> > +{
> > +	return test_bit(CXL_DCD_ENABLED_GET_CONFIG, mds->dcd_cmds);
> > +}
> > +
> > +static inline void cxl_disable_dcd(struct cxl_memdev_state *mds)
> > +{
> > +	clear_bit(CXL_DCD_ENABLED_GET_CONFIG, mds->dcd_cmds);
> > +}
> > +
> >  int cxl_set_timestamp(struct cxl_memdev_state *mds);
> >  int cxl_poison_state_init(struct cxl_memdev_state *mds);
> >  int cxl_mem_get_poison(struct cxl_memdev *cxlmd, u64 offset, u64 len,
> > diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
> > index 3a60cd66263e..f7f03599bc83 100644
> > --- a/drivers/cxl/pci.c
> > +++ b/drivers/cxl/pci.c
> > @@ -874,6 +874,10 @@ static int cxl_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
> >  	if (rc)
> >  		return rc;
> >  
> > +	rc = cxl_dev_dynamic_capacity_identify(mds);
> > +	if (rc)
> > +		cxl_disable_dcd(mds);
> > +
> >  	rc = cxl_mem_create_range_info(mds);
> >  	if (rc)
> >  		return rc;
> > 

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v3 23/25] cxl/mem: Trace Dynamic capacity Event Record
  2024-08-16 14:44 ` [PATCH v3 23/25] cxl/mem: Trace Dynamic capacity Event Record ira.weiny
@ 2024-08-20 22:54   ` Dave Jiang
  2024-08-26 18:02     ` Ira Weiny
  2024-08-27 14:20   ` Jonathan Cameron
  2024-09-05 19:38   ` Fan Ni
  2 siblings, 1 reply; 120+ messages in thread
From: Dave Jiang @ 2024-08-20 22:54 UTC (permalink / raw)
  To: ira.weiny, Fan Ni, Jonathan Cameron, Navneet Singh, Chris Mason,
	Josef Bacik, David Sterba, Petr Mladek, Steven Rostedt,
	Andy Shevchenko, Rasmus Villemoes, Sergey Senozhatsky,
	Jonathan Corbet, Andrew Morton
  Cc: Dan Williams, Davidlohr Bueso, Alison Schofield, Vishal Verma,
	linux-btrfs, linux-cxl, linux-kernel, linux-doc, nvdimm



On 8/16/24 7:44 AM, ira.weiny@intel.com wrote:
> From: Navneet Singh <navneet.singh@intel.com>
> 
> CXL rev 3.1 section 8.2.9.2.1 adds the Dynamic Capacity Event Records.
> User space can use trace events for debugging of DC capacity changes.
> 
> Add DC trace points to the trace log.
> 
> Signed-off-by: Navneet Singh <navneet.singh@intel.com>
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>

small nit below

> 
> ---
> Changes:
> [Alison: Update commit message]
> ---
>  drivers/cxl/core/mbox.c  |  4 +++
>  drivers/cxl/core/trace.h | 65 ++++++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 69 insertions(+)
> 
> diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
> index d43ac8eabf56..8202fc6c111d 100644
> --- a/drivers/cxl/core/mbox.c
> +++ b/drivers/cxl/core/mbox.c
> @@ -977,6 +977,10 @@ static void __cxl_event_trace_record(const struct cxl_memdev *cxlmd,
>  		ev_type = CXL_CPER_EVENT_DRAM;
>  	else if (uuid_equal(uuid, &CXL_EVENT_MEM_MODULE_UUID))
>  		ev_type = CXL_CPER_EVENT_MEM_MODULE;
> +	else if (uuid_equal(uuid, &CXL_EVENT_DC_EVENT_UUID)) {
> +		trace_cxl_dynamic_capacity(cxlmd, type, &record->event.dcd);
> +		return;
> +	}
>  
>  	cxl_event_trace_record(cxlmd, type, ev_type, uuid, &record->event);
>  }
> diff --git a/drivers/cxl/core/trace.h b/drivers/cxl/core/trace.h
> index 9167cfba7f59..a3a5269311ee 100644
> --- a/drivers/cxl/core/trace.h
> +++ b/drivers/cxl/core/trace.h
> @@ -731,6 +731,71 @@ TRACE_EVENT(cxl_poison,
>  	)
>  );
>  
> +/*
> + * DYNAMIC CAPACITY Event Record - DER
> + *
> + * CXL rev 3.0 section 8.2.9.2.1.5 Table 8-47

Should we just use 3.1 since it's the latest?

> + */
> +
> +#define CXL_DC_ADD_CAPACITY			0x00
> +#define CXL_DC_REL_CAPACITY			0x01
> +#define CXL_DC_FORCED_REL_CAPACITY		0x02
> +#define CXL_DC_REG_CONF_UPDATED			0x03
> +#define show_dc_evt_type(type)	__print_symbolic(type,		\
> +	{ CXL_DC_ADD_CAPACITY,	"Add capacity"},		\
> +	{ CXL_DC_REL_CAPACITY,	"Release capacity"},		\
> +	{ CXL_DC_FORCED_REL_CAPACITY,	"Forced capacity release"},	\
> +	{ CXL_DC_REG_CONF_UPDATED,	"Region Configuration Updated"	} \
> +)
> +
> +TRACE_EVENT(cxl_dynamic_capacity,
> +
> +	TP_PROTO(const struct cxl_memdev *cxlmd, enum cxl_event_log_type log,
> +		 struct cxl_event_dcd *rec),
> +
> +	TP_ARGS(cxlmd, log, rec),
> +
> +	TP_STRUCT__entry(
> +		CXL_EVT_TP_entry
> +
> +		/* Dynamic capacity Event */
> +		__field(u8, event_type)
> +		__field(u16, hostid)
> +		__field(u8, region_id)
> +		__field(u64, dpa_start)
> +		__field(u64, length)
> +		__array(u8, tag, CXL_EXTENT_TAG_LEN)
> +		__field(u16, sh_extent_seq)
> +	),
> +
> +	TP_fast_assign(
> +		CXL_EVT_TP_fast_assign(cxlmd, log, rec->hdr);
> +
> +		/* Dynamic_capacity Event */
> +		__entry->event_type = rec->event_type;
> +
> +		/* DCD event record data */
> +		__entry->hostid = le16_to_cpu(rec->host_id);
> +		__entry->region_id = rec->region_index;
> +		__entry->dpa_start = le64_to_cpu(rec->extent.start_dpa);
> +		__entry->length = le64_to_cpu(rec->extent.length);
> +		memcpy(__entry->tag, &rec->extent.tag, CXL_EXTENT_TAG_LEN);
> +		__entry->sh_extent_seq = le16_to_cpu(rec->extent.shared_extn_seq);
> +	),
> +
> +	CXL_EVT_TP_printk("event_type='%s' host_id='%d' region_id='%d' " \
> +		"starting_dpa=%llx length=%llx tag=%s " \
> +		"shared_extent_sequence=%d",
> +		show_dc_evt_type(__entry->event_type),
> +		__entry->hostid,
> +		__entry->region_id,
> +		__entry->dpa_start,
> +		__entry->length,
> +		__print_hex(__entry->tag, CXL_EXTENT_TAG_LEN),
> +		__entry->sh_extent_seq
> +	)
> +);
> +
>  #endif /* _CXL_EVENTS_H */
>  
>  #define TRACE_INCLUDE_FILE trace
> 

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v3 24/25] tools/testing/cxl: Make event logs dynamic
  2024-08-16 14:44 ` [PATCH v3 24/25] tools/testing/cxl: Make event logs dynamic Ira Weiny
@ 2024-08-20 23:30   ` Dave Jiang
  2024-08-27 14:32   ` Jonathan Cameron
  1 sibling, 0 replies; 120+ messages in thread
From: Dave Jiang @ 2024-08-20 23:30 UTC (permalink / raw)
  To: Ira Weiny, Fan Ni, Jonathan Cameron, Navneet Singh, Chris Mason,
	Josef Bacik, David Sterba, Petr Mladek, Steven Rostedt,
	Andy Shevchenko, Rasmus Villemoes, Sergey Senozhatsky,
	Jonathan Corbet, Andrew Morton
  Cc: Dan Williams, Davidlohr Bueso, Alison Schofield, Vishal Verma,
	linux-btrfs, linux-cxl, linux-kernel, linux-doc, nvdimm



On 8/16/24 7:44 AM, Ira Weiny wrote:
> The test event logs were created as static arrays as an easy way to mock
> events.  Dynamic Capacity Device (DCD) test support requires events be
> generated dynamically when extents are created or destroyed.
> 
> Modify the event log storage to be dynamically allocated.  Reuse the
> static event data to create the dynamic events in the new logs without
> inventing complex event injection for the previous tests.  Simplify the
> processing of the logs by using the event log array index as the handle.
> Add a lock to manage concurrency required when user space is allowed to
> control DCD extents
> 
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>

Reviewed-by: Dave Jiang <dave.jiang@intel.com>
> 
> ---
> Changes:
> [iweiny: rebase]
> ---
>  tools/testing/cxl/test/mem.c | 278 ++++++++++++++++++++++++++-----------------
>  1 file changed, 171 insertions(+), 107 deletions(-)
> 
> diff --git a/tools/testing/cxl/test/mem.c b/tools/testing/cxl/test/mem.c
> index 129f179b0ac5..674fc7f086cd 100644
> --- a/tools/testing/cxl/test/mem.c
> +++ b/tools/testing/cxl/test/mem.c
> @@ -125,18 +125,27 @@ static struct {
>  
>  #define PASS_TRY_LIMIT 3
>  
> -#define CXL_TEST_EVENT_CNT_MAX 15
> +#define CXL_TEST_EVENT_CNT_MAX 17
>  
>  /* Set a number of events to return at a time for simulation.  */
>  #define CXL_TEST_EVENT_RET_MAX 4
>  
> +/*
> + * @next_handle: next handle (index) to be stored to
> + * @cur_handle: current handle (index) to be returned to the user on get_event
> + * @nr_events: total events in this log
> + * @nr_overflow: number of events added past the log size
> + * @lock: protect these state variables
> + * @events: array of pending events to be returned.
> + */
>  struct mock_event_log {
> -	u16 clear_idx;
> -	u16 cur_idx;
> +	u16 next_handle;
> +	u16 cur_handle;
>  	u16 nr_events;
>  	u16 nr_overflow;
> -	u16 overflow_reset;
> -	struct cxl_event_record_raw *events[CXL_TEST_EVENT_CNT_MAX];
> +	rwlock_t lock;
> +	/* 1 extra slot to accommodate that handles can't be 0 */
> +	struct cxl_event_record_raw *events[CXL_TEST_EVENT_CNT_MAX + 1];
>  };
>  
>  struct mock_event_store {
> @@ -171,56 +180,68 @@ static struct mock_event_log *event_find_log(struct device *dev, int log_type)
>  	return &mdata->mes.mock_logs[log_type];
>  }
>  
> -static struct cxl_event_record_raw *event_get_current(struct mock_event_log *log)
> -{
> -	return log->events[log->cur_idx];
> -}
> -
> -static void event_reset_log(struct mock_event_log *log)
> -{
> -	log->cur_idx = 0;
> -	log->clear_idx = 0;
> -	log->nr_overflow = log->overflow_reset;
> -}
> -
>  /* Handle can never be 0 use 1 based indexing for handle */
> -static u16 event_get_clear_handle(struct mock_event_log *log)
> +static void event_inc_handle(u16 *handle)
>  {
> -	return log->clear_idx + 1;
> +	*handle = (*handle + 1) % CXL_TEST_EVENT_CNT_MAX;
> +	if (!*handle)
> +		*handle = *handle + 1;
>  }
>  
> -/* Handle can never be 0 use 1 based indexing for handle */
> -static __le16 event_get_cur_event_handle(struct mock_event_log *log)
> -{
> -	u16 cur_handle = log->cur_idx + 1;
> -
> -	return cpu_to_le16(cur_handle);
> -}
> -
> -static bool event_log_empty(struct mock_event_log *log)
> -{
> -	return log->cur_idx == log->nr_events;
> -}
> -
> -static void mes_add_event(struct mock_event_store *mes,
> +/* Add the event or free it on 'overflow' */
> +static void mes_add_event(struct cxl_mockmem_data *mdata,
>  			  enum cxl_event_log_type log_type,
>  			  struct cxl_event_record_raw *event)
>  {
> +	struct device *dev = mdata->mds->cxlds.dev;
>  	struct mock_event_log *log;
> +	u16 handle;
>  
>  	if (WARN_ON(log_type >= CXL_EVENT_TYPE_MAX))
>  		return;
>  
> -	log = &mes->mock_logs[log_type];
> +	log = &mdata->mes.mock_logs[log_type];
>  
> -	if ((log->nr_events + 1) > CXL_TEST_EVENT_CNT_MAX) {
> +	write_lock(&log->lock);
> +
> +	handle = log->next_handle;
> +	if ((handle + 1) == log->cur_handle) {
>  		log->nr_overflow++;
> -		log->overflow_reset = log->nr_overflow;
> -		return;
> +		dev_dbg(dev, "Overflowing %d\n", log_type);
> +		devm_kfree(dev, event);
> +		goto unlock;
>  	}
>  
> -	log->events[log->nr_events] = event;
> +	dev_dbg(dev, "Log %d; handle %u\n", log_type, handle);
> +	event->event.generic.hdr.handle = cpu_to_le16(handle);
> +	log->events[handle] = event;
> +	event_inc_handle(&log->next_handle);
>  	log->nr_events++;
> +
> +unlock:
> +	write_unlock(&log->lock);
> +}
> +
> +static void mes_del_event(struct device *dev,
> +			  struct mock_event_log *log,
> +			  u16 handle)
> +{
> +	struct cxl_event_record_raw *cur;
> +
> +	lockdep_assert(lockdep_is_held(&log->lock));
> +
> +	dev_dbg(dev, "Clearing event %u; cur %u\n", handle, log->cur_handle);
> +	cur = log->events[handle];
> +	if (!cur) {
> +		dev_err(dev, "Mock event index %u empty? nr_events %u",
> +			handle, log->nr_events);
> +		return;
> +	}
> +	log->events[handle] = NULL;
> +
> +	event_inc_handle(&log->cur_handle);
> +	log->nr_events--;
> +	devm_kfree(dev, cur);
>  }
>  
>  /*
> @@ -233,8 +254,8 @@ static int mock_get_event(struct device *dev, struct cxl_mbox_cmd *cmd)
>  {
>  	struct cxl_get_event_payload *pl;
>  	struct mock_event_log *log;
> -	u16 nr_overflow;
>  	u8 log_type;
> +	u16 handle;
>  	int i;
>  
>  	if (cmd->size_in != sizeof(log_type))
> @@ -254,29 +275,39 @@ static int mock_get_event(struct device *dev, struct cxl_mbox_cmd *cmd)
>  	memset(cmd->payload_out, 0, struct_size(pl, records, 0));
>  
>  	log = event_find_log(dev, log_type);
> -	if (!log || event_log_empty(log))
> +	if (!log)
>  		return 0;
>  
>  	pl = cmd->payload_out;
>  
> -	for (i = 0; i < ret_limit && !event_log_empty(log); i++) {
> -		memcpy(&pl->records[i], event_get_current(log),
> -		       sizeof(pl->records[i]));
> -		pl->records[i].event.generic.hdr.handle =
> -				event_get_cur_event_handle(log);
> -		log->cur_idx++;
> +	read_lock(&log->lock);
> +
> +	handle = log->cur_handle;
> +	dev_dbg(dev, "Get log %d handle %u next %u\n",
> +		log_type, handle, log->next_handle);
> +	for (i = 0;
> +	     i < ret_limit && handle != log->next_handle;
> +	     i++, event_inc_handle(&handle)) {
> +		struct cxl_event_record_raw *cur;
> +
> +		cur = log->events[handle];
> +		dev_dbg(dev, "Sending event log %d handle %d idx %u\n",
> +			log_type, le16_to_cpu(cur->event.generic.hdr.handle),
> +			handle);
> +		memcpy(&pl->records[i], cur, sizeof(pl->records[i]));
> +		pl->records[i].event.generic.hdr.handle = cpu_to_le16(handle);
>  	}
>  
>  	cmd->size_out = struct_size(pl, records, i);
>  	pl->record_count = cpu_to_le16(i);
> -	if (!event_log_empty(log))
> +	if (log->nr_events > i)
>  		pl->flags |= CXL_GET_EVENT_FLAG_MORE_RECORDS;
>  
>  	if (log->nr_overflow) {
>  		u64 ns;
>  
>  		pl->flags |= CXL_GET_EVENT_FLAG_OVERFLOW;
> -		pl->overflow_err_count = cpu_to_le16(nr_overflow);
> +		pl->overflow_err_count = cpu_to_le16(log->nr_overflow);
>  		ns = ktime_get_real_ns();
>  		ns -= 5000000000; /* 5s ago */
>  		pl->first_overflow_timestamp = cpu_to_le64(ns);
> @@ -285,16 +316,17 @@ static int mock_get_event(struct device *dev, struct cxl_mbox_cmd *cmd)
>  		pl->last_overflow_timestamp = cpu_to_le64(ns);
>  	}
>  
> +	read_unlock(&log->lock);
>  	return 0;
>  }
>  
>  static int mock_clear_event(struct device *dev, struct cxl_mbox_cmd *cmd)
>  {
>  	struct cxl_mbox_clear_event_payload *pl = cmd->payload_in;
> -	struct mock_event_log *log;
>  	u8 log_type = pl->event_log;
> +	struct mock_event_log *log;
> +	int nr, rc = 0;
>  	u16 handle;
> -	int nr;
>  
>  	if (log_type >= CXL_EVENT_TYPE_MAX)
>  		return -EINVAL;
> @@ -303,24 +335,23 @@ static int mock_clear_event(struct device *dev, struct cxl_mbox_cmd *cmd)
>  	if (!log)
>  		return 0; /* No mock data in this log */
>  
> -	/*
> -	 * This check is technically not invalid per the specification AFAICS.
> -	 * (The host could 'guess' handles and clear them in order).
> -	 * However, this is not good behavior for the host so test it.
> -	 */
> -	if (log->clear_idx + pl->nr_recs > log->cur_idx) {
> -		dev_err(dev,
> -			"Attempting to clear more events than returned!\n");
> -		return -EINVAL;
> -	}
> +	write_lock(&log->lock);
>  
>  	/* Check handle order prior to clearing events */
> -	for (nr = 0, handle = event_get_clear_handle(log);
> -	     nr < pl->nr_recs;
> -	     nr++, handle++) {
> +	handle = log->cur_handle;
> +	for (nr = 0;
> +	     nr < pl->nr_recs && handle != log->next_handle;
> +	     nr++, event_inc_handle(&handle)) {
> +
> +		dev_dbg(dev, "Checking clear of %d handle %u plhandle %u\n",
> +			log_type, handle,
> +			le16_to_cpu(pl->handles[nr]));
> +
>  		if (handle != le16_to_cpu(pl->handles[nr])) {
> -			dev_err(dev, "Clearing events out of order\n");
> -			return -EINVAL;
> +			dev_err(dev, "Clearing events out of order %u %u\n",
> +				handle, le16_to_cpu(pl->handles[nr]));
> +			rc = -EINVAL;
> +			goto unlock;
>  		}
>  	}
>  
> @@ -328,25 +359,12 @@ static int mock_clear_event(struct device *dev, struct cxl_mbox_cmd *cmd)
>  		log->nr_overflow = 0;
>  
>  	/* Clear events */
> -	log->clear_idx += pl->nr_recs;
> -	return 0;
> -}
> -
> -static void cxl_mock_event_trigger(struct device *dev)
> -{
> -	struct cxl_mockmem_data *mdata = dev_get_drvdata(dev);
> -	struct mock_event_store *mes = &mdata->mes;
> -	int i;
> -
> -	for (i = CXL_EVENT_TYPE_INFO; i < CXL_EVENT_TYPE_MAX; i++) {
> -		struct mock_event_log *log;
> -
> -		log = event_find_log(dev, i);
> -		if (log)
> -			event_reset_log(log);
> -	}
> +	for (nr = 0; nr < pl->nr_recs; nr++)
> +		mes_del_event(dev, log, le16_to_cpu(pl->handles[nr]));
>  
> -	cxl_mem_get_event_records(mdata->mds, mes->ev_status);
> +unlock:
> +	write_unlock(&log->lock);
> +	return rc;
>  }
>  
>  struct cxl_event_record_raw maint_needed = {
> @@ -475,8 +493,27 @@ static int mock_set_timestamp(struct cxl_dev_state *cxlds,
>  	return 0;
>  }
>  
> -static void cxl_mock_add_event_logs(struct mock_event_store *mes)
> +/* Create a dynamically allocated event out of a statically defined event. */
> +static void add_event_from_static(struct cxl_mockmem_data *mdata,
> +				  enum cxl_event_log_type log_type,
> +				  struct cxl_event_record_raw *raw)
> +{
> +	struct device *dev = mdata->mds->cxlds.dev;
> +	struct cxl_event_record_raw *rec;
> +
> +	rec = devm_kmemdup(dev, raw, sizeof(*rec), GFP_KERNEL);
> +	if (!rec) {
> +		dev_err(dev, "Failed to alloc event for log\n");
> +		return;
> +	}
> +	mes_add_event(mdata, log_type, rec);
> +}
> +
> +static void cxl_mock_add_event_logs(struct cxl_mockmem_data *mdata)
>  {
> +	struct mock_event_store *mes = &mdata->mes;
> +	struct device *dev = mdata->mds->cxlds.dev;
> +
>  	put_unaligned_le16(CXL_GMER_VALID_CHANNEL | CXL_GMER_VALID_RANK,
>  			   &gen_media.rec.media_hdr.validity_flags);
>  
> @@ -484,43 +521,60 @@ static void cxl_mock_add_event_logs(struct mock_event_store *mes)
>  			   CXL_DER_VALID_BANK | CXL_DER_VALID_COLUMN,
>  			   &dram.rec.media_hdr.validity_flags);
>  
> -	mes_add_event(mes, CXL_EVENT_TYPE_INFO, &maint_needed);
> -	mes_add_event(mes, CXL_EVENT_TYPE_INFO,
> +	dev_dbg(dev, "Generating fake event logs %d\n",
> +		CXL_EVENT_TYPE_INFO);
> +	add_event_from_static(mdata, CXL_EVENT_TYPE_INFO, &maint_needed);
> +	add_event_from_static(mdata, CXL_EVENT_TYPE_INFO,
>  		      (struct cxl_event_record_raw *)&gen_media);
> -	mes_add_event(mes, CXL_EVENT_TYPE_INFO,
> +	add_event_from_static(mdata, CXL_EVENT_TYPE_INFO,
>  		      (struct cxl_event_record_raw *)&mem_module);
>  	mes->ev_status |= CXLDEV_EVENT_STATUS_INFO;
>  
> -	mes_add_event(mes, CXL_EVENT_TYPE_FAIL, &maint_needed);
> -	mes_add_event(mes, CXL_EVENT_TYPE_FAIL, &hardware_replace);
> -	mes_add_event(mes, CXL_EVENT_TYPE_FAIL,
> +	dev_dbg(dev, "Generating fake event logs %d\n",
> +		CXL_EVENT_TYPE_FAIL);
> +	add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL, &maint_needed);
> +	add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL,
> +		      (struct cxl_event_record_raw *)&mem_module);
> +	add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL, &hardware_replace);
> +	add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL,
>  		      (struct cxl_event_record_raw *)&dram);
> -	mes_add_event(mes, CXL_EVENT_TYPE_FAIL,
> +	add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL,
>  		      (struct cxl_event_record_raw *)&gen_media);
> -	mes_add_event(mes, CXL_EVENT_TYPE_FAIL,
> +	add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL,
>  		      (struct cxl_event_record_raw *)&mem_module);
> -	mes_add_event(mes, CXL_EVENT_TYPE_FAIL, &hardware_replace);
> -	mes_add_event(mes, CXL_EVENT_TYPE_FAIL,
> +	add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL, &hardware_replace);
> +	add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL,
>  		      (struct cxl_event_record_raw *)&dram);
>  	/* Overflow this log */
> -	mes_add_event(mes, CXL_EVENT_TYPE_FAIL, &hardware_replace);
> -	mes_add_event(mes, CXL_EVENT_TYPE_FAIL, &hardware_replace);
> -	mes_add_event(mes, CXL_EVENT_TYPE_FAIL, &hardware_replace);
> -	mes_add_event(mes, CXL_EVENT_TYPE_FAIL, &hardware_replace);
> -	mes_add_event(mes, CXL_EVENT_TYPE_FAIL, &hardware_replace);
> -	mes_add_event(mes, CXL_EVENT_TYPE_FAIL, &hardware_replace);
> -	mes_add_event(mes, CXL_EVENT_TYPE_FAIL, &hardware_replace);
> -	mes_add_event(mes, CXL_EVENT_TYPE_FAIL, &hardware_replace);
> -	mes_add_event(mes, CXL_EVENT_TYPE_FAIL, &hardware_replace);
> -	mes_add_event(mes, CXL_EVENT_TYPE_FAIL, &hardware_replace);
> +	add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL, &hardware_replace);
> +	add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL, &hardware_replace);
> +	add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL, &hardware_replace);
> +	add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL, &hardware_replace);
> +	add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL, &hardware_replace);
> +	add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL, &hardware_replace);
> +	add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL, &hardware_replace);
> +	add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL, &hardware_replace);
> +	add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL, &hardware_replace);
> +	add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL, &hardware_replace);
>  	mes->ev_status |= CXLDEV_EVENT_STATUS_FAIL;
>  
> -	mes_add_event(mes, CXL_EVENT_TYPE_FATAL, &hardware_replace);
> -	mes_add_event(mes, CXL_EVENT_TYPE_FATAL,
> +	dev_dbg(dev, "Generating fake event logs %d\n",
> +		CXL_EVENT_TYPE_FATAL);
> +	add_event_from_static(mdata, CXL_EVENT_TYPE_FATAL, &hardware_replace);
> +	add_event_from_static(mdata, CXL_EVENT_TYPE_FATAL,
>  		      (struct cxl_event_record_raw *)&dram);
>  	mes->ev_status |= CXLDEV_EVENT_STATUS_FATAL;
>  }
>  
> +static void cxl_mock_event_trigger(struct device *dev)
> +{
> +	struct cxl_mockmem_data *mdata = dev_get_drvdata(dev);
> +	struct mock_event_store *mes = &mdata->mes;
> +
> +	cxl_mock_add_event_logs(mdata);
> +	cxl_mem_get_event_records(mdata->mds, mes->ev_status);
> +}
> +
>  static int mock_gsl(struct cxl_mbox_cmd *cmd)
>  {
>  	if (cmd->size_out < sizeof(mock_gsl_payload))
> @@ -1453,6 +1507,14 @@ static ssize_t event_trigger_store(struct device *dev,
>  }
>  static DEVICE_ATTR_WO(event_trigger);
>  
> +static void init_event_log(struct mock_event_log *log)
> +{
> +	rwlock_init(&log->lock);
> +	/* Handle can never be 0 use 1 based indexing for handle */
> +	log->cur_handle = 1;
> +	log->next_handle = 1;
> +}
> +
>  static int cxl_mock_mem_probe(struct platform_device *pdev)
>  {
>  	struct device *dev = &pdev->dev;
> @@ -1519,7 +1581,9 @@ static int cxl_mock_mem_probe(struct platform_device *pdev)
>  	if (rc)
>  		return rc;
>  
> -	cxl_mock_add_event_logs(&mdata->mes);
> +	for (int i = 0; i < CXL_EVENT_TYPE_MAX; i++)
> +		init_event_log(&mdata->mes.mock_logs[i]);
> +	cxl_mock_add_event_logs(mdata);
>  
>  	cxlmd = devm_cxl_add_memdev(&pdev->dev, cxlds);
>  	if (IS_ERR(cxlmd))
> 

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v3 02/25] printk: Add print format (%par) for struct range
  2024-08-20 14:08   ` Petr Mladek
@ 2024-08-22 17:53     ` Ira Weiny
  2024-08-22 18:10       ` Andy Shevchenko
  2024-08-26 13:17       ` Petr Mladek
  0 siblings, 2 replies; 120+ messages in thread
From: Ira Weiny @ 2024-08-22 17:53 UTC (permalink / raw)
  To: Petr Mladek, Ira Weiny
  Cc: Dave Jiang, Fan Ni, Jonathan Cameron, Navneet Singh, Chris Mason,
	Josef Bacik, David Sterba, Steven Rostedt, Andy Shevchenko,
	Rasmus Villemoes, Sergey Senozhatsky, Jonathan Corbet,
	Andrew Morton, Dan Williams, Davidlohr Bueso, Alison Schofield,
	Vishal Verma, linux-btrfs, linux-cxl, linux-kernel, linux-doc,
	nvdimm

Petr Mladek wrote:
> On Fri 2024-08-16 09:44:10, Ira Weiny wrote:
> > The use of struct range in the CXL subsystem is growing.  In particular,
> > the addition of Dynamic Capacity devices uses struct range in a number
> > of places which are reported in debug and error messages.
> > 
> > To wit requiring the printing of the start/end fields in each print
> > became cumbersome.  Dan Williams mentions in [1] that it might be time
> > to have a print specifier for struct range similar to struct resource
> > 
> > A few alternatives were considered including '%pn' for 'print raNge' but
> > %par follows that struct range is most often used to store a range of
> > physical addresses.  So use '%par' for 'print address range'.
> > 
> > --- a/Documentation/core-api/printk-formats.rst
> > +++ b/Documentation/core-api/printk-formats.rst
> > @@ -231,6 +231,20 @@ width of the CPU data path.
> >  
> >  Passed by reference.
> >  
> > +Struct Range
> > +------------
> > +
> > +::
> > +
> > +	%par	[range 0x60000000-0x6fffffff] or
> 
> It seems that it is always 64-bit. It prints:
> 
> struct range {
> 	u64   start;
> 	u64   end;
> };

Indeed.  Thanks I should not have just copied/pasted.

> 
> > +		[range 0x0000000060000000-0x000000006fffffff]
> > +
> > +For printing struct range.  A variation of printing a physical address is to
> > +print the value of struct range which are often used to hold a physical address
> > +range.
> > +
> > +Passed by reference.
> > +
> >  DMA address types dma_addr_t
> >  ----------------------------
> >  
> > diff --git a/lib/vsprintf.c b/lib/vsprintf.c
> > index 2d71b1115916..c132178fac07 100644
> > --- a/lib/vsprintf.c
> > +++ b/lib/vsprintf.c
> > @@ -1140,6 +1140,39 @@ char *resource_string(char *buf, char *end, struct resource *res,
> >  	return string_nocheck(buf, end, sym, spec);
> >  }
> >  
> > +static noinline_for_stack
> > +char *range_string(char *buf, char *end, const struct range *range,
> > +		      struct printf_spec spec, const char *fmt)
> > +{
> > +#define RANGE_PRINTK_SIZE		16
> > +#define RANGE_DECODED_BUF_SIZE		((2 * sizeof(struct range)) + 4)
> > +#define RANGE_PRINT_BUF_SIZE		sizeof("[range - ]")
> 
> I think that it should be "[range -]"

Sounds good.

> 
> > +	char sym[RANGE_DECODED_BUF_SIZE + RANGE_PRINT_BUF_SIZE];
> > +	char *p = sym, *pend = sym + sizeof(sym);
> > +
> > +	static const struct printf_spec str_spec = {
> > +		.field_width = -1,
> > +		.precision = 10,
> > +		.flags = LEFT,
> > +	};
> 
> Is this really needed? What about using "default_str_spec" instead?

Because I got confused and was coping from resource_string().

Deleted now...

> 
> > +	static const struct printf_spec range_spec = {
> > +		.base = 16,
> > +		.field_width = RANGE_PRINTK_SIZE,

However, my testing indicates this needs to be.

                .field_width = 18, /* 2 (0x) + 2 * 8 (bytes) */

... to properly zero pad the value.  Does that make sense?

> > +		.precision = -1,
> > +		.flags = SPECIAL | SMALL | ZEROPAD,
> > +	};
> > +
> > +	*p++ = '[';
> > +	p = string_nocheck(p, pend, "range ", str_spec);
> > +	p = number(p, pend, range->start, range_spec);
> > +	*p++ = '-';
> > +	p = number(p, pend, range->end, range_spec);
> > +	*p++ = ']';
> > +	*p = '\0';
> > +
> > +	return string_nocheck(buf, end, sym, spec);
> > +}
> > +
> >  static noinline_for_stack
> >  char *hex_string(char *buf, char *end, u8 *addr, struct printf_spec spec,
> >  		 const char *fmt)
> 
> Also add a selftest into lib/test_printf.c, please.

Yes of course...  Makes testing easier too.

Thanks,
Ira

> 
> Best Regards,
> Petr

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v3 02/25] printk: Add print format (%par) for struct range
  2024-08-22 17:53     ` Ira Weiny
@ 2024-08-22 18:10       ` Andy Shevchenko
  2024-08-26 13:23         ` Petr Mladek
  2024-08-26 13:17       ` Petr Mladek
  1 sibling, 1 reply; 120+ messages in thread
From: Andy Shevchenko @ 2024-08-22 18:10 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Petr Mladek, Dave Jiang, Fan Ni, Jonathan Cameron, Navneet Singh,
	Chris Mason, Josef Bacik, David Sterba, Steven Rostedt,
	Rasmus Villemoes, Sergey Senozhatsky, Jonathan Corbet,
	Andrew Morton, Dan Williams, Davidlohr Bueso, Alison Schofield,
	Vishal Verma, linux-btrfs, linux-cxl, linux-kernel, linux-doc,
	nvdimm

On Thu, Aug 22, 2024 at 12:53:32PM -0500, Ira Weiny wrote:
> Petr Mladek wrote:
> > On Fri 2024-08-16 09:44:10, Ira Weiny wrote:

...

> > > +	%par	[range 0x60000000-0x6fffffff] or
> > 
> > It seems that it is always 64-bit. It prints:
> > 
> > struct range {
> > 	u64   start;
> > 	u64   end;
> > };
> 
> Indeed.  Thanks I should not have just copied/pasted.

With that said, I'm not sure the %pa is a good placeholder for this ('a' stands
to "address" AFAIU). Perhaps this should go somewhere under %pr/%pR?

> > > +		[range 0x0000000060000000-0x000000006fffffff]
> > > +
> > > +For printing struct range.  A variation of printing a physical address is to
> > > +print the value of struct range which are often used to hold a physical address
> > > +range.
> > > +
> > > +Passed by reference.

...

> > Is this really needed? What about using "default_str_spec" instead?
> 
> Because I got confused and was coping from resource_string().
> 
> Deleted now...
> 
> > > +		.field_width = RANGE_PRINTK_SIZE,
> 
> However, my testing indicates this needs to be.
> 
>                 .field_width = 18, /* 2 (0x) + 2 * 8 (bytes) */
> 
> ... to properly zero pad the value.  Does that make sense?

Looking at this, moving under %pr/R should deduplicate the code, no?
I.o.w. better to use existing code for them to print struct range, no?

-- 
With Best Regards,
Andy Shevchenko



^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v3 13/25] cxl/region: Add sparse DAX region support
  2024-08-16 14:44 ` [PATCH v3 13/25] cxl/region: Add sparse DAX region support ira.weiny
  2024-08-16 23:51   ` Dave Jiang
@ 2024-08-22 18:50   ` Fan Ni
  2024-08-23 16:59   ` Jonathan Cameron
  2024-09-03  2:15   ` Li, Ming4
  3 siblings, 0 replies; 120+ messages in thread
From: Fan Ni @ 2024-08-22 18:50 UTC (permalink / raw)
  To: ira.weiny
  Cc: Dave Jiang, Jonathan Cameron, Navneet Singh, Chris Mason,
	Josef Bacik, David Sterba, Petr Mladek, Steven Rostedt,
	Andy Shevchenko, Rasmus Villemoes, Sergey Senozhatsky,
	Jonathan Corbet, Andrew Morton, Dan Williams, Davidlohr Bueso,
	Alison Schofield, Vishal Verma, linux-btrfs, linux-cxl,
	linux-kernel, linux-doc, nvdimm

On Fri, Aug 16, 2024 at 09:44:21AM -0500, ira.weiny@intel.com wrote:
> From: Navneet Singh <navneet.singh@intel.com>
> 
> Dynamic Capacity CXL regions must allow memory to be added or removed
> dynamically.  In addition to the quantity of memory available the
> location of the memory within a DC partition is dynamic based on the
> extents offered by a device.  CXL DAX regions must accommodate the
> sparseness of this memory in the management of DAX regions and devices.
> 
> Introduce the concept of a sparse DAX region.  Add a create_dc_region()
> sysfs entry to create such regions.  Special case DC capable regions to
> create a 0 sized seed DAX device to maintain compatibility which
> requires a default DAX device to hold a region reference.
> 
> Indicate 0 byte available capacity until such time that capacity is
> added.
> 
> Sparse regions complicate the range mapping of dax devices.  There is no
> known use case for range mapping on sparse regions.  Avoid the
> complication by preventing range mapping of dax devices on sparse
> regions.
> 
> Interleaving is deferred for now.  Add checks.
> 
> Signed-off-by: Navneet Singh <navneet.singh@intel.com>
> Co-developed-by: Ira Weiny <ira.weiny@intel.com>
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> 
> ---
> Changes:
> [Fan: use single function for dc region store]
> [djiang: avoid setting dev_size twice]
> [djbw: Check DCD support and interleave restriction on region creation]
> [iweiny: squash patch : dax/region: Prevent range mapping allocation on sparse regions]
> [iwieny: remove reviews]
> [iweiny: rebase to master]
> [iweiny: push sysfs version to 6.12]
> [iweiny: make cxled_to_mds inline]
> ---
>  Documentation/ABI/testing/sysfs-bus-cxl | 22 ++++++++--------
>  drivers/cxl/core/core.h                 | 12 +++++++++
>  drivers/cxl/core/port.c                 |  1 +
>  drivers/cxl/core/region.c               | 46 +++++++++++++++++++++++++++++++--
>  drivers/dax/bus.c                       | 10 +++++++
>  drivers/dax/bus.h                       |  1 +
>  drivers/dax/cxl.c                       | 16 ++++++++++--
>  7 files changed, 93 insertions(+), 15 deletions(-)
> 
> diff --git a/Documentation/ABI/testing/sysfs-bus-cxl b/Documentation/ABI/testing/sysfs-bus-cxl
> index 6227ae0ab3fc..3a5ee88e551b 100644
> --- a/Documentation/ABI/testing/sysfs-bus-cxl
> +++ b/Documentation/ABI/testing/sysfs-bus-cxl
> @@ -406,20 +406,20 @@ Description:
>  		interleave_granularity).
>  
>  
> -What:		/sys/bus/cxl/devices/decoderX.Y/create_{pmem,ram}_region
> -Date:		May, 2022, January, 2023
> -KernelVersion:	v6.0 (pmem), v6.3 (ram)
> +What:		/sys/bus/cxl/devices/decoderX.Y/create_{pmem,ram,dc}_region
> +Date:		May, 2022, January, 2023, August 2024
> +KernelVersion:	v6.0 (pmem), v6.3 (ram), v6.12 (dc)
>  Contact:	linux-cxl@vger.kernel.org
>  Description:
>  		(RW) Write a string in the form 'regionZ' to start the process
> -		of defining a new persistent, or volatile memory region
> -		(interleave-set) within the decode range bounded by root decoder
> -		'decoderX.Y'. The value written must match the current value
> -		returned from reading this attribute. An atomic compare exchange
> -		operation is done on write to assign the requested id to a
> -		region and allocate the region-id for the next creation attempt.
> -		EBUSY is returned if the region name written does not match the
> -		current cached value.
> +		of defining a new persistent, volatile, or Dynamic Capacity
> +		(DC) memory region (interleave-set) within the decode range
> +		bounded by root decoder 'decoderX.Y'. The value written must
> +		match the current value returned from reading this attribute.
> +		An atomic compare exchange operation is done on write to assign
> +		the requested id to a region and allocate the region-id for the
> +		next creation attempt.  EBUSY is returned if the region name
> +		written does not match the current cached value.
>  
>  
>  What:		/sys/bus/cxl/devices/decoderX.Y/delete_region
> diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
> index 72a506c9dbd0..15b6cf1c19ef 100644
> --- a/drivers/cxl/core/core.h
> +++ b/drivers/cxl/core/core.h
> @@ -4,15 +4,27 @@
>  #ifndef __CXL_CORE_H__
>  #define __CXL_CORE_H__
>  
> +#include <cxlmem.h>
> +
>  extern const struct device_type cxl_nvdimm_bridge_type;
>  extern const struct device_type cxl_nvdimm_type;
>  extern const struct device_type cxl_pmu_type;
>  
>  extern struct attribute_group cxl_base_attribute_group;
>  
> +static inline struct cxl_memdev_state *
> +cxled_to_mds(struct cxl_endpoint_decoder *cxled)
> +{
> +	struct cxl_memdev *cxlmd = cxled_to_memdev(cxled);
> +	struct cxl_dev_state *cxlds = cxlmd->cxlds;
> +
> +	return container_of(cxlds, struct cxl_memdev_state, cxlds);
> +}
> +
>  #ifdef CONFIG_CXL_REGION
>  extern struct device_attribute dev_attr_create_pmem_region;
>  extern struct device_attribute dev_attr_create_ram_region;
> +extern struct device_attribute dev_attr_create_dc_region;
>  extern struct device_attribute dev_attr_delete_region;
>  extern struct device_attribute dev_attr_region;
>  extern const struct device_type cxl_pmem_region_type;
> diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
> index 222aa0aeeef7..44e1e203173d 100644
> --- a/drivers/cxl/core/port.c
> +++ b/drivers/cxl/core/port.c
> @@ -320,6 +320,7 @@ static struct attribute *cxl_decoder_root_attrs[] = {
>  	&dev_attr_qos_class.attr,
>  	SET_CXL_REGION_ATTR(create_pmem_region)
>  	SET_CXL_REGION_ATTR(create_ram_region)
> +	SET_CXL_REGION_ATTR(create_dc_region)
>  	SET_CXL_REGION_ATTR(delete_region)
>  	NULL,
>  };
> diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
> index f85b26b39b2f..35c4a1f4f9bd 100644
> --- a/drivers/cxl/core/region.c
> +++ b/drivers/cxl/core/region.c
> @@ -496,6 +496,11 @@ static ssize_t interleave_ways_store(struct device *dev,
>  	if (rc)
>  		return rc;
>  
> +	if (cxlr->mode == CXL_REGION_DC && val != 1) {
> +		dev_err(dev, "Interleaving and DCD not supported\n");

Is there a typo here?
Maybe "Interleaving a DCD not supported"?

> +		return -EINVAL;
> +	}
> +
>  	rc = ways_to_eiw(val, &iw);
>  	if (rc)
>  		return rc;
> @@ -2174,6 +2179,7 @@ static size_t store_targetN(struct cxl_region *cxlr, const char *buf, int pos,
>  	if (sysfs_streq(buf, "\n"))
>  		rc = detach_target(cxlr, pos);
>  	else {
> +		struct cxl_endpoint_decoder *cxled;
>  		struct device *dev;
>  
>  		dev = bus_find_device_by_name(&cxl_bus_type, NULL, buf);
> @@ -2185,8 +2191,13 @@ static size_t store_targetN(struct cxl_region *cxlr, const char *buf, int pos,
>  			goto out;
>  		}
>  
> -		rc = attach_target(cxlr, to_cxl_endpoint_decoder(dev), pos,
> -				   TASK_INTERRUPTIBLE);
> +		cxled = to_cxl_endpoint_decoder(dev);
> +		if (cxlr->mode == CXL_REGION_DC &&
> +		    !cxl_dcd_supported(cxled_to_mds(cxled))) {
> +			dev_dbg(dev, "DCD unsupported\n");
> +			return -EINVAL;
> +		}
> +		rc = attach_target(cxlr, cxled, pos, TASK_INTERRUPTIBLE);
>  out:
>  		put_device(dev);
>  	}
> @@ -2534,6 +2545,7 @@ static struct cxl_region *__create_region(struct cxl_root_decoder *cxlrd,
>  	switch (mode) {
>  	case CXL_REGION_RAM:
>  	case CXL_REGION_PMEM:
> +	case CXL_REGION_DC:
>  		break;
>  	default:
>  		dev_err(&cxlrd->cxlsd.cxld.dev, "unsupported mode %s\n",
> @@ -2587,6 +2599,20 @@ static ssize_t create_ram_region_store(struct device *dev,
>  }
>  DEVICE_ATTR_RW(create_ram_region);
>  
> +static ssize_t create_dc_region_show(struct device *dev,
> +				     struct device_attribute *attr, char *buf)
> +{
> +	return __create_region_show(to_cxl_root_decoder(dev), buf);
> +}
> +
> +static ssize_t create_dc_region_store(struct device *dev,
> +				      struct device_attribute *attr,
> +				      const char *buf, size_t len)
> +{
> +	return create_region_store(dev, buf, len, CXL_REGION_DC);
> +}
> +DEVICE_ATTR_RW(create_dc_region);
> +
>  static ssize_t region_show(struct device *dev, struct device_attribute *attr,
>  			   char *buf)
>  {
> @@ -3168,6 +3194,11 @@ static int devm_cxl_add_dax_region(struct cxl_region *cxlr)
>  	struct device *dev;
>  	int rc;
>  
> +	if (cxlr->mode == CXL_REGION_DC && cxlr->params.interleave_ways != 1) {
> +		dev_err(&cxlr->dev, "Interleaving DC not supported\n");
> +		return -EINVAL;
> +	}
> +
>  	cxlr_dax = cxl_dax_region_alloc(cxlr);
>  	if (IS_ERR(cxlr_dax))
>  		return PTR_ERR(cxlr_dax);
> @@ -3260,6 +3291,16 @@ static struct cxl_region *construct_region(struct cxl_root_decoder *cxlrd,
>  		return ERR_PTR(-EINVAL);
>  
>  	mode = cxl_decoder_to_region_mode(cxled->mode);
> +	if (mode == CXL_REGION_DC) {
> +		if (!cxl_dcd_supported(cxled_to_mds(cxled))) {
> +			dev_err(&cxled->cxld.dev, "DCD unsupported\n");
> +			return ERR_PTR(-EINVAL);
> +		}
> +		if (cxled->cxld.interleave_ways != 1) {
> +			dev_err(&cxled->cxld.dev, "Interleaving and DCD not supported\n");
If it goes here, it means DCD is upported, but interleaving is not, so
the message here may also need change.

Fan
> +			return ERR_PTR(-EINVAL);
> +		}
> +	}
>  	do {
>  		cxlr = __create_region(cxlrd, mode,
>  				       atomic_read(&cxlrd->region_id));
> @@ -3467,6 +3508,7 @@ static int cxl_region_probe(struct device *dev)
>  	case CXL_REGION_PMEM:
>  		return devm_cxl_add_pmem_region(cxlr);
>  	case CXL_REGION_RAM:
> +	case CXL_REGION_DC:
>  		/*
>  		 * The region can not be manged by CXL if any portion of
>  		 * it is already online as 'System RAM'
> diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
> index fde29e0ad68b..d8cb5195a227 100644
> --- a/drivers/dax/bus.c
> +++ b/drivers/dax/bus.c
> @@ -178,6 +178,11 @@ static bool is_static(struct dax_region *dax_region)
>  	return (dax_region->res.flags & IORESOURCE_DAX_STATIC) != 0;
>  }
>  
> +static bool is_sparse(struct dax_region *dax_region)
> +{
> +	return (dax_region->res.flags & IORESOURCE_DAX_SPARSE_CAP) != 0;
> +}
> +
>  bool static_dev_dax(struct dev_dax *dev_dax)
>  {
>  	return is_static(dev_dax->region);
> @@ -301,6 +306,9 @@ static unsigned long long dax_region_avail_size(struct dax_region *dax_region)
>  
>  	lockdep_assert_held(&dax_region_rwsem);
>  
> +	if (is_sparse(dax_region))
> +		return 0;
> +
>  	for_each_dax_region_resource(dax_region, res)
>  		size -= resource_size(res);
>  	return size;
> @@ -1373,6 +1381,8 @@ static umode_t dev_dax_visible(struct kobject *kobj, struct attribute *a, int n)
>  		return 0;
>  	if (a == &dev_attr_mapping.attr && is_static(dax_region))
>  		return 0;
> +	if (a == &dev_attr_mapping.attr && is_sparse(dax_region))
> +		return 0;
>  	if ((a == &dev_attr_align.attr ||
>  	     a == &dev_attr_size.attr) && is_static(dax_region))
>  		return 0444;
> diff --git a/drivers/dax/bus.h b/drivers/dax/bus.h
> index cbbf64443098..783bfeef42cc 100644
> --- a/drivers/dax/bus.h
> +++ b/drivers/dax/bus.h
> @@ -13,6 +13,7 @@ struct dax_region;
>  /* dax bus specific ioresource flags */
>  #define IORESOURCE_DAX_STATIC BIT(0)
>  #define IORESOURCE_DAX_KMEM BIT(1)
> +#define IORESOURCE_DAX_SPARSE_CAP BIT(2)
>  
>  struct dax_region *alloc_dax_region(struct device *parent, int region_id,
>  		struct range *range, int target_node, unsigned int align,
> diff --git a/drivers/dax/cxl.c b/drivers/dax/cxl.c
> index 9b29e732b39a..367e86b1c22a 100644
> --- a/drivers/dax/cxl.c
> +++ b/drivers/dax/cxl.c
> @@ -13,19 +13,31 @@ static int cxl_dax_region_probe(struct device *dev)
>  	struct cxl_region *cxlr = cxlr_dax->cxlr;
>  	struct dax_region *dax_region;
>  	struct dev_dax_data data;
> +	resource_size_t dev_size;
> +	unsigned long flags;
>  
>  	if (nid == NUMA_NO_NODE)
>  		nid = memory_add_physaddr_to_nid(cxlr_dax->hpa_range.start);
>  
> +	flags = IORESOURCE_DAX_KMEM;
> +	if (cxlr->mode == CXL_REGION_DC)
> +		flags |= IORESOURCE_DAX_SPARSE_CAP;
> +
>  	dax_region = alloc_dax_region(dev, cxlr->id, &cxlr_dax->hpa_range, nid,
> -				      PMD_SIZE, IORESOURCE_DAX_KMEM);
> +				      PMD_SIZE, flags);
>  	if (!dax_region)
>  		return -ENOMEM;
>  
> +	if (cxlr->mode == CXL_REGION_DC)
> +		/* Add empty seed dax device */
> +		dev_size = 0;
> +	else
> +		dev_size = range_len(&cxlr_dax->hpa_range);
> +
>  	data = (struct dev_dax_data) {
>  		.dax_region = dax_region,
>  		.id = -1,
> -		.size = range_len(&cxlr_dax->hpa_range),
> +		.size = dev_size,
>  		.memmap_on_memory = true,
>  	};
>  
> 
> -- 
> 2.45.2
> 

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v3 12/25] cxl/region: Refactor common create region code
  2024-08-16 14:44 ` [PATCH v3 12/25] cxl/region: Refactor common create region code Ira Weiny
  2024-08-16 23:43   ` Dave Jiang
@ 2024-08-22 18:51   ` Fan Ni
  2024-08-23 16:17   ` Jonathan Cameron
  2024-09-03  7:04   ` Li, Ming4
  3 siblings, 0 replies; 120+ messages in thread
From: Fan Ni @ 2024-08-22 18:51 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Dave Jiang, Jonathan Cameron, Navneet Singh, Chris Mason,
	Josef Bacik, David Sterba, Petr Mladek, Steven Rostedt,
	Andy Shevchenko, Rasmus Villemoes, Sergey Senozhatsky,
	Jonathan Corbet, Andrew Morton, Dan Williams, Davidlohr Bueso,
	Alison Schofield, Vishal Verma, linux-btrfs, linux-cxl,
	linux-kernel, linux-doc, nvdimm

On Fri, Aug 16, 2024 at 09:44:20AM -0500, Ira Weiny wrote:
> create_pmem_region_store() and create_ram_region_store() are identical
> with the exception of the region mode.  With the addition of DC region
> mode this would end up being 3 copies of the same code.
> 
> Refactor create_pmem_region_store() and create_ram_region_store() to use
> a single common function to be used in subsequent DC code.
> 
> Suggested-by: Fan Ni <fan.ni@samsung.com>
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> ---

Reviewed-by: Fan Ni <fan.ni@samsung.com>

>  drivers/cxl/core/region.c | 28 +++++++++++-----------------
>  1 file changed, 11 insertions(+), 17 deletions(-)
> 
> diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
> index 650fe33f2ed4..f85b26b39b2f 100644
> --- a/drivers/cxl/core/region.c
> +++ b/drivers/cxl/core/region.c
> @@ -2553,9 +2553,8 @@ static struct cxl_region *__create_region(struct cxl_root_decoder *cxlrd,
>  	return devm_cxl_add_region(cxlrd, id, mode, CXL_DECODER_HOSTONLYMEM);
>  }
>  
> -static ssize_t create_pmem_region_store(struct device *dev,
> -					struct device_attribute *attr,
> -					const char *buf, size_t len)
> +static ssize_t create_region_store(struct device *dev, const char *buf,
> +				   size_t len, enum cxl_region_mode mode)
>  {
>  	struct cxl_root_decoder *cxlrd = to_cxl_root_decoder(dev);
>  	struct cxl_region *cxlr;
> @@ -2565,31 +2564,26 @@ static ssize_t create_pmem_region_store(struct device *dev,
>  	if (rc != 1)
>  		return -EINVAL;
>  
> -	cxlr = __create_region(cxlrd, CXL_REGION_PMEM, id);
> +	cxlr = __create_region(cxlrd, mode, id);
>  	if (IS_ERR(cxlr))
>  		return PTR_ERR(cxlr);
>  
>  	return len;
>  }
> +
> +static ssize_t create_pmem_region_store(struct device *dev,
> +					struct device_attribute *attr,
> +					const char *buf, size_t len)
> +{
> +	return create_region_store(dev, buf, len, CXL_REGION_PMEM);
> +}
>  DEVICE_ATTR_RW(create_pmem_region);
>  
>  static ssize_t create_ram_region_store(struct device *dev,
>  				       struct device_attribute *attr,
>  				       const char *buf, size_t len)
>  {
> -	struct cxl_root_decoder *cxlrd = to_cxl_root_decoder(dev);
> -	struct cxl_region *cxlr;
> -	int rc, id;
> -
> -	rc = sscanf(buf, "region%d\n", &id);
> -	if (rc != 1)
> -		return -EINVAL;
> -
> -	cxlr = __create_region(cxlrd, CXL_REGION_RAM, id);
> -	if (IS_ERR(cxlr))
> -		return PTR_ERR(cxlr);
> -
> -	return len;
> +	return create_region_store(dev, buf, len, CXL_REGION_RAM);
>  }
>  DEVICE_ATTR_RW(create_ram_region);
>  
> 
> -- 
> 2.45.2
> 

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v3 14/25] cxl/events: Split event msgnum configuration from irq setup
  2024-08-16 14:44 ` [PATCH v3 14/25] cxl/events: Split event msgnum configuration from irq setup Ira Weiny
  2024-08-16 23:57   ` Dave Jiang
@ 2024-08-22 21:39   ` Fan Ni
  2024-08-23 17:01   ` Jonathan Cameron
  2024-09-03  7:06   ` Li, Ming4
  3 siblings, 0 replies; 120+ messages in thread
From: Fan Ni @ 2024-08-22 21:39 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Dave Jiang, Jonathan Cameron, Navneet Singh, Chris Mason,
	Josef Bacik, David Sterba, Petr Mladek, Steven Rostedt,
	Andy Shevchenko, Rasmus Villemoes, Sergey Senozhatsky,
	Jonathan Corbet, Andrew Morton, Dan Williams, Davidlohr Bueso,
	Alison Schofield, Vishal Verma, linux-btrfs, linux-cxl,
	linux-kernel, linux-doc, nvdimm

On Fri, Aug 16, 2024 at 09:44:22AM -0500, Ira Weiny wrote:
> Dynamic Capacity Devices (DCD) require event interrupts to process
> memory addition or removal.  BIOS may have control over non-DCD event
> processing.  DCD interrupt configuration needs to be separate from
> memory event interrupt configuration.
> 
> Split cxl_event_config_msgnums() from irq setup in preparation for
> separate DCD interrupts configuration.
> 
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> ---

Reviewed-by: Fan Ni <fan.ni@samsung.com>

>  drivers/cxl/pci.c | 24 ++++++++++++------------
>  1 file changed, 12 insertions(+), 12 deletions(-)
> 
> diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
> index f7f03599bc83..17bea49bbf4d 100644
> --- a/drivers/cxl/pci.c
> +++ b/drivers/cxl/pci.c
> @@ -698,35 +698,31 @@ static int cxl_event_config_msgnums(struct cxl_memdev_state *mds,
>  	return cxl_event_get_int_policy(mds, policy);
>  }
>  
> -static int cxl_event_irqsetup(struct cxl_memdev_state *mds)
> +static int cxl_event_irqsetup(struct cxl_memdev_state *mds,
> +			      struct cxl_event_interrupt_policy *policy)
>  {
>  	struct cxl_dev_state *cxlds = &mds->cxlds;
> -	struct cxl_event_interrupt_policy policy;
>  	int rc;
>  
> -	rc = cxl_event_config_msgnums(mds, &policy);
> -	if (rc)
> -		return rc;
> -
> -	rc = cxl_event_req_irq(cxlds, policy.info_settings);
> +	rc = cxl_event_req_irq(cxlds, policy->info_settings);
>  	if (rc) {
>  		dev_err(cxlds->dev, "Failed to get interrupt for event Info log\n");
>  		return rc;
>  	}
>  
> -	rc = cxl_event_req_irq(cxlds, policy.warn_settings);
> +	rc = cxl_event_req_irq(cxlds, policy->warn_settings);
>  	if (rc) {
>  		dev_err(cxlds->dev, "Failed to get interrupt for event Warn log\n");
>  		return rc;
>  	}
>  
> -	rc = cxl_event_req_irq(cxlds, policy.failure_settings);
> +	rc = cxl_event_req_irq(cxlds, policy->failure_settings);
>  	if (rc) {
>  		dev_err(cxlds->dev, "Failed to get interrupt for event Failure log\n");
>  		return rc;
>  	}
>  
> -	rc = cxl_event_req_irq(cxlds, policy.fatal_settings);
> +	rc = cxl_event_req_irq(cxlds, policy->fatal_settings);
>  	if (rc) {
>  		dev_err(cxlds->dev, "Failed to get interrupt for event Fatal log\n");
>  		return rc;
> @@ -745,7 +741,7 @@ static bool cxl_event_int_is_fw(u8 setting)
>  static int cxl_event_config(struct pci_host_bridge *host_bridge,
>  			    struct cxl_memdev_state *mds, bool irq_avail)
>  {
> -	struct cxl_event_interrupt_policy policy;
> +	struct cxl_event_interrupt_policy policy = { 0 };
>  	int rc;
>  
>  	/*
> @@ -773,11 +769,15 @@ static int cxl_event_config(struct pci_host_bridge *host_bridge,
>  		return -EBUSY;
>  	}
>  
> +	rc = cxl_event_config_msgnums(mds, &policy);
> +	if (rc)
> +		return rc;
> +
>  	rc = cxl_mem_alloc_event_buf(mds);
>  	if (rc)
>  		return rc;
>  
> -	rc = cxl_event_irqsetup(mds);
> +	rc = cxl_event_irqsetup(mds, &policy);
>  	if (rc)
>  		return rc;
>  
> 
> -- 
> 2.45.2
> 

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v3 15/25] cxl/pci: Factor out interrupt policy check
  2024-08-16 14:44 ` [PATCH v3 15/25] cxl/pci: Factor out interrupt policy check Ira Weiny
@ 2024-08-22 21:41   ` Fan Ni
  2024-09-03  7:07   ` Li, Ming4
  1 sibling, 0 replies; 120+ messages in thread
From: Fan Ni @ 2024-08-22 21:41 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Dave Jiang, Jonathan Cameron, Navneet Singh, Chris Mason,
	Josef Bacik, David Sterba, Petr Mladek, Steven Rostedt,
	Andy Shevchenko, Rasmus Villemoes, Sergey Senozhatsky,
	Jonathan Corbet, Andrew Morton, Dan Williams, Davidlohr Bueso,
	Alison Schofield, Vishal Verma, linux-btrfs, linux-cxl,
	linux-kernel, linux-doc, nvdimm

On Fri, Aug 16, 2024 at 09:44:23AM -0500, Ira Weiny wrote:
> Dynamic Capacity Devices (DCD) require event interrupts to process
> memory addition or removal.  BIOS may have control over non-DCD event
> processing.  DCD interrupt configuration needs to be separate from
> memory event interrupt configuration.
> 
> Factor out event interrupt setting validation.
> 
> Reviewed-by: Dave Jiang <dave.jiang@intel.com>
> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> 

Reviewed-by: Fan Ni <fan.ni@samsung.com>

> ---
> Changes:
> [iweiny: reword commit message]
> [iweiny: keep review tags on simple patch]
> ---
>  drivers/cxl/pci.c | 23 ++++++++++++++++-------
>  1 file changed, 16 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
> index 17bea49bbf4d..370c74eae323 100644
> --- a/drivers/cxl/pci.c
> +++ b/drivers/cxl/pci.c
> @@ -738,6 +738,21 @@ static bool cxl_event_int_is_fw(u8 setting)
>  	return mode == CXL_INT_FW;
>  }
>  
> +static bool cxl_event_validate_mem_policy(struct cxl_memdev_state *mds,
> +					  struct cxl_event_interrupt_policy *policy)
> +{
> +	if (cxl_event_int_is_fw(policy->info_settings) ||
> +	    cxl_event_int_is_fw(policy->warn_settings) ||
> +	    cxl_event_int_is_fw(policy->failure_settings) ||
> +	    cxl_event_int_is_fw(policy->fatal_settings)) {
> +		dev_err(mds->cxlds.dev,
> +			"FW still in control of Event Logs despite _OSC settings\n");
> +		return false;
> +	}
> +
> +	return true;
> +}
> +
>  static int cxl_event_config(struct pci_host_bridge *host_bridge,
>  			    struct cxl_memdev_state *mds, bool irq_avail)
>  {
> @@ -760,14 +775,8 @@ static int cxl_event_config(struct pci_host_bridge *host_bridge,
>  	if (rc)
>  		return rc;
>  
> -	if (cxl_event_int_is_fw(policy.info_settings) ||
> -	    cxl_event_int_is_fw(policy.warn_settings) ||
> -	    cxl_event_int_is_fw(policy.failure_settings) ||
> -	    cxl_event_int_is_fw(policy.fatal_settings)) {
> -		dev_err(mds->cxlds.dev,
> -			"FW still in control of Event Logs despite _OSC settings\n");
> +	if (!cxl_event_validate_mem_policy(mds, &policy))
>  		return -EBUSY;
> -	}
>  
>  	rc = cxl_event_config_msgnums(mds, &policy);
>  	if (rc)
> 
> -- 
> 2.45.2
> 

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v3 06/25] cxl/mem: Read dynamic capacity configuration from the device
  2024-08-20 17:01     ` Fan Ni
@ 2024-08-23  2:01       ` Ira Weiny
  2024-08-23  2:02       ` Ira Weiny
  1 sibling, 0 replies; 120+ messages in thread
From: Ira Weiny @ 2024-08-23  2:01 UTC (permalink / raw)
  To: Fan Ni, Dave Jiang
  Cc: ira.weiny, Jonathan Cameron, Navneet Singh, Chris Mason,
	Josef Bacik, David Sterba, Petr Mladek, Steven Rostedt,
	Andy Shevchenko, Rasmus Villemoes, Sergey Senozhatsky,
	Jonathan Corbet, Andrew Morton, Dan Williams, Davidlohr Bueso,
	Alison Schofield, Vishal Verma, linux-btrfs, linux-cxl,
	linux-kernel, linux-doc, nvdimm, Li, Ming

Fan Ni wrote:
> On Fri, Aug 16, 2024 at 02:45:47PM -0700, Dave Jiang wrote:
> > 
> > > +
> > > +/**
> > > + * cxl_dev_dynamic_capacity_identify() - Reads the dynamic capacity
> > > + *					 information from the device.
> > > + * @mds: The memory device state
> > > + *
> > > + * Read Dynamic Capacity information from the device and populate the state
> > > + * structures for later use.
> > > + *
> > > + * Return: 0 if identify was executed successfully, -ERRNO on error.
> > > + */
> > > +int cxl_dev_dynamic_capacity_identify(struct cxl_memdev_state *mds)
> > > +{
> > > +	size_t dc_resp_size = mds->payload_size;
> > > +	struct device *dev = mds->cxlds.dev;
> > > +	u8 start_region, i;
> > > +
> > > +	for (i = 0; i < CXL_MAX_DC_REGION; i++)
> > > +		snprintf(mds->dc_region[i].name, CXL_DC_REGION_STRLEN, "<nil>");
> > > +
> > > +	if (!cxl_dcd_supported(mds)) {
> > > +		dev_dbg(dev, "DCD not supported\n");
> > > +		return 0;
> > > +	}
> > 
> > This should happen before you pre-format the name string? I would assume that if DCD is not supported then the dcd name sysfs attribs would be not be visible?
> > 

No this string is not used for sysfs.  It is used to label the dpa
resources...  That said in review I don't recall why it was necessary to
add the '<nil>' to them by default.  I'm actually going to remove that and
continue testing and if I recall where this was showing up I might add it
back in.

> > > +
> > > +	struct cxl_mbox_get_dc_config_out *dc_resp __free(kfree) =
> > > +					kvmalloc(dc_resp_size, GFP_KERNEL);
> > > +	if (!dc_resp)
> > > +		return -ENOMEM;
> > > +
> > > +	start_region = 0;
> > > +	do {
> > > +		int rc, j;
> > > +
> > > +		rc = cxl_get_dc_config(mds, start_region, dc_resp, dc_resp_size);
> > > +		if (rc < 0) {
> > > +			dev_dbg(dev, "Failed to get DC config: %d\n", rc);
> > > +			return rc;
> > > +		}
> > > +
> > > +		mds->nr_dc_region += rc;
> > > +
> > > +		if (mds->nr_dc_region < 1 || mds->nr_dc_region > CXL_MAX_DC_REGION) {
> > > +			dev_err(dev, "Invalid num of dynamic capacity regions %d\n",
> > > +				mds->nr_dc_region);
> > > +			return -EINVAL;
> > > +		}
> > > +
> > > +		for (i = start_region, j = 0; i < mds->nr_dc_region; i++, j++) {
> > 
> > This should be 'j < mds->nr_dc_region'? Otherwise if your start region say is '3' and you have '2' DC regions, you never enter the loop. Or does that not happen? I also wonder if you need to check if 'start_region + mds->nr_dc_region > CXL_MAX_DC_REGION'.
> > 
> That can not happen, start_region was updated to the number of regions
> has returned till now (not counting the current call), while
> nr_dc_region is the total number of regions returned till now (including
> the current call) as we update it above, so start_region should never be larger
> than nr_dc_region.

Yep.

> 
> > > +			rc = cxl_dc_save_region_info(mds, i, &dc_resp->region[j]);
> > > +			if (rc) {
> > > +				dev_dbg(dev, "Failed to save region info: %d\n", rc);
> 
> I am not sure why we sometimes use dev_err and sometimes we use dev_dbg
> here, if dcd is supported, error from getting dc configuration is an
> error to me.

We are trying to reduce the dev_err() use.  cxl_dc_save_region_info() has
dev_err() which is much more specific as to the error.  At worse this is
just redundant as a debug.

I'll remove it because the debug output is pretty verbose too.

Ira

> 
> Fan
> 
> > > +				return rc;
> > > +			}
> > > +		}
> > > +
> > > +		start_region = mds->nr_dc_region;
> > > +
> > > +	} while (mds->nr_dc_region < dc_resp->avail_region_count);
> > > +
> > > +	mds->dynamic_bytes =
> > > +		mds->dc_region[mds->nr_dc_region - 1].base +
> > > +		mds->dc_region[mds->nr_dc_region - 1].decode_len -
> > > +		mds->dc_region[0].base;
> > > +	dev_dbg(dev, "Total dynamic range: %#llx\n", mds->dynamic_bytes);
> > > +
> > > +	return 0;
> > > +}
> > > +EXPORT_SYMBOL_NS_GPL(cxl_dev_dynamic_capacity_identify, CXL);
> > > +
> > >  static int add_dpa_res(struct device *dev, struct resource *parent,
> > >  		       struct resource *res, resource_size_t start,
> > >  		       resource_size_t size, const char *type)
> > > @@ -1294,8 +1447,15 @@ int cxl_mem_create_range_info(struct cxl_memdev_state *mds)
> > >  {
> > >  	struct cxl_dev_state *cxlds = &mds->cxlds;
> > >  	struct device *dev = cxlds->dev;
> > > +	size_t untenanted_mem;
> > >  	int rc;
> > >  
> > > +	mds->total_bytes = mds->static_bytes;
> > > +	if (mds->nr_dc_region) {
> > > +		untenanted_mem = mds->dc_region[0].base - mds->static_bytes;
> > > +		mds->total_bytes += untenanted_mem + mds->dynamic_bytes;
> > > +	}
> > > +
> > >  	if (!cxlds->media_ready) {
> > >  		cxlds->dpa_res = DEFINE_RES_MEM(0, 0);
> > >  		cxlds->ram_res = DEFINE_RES_MEM(0, 0);
> > > @@ -1305,6 +1465,15 @@ int cxl_mem_create_range_info(struct cxl_memdev_state *mds)
> > >  
> > >  	cxlds->dpa_res = DEFINE_RES_MEM(0, mds->total_bytes);
> > >  
> > > +	for (int i = 0; i < mds->nr_dc_region; i++) {
> > > +		struct cxl_dc_region_info *dcr = &mds->dc_region[i];
> > > +
> > > +		rc = add_dpa_res(dev, &cxlds->dpa_res, &cxlds->dc_res[i],
> > > +				 dcr->base, dcr->decode_len, dcr->name);
> > > +		if (rc)
> > > +			return rc;
> > > +	}
> > > +
> > >  	if (mds->partition_align_bytes == 0) {
> > >  		rc = add_dpa_res(dev, &cxlds->dpa_res, &cxlds->ram_res, 0,
> > >  				 mds->volatile_only_bytes, "ram");
> > > diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
> > > index f2f8b567e0e7..b4eb8164d05d 100644
> > > --- a/drivers/cxl/cxlmem.h
> > > +++ b/drivers/cxl/cxlmem.h
> > > @@ -402,6 +402,7 @@ enum cxl_devtype {
> > >  	CXL_DEVTYPE_CLASSMEM,
> > >  };
> > >  
> > > +#define CXL_MAX_DC_REGION 8
> > >  /**
> > >   * struct cxl_dpa_perf - DPA performance property entry
> > >   * @dpa_range: range for DPA address
> > > @@ -431,6 +432,8 @@ struct cxl_dpa_perf {
> > >   * @dpa_res: Overall DPA resource tree for the device
> > >   * @pmem_res: Active Persistent memory capacity configuration
> > >   * @ram_res: Active Volatile memory capacity configuration
> > > + * @dc_res: Active Dynamic Capacity memory configuration for each possible
> > > + *          region
> > >   * @serial: PCIe Device Serial Number
> > >   * @type: Generic Memory Class device or Vendor Specific Memory device
> > >   */
> > > @@ -445,10 +448,22 @@ struct cxl_dev_state {
> > >  	struct resource dpa_res;
> > >  	struct resource pmem_res;
> > >  	struct resource ram_res;
> > > +	struct resource dc_res[CXL_MAX_DC_REGION];
> > >  	u64 serial;
> > >  	enum cxl_devtype type;
> > >  };
> > >  
> > > +#define CXL_DC_REGION_STRLEN > +struct cxl_dc_region_info {
> > > +	u64 base;
> > > +	u64 decode_len;
> > > +	u64 len;
> > > +	u64 blk_size;
> > > +	u32 dsmad_handle;
> > > +	u8 flags;
> > > +	u8 name[CXL_DC_REGION_STRLEN];
> > > +};
> > 
> > Does this need kdoc comments?
> > 
> > 
> > > +
> > >  /**
> > >   * struct cxl_memdev_state - Generic Type-3 Memory Device Class driver data
> > >   *
> > > @@ -466,7 +481,9 @@ struct cxl_dev_state {
> > >   * @dcd_cmds: List of DCD commands implemented by memory device
> > >   * @enabled_cmds: Hardware commands found enabled in CEL.
> > >   * @exclusive_cmds: Commands that are kernel-internal only
> > > - * @total_bytes: sum of all possible capacities
> > > + * @total_bytes: length of all possible capacities
> > > + * @static_bytes: length of possible static RAM and PMEM partitions
> > > + * @dynamic_bytes: length of possible DC partitions (DC Regions)
> > 
> > Did this get added to the wrong struct comment header? 'cxl_dev_state' instead of 'cxl_memdev_state'?
> > >   * @volatile_only_bytes: hard volatile capacity
> > >   * @persistent_only_bytes: hard persistent capacity
> > >   * @partition_align_bytes: alignment size for partition-able capacity
> > > @@ -476,6 +493,8 @@ struct cxl_dev_state {
> > >   * @next_persistent_bytes: persistent capacity change pending device reset
> > >   * @ram_perf: performance data entry matched to RAM partition
> > >   * @pmem_perf: performance data entry matched to PMEM partition
> > > + * @nr_dc_region: number of DC regions implemented in the memory device
> > > + * @dc_region: array containing info about the DC regions
> > Did this get added to the wrong struct comment header? 'cxl_dev_state' instead of 'cxl_memdev_state'?
> > 
> > DJ
> > 
> > >   * @event: event log driver state
> > >   * @poison: poison driver state info
> > >   * @security: security driver state info
> > > @@ -496,6 +515,8 @@ struct cxl_memdev_state {
> > >  	DECLARE_BITMAP(enabled_cmds, CXL_MEM_COMMAND_ID_MAX);
> > >  	DECLARE_BITMAP(exclusive_cmds, CXL_MEM_COMMAND_ID_MAX);
> > >  	u64 total_bytes;
> > > +	u64 static_bytes;
> > > +	u64 dynamic_bytes;
> > >  	u64 volatile_only_bytes;
> > >  	u64 persistent_only_bytes;
> > >  	u64 partition_align_bytes;
> > > @@ -507,6 +528,9 @@ struct cxl_memdev_state {
> > >  	struct cxl_dpa_perf ram_perf;
> > >  	struct cxl_dpa_perf pmem_perf;
> > >  
> > > +	u8 nr_dc_region;
> > > +	struct cxl_dc_region_info dc_region[CXL_MAX_DC_REGION];
> > > +
> > >  	struct cxl_event_state event;
> > >  	struct cxl_poison_state poison;
> > >  	struct cxl_security_state security;
> > > @@ -709,6 +733,32 @@ struct cxl_mbox_set_partition_info {
> > >  
> > >  #define  CXL_SET_PARTITION_IMMEDIATE_FLAG	BIT(0)
> > >  
> > > +/* See CXL 3.1 Table 8-163 get dynamic capacity config Input Payload */
> > > +struct cxl_mbox_get_dc_config_in {
> > > +	u8 region_count;
> > > +	u8 start_region_index;
> > > +} __packed;
> > > +
> > > +/* See CXL 3.1 Table 8-164 get dynamic capacity config Output Payload */
> > > +struct cxl_mbox_get_dc_config_out {
> > > +	u8 avail_region_count;
> > > +	u8 regions_returned;
> > > +	u8 rsvd[6];
> > > +	/* See CXL 3.1 Table 8-165 */
> > > +	struct cxl_dc_region_config {
> > > +		__le64 region_base;
> > > +		__le64 region_decode_length;
> > > +		__le64 region_length;
> > > +		__le64 region_block_size;
> > > +		__le32 region_dsmad_handle;
> > > +		u8 flags;
> > > +		u8 rsvd[3];
> > > +	} __packed region[];
> > > +	/* Trailing fields unused */
> > > +} __packed;
> > > +#define CXL_DYNAMIC_CAPACITY_SANITIZE_ON_RELEASE_FLAG BIT(0)
> > > +#define CXL_DCD_BLOCK_LINE_SIZE 0x40
> > > +
> > >  /* Set Timestamp CXL 3.0 Spec 8.2.9.4.2 */
> > >  struct cxl_mbox_set_timestamp_in {
> > >  	__le64 timestamp;
> > > @@ -832,6 +882,7 @@ enum {
> > >  int cxl_internal_send_cmd(struct cxl_memdev_state *mds,
> > >  			  struct cxl_mbox_cmd *cmd);
> > >  int cxl_dev_state_identify(struct cxl_memdev_state *mds);
> > > +int cxl_dev_dynamic_capacity_identify(struct cxl_memdev_state *mds);
> > >  int cxl_await_media_ready(struct cxl_dev_state *cxlds);
> > >  int cxl_enumerate_cmds(struct cxl_memdev_state *mds);
> > >  int cxl_mem_create_range_info(struct cxl_memdev_state *mds);
> > > @@ -845,6 +896,17 @@ void cxl_event_trace_record(const struct cxl_memdev *cxlmd,
> > >  			    enum cxl_event_log_type type,
> > >  			    enum cxl_event_type event_type,
> > >  			    const uuid_t *uuid, union cxl_event *evt);
> > > +
> > > +static inline bool cxl_dcd_supported(struct cxl_memdev_state *mds)
> > > +{
> > > +	return test_bit(CXL_DCD_ENABLED_GET_CONFIG, mds->dcd_cmds);
> > > +}
> > > +
> > > +static inline void cxl_disable_dcd(struct cxl_memdev_state *mds)
> > > +{
> > > +	clear_bit(CXL_DCD_ENABLED_GET_CONFIG, mds->dcd_cmds);
> > > +}
> > > +
> > >  int cxl_set_timestamp(struct cxl_memdev_state *mds);
> > >  int cxl_poison_state_init(struct cxl_memdev_state *mds);
> > >  int cxl_mem_get_poison(struct cxl_memdev *cxlmd, u64 offset, u64 len,
> > > diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
> > > index 3a60cd66263e..f7f03599bc83 100644
> > > --- a/drivers/cxl/pci.c
> > > +++ b/drivers/cxl/pci.c
> > > @@ -874,6 +874,10 @@ static int cxl_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
> > >  	if (rc)
> > >  		return rc;
> > >  
> > > +	rc = cxl_dev_dynamic_capacity_identify(mds);
> > > +	if (rc)
> > > +		cxl_disable_dcd(mds);
> > > +
> > >  	rc = cxl_mem_create_range_info(mds);
> > >  	if (rc)
> > >  		return rc;
> > > 
> 



^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v3 06/25] cxl/mem: Read dynamic capacity configuration from the device
  2024-08-20 17:01     ` Fan Ni
  2024-08-23  2:01       ` Ira Weiny
@ 2024-08-23  2:02       ` Ira Weiny
  1 sibling, 0 replies; 120+ messages in thread
From: Ira Weiny @ 2024-08-23  2:02 UTC (permalink / raw)
  To: Fan Ni, Dave Jiang
  Cc: ira.weiny, Jonathan Cameron, Navneet Singh, Chris Mason,
	Josef Bacik, David Sterba, Petr Mladek, Steven Rostedt,
	Andy Shevchenko, Rasmus Villemoes, Sergey Senozhatsky,
	Jonathan Corbet, Andrew Morton, Dan Williams, Davidlohr Bueso,
	Alison Schofield, Vishal Verma, linux-btrfs, linux-cxl,
	linux-kernel, linux-doc, nvdimm, Li, Ming

Fan Ni wrote:
> On Fri, Aug 16, 2024 at 02:45:47PM -0700, Dave Jiang wrote:

[snip]

> > > +
> > > +/**
> > > + * cxl_dev_dynamic_capacity_identify() - Reads the dynamic capacity
> > > + *					 information from the device.
> > > + * @mds: The memory device state
> > > + *
> > > + * Read Dynamic Capacity information from the device and populate the state
> > > + * structures for later use.
> > > + *
> > > + * Return: 0 if identify was executed successfully, -ERRNO on error.
> > > + */
> > > +int cxl_dev_dynamic_capacity_identify(struct cxl_memdev_state *mds)
> > > +{
> > > +	size_t dc_resp_size = mds->payload_size;
> > > +	struct device *dev = mds->cxlds.dev;
> > > +	u8 start_region, i;
> > > +
> > > +	for (i = 0; i < CXL_MAX_DC_REGION; i++)
> > > +		snprintf(mds->dc_region[i].name, CXL_DC_REGION_STRLEN, "<nil>");
> > > +
> > > +	if (!cxl_dcd_supported(mds)) {
> > > +		dev_dbg(dev, "DCD not supported\n");
> > > +		return 0;
> > > +	}
> > 
> > This should happen before you pre-format the name string? I would assume that if DCD is not supported then the dcd name sysfs attribs would be not be visible?
> > 

No this string is not used for sysfs.  It is used to label the dpa
resources...  That said in review I don't recall why it was necessary to
add the '<nil>' to them by default.  I'm actually going to remove that and
continue testing and if I recall where this was showing up I might add it
back in.

> > > +
> > > +	struct cxl_mbox_get_dc_config_out *dc_resp __free(kfree) =
> > > +					kvmalloc(dc_resp_size, GFP_KERNEL);
> > > +	if (!dc_resp)
> > > +		return -ENOMEM;
> > > +
> > > +	start_region = 0;
> > > +	do {
> > > +		int rc, j;
> > > +
> > > +		rc = cxl_get_dc_config(mds, start_region, dc_resp, dc_resp_size);
> > > +		if (rc < 0) {
> > > +			dev_dbg(dev, "Failed to get DC config: %d\n", rc);
> > > +			return rc;
> > > +		}
> > > +
> > > +		mds->nr_dc_region += rc;
> > > +
> > > +		if (mds->nr_dc_region < 1 || mds->nr_dc_region > CXL_MAX_DC_REGION) {
> > > +			dev_err(dev, "Invalid num of dynamic capacity regions %d\n",
> > > +				mds->nr_dc_region);
> > > +			return -EINVAL;
> > > +		}
> > > +
> > > +		for (i = start_region, j = 0; i < mds->nr_dc_region; i++, j++) {
> > 
> > This should be 'j < mds->nr_dc_region'? Otherwise if your start region say is '3' and you have '2' DC regions, you never enter the loop. Or does that not happen? I also wonder if you need to check if 'start_region + mds->nr_dc_region > CXL_MAX_DC_REGION'.
> > 
> That can not happen, start_region was updated to the number of regions
> has returned till now (not counting the current call), while
> nr_dc_region is the total number of regions returned till now (including
> the current call) as we update it above, so start_region should never be larger
> than nr_dc_region.

Yep.

> 
> > > +			rc = cxl_dc_save_region_info(mds, i, &dc_resp->region[j]);
> > > +			if (rc) {
> > > +				dev_dbg(dev, "Failed to save region info: %d\n", rc);
> 
> I am not sure why we sometimes use dev_err and sometimes we use dev_dbg
> here, if dcd is supported, error from getting dc configuration is an
> error to me.

We are trying to reduce the dev_err() use.  cxl_dc_save_region_info() has
dev_err() which is much more specific as to the error.  At worse this is
just redundant as a debug.

I'll remove it because the debug output is pretty verbose too.

Ira

> 
> Fan

[snip]

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v3 09/25] cxl/hdm: Add dynamic capacity size support to endpoint decoders
  2024-08-16 23:08   ` Dave Jiang
@ 2024-08-23  2:26     ` Ira Weiny
  0 siblings, 0 replies; 120+ messages in thread
From: Ira Weiny @ 2024-08-23  2:26 UTC (permalink / raw)
  To: Dave Jiang, ira.weiny, Fan Ni, Jonathan Cameron, Navneet Singh,
	Chris Mason, Josef Bacik, David Sterba, Petr Mladek,
	Steven Rostedt, Andy Shevchenko, Rasmus Villemoes,
	Sergey Senozhatsky, Jonathan Corbet, Andrew Morton
  Cc: Dan Williams, Davidlohr Bueso, Alison Schofield, Vishal Verma,
	linux-btrfs, linux-cxl, linux-kernel, linux-doc, nvdimm

Dave Jiang wrote:
> 
> 
> On 8/16/24 7:44 AM, ira.weiny@intel.com wrote:
> > From: Navneet Singh <navneet.singh@intel.com>
> > 

[snip]

> > +static int dc_mode_to_region_index(enum cxl_decoder_mode mode)
> > +{
> > +	return mode - CXL_DECODER_DC0;
> > +}
> > +
> > +static int cxl_request_skip(struct cxl_endpoint_decoder *cxled,
> > +			    resource_size_t skip_base, resource_size_t skip_len)
> > +{
> > +	struct cxl_dev_state *cxlds = cxled_to_memdev(cxled)->cxlds;
> > +	const char *name = dev_name(&cxled->cxld.dev);
> > +	struct cxl_port *port = cxled_to_port(cxled);
> > +	struct resource *dpa_res = &cxlds->dpa_res;
> > +	struct device *dev = &port->dev;
> > +	struct resource *res;
> > +	int rc;
> > +
> > +	res = __request_region(dpa_res, skip_base, skip_len, name, 0);
> > +	if (!res)
> > +		return -EBUSY;
> > +
> > +	rc = xa_insert(&cxled->skip_res, skip_base, res, GFP_KERNEL);
> 
> Maybe rename skip_res to skip_xa, given most of the vars in CXL with
> _res are 'struct resource' to avoid confusion. See 'dpa_res' above.
> 

Good idea.
[done]

> > +	if (rc) {
> > +		__release_region(dpa_res, skip_base, skip_len);
> > +		return rc;
> > +	}
> > +
> > +	dev_dbg(dev, "decoder%d.%d: skipped space; %pr\n",
> > +		port->id, cxled->cxld.id, res);
> > +	return 0;
> > +}
> > +
> > +static int cxl_reserve_dpa_skip(struct cxl_endpoint_decoder *cxled,
> > +				resource_size_t base, resource_size_t skipped)
> > +{
> > +	struct cxl_memdev *cxlmd = cxled_to_memdev(cxled);
> > +	struct cxl_port *port = cxled_to_port(cxled);
> > +	struct cxl_dev_state *cxlds = cxlmd->cxlds;
> > +	resource_size_t skip_base = base - skipped;
> > +	struct device *dev = &port->dev;
> > +	resource_size_t skip_len = 0;
> > +	int rc, index;
> > +
> > +	if (resource_size(&cxlds->ram_res) && skip_base <= cxlds->ram_res.end) {
> > +		skip_len = cxlds->ram_res.end - skip_base + 1;
> > +		rc = cxl_request_skip(cxled, skip_base, skip_len);
> > +		if (rc)
> > +			return rc;
> > +		skip_base += skip_len;
> > +	}
> > +
> > +	if (skip_base == base) {
> > +		dev_dbg(dev, "skip done ram!\n");
> > +		return 0;
> > +	}
> > +
> > +	if (resource_size(&cxlds->pmem_res) &&
> > +	    skip_base <= cxlds->pmem_res.end) {
> > +		skip_len = cxlds->pmem_res.end - skip_base + 1;
> > +		rc = cxl_request_skip(cxled, skip_base, skip_len);
> > +		if (rc)
> > +			return rc;
> > +		skip_base += skip_len;
> > +	}
> 
> Does 'skip_base == base' need to be checked here again before going to DCD?

No it is checked below...

> 
> DJ
> 
> > +
> > +	index = dc_mode_to_region_index(cxled->mode);
> > +	for (int i = 0; i <= index; i++) {
> > +		struct resource *dcr = &cxlds->dc_res[i];
> > +
> > +		if (skip_base < dcr->start) {
> > +			skip_len = dcr->start - skip_base;
> > +			rc = cxl_request_skip(cxled, skip_base, skip_len);
> > +			if (rc)
> > +				return rc;
> > +			skip_base += skip_len;
> > +		}
> > +
> > +		if (skip_base == base) {
> > +			dev_dbg(dev, "skip done DC region %d!\n", i);
> > +			break;
> > +		}

... here.

After any skips between pmem and the first DC partition.
Ira

> > +
> > +		if (resource_size(dcr) && skip_base <= dcr->end) {
> > +			if (skip_base > base) {
> > +				dev_err(dev, "Skip error DC region %d; skip_base %pa; base %pa\n",
> > +					i, &skip_base, &base);
> > +				return -ENXIO;
> > +			}
> > +
> > +			skip_len = dcr->end - skip_base + 1;
> > +			rc = cxl_request_skip(cxled, skip_base, skip_len);
> > +			if (rc)
> > +				return rc;
> > +			skip_base += skip_len;
> > +		}
> > +	}
> > +
> > +	return 0;
> > +}
> > +

[snip]

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v3 11/25] cxl/mem: Expose DCD partition capabilities in sysfs
  2024-08-16 23:42   ` Dave Jiang
@ 2024-08-23  2:28     ` Ira Weiny
  2024-08-23 14:58       ` Dave Jiang
  2024-08-23 16:14       ` Jonathan Cameron
  0 siblings, 2 replies; 120+ messages in thread
From: Ira Weiny @ 2024-08-23  2:28 UTC (permalink / raw)
  To: Dave Jiang, ira.weiny, Fan Ni, Jonathan Cameron, Navneet Singh,
	Chris Mason, Josef Bacik, David Sterba, Petr Mladek,
	Steven Rostedt, Andy Shevchenko, Rasmus Villemoes,
	Sergey Senozhatsky, Jonathan Corbet, Andrew Morton
  Cc: Dan Williams, Davidlohr Bueso, Alison Schofield, Vishal Verma,
	linux-btrfs, linux-cxl, linux-kernel, linux-doc, nvdimm

Dave Jiang wrote:
> 
> 
> On 8/16/24 7:44 AM, ira.weiny@intel.com wrote:
> > From: Navneet Singh <navneet.singh@intel.com>
> > 
> > To properly configure CXL regions on Dynamic Capacity Devices (DCD),
> > user space will need to know the details of the DC partitions available.
> > 
> > Expose dynamic capacity capabilities through sysfs.
> > 
> > Signed-off-by: Navneet Singh <navneet.singh@intel.com>
> > Co-developed-by: Ira Weiny <ira.weiny@intel.com>
> > Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> > 
> > ---
> > Changes:
> > [iweiny: remove review tags]
> > [Davidlohr/Fan/Jonathan: omit 'dc' attribute directory if device is not DC]
> > [Jonathan: update documentation for dc visibility]
> > [Jonathan: Add a comment to DC region X attributes to ensure visibility checks work]
> > [iweiny: push sysfs version to 6.12]
> > ---
> >  Documentation/ABI/testing/sysfs-bus-cxl | 12 ++++
> >  drivers/cxl/core/memdev.c               | 97 +++++++++++++++++++++++++++++++++
> >  2 files changed, 109 insertions(+)
> > 
> > diff --git a/Documentation/ABI/testing/sysfs-bus-cxl b/Documentation/ABI/testing/sysfs-bus-cxl
> > index 957717264709..6227ae0ab3fc 100644
> > --- a/Documentation/ABI/testing/sysfs-bus-cxl
> > +++ b/Documentation/ABI/testing/sysfs-bus-cxl
> > @@ -54,6 +54,18 @@ Description:
> >  		identically named field in the Identify Memory Device Output
> >  		Payload in the CXL-2.0 specification.
> >  
> > +What:		/sys/bus/cxl/devices/memX/dc/region_count
> > +		/sys/bus/cxl/devices/memX/dc/regionY_size
> 
> Just make it into 2 separate entries?

Do you mean in the docs?

Ira

> 
> DJ
> > +Date:		August, 2024
> > +KernelVersion:	v6.12
> > +Contact:	linux-cxl@vger.kernel.org
> > +Description:
> > +		(RO) Dynamic Capacity (DC) region information.  The dc
> > +		directory is only visible on devices which support Dynamic
> > +		Capacity.
> > +		The region_count is the number of Dynamic Capacity (DC)
> > +		partitions (regions) supported on the device.
> > +		regionY_size is the size of each of those partitions.
> >  
> >  What:		/sys/bus/cxl/devices/memX/pmem/qos_class
> >  Date:		May, 2023

[snip]

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v3 18/25] cxl/extent: Process DCD events and realize region extents
  2024-08-19 18:51   ` Dave Jiang
@ 2024-08-23  2:53     ` Ira Weiny
  0 siblings, 0 replies; 120+ messages in thread
From: Ira Weiny @ 2024-08-23  2:53 UTC (permalink / raw)
  To: Dave Jiang, ira.weiny, Fan Ni, Jonathan Cameron, Navneet Singh,
	Chris Mason, Josef Bacik, David Sterba, Petr Mladek,
	Steven Rostedt, Andy Shevchenko, Rasmus Villemoes,
	Sergey Senozhatsky, Jonathan Corbet, Andrew Morton
  Cc: Dan Williams, Davidlohr Bueso, Alison Schofield, Vishal Verma,
	linux-btrfs, linux-cxl, linux-kernel, linux-doc, nvdimm

Dave Jiang wrote:
> 
> 
> On 8/16/24 7:44 AM, ira.weiny@intel.com wrote:
> > From: Navneet Singh <navneet.singh@intel.com>
> > 

[snip]

> > 
> > Process DCD events and create region devices.
> > 
> > Signed-off-by: Navneet Singh <navneet.singh@intel.com>
> > Co-developed-by: Ira Weiny <ira.weiny@intel.com>
> > Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> 
> A few nits below, but in general
> Reviewed-by: Dave Jiang <dave.jiang@intel.com>

Thanks.

> > +
> > +static int online_region_extent(struct region_extent *region_extent)
> > +{
> > +	struct cxl_dax_region *cxlr_dax = region_extent->cxlr_dax;
> > +	struct device *dev;
> > +	int rc;
> > +
> > +	dev = &region_extent->dev;
> 
> Nit. You can move this up to when you declare 'dev'.

[done.]

[snip]

> > +
> > +static int cxl_add_pending(struct cxl_memdev_state *mds)
> > +{
> > +	struct device *dev = mds->cxlds.dev;
> > +	struct cxl_extent *extent;
> > +	unsigned long index;
> > +	unsigned long cnt = 0;
> reverse xmas tree

yep.
[done.]


[snip]

> > +
> > +static int handle_add_event(struct cxl_memdev_state *mds,
> > +			    struct cxl_event_dcd *event)
> > +{
> > +	struct cxl_extent *tmp = kzalloc(sizeof(*tmp), GFP_KERNEL);
> for readability I would use *extent instead of *tmp

sure.
[done.]


[snip]

> >  
> > diff --git a/include/linux/cxl-event.h b/include/linux/cxl-event.h
> > index 0bea1afbd747..eeda8059d81a 100644
> > --- a/include/linux/cxl-event.h
> > +++ b/include/linux/cxl-event.h
> > @@ -96,11 +96,43 @@ struct cxl_event_mem_module {
> >  	u8 reserved[0x3d];
> Previous code, but 61 would be better than 0x3d to be consistent with rest of cxl code

:-(

I get the rest of the code argument.  However, the specification uses hex
for the number of bytes in the definitions.  For this reason I prefer the
use of hex here so that one can better match the code to the spec.

> 
> >  } __packed;
> >  
> > +/*
> > + * CXL rev 3.1 section 8.2.9.2.1.6; Table 8-51
> > + */
> > +#define CXL_EXTENT_TAG_LEN 0x10
> > +struct cxl_extent {
> > +	__le64 start_dpa;
> > +	__le64 length;
> > +	u8 tag[CXL_EXTENT_TAG_LEN];
> > +	__le16 shared_extn_seq;
> > +	u8 reserved[0x6];
> 
> Why not just 6? In general I find it odd that this header uses hex for
> array indexing when the rest of the cxl code uses decimal. 

I was just directly matching the spec.

> 
> > +} __packed;
> > +
> > +/*
> > + * Dynamic Capacity Event Record
> > + * CXL rev 3.1 section 8.2.9.2.1.6; Table 8-50
> > + */
> > +#define CXL_DCD_EVENT_MORE			BIT(0)
> > +struct cxl_event_dcd {
> > +	struct cxl_event_record_hdr hdr;
> > +	u8 event_type;
> > +	u8 validity_flags;
> > +	__le16 host_id;
> > +	u8 region_index;
> > +	u8 flags;
> > +	u8 reserved1[0x2];
> 
> also here, 2?

Same...  I know it is odd when the hex string == the decimal string.

> 
> > +	struct cxl_extent extent;
> > +	u8 reserved2[0x18];
> 
> 24?

same.

Ira

[snip]

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v3 19/25] cxl/region/extent: Expose region extent information in sysfs
  2024-08-19 19:05   ` Dave Jiang
@ 2024-08-23  2:58     ` Ira Weiny
  2024-08-23 17:17       ` Jonathan Cameron
  0 siblings, 1 reply; 120+ messages in thread
From: Ira Weiny @ 2024-08-23  2:58 UTC (permalink / raw)
  To: Dave Jiang, ira.weiny, Fan Ni, Jonathan Cameron, Navneet Singh,
	Chris Mason, Josef Bacik, David Sterba, Petr Mladek,
	Steven Rostedt, Andy Shevchenko, Rasmus Villemoes,
	Sergey Senozhatsky, Jonathan Corbet, Andrew Morton
  Cc: Dan Williams, Davidlohr Bueso, Alison Schofield, Vishal Verma,
	linux-btrfs, linux-cxl, linux-kernel, linux-doc, nvdimm

Dave Jiang wrote:
> 
> 
> On 8/16/24 7:44 AM, ira.weiny@intel.com wrote:
> > From: Navneet Singh <navneet.singh@intel.com>
> > 
> > Extent information can be helpful to the user to coordinate memory usage
> > with the external orchestrator and FM.
> > 
> > Expose the details of region extents by creating the following
> > sysfs entries.
> > 
> >         /sys/bus/cxl/devices/dax_regionX/extentX.Y
> >         /sys/bus/cxl/devices/dax_regionX/extentX.Y/offset
> >         /sys/bus/cxl/devices/dax_regionX/extentX.Y/length
> >         /sys/bus/cxl/devices/dax_regionX/extentX.Y/tag
> > 
> > Signed-off-by: Navneet Singh <navneet.singh@intel.com>
> > Co-developed-by: Ira Weiny <ira.weiny@intel.com>
> > Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> > 
> > ---
> > Changes:
> > [iweiny: split this out]
> > [Jonathan: add documentation for extent sysfs]
> > [Jonathan/djbw: s/label/tag]
> > [Jonathan/djbw: treat tag as uuid]
> > [djbw: use __ATTRIBUTE_GROUPS]
> > [djbw: make tag invisible if it is empty]
> > [djbw/iweiny: use conventional id names for extents; extentX.Y]
> > ---
> >  Documentation/ABI/testing/sysfs-bus-cxl | 13 ++++++++
> >  drivers/cxl/core/extent.c               | 58 +++++++++++++++++++++++++++++++++
> >  2 files changed, 71 insertions(+)
> > 
> > diff --git a/Documentation/ABI/testing/sysfs-bus-cxl b/Documentation/ABI/testing/sysfs-bus-cxl
> > index 3a5ee88e551b..e97e6a73c960 100644
> > --- a/Documentation/ABI/testing/sysfs-bus-cxl
> > +++ b/Documentation/ABI/testing/sysfs-bus-cxl
> > @@ -599,3 +599,16 @@ Description:
> >  		See Documentation/ABI/stable/sysfs-devices-node. access0 provides
> >  		the number to the closest initiator and access1 provides the
> >  		number to the closest CPU.
> > +
> > +What:		/sys/bus/cxl/devices/dax_regionX/extentX.Y/offset
> > +		/sys/bus/cxl/devices/dax_regionX/extentX.Y/length
> > +		/sys/bus/cxl/devices/dax_regionX/extentX.Y/tag
> 
> I wonder consider an entry for each with their own descriptions, which seems to be the standard practice.

:-/  Except kind of for the access'.

What:           /sys/bus/cxl/devices/regionZ/accessY/read_bandwidth
                /sys/bus/cxl/devices/regionZ/accessY/write_banwidth

What:           /sys/bus/cxl/devices/regionZ/accessY/read_latency
                /sys/bus/cxl/devices/regionZ/accessY/write_latency

But I think you have a point.

Ira

> 
> DJ
> 

[snip]

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v3 21/25] dax/region: Create resources on sparse DAX regions
  2024-08-19 23:30   ` Dave Jiang
@ 2024-08-23 14:28     ` Ira Weiny
  0 siblings, 0 replies; 120+ messages in thread
From: Ira Weiny @ 2024-08-23 14:28 UTC (permalink / raw)
  To: Dave Jiang, ira.weiny, Fan Ni, Jonathan Cameron, Navneet Singh,
	Chris Mason, Josef Bacik, David Sterba, Petr Mladek,
	Steven Rostedt, Andy Shevchenko, Rasmus Villemoes,
	Sergey Senozhatsky, Jonathan Corbet, Andrew Morton
  Cc: Dan Williams, Davidlohr Bueso, Alison Schofield, Vishal Verma,
	linux-btrfs, linux-cxl, linux-kernel, linux-doc, nvdimm

Dave Jiang wrote:
> 
> 
> On 8/16/24 7:44 AM, ira.weiny@intel.com wrote:
> > From: Navneet Singh <navneet.singh@intel.com>
> > 

[snip]

> > diff --git a/drivers/cxl/core/extent.c b/drivers/cxl/core/extent.c
> > index d7d526a51e2b..103b0bec3a4a 100644
> > --- a/drivers/cxl/core/extent.c
> > +++ b/drivers/cxl/core/extent.c
> > @@ -271,20 +271,67 @@ static void calc_hpa_range(struct cxl_endpoint_decoder *cxled,
> >  	hpa_range->end = hpa_range->start + range_len(dpa_range) - 1;
> >  }
> >  
> > +static int cxlr_notify_extent(struct cxl_region *cxlr, enum dc_event event,
> > +			      struct region_extent *region_extent)
> > +{
> > +	struct cxl_dax_region *cxlr_dax;
> > +	struct device *dev;
> > +	int rc = 0;
> > +
> > +	cxlr_dax = cxlr->cxlr_dax;
> > +	dev = &cxlr_dax->dev;
> > +	dev_dbg(dev, "Trying notify: type %d HPA %par\n",
> > +		event, &region_extent->hpa_range);
> > +
> > +	/*
> > +	 * NOTE the lack of a driver indicates a notification has failed.  No
> > +	 * user space coordiantion was possible.
> > +	 */
> > +	device_lock(dev);
> > +	if (dev->driver) {
> > +		struct cxl_driver *driver = to_cxl_drv(dev->driver);
> > +		struct cxl_notify_data notify_data = (struct cxl_notify_data) {
> > +			.event = event,
> > +			.region_extent = region_extent,
> > +		};
> > +
> > +		if (driver->notify) {
> > +			dev_dbg(dev, "Notify: type %d HPA %par\n",
> > +				event, &region_extent->hpa_range);
> > +			rc = driver->notify(dev, &notify_data);
> > +		}
> > +	}
> > +	device_unlock(dev);
> 
> Maybe a cleaner version:
> 	guard(device)(dev);
> 	if (!dev->driver || !dev->driver->notify)
> 		return 0;

There is no dev->driver->notify.  But this works.

        if (!dev->driver)
                return 0;
        driver = to_cxl_drv(dev->driver);
        if (!driver->notify)
                return 0;

Not quite as clean but I did miss the use of guard.

> 
> 	dev_dbg(...);
> 	return driver->notify(dev, &notify_data);
>

I've cleaned it up.

[snip]

> > +
> > +int dax_region_add_resource(struct dax_region *dax_region,
> > +			    struct device *device,
> > +			    resource_size_t start, resource_size_t length)
> >
> kdoc header?

Because dax_region_add_resource() is part of the DAX private interfaces
and not intended to be used outside the DAX subsystem I skipped the kdoc
here even though the function must be exported.  Same for
dax_region_rm_resource() and dax_avail_size().

This is similar to run_dax() in that it is designed as a 'generic
operation' within the dax subsystem but not generally useful to the
kernel.

For now I'll move their declarations in dax-private.h and make a similar
comment.

Ira

> 
>  +{
> > +	struct resource *new_resource;
> > +	int rc;
> > +
> > +	struct dax_resource *dax_resource __free(kfree) =
> > +				kzalloc(sizeof(*dax_resource), GFP_KERNEL);
> > +	if (!dax_resource)
> > +		return -ENOMEM;
> > +
> > +	guard(rwsem_write)(&dax_region_rwsem);
> > +
> > +	dev_dbg(dax_region->dev, "DAX region resource %pr\n", &dax_region->res);
> > +	new_resource = __request_region(&dax_region->res, start, length, "extent", 0);
> > +	if (!new_resource) {
> > +		dev_err(dax_region->dev, "Failed to add region s:%pa l:%pa\n",
> > +			&start, &length);
> > +		return -ENOSPC;
> > +	}
> > +
> > +	dev_dbg(dax_region->dev, "add resource %pr\n", new_resource);
> > +	dax_resource->region = dax_region;
> > +	dax_resource->res = new_resource;
> > +	dev_set_drvdata(device, dax_resource);
> > +	rc = devm_add_action_or_reset(device, dax_release_resource,
> > +				      no_free_ptr(dax_resource));
> > +	/*  On error; ensure driver data is cleared under semaphore */
> > +	if (rc)
> > +		dev_set_drvdata(device, NULL);
> > +	return rc;
> > +}
> > +EXPORT_SYMBOL_GPL(dax_region_add_resource);
> > +
> > +int dax_region_rm_resource(struct dax_region *dax_region,
> > +			   struct device *dev)
> 
> kdoc header
> > +{
> > +	struct dax_resource *dax_resource;
> > +
> > +	guard(rwsem_write)(&dax_region_rwsem);
> > +
> > +	dax_resource = dev_get_drvdata(dev);
> > +	if (!dax_resource)
> > +		return 0;
> > +
> > +	if (dax_resource->use_cnt)
> > +		return -EBUSY;
> > +
> > +	/* avoid races with users trying to use the extent */
> > +	__dax_release_resource(dax_resource);
> > +	return 0;
> > +}
> > +EXPORT_SYMBOL_GPL(dax_region_rm_resource);
> > +
> >  bool static_dev_dax(struct dev_dax *dev_dax)
> >  {
> >  	return is_static(dev_dax->region);
> > @@ -296,19 +373,44 @@ static ssize_t region_align_show(struct device *dev,
> >  static struct device_attribute dev_attr_region_align =
> >  		__ATTR(align, 0400, region_align_show, NULL);
> >  
> > +#define for_each_child_resource(extent, res) \
> > +	for (res = (extent)->child; res; res = res->sibling)
> > +
> > +resource_size_t
> > +dax_avail_size(struct resource *dax_resource)
> kdoc header
> 
> DJ
> 

[snip]

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v3 11/25] cxl/mem: Expose DCD partition capabilities in sysfs
  2024-08-23  2:28     ` Ira Weiny
@ 2024-08-23 14:58       ` Dave Jiang
  2024-08-23 16:14       ` Jonathan Cameron
  1 sibling, 0 replies; 120+ messages in thread
From: Dave Jiang @ 2024-08-23 14:58 UTC (permalink / raw)
  To: Ira Weiny, Fan Ni, Jonathan Cameron, Navneet Singh, Chris Mason,
	Josef Bacik, David Sterba, Petr Mladek, Steven Rostedt,
	Andy Shevchenko, Rasmus Villemoes, Sergey Senozhatsky,
	Jonathan Corbet, Andrew Morton
  Cc: Dan Williams, Davidlohr Bueso, Alison Schofield, Vishal Verma,
	linux-btrfs, linux-cxl, linux-kernel, linux-doc, nvdimm



On 8/22/24 7:28 PM, Ira Weiny wrote:
> Dave Jiang wrote:
>>
>>
>> On 8/16/24 7:44 AM, ira.weiny@intel.com wrote:
>>> From: Navneet Singh <navneet.singh@intel.com>
>>>
>>> To properly configure CXL regions on Dynamic Capacity Devices (DCD),
>>> user space will need to know the details of the DC partitions available.
>>>
>>> Expose dynamic capacity capabilities through sysfs.
>>>
>>> Signed-off-by: Navneet Singh <navneet.singh@intel.com>
>>> Co-developed-by: Ira Weiny <ira.weiny@intel.com>
>>> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
>>>
>>> ---
>>> Changes:
>>> [iweiny: remove review tags]
>>> [Davidlohr/Fan/Jonathan: omit 'dc' attribute directory if device is not DC]
>>> [Jonathan: update documentation for dc visibility]
>>> [Jonathan: Add a comment to DC region X attributes to ensure visibility checks work]
>>> [iweiny: push sysfs version to 6.12]
>>> ---
>>>  Documentation/ABI/testing/sysfs-bus-cxl | 12 ++++
>>>  drivers/cxl/core/memdev.c               | 97 +++++++++++++++++++++++++++++++++
>>>  2 files changed, 109 insertions(+)
>>>
>>> diff --git a/Documentation/ABI/testing/sysfs-bus-cxl b/Documentation/ABI/testing/sysfs-bus-cxl
>>> index 957717264709..6227ae0ab3fc 100644
>>> --- a/Documentation/ABI/testing/sysfs-bus-cxl
>>> +++ b/Documentation/ABI/testing/sysfs-bus-cxl
>>> @@ -54,6 +54,18 @@ Description:
>>>  		identically named field in the Identify Memory Device Output
>>>  		Payload in the CXL-2.0 specification.
>>>  
>>> +What:		/sys/bus/cxl/devices/memX/dc/region_count
>>> +		/sys/bus/cxl/devices/memX/dc/regionY_size
>>
>> Just make it into 2 separate entries?
> 
> Do you mean in the docs?

Yes. Here you are combining all the sysfs entries into 1. I'm suggesting unique block per each sysfs entry with their own description. 
> 
> Ira
> 
>>
>> DJ
>>> +Date:		August, 2024
>>> +KernelVersion:	v6.12
>>> +Contact:	linux-cxl@vger.kernel.org
>>> +Description:
>>> +		(RO) Dynamic Capacity (DC) region information.  The dc
>>> +		directory is only visible on devices which support Dynamic
>>> +		Capacity.
>>> +		The region_count is the number of Dynamic Capacity (DC)
>>> +		partitions (regions) supported on the device.
>>> +		regionY_size is the size of each of those partitions.
>>>  
>>>  What:		/sys/bus/cxl/devices/memX/pmem/qos_class
>>>  Date:		May, 2023
> 
> [snip]

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v3 03/25] dax: Document dax dev range tuple
  2024-08-16 14:44 ` [PATCH v3 03/25] dax: Document dax dev range tuple Ira Weiny
  2024-08-16 20:58   ` Dave Jiang
@ 2024-08-23 15:29   ` Jonathan Cameron
  1 sibling, 0 replies; 120+ messages in thread
From: Jonathan Cameron @ 2024-08-23 15:29 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Dave Jiang, Fan Ni, Navneet Singh, Chris Mason, Josef Bacik,
	David Sterba, Petr Mladek, Steven Rostedt, Andy Shevchenko,
	Rasmus Villemoes, Sergey Senozhatsky, Jonathan Corbet,
	Andrew Morton, Dan Williams, Davidlohr Bueso, Alison Schofield,
	Vishal Verma, linux-btrfs, linux-cxl, linux-kernel, linux-doc,
	nvdimm

On Fri, 16 Aug 2024 09:44:11 -0500
Ira Weiny <ira.weiny@intel.com> wrote:

> The device DAX structure is being enhanced to track additional DCD
> information.
> 
> The current range tuple was not fully documented.  Document it prior to
> adding information for DC.
> 
> Suggested-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> 
> ---
> Changes:
> [iweiny: move to start of series]
> ---
>  drivers/dax/dax-private.h | 5 ++++-
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/dax/dax-private.h b/drivers/dax/dax-private.h
> index 446617b73aea..ccde98c3d4e2 100644
> --- a/drivers/dax/dax-private.h
> +++ b/drivers/dax/dax-private.h
> @@ -58,7 +58,10 @@ struct dax_mapping {
>   * @dev - device core
>   * @pgmap - pgmap for memmap setup / lifetime (driver owned)
>   * @nr_range: size of @ranges
> - * @ranges: resource-span + pgoff tuples for the instance
> + * @ranges: range tuples of memory used
> + * @pgoff: page offset
> + * @range: resource-span
> + * @mapping: device to assist in interrogating the range layout
I think the kernel doc format for this should be
@ranges.pgoff: etc
https://docs.kernel.org/doc-guide/kernel-doc.html#nested-structs-unions

Though not quite sure what happens for pointers to structures, maybe
this is correct as it stands?

>   */
>  struct dev_dax {
>  	struct dax_region *region;
> 


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v3 06/25] cxl/mem: Read dynamic capacity configuration from the device
  2024-08-16 14:44 ` [PATCH v3 06/25] cxl/mem: Read dynamic capacity configuration from the device ira.weiny
  2024-08-16 21:45   ` Dave Jiang
@ 2024-08-23 15:45   ` Jonathan Cameron
  1 sibling, 0 replies; 120+ messages in thread
From: Jonathan Cameron @ 2024-08-23 15:45 UTC (permalink / raw)
  To: ira.weiny
  Cc: Dave Jiang, Fan Ni, Navneet Singh, Chris Mason, Josef Bacik,
	David Sterba, Petr Mladek, Steven Rostedt, Andy Shevchenko,
	Rasmus Villemoes, Sergey Senozhatsky, Jonathan Corbet,
	Andrew Morton, Dan Williams, Davidlohr Bueso, Alison Schofield,
	Vishal Verma, linux-btrfs, linux-cxl, linux-kernel, linux-doc,
	nvdimm, Li, Ming

On Fri, 16 Aug 2024 09:44:14 -0500
ira.weiny@intel.com wrote:

> From: Navneet Singh <navneet.singh@intel.com>
> 
> Devices which optionally support Dynamic Capacity (DC) are configured
> via mailbox commands.  CXL 3.1 requires the host to issue the Get DC
> Configuration command in order to properly configure DCDs.  Without the
> Get DC Configuration command DCD can't be supported.
> 
> Implement the DC mailbox commands as specified in CXL 3.1 section
> 8.2.9.9.9 (opcodes 48XXh) to read and store the DCD configuration
> information.  Disable DCD if DCD is not supported.  Leverage the Get DC
> Configuration command supported bit to indicate if DCD support.
> 
> Linux has no use for the trailing fields of the Get Dynamic Capacity
> Configuration Output Payload (Total number of supported extents, number
> of available extents, total number of supported tags, and number of
> available tags).  Avoid defining those fields to use the more useful
> dynamic C array.
> 
> Cc: "Li, Ming" <ming4.li@intel.com>
> Signed-off-by: Navneet Singh <navneet.singh@intel.com>
> Co-developed-by: Ira Weiny <ira.weiny@intel.com>
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
LGTM
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
If you can get rid of the <nil> thing even better.




^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v3 07/25] cxl/core: Separate region mode from decoder mode
  2024-08-16 14:44 ` [PATCH v3 07/25] cxl/core: Separate region mode from decoder mode ira.weiny
  2024-08-16 22:11   ` Dave Jiang
@ 2024-08-23 15:47   ` Jonathan Cameron
  2024-09-03  6:56   ` Li, Ming4
  2 siblings, 0 replies; 120+ messages in thread
From: Jonathan Cameron @ 2024-08-23 15:47 UTC (permalink / raw)
  To: ira.weiny
  Cc: Dave Jiang, Fan Ni, Navneet Singh, Chris Mason, Josef Bacik,
	David Sterba, Petr Mladek, Steven Rostedt, Andy Shevchenko,
	Rasmus Villemoes, Sergey Senozhatsky, Jonathan Corbet,
	Andrew Morton, Dan Williams, Davidlohr Bueso, Alison Schofield,
	Vishal Verma, linux-btrfs, linux-cxl, linux-kernel, linux-doc,
	nvdimm

On Fri, 16 Aug 2024 09:44:15 -0500
ira.weiny@intel.com wrote:

> From: Navneet Singh <navneet.singh@intel.com>
> 
> Until now region modes and decoder modes were equivalent in that both
> modes were either PMEM or RAM.  The addition of Dynamic
> Capacity partitions defines up to 8 DC partitions per device.
> 
> The region mode is thus no longer equivalent to the endpoint decoder
> mode.  IOW the endpoint decoders may have modes of DC0-DC7 while the
> region mode is simply DC.
> 
> Define a new region mode enumeration which applies to regions separate
> from the decoder mode.  Adjust the code to process these modes
> independently.
> 
> There is no equal to decoder mode dead in region modes.  Avoid
> constructing regions with decoders which have been flagged as dead.
> 
> Suggested-by: Jonathan Cameron <Jonathan.Cameron@Huawei.com>
> Signed-off-by: Navneet Singh <navneet.singh@intel.com>
> Co-developed-by: Ira Weiny <ira.weiny@intel.com>
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v3 09/25] cxl/hdm: Add dynamic capacity size support to endpoint decoders
  2024-08-16 14:44 ` [PATCH v3 09/25] cxl/hdm: Add dynamic capacity size support to endpoint decoders ira.weiny
  2024-08-16 23:08   ` Dave Jiang
@ 2024-08-23 16:09   ` Jonathan Cameron
  1 sibling, 0 replies; 120+ messages in thread
From: Jonathan Cameron @ 2024-08-23 16:09 UTC (permalink / raw)
  To: ira.weiny
  Cc: Dave Jiang, Fan Ni, Navneet Singh, Chris Mason, Josef Bacik,
	David Sterba, Petr Mladek, Steven Rostedt, Andy Shevchenko,
	Rasmus Villemoes, Sergey Senozhatsky, Jonathan Corbet,
	Andrew Morton, Dan Williams, Davidlohr Bueso, Alison Schofield,
	Vishal Verma, linux-btrfs, linux-cxl, linux-kernel, linux-doc,
	nvdimm

On Fri, 16 Aug 2024 09:44:17 -0500
ira.weiny@intel.com wrote:

> From: Navneet Singh <navneet.singh@intel.com>
> 
> To support Dynamic Capacity Devices (DCD) endpoint decoders will need to
> map DC partitions (regions).  In addition to assigning the size of the
> DC partition, the decoder must assign any skip value from the previous
> decoder.  This must be done within a contiguous DPA space.
> 
> Two complications arise with Dynamic Capacity regions which did not
> exist with Ram and PMEM partitions.  First, gaps in the DPA space can
> exist between and around the DC partitions.  Second, the Linux resource
> tree does not allow a resource to be marked across existing nodes within
> a tree.
> 
> For clarity, below is an example of an 60GB device with 10GB of RAM,
> 10GB of PMEM and 10GB for each of 2 DC partitions.  The desired CXL
> mapping is 5GB of RAM, 5GB of PMEM, and 5GB of DC1.
> 
>      DPA RANGE
>      (dpa_res)
> 0GB        10GB       20GB       30GB       40GB       50GB       60GB
> |----------|----------|----------|----------|----------|----------|
> 
> RAM         PMEM                  DC0                   DC1
>  (ram_res)  (pmem_res)            (dc_res[0])           (dc_res[1])
> |----------|----------|   <gap>  |----------|   <gap>  |----------|
> 
>  RAM        PMEM                                        DC1
> |XXXXX|----|XXXXX|----|----------|----------|----------|XXXXX-----|
> 0GB   5GB  10GB  15GB 20GB       30GB       40GB       50GB       60GB
> 
> The previous skip resource between RAM and PMEM was always a child of
> the RAM resource and fit nicely [see (S) below].  Because of this
> simplicity this skip resource reference was not stored in any CXL state.
> On release the skip range could be calculated based on the endpoint
> decoders stored values.
> 
> Now when DC1 is being mapped 4 skip resources must be created as
> children.  One for the PMEM resource (A), two of the parent DPA resource
> (B,D), and one more child of the DC0 resource (C).
> 
> 0GB        10GB       20GB       30GB       40GB       50GB       60GB
> |----------|----------|----------|----------|----------|----------|
>                            |                     |
> |----------|----------|    |     |----------|    |     |----------|
>         |          |       |          |          |
>        (S)        (A)     (B)        (C)        (D)
> 	v          v       v          v          v
> |XXXXX|----|XXXXX|----|----------|----------|----------|XXXXX-----|
>        skip       skip  skip        skip      skip
> 
> Expand the calculation of DPA free space and enhance the logic to
> support this more complex skipping.  To track the potential of multiple
> skip resources an xarray is attached to the endpoint decoder.  The
> existing algorithm between RAM and PMEM is consolidated within the new
> one to streamline the code even though the result is the storage of a
> single skip resource in the xarray.
> 
> Signed-off-by: Navneet Singh <navneet.singh@intel.com>
> Co-developed-by: Ira Weiny <ira.weiny@intel.com>
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> 
One query below + request to add a comment on it for when I've
again completely forgotten how this works.

Also a grumpy reviewer comment.

> +static int cxl_reserve_dpa_skip(struct cxl_endpoint_decoder *cxled,
> +				resource_size_t base, resource_size_t skipped)
> +{
> +	struct cxl_memdev *cxlmd = cxled_to_memdev(cxled);
> +	struct cxl_port *port = cxled_to_port(cxled);
> +	struct cxl_dev_state *cxlds = cxlmd->cxlds;
> +	resource_size_t skip_base = base - skipped;
> +	struct device *dev = &port->dev;
> +	resource_size_t skip_len = 0;
> +	int rc, index;
> +

> +	index = dc_mode_to_region_index(cxled->mode);
> +	for (int i = 0; i <= index; i++) {

I'm not sure why this is <= so maybe a comment?

> +		struct resource *dcr = &cxlds->dc_res[i];
> +
> +		if (skip_base < dcr->start) {
> +			skip_len = dcr->start - skip_base;
> +			rc = cxl_request_skip(cxled, skip_base, skip_len);
> +			if (rc)
> +				return rc;
> +			skip_base += skip_len;
> +		}
> +
> +		if (skip_base == base) {
> +			dev_dbg(dev, "skip done DC region %d!\n", i);
> +			break;
> +		}
> +
> +		if (resource_size(dcr) && skip_base <= dcr->end) {
> +			if (skip_base > base) {
> +				dev_err(dev, "Skip error DC region %d; skip_base %pa; base %pa\n",
> +					i, &skip_base, &base);
> +				return -ENXIO;
> +			}
> +
> +			skip_len = dcr->end - skip_base + 1;
> +			rc = cxl_request_skip(cxled, skip_base, skip_len);
> +			if (rc)
> +				return rc;
> +			skip_base += skip_len;
> +		}
> +	}
> +
> +	return 0;
> +}

> @@ -466,8 +588,8 @@ int cxl_dpa_set_mode(struct cxl_endpoint_decoder *cxled,
>  
>  int cxl_dpa_alloc(struct cxl_endpoint_decoder *cxled, unsigned long long size)
>  {
> -	struct cxl_memdev *cxlmd = cxled_to_memdev(cxled);
>  	resource_size_t free_ram_start, free_pmem_start;
> +	struct cxl_memdev *cxlmd = cxled_to_memdev(cxled);

Patch noise.  Put it back where it was! (assuming I haven't failed to spot the difference)

>  	struct cxl_port *port = cxled_to_port(cxled);
>  	struct cxl_dev_state *cxlds = cxlmd->cxlds;
>  	struct device *dev = &cxled->cxld.dev;

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v3 10/25] cxl/port: Add endpoint decoder DC mode support to sysfs
  2024-08-16 14:44 ` [PATCH v3 10/25] cxl/port: Add endpoint decoder DC mode support to sysfs ira.weiny
  2024-08-16 23:17   ` Dave Jiang
@ 2024-08-23 16:12   ` Jonathan Cameron
  1 sibling, 0 replies; 120+ messages in thread
From: Jonathan Cameron @ 2024-08-23 16:12 UTC (permalink / raw)
  To: ira.weiny
  Cc: Dave Jiang, Fan Ni, Navneet Singh, Chris Mason, Josef Bacik,
	David Sterba, Petr Mladek, Steven Rostedt, Andy Shevchenko,
	Rasmus Villemoes, Sergey Senozhatsky, Jonathan Corbet,
	Andrew Morton, Dan Williams, Davidlohr Bueso, Alison Schofield,
	Vishal Verma, linux-btrfs, linux-cxl, linux-kernel, linux-doc,
	nvdimm

On Fri, 16 Aug 2024 09:44:18 -0500
ira.weiny@intel.com wrote:

> From: Navneet Singh <navneet.singh@intel.com>
> 
> Endpoint decoder mode is used to represent the partition the decoder
> points to such as ram or pmem.
> 
> Expand the mode to allow a decoder to point to a specific DC partition
> (Region).
> 
> Signed-off-by: Navneet Singh <navneet.singh@intel.com>
> Co-developed-by: Ira Weiny <ira.weiny@intel.com>
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> 
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v3 11/25] cxl/mem: Expose DCD partition capabilities in sysfs
  2024-08-23  2:28     ` Ira Weiny
  2024-08-23 14:58       ` Dave Jiang
@ 2024-08-23 16:14       ` Jonathan Cameron
  1 sibling, 0 replies; 120+ messages in thread
From: Jonathan Cameron @ 2024-08-23 16:14 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Dave Jiang, Fan Ni, Navneet Singh, Chris Mason, Josef Bacik,
	David Sterba, Petr Mladek, Steven Rostedt, Andy Shevchenko,
	Rasmus Villemoes, Sergey Senozhatsky, Jonathan Corbet,
	Andrew Morton, Dan Williams, Davidlohr Bueso, Alison Schofield,
	Vishal Verma, linux-btrfs, linux-cxl, linux-kernel, linux-doc,
	nvdimm

On Thu, 22 Aug 2024 21:28:41 -0500
Ira Weiny <ira.weiny@intel.com> wrote:

> Dave Jiang wrote:
> > 
> > 
> > On 8/16/24 7:44 AM, ira.weiny@intel.com wrote:  
> > > From: Navneet Singh <navneet.singh@intel.com>
> > > 
> > > To properly configure CXL regions on Dynamic Capacity Devices (DCD),
> > > user space will need to know the details of the DC partitions available.
> > > 
> > > Expose dynamic capacity capabilities through sysfs.
> > > 
> > > Signed-off-by: Navneet Singh <navneet.singh@intel.com>
> > > Co-developed-by: Ira Weiny <ira.weiny@intel.com>
> > > Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> > > 
> > > ---
> > > Changes:
> > > [iweiny: remove review tags]
> > > [Davidlohr/Fan/Jonathan: omit 'dc' attribute directory if device is not DC]
> > > [Jonathan: update documentation for dc visibility]
> > > [Jonathan: Add a comment to DC region X attributes to ensure visibility checks work]
> > > [iweiny: push sysfs version to 6.12]
> > > ---
> > >  Documentation/ABI/testing/sysfs-bus-cxl | 12 ++++
> > >  drivers/cxl/core/memdev.c               | 97 +++++++++++++++++++++++++++++++++
> > >  2 files changed, 109 insertions(+)
> > > 
> > > diff --git a/Documentation/ABI/testing/sysfs-bus-cxl b/Documentation/ABI/testing/sysfs-bus-cxl
> > > index 957717264709..6227ae0ab3fc 100644
> > > --- a/Documentation/ABI/testing/sysfs-bus-cxl
> > > +++ b/Documentation/ABI/testing/sysfs-bus-cxl
> > > @@ -54,6 +54,18 @@ Description:
> > >  		identically named field in the Identify Memory Device Output
> > >  		Payload in the CXL-2.0 specification.
> > >  
> > > +What:		/sys/bus/cxl/devices/memX/dc/region_count
> > > +		/sys/bus/cxl/devices/memX/dc/regionY_size  
> > 
> > Just make it into 2 separate entries?  
> 
> Do you mean in the docs?

Assuming yes, then I think it would be cleaner as two separate entries
+ Maybe even one for the directory which can then have
the visibility statement.

> 
> Ira
> 
> > 
> > DJ  
> > > +Date:		August, 2024
> > > +KernelVersion:	v6.12
> > > +Contact:	linux-cxl@vger.kernel.org
> > > +Description:
> > > +		(RO) Dynamic Capacity (DC) region information.  The dc
> > > +		directory is only visible on devices which support Dynamic
> > > +		Capacity.
> > > +		The region_count is the number of Dynamic Capacity (DC)
> > > +		partitions (regions) supported on the device.
> > > +		regionY_size is the size of each of those partitions.
> > >  
> > >  What:		/sys/bus/cxl/devices/memX/pmem/qos_class
> > >  Date:		May, 2023  
> 
> [snip]
> 


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v3 12/25] cxl/region: Refactor common create region code
  2024-08-16 14:44 ` [PATCH v3 12/25] cxl/region: Refactor common create region code Ira Weiny
  2024-08-16 23:43   ` Dave Jiang
  2024-08-22 18:51   ` Fan Ni
@ 2024-08-23 16:17   ` Jonathan Cameron
  2024-09-03  7:04   ` Li, Ming4
  3 siblings, 0 replies; 120+ messages in thread
From: Jonathan Cameron @ 2024-08-23 16:17 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Dave Jiang, Fan Ni, Navneet Singh, Chris Mason, Josef Bacik,
	David Sterba, Petr Mladek, Steven Rostedt, Andy Shevchenko,
	Rasmus Villemoes, Sergey Senozhatsky, Jonathan Corbet,
	Andrew Morton, Dan Williams, Davidlohr Bueso, Alison Schofield,
	Vishal Verma, linux-btrfs, linux-cxl, linux-kernel, linux-doc,
	nvdimm

On Fri, 16 Aug 2024 09:44:20 -0500
Ira Weiny <ira.weiny@intel.com> wrote:

> create_pmem_region_store() and create_ram_region_store() are identical
> with the exception of the region mode.  With the addition of DC region
> mode this would end up being 3 copies of the same code.
> 
> Refactor create_pmem_region_store() and create_ram_region_store() to use
> a single common function to be used in subsequent DC code.
> 
> Suggested-by: Fan Ni <fan.ni@samsung.com>
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>

Make sense

Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>


> ---
>  drivers/cxl/core/region.c | 28 +++++++++++-----------------
>  1 file changed, 11 insertions(+), 17 deletions(-)
> 
> diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
> index 650fe33f2ed4..f85b26b39b2f 100644
> --- a/drivers/cxl/core/region.c
> +++ b/drivers/cxl/core/region.c
> @@ -2553,9 +2553,8 @@ static struct cxl_region *__create_region(struct cxl_root_decoder *cxlrd,
>  	return devm_cxl_add_region(cxlrd, id, mode, CXL_DECODER_HOSTONLYMEM);
>  }
>  
> -static ssize_t create_pmem_region_store(struct device *dev,
> -					struct device_attribute *attr,
> -					const char *buf, size_t len)
> +static ssize_t create_region_store(struct device *dev, const char *buf,
> +				   size_t len, enum cxl_region_mode mode)
>  {
>  	struct cxl_root_decoder *cxlrd = to_cxl_root_decoder(dev);
>  	struct cxl_region *cxlr;
> @@ -2565,31 +2564,26 @@ static ssize_t create_pmem_region_store(struct device *dev,
>  	if (rc != 1)
>  		return -EINVAL;
>  
> -	cxlr = __create_region(cxlrd, CXL_REGION_PMEM, id);
> +	cxlr = __create_region(cxlrd, mode, id);
>  	if (IS_ERR(cxlr))
>  		return PTR_ERR(cxlr);
>  
>  	return len;
>  }
> +
> +static ssize_t create_pmem_region_store(struct device *dev,
> +					struct device_attribute *attr,
> +					const char *buf, size_t len)
> +{
> +	return create_region_store(dev, buf, len, CXL_REGION_PMEM);
> +}
>  DEVICE_ATTR_RW(create_pmem_region);
>  
>  static ssize_t create_ram_region_store(struct device *dev,
>  				       struct device_attribute *attr,
>  				       const char *buf, size_t len)
>  {
> -	struct cxl_root_decoder *cxlrd = to_cxl_root_decoder(dev);
> -	struct cxl_region *cxlr;
> -	int rc, id;
> -
> -	rc = sscanf(buf, "region%d\n", &id);
> -	if (rc != 1)
> -		return -EINVAL;
> -
> -	cxlr = __create_region(cxlrd, CXL_REGION_RAM, id);
> -	if (IS_ERR(cxlr))
> -		return PTR_ERR(cxlr);
> -
> -	return len;
> +	return create_region_store(dev, buf, len, CXL_REGION_RAM);
>  }
>  DEVICE_ATTR_RW(create_ram_region);
>  
> 


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v3 13/25] cxl/region: Add sparse DAX region support
  2024-08-16 14:44 ` [PATCH v3 13/25] cxl/region: Add sparse DAX region support ira.weiny
  2024-08-16 23:51   ` Dave Jiang
  2024-08-22 18:50   ` Fan Ni
@ 2024-08-23 16:59   ` Jonathan Cameron
  2024-09-03  2:15   ` Li, Ming4
  3 siblings, 0 replies; 120+ messages in thread
From: Jonathan Cameron @ 2024-08-23 16:59 UTC (permalink / raw)
  To: ira.weiny
  Cc: Dave Jiang, Fan Ni, Navneet Singh, Chris Mason, Josef Bacik,
	David Sterba, Petr Mladek, Steven Rostedt, Andy Shevchenko,
	Rasmus Villemoes, Sergey Senozhatsky, Jonathan Corbet,
	Andrew Morton, Dan Williams, Davidlohr Bueso, Alison Schofield,
	Vishal Verma, linux-btrfs, linux-cxl, linux-kernel, linux-doc,
	nvdimm

On Fri, 16 Aug 2024 09:44:21 -0500
ira.weiny@intel.com wrote:

> From: Navneet Singh <navneet.singh@intel.com>
> 
> Dynamic Capacity CXL regions must allow memory to be added or removed
> dynamically.  In addition to the quantity of memory available the
> location of the memory within a DC partition is dynamic based on the
> extents offered by a device.  CXL DAX regions must accommodate the
> sparseness of this memory in the management of DAX regions and devices.
> 
> Introduce the concept of a sparse DAX region.  Add a create_dc_region()
> sysfs entry to create such regions.  Special case DC capable regions to
> create a 0 sized seed DAX device to maintain compatibility which
> requires a default DAX device to hold a region reference.
> 
> Indicate 0 byte available capacity until such time that capacity is
> added.
> 
> Sparse regions complicate the range mapping of dax devices.  There is no
> known use case for range mapping on sparse regions.  Avoid the
> complication by preventing range mapping of dax devices on sparse
> regions.
> 
> Interleaving is deferred for now.  Add checks.
> 
> Signed-off-by: Navneet Singh <navneet.singh@intel.com>
> Co-developed-by: Ira Weiny <ira.weiny@intel.com>
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>

Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v3 14/25] cxl/events: Split event msgnum configuration from irq setup
  2024-08-16 14:44 ` [PATCH v3 14/25] cxl/events: Split event msgnum configuration from irq setup Ira Weiny
  2024-08-16 23:57   ` Dave Jiang
  2024-08-22 21:39   ` Fan Ni
@ 2024-08-23 17:01   ` Jonathan Cameron
  2024-09-03  7:06   ` Li, Ming4
  3 siblings, 0 replies; 120+ messages in thread
From: Jonathan Cameron @ 2024-08-23 17:01 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Dave Jiang, Fan Ni, Navneet Singh, Chris Mason, Josef Bacik,
	David Sterba, Petr Mladek, Steven Rostedt, Andy Shevchenko,
	Rasmus Villemoes, Sergey Senozhatsky, Jonathan Corbet,
	Andrew Morton, Dan Williams, Davidlohr Bueso, Alison Schofield,
	Vishal Verma, linux-btrfs, linux-cxl, linux-kernel, linux-doc,
	nvdimm

On Fri, 16 Aug 2024 09:44:22 -0500
Ira Weiny <ira.weiny@intel.com> wrote:

> Dynamic Capacity Devices (DCD) require event interrupts to process
> memory addition or removal.  BIOS may have control over non-DCD event
> processing.  DCD interrupt configuration needs to be separate from
> memory event interrupt configuration.
> 
> Split cxl_event_config_msgnums() from irq setup in preparation for
> separate DCD interrupts configuration.
> 
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v3 16/25] cxl/mem: Configure dynamic capacity interrupts
  2024-08-16 14:44 ` [PATCH v3 16/25] cxl/mem: Configure dynamic capacity interrupts ira.weiny
  2024-08-17  0:02   ` Dave Jiang
@ 2024-08-23 17:08   ` Jonathan Cameron
  2024-09-03  7:09   ` Li, Ming4
  2 siblings, 0 replies; 120+ messages in thread
From: Jonathan Cameron @ 2024-08-23 17:08 UTC (permalink / raw)
  To: ira.weiny
  Cc: Dave Jiang, Fan Ni, Navneet Singh, Chris Mason, Josef Bacik,
	David Sterba, Petr Mladek, Steven Rostedt, Andy Shevchenko,
	Rasmus Villemoes, Sergey Senozhatsky, Jonathan Corbet,
	Andrew Morton, Dan Williams, Davidlohr Bueso, Alison Schofield,
	Vishal Verma, linux-btrfs, linux-cxl, linux-kernel, linux-doc,
	nvdimm

On Fri, 16 Aug 2024 09:44:24 -0500
ira.weiny@intel.com wrote:

> From: Navneet Singh <navneet.singh@intel.com>
> 
> Dynamic Capacity Devices (DCD) support extent change notifications
> through the event log mechanism.  The interrupt mailbox commands were
> extended in CXL 3.1 to support these notifications.  Firmware can't
> configure DCD events to be FW controlled but can retain control of
> memory events.
> 
> Configure DCD event log interrupts on devices supporting dynamic
> capacity.  Disable DCD if interrupts are not supported.
> 
> Care is taken to preserve the interrupt policy set by the FW if FW first
> has been selected by the BIOS.
> 
> Signed-off-by: Navneet Singh <navneet.singh@intel.com>
> Co-developed-by: Ira Weiny <ira.weiny@intel.com>
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Minor thing on naming inline.  Either way
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>

> 
> ---
> Changes:
> [iweiny: update commit message]
> [iweiny: rebase to upstream irq code]
> [iweiny: disable DCD if irqs not supported]
> [Jonathan: formatting fix]
> [Fan: add text to debug print]
> [djiang: make dcd helpers inline]
> ---
>  drivers/cxl/cxlmem.h |  2 ++
>  drivers/cxl/pci.c    | 72 +++++++++++++++++++++++++++++++++++++++++++---------
>  2 files changed, 62 insertions(+), 12 deletions(-)
> 
> diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
> index b4eb8164d05d..d41bec5433db 100644
> --- a/drivers/cxl/cxlmem.h
> +++ b/drivers/cxl/cxlmem.h
> @@ -225,7 +225,9 @@ struct cxl_event_interrupt_policy {
>  	u8 warn_settings;
>  	u8 failure_settings;
>  	u8 fatal_settings;
> +	u8 dcd_settings;
>  } __packed;
> +#define CXL_EVENT_INT_POLICY_BASE_SIZE 4 /* info, warn, failure, fatal */
>  
>  /**
>   * struct cxl_event_state - Event log driver state
> diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
> index 370c74eae323..e5430c4e3a3b 100644
> --- a/drivers/cxl/pci.c
> +++ b/drivers/cxl/pci.c
> @@ -669,22 +669,33 @@ static int cxl_event_get_int_policy(struct cxl_memdev_state *mds,
>  }
>  
>  static int cxl_event_config_msgnums(struct cxl_memdev_state *mds,
> -				    struct cxl_event_interrupt_policy *policy)
> +				    struct cxl_event_interrupt_policy *policy,
> +				    bool native_cxl)
Maybe carry through the native_cxl_error naming?

>  {
> +	size_t size_in = CXL_EVENT_INT_POLICY_BASE_SIZE;
>  	struct cxl_mbox_cmd mbox_cmd;
>  	int rc;
>  
> -	*policy = (struct cxl_event_interrupt_policy) {
> -		.info_settings = CXL_INT_MSI_MSIX,
> -		.warn_settings = CXL_INT_MSI_MSIX,
> -		.failure_settings = CXL_INT_MSI_MSIX,
> -		.fatal_settings = CXL_INT_MSI_MSIX,
> -	};
> +	/* memory event policy is left if FW has control */
> +	if (native_cxl) {
> +		*policy = (struct cxl_event_interrupt_policy) {
> +			.info_settings = CXL_INT_MSI_MSIX,
> +			.warn_settings = CXL_INT_MSI_MSIX,
> +			.failure_settings = CXL_INT_MSI_MSIX,
> +			.fatal_settings = CXL_INT_MSI_MSIX,
> +			.dcd_settings = 0,
> +		};
> +	}
> +
> +	if (cxl_dcd_supported(mds)) {
> +		policy->dcd_settings = CXL_INT_MSI_MSIX;
> +		size_in += sizeof(policy->dcd_settings);
> +	}
>  
>  	mbox_cmd = (struct cxl_mbox_cmd) {
>  		.opcode = CXL_MBOX_OP_SET_EVT_INT_POLICY,
>  		.payload_in = policy,
> -		.size_in = sizeof(*policy),
> +		.size_in = size_in,
>  	};
>  
>  	rc = cxl_internal_send_cmd(mds, &mbox_cmd);
> @@ -731,6 +742,31 @@ static int cxl_event_irqsetup(struct cxl_memdev_state *mds,
>  	return 0;
>  }

> +
>  static bool cxl_event_int_is_fw(u8 setting)
>  {
>  	u8 mode = FIELD_GET(CXLDEV_EVENT_INT_MODE_MASK, setting);
> @@ -757,17 +793,25 @@ static int cxl_event_config(struct pci_host_bridge *host_bridge,
>  			    struct cxl_memdev_state *mds, bool irq_avail)
>  {
>  	struct cxl_event_interrupt_policy policy = { 0 };
> +	bool native_cxl = host_bridge->native_cxl_error;

Maybe keep the native_cxl_error naming for the local variable as well?


>  	int rc;
>  
>  	/*
>  	 * When BIOS maintains CXL error reporting control, it will process
>  	 * event records.  Only one agent can do so.
> +	 *
> +	 * If BIOS has control of events and DCD is not supported skip event
> +	 * configuration.
>  	 */
> -	if (!host_bridge->native_cxl_error)
> +	if (!native_cxl && !cxl_dcd_supported(mds))
>  		return 0;
>  
>  	if (!irq_avail) {
>  		dev_info(mds->cxlds.dev, "No interrupt support, disable event processing.\n");
> +		if (cxl_dcd_supported(mds)) {
> +			dev_info(mds->cxlds.dev, "DCD requires interrupts, disable DCD\n");
> +			cxl_disable_dcd(mds);
> +		}
>  		return 0;
>  	}
>  
> @@ -775,10 +819,10 @@ static int cxl_event_config(struct pci_host_bridge *host_bridge,
>  	if (rc)
>  		return rc;
>  
> -	if (!cxl_event_validate_mem_policy(mds, &policy))
> +	if (native_cxl && !cxl_event_validate_mem_policy(mds, &policy))
>  		return -EBUSY;
>  
> -	rc = cxl_event_config_msgnums(mds, &policy);
> +	rc = cxl_event_config_msgnums(mds, &policy, native_cxl);
>  	if (rc)
>  		return rc;
>  
> @@ -786,12 +830,16 @@ static int cxl_event_config(struct pci_host_bridge *host_bridge,
>  	if (rc)
>  		return rc;
>  
> -	rc = cxl_event_irqsetup(mds, &policy);
> +	rc = cxl_irqsetup(mds, &policy, native_cxl);
>  	if (rc)
>  		return rc;
>  
>  	cxl_mem_get_event_records(mds, CXLDEV_EVENT_STATUS_ALL);
>  
> +	dev_dbg(mds->cxlds.dev, "Event config : %s DCD %s\n",
> +		native_cxl ? "OS" : "BIOS",
> +		cxl_dcd_supported(mds) ? "supported" : "not supported");
> +
>  	return 0;
>  }
>  
> 


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v3 17/25] cxl/core: Return endpoint decoder information from region search
  2024-08-16 14:44 ` [PATCH v3 17/25] cxl/core: Return endpoint decoder information from region search Ira Weiny
  2024-08-19 16:35   ` Dave Jiang
@ 2024-08-23 17:12   ` Jonathan Cameron
  2024-09-03  7:10   ` Li, Ming4
  2 siblings, 0 replies; 120+ messages in thread
From: Jonathan Cameron @ 2024-08-23 17:12 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Dave Jiang, Fan Ni, Navneet Singh, Chris Mason, Josef Bacik,
	David Sterba, Petr Mladek, Steven Rostedt, Andy Shevchenko,
	Rasmus Villemoes, Sergey Senozhatsky, Jonathan Corbet,
	Andrew Morton, Dan Williams, Davidlohr Bueso, Alison Schofield,
	Vishal Verma, linux-btrfs, linux-cxl, linux-kernel, linux-doc,
	nvdimm

On Fri, 16 Aug 2024 09:44:25 -0500
Ira Weiny <ira.weiny@intel.com> wrote:

> cxl_dpa_to_region() finds the region from a <DPA, device> tuple.
> The search involves finding the device endpoint decoder as well.
> 
> Dynamic capacity extent processing uses the endpoint decoder HPA
> information to calculate the HPA offset.  In addition, well behaved
> extents should be contained within an endpoint decoder.
> 
> Return the endpoint decoder found to be used in subsequent DCD code.
Maybe make this
Optionally return the ...

> 
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v3 19/25] cxl/region/extent: Expose region extent information in sysfs
  2024-08-23  2:58     ` Ira Weiny
@ 2024-08-23 17:17       ` Jonathan Cameron
  0 siblings, 0 replies; 120+ messages in thread
From: Jonathan Cameron @ 2024-08-23 17:17 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Dave Jiang, Fan Ni, Navneet Singh, Chris Mason, Josef Bacik,
	David Sterba, Petr Mladek, Steven Rostedt, Andy Shevchenko,
	Rasmus Villemoes, Sergey Senozhatsky, Jonathan Corbet,
	Andrew Morton, Dan Williams, Davidlohr Bueso, Alison Schofield,
	Vishal Verma, linux-btrfs, linux-cxl, linux-kernel, linux-doc,
	nvdimm

On Thu, 22 Aug 2024 21:58:02 -0500
Ira Weiny <ira.weiny@intel.com> wrote:

> Dave Jiang wrote:
> > 
> > 
> > On 8/16/24 7:44 AM, ira.weiny@intel.com wrote:  
> > > From: Navneet Singh <navneet.singh@intel.com>
> > > 
> > > Extent information can be helpful to the user to coordinate memory usage
> > > with the external orchestrator and FM.
> > > 
> > > Expose the details of region extents by creating the following
> > > sysfs entries.
> > > 
> > >         /sys/bus/cxl/devices/dax_regionX/extentX.Y
> > >         /sys/bus/cxl/devices/dax_regionX/extentX.Y/offset
> > >         /sys/bus/cxl/devices/dax_regionX/extentX.Y/length
> > >         /sys/bus/cxl/devices/dax_regionX/extentX.Y/tag
> > > 
> > > Signed-off-by: Navneet Singh <navneet.singh@intel.com>
> > > Co-developed-by: Ira Weiny <ira.weiny@intel.com>
> > > Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> > > 
> > > ---
> > > Changes:
> > > [iweiny: split this out]
> > > [Jonathan: add documentation for extent sysfs]
> > > [Jonathan/djbw: s/label/tag]
> > > [Jonathan/djbw: treat tag as uuid]
> > > [djbw: use __ATTRIBUTE_GROUPS]
> > > [djbw: make tag invisible if it is empty]
> > > [djbw/iweiny: use conventional id names for extents; extentX.Y]
> > > ---
> > >  Documentation/ABI/testing/sysfs-bus-cxl | 13 ++++++++
> > >  drivers/cxl/core/extent.c               | 58 +++++++++++++++++++++++++++++++++
> > >  2 files changed, 71 insertions(+)
> > > 
> > > diff --git a/Documentation/ABI/testing/sysfs-bus-cxl b/Documentation/ABI/testing/sysfs-bus-cxl
> > > index 3a5ee88e551b..e97e6a73c960 100644
> > > --- a/Documentation/ABI/testing/sysfs-bus-cxl
> > > +++ b/Documentation/ABI/testing/sysfs-bus-cxl
> > > @@ -599,3 +599,16 @@ Description:
> > >  		See Documentation/ABI/stable/sysfs-devices-node. access0 provides
> > >  		the number to the closest initiator and access1 provides the
> > >  		number to the closest CPU.
> > > +
> > > +What:		/sys/bus/cxl/devices/dax_regionX/extentX.Y/offset
> > > +		/sys/bus/cxl/devices/dax_regionX/extentX.Y/length
> > > +		/sys/bus/cxl/devices/dax_regionX/extentX.Y/tag  
> > 
> > I wonder consider an entry for each with their own descriptions, which seems to be the standard practice.  
> 
> :-/  Except kind of for the access'.
> 
> What:           /sys/bus/cxl/devices/regionZ/accessY/read_bandwidth
>                 /sys/bus/cxl/devices/regionZ/accessY/write_banwidth
> 
> What:           /sys/bus/cxl/devices/regionZ/accessY/read_latency
>                 /sys/bus/cxl/devices/regionZ/accessY/write_latency
> 
> But I think you have a point.

It's a balance between complexity and repetition.

E.g. https://elixir.bootlin.com/linux/v6.11-rc4/source/Documentation/ABI/testing/sysfs-bus-iio#L427
is one of these files I know far too well. That would be a lot
of very boring repetition and that doc is long enough without breaking them up.

Here there are only 3 and a good bit of description differs so
probably good to split up.

Less so for bandwidth and latency cases.

Jonathan

> 
> Ira
> 
> > 
> > DJ
> >   
> 
> [snip]
> 


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v3 19/25] cxl/region/extent: Expose region extent information in sysfs
  2024-08-16 14:44 ` [PATCH v3 19/25] cxl/region/extent: Expose region extent information in sysfs ira.weiny
  2024-08-19 19:05   ` Dave Jiang
@ 2024-08-23 17:19   ` Jonathan Cameron
  2024-08-28 17:44   ` Fan Ni
  2 siblings, 0 replies; 120+ messages in thread
From: Jonathan Cameron @ 2024-08-23 17:19 UTC (permalink / raw)
  To: ira.weiny
  Cc: Dave Jiang, Fan Ni, Navneet Singh, Chris Mason, Josef Bacik,
	David Sterba, Petr Mladek, Steven Rostedt, Andy Shevchenko,
	Rasmus Villemoes, Sergey Senozhatsky, Jonathan Corbet,
	Andrew Morton, Dan Williams, Davidlohr Bueso, Alison Schofield,
	Vishal Verma, linux-btrfs, linux-cxl, linux-kernel, linux-doc,
	nvdimm

On Fri, 16 Aug 2024 09:44:27 -0500
ira.weiny@intel.com wrote:

> From: Navneet Singh <navneet.singh@intel.com>
> 
> Extent information can be helpful to the user to coordinate memory usage
> with the external orchestrator and FM.
> 
> Expose the details of region extents by creating the following
> sysfs entries.
> 
>         /sys/bus/cxl/devices/dax_regionX/extentX.Y
>         /sys/bus/cxl/devices/dax_regionX/extentX.Y/offset
>         /sys/bus/cxl/devices/dax_regionX/extentX.Y/length
>         /sys/bus/cxl/devices/dax_regionX/extentX.Y/tag
> 
> Signed-off-by: Navneet Singh <navneet.singh@intel.com>
> Co-developed-by: Ira Weiny <ira.weiny@intel.com>
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> 
LGTM with or without the docs split.
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>



^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v3 22/25] cxl/region: Read existing extents on region creation
  2024-08-20  0:06   ` Dave Jiang
@ 2024-08-23 21:31     ` Ira Weiny
  0 siblings, 0 replies; 120+ messages in thread
From: Ira Weiny @ 2024-08-23 21:31 UTC (permalink / raw)
  To: Dave Jiang, ira.weiny, Fan Ni, Jonathan Cameron, Navneet Singh,
	Chris Mason, Josef Bacik, David Sterba, Petr Mladek,
	Steven Rostedt, Andy Shevchenko, Rasmus Villemoes,
	Sergey Senozhatsky, Jonathan Corbet, Andrew Morton
  Cc: Dan Williams, Davidlohr Bueso, Alison Schofield, Vishal Verma,
	linux-btrfs, linux-cxl, linux-kernel, linux-doc, nvdimm

Dave Jiang wrote:
> 
> 
> On 8/16/24 7:44 AM, ira.weiny@intel.com wrote:
> > From: Navneet Singh <navneet.singh@intel.com>
> > 
> > Dynamic capacity device extents may be left in an accepted state on a
> > device due to an unexpected host crash.  In this case it is expected
> > that the creation of a new region on top of a DC partition can read
> > those extents and surface them for continued use.
> > 
> > Once all endpoint decoders are part of a region and the region is being
> > realized a read of the devices extent list can reveal these previously
> > accepted extents.
> 
> Once all endpoint decoders are part of a region and the region is being
> realized, a read of the 'devices extend list' can reveal these previously
> accepted extents.
> 

Thanks, done.

[snip]

> > +
> > +/**
> > + * cxl_read_extent_list() - Read existing extents
> > + * @cxled: Endpoint decoder which is part of a region
> > + *
> > + * Issue the Get Dynamic Capacity Extent List command to the device
> > + * and add existing extents if found.
> > + */
> > +void cxl_read_extent_list(struct cxl_endpoint_decoder *cxled)
> 
> cxl_process_extend_list()? It seems to do read+validate+add. 

yea maybe.  The name of this function actually changed in my mind many
times.  In the end I went for the higher level meaning which was to read
the existing extent list.

I'll change it because I'm not convinced of any particular name.

> 
> > +{
> > +	int retry = 10;
> 
> arbitrary retry number? maybe define it?

Sure.  But it is still an arbitrary value of 10.

I'll document my justification thusly.

/*
   ...
 * A retry of 10 is somewhat arbitrary, however, extent changes should be      
 * relatively rare while bringing up a region.  So 10 should be plenty.        
 */                                                                            
#define CXL_READ_EXTENT_LIST_RETRY 10

Ira


[snip]

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v3 18/25] cxl/extent: Process DCD events and realize region extents
  2024-08-16 14:44 ` [PATCH v3 18/25] cxl/extent: Process DCD events and realize region extents ira.weiny
  2024-08-19 18:51   ` Dave Jiang
@ 2024-08-23 21:32   ` Fan Ni
  2024-08-27 12:08     ` Jonathan Cameron
  2024-08-27 13:18   ` Jonathan Cameron
                     ` (2 subsequent siblings)
  4 siblings, 1 reply; 120+ messages in thread
From: Fan Ni @ 2024-08-23 21:32 UTC (permalink / raw)
  To: ira.weiny
  Cc: Dave Jiang, Jonathan Cameron, Navneet Singh, Chris Mason,
	Josef Bacik, David Sterba, Petr Mladek, Steven Rostedt,
	Andy Shevchenko, Rasmus Villemoes, Sergey Senozhatsky,
	Jonathan Corbet, Andrew Morton, Dan Williams, Davidlohr Bueso,
	Alison Schofield, Vishal Verma, linux-btrfs, linux-cxl,
	linux-kernel, linux-doc, nvdimm

On Fri, Aug 16, 2024 at 09:44:26AM -0500, ira.weiny@intel.com wrote:
> From: Navneet Singh <navneet.singh@intel.com>
> 
> A dynamic capacity device (DCD) sends events to signal the host for
> changes in the availability of Dynamic Capacity (DC) memory.  These
> events contain extents describing a DPA range and meta data for memory
> to be added or removed.  Events may be sent from the device at any time.
> 
> Three types of events can be signaled, Add, Release, and Force Release.
> 
> On add, the host may accept or reject the memory being offered.  If no
> region exists, or the extent is invalid, the extent should be rejected.
> Add extent events may be grouped by a 'more' bit which indicates those
> extents should be processed as a group.
> 
> On remove, the host can delay the response until the host is safely not
> using the memory.  If no region exists the release can be sent
> immediately.  The host may also release extents (or partial extents) at
> any time.  Thus the 'more' bit grouping of release events is of less
> value and can be ignored in favor of sending multiple release capacity
> responses for groups of release events.
> 
> Force removal is intended as a mechanism between the FM and the device
> and intended only when the host is unresponsive, out of sync, or
> otherwise broken.  Purposely ignore force removal events.
> 
> Regions are made up of one or more devices which may be surfacing memory
> to the host.  Once all devices in a region have surfaced an extent the
> region can expose a corresponding extent for the user to consume.
> Without interleaving a device extent forms a 1:1 relationship with the
> region extent.  Immediately surface a region extent upon getting a
> device extent.
> 
> Per the specification the device is allowed to offer or remove extents
> at any time.  However, anticipated use cases can expect extents to be
> offered, accepted, and removed in well defined chunks.
> 
> Simplify extent tracking with the following restrictions.
> 
> 	1) Flag for removal any extent which overlaps a requested
> 	   release range.
> 	2) Refuse the offer of extents which overlap already accepted
> 	   memory ranges.
> 	3) Accept again a range which has already been accepted by the
> 	   host.  (It is likely the device has an error because it
> 	   should already know that this range was accepted.  But from
> 	   the host point of view it is safe to acknowledge that
> 	   acceptance again.)
> 
> Management of the region extent devices must be synchronized with
> potential uses of the memory within the DAX layer.  Create region extent
> devices as children of the cxl_dax_region device such that the DAX
> region driver can co-drive them and synchronize with the DAX layer.
> Synchronization and management is handled in a subsequent patch.
> 
> Process DCD events and create region devices.
> 
> Signed-off-by: Navneet Singh <navneet.singh@intel.com>
> Co-developed-by: Ira Weiny <ira.weiny@intel.com>
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> 

One minor change inline.

> ---
> Changes:
> [iweiny: combine this with the extent surface patches to better show the
>          lifetime extent objects in review]
> [iweiny: clean up commit message.]
> [iweiny: move extent verification of the 'read extents on region
>          creation' to this patch]
> [iweiny: Provide for a common path for extent realization between an add
> 	 event and adding existing extents.]
> [iweiny: Persist a check that an extent is within an endpoint decoder]
> [iweiny: reduce exported and non-static calls]
> [iweiny: use %par]
> 
> 	<Combined comments from the old patches which were addressed>
> 
> [Jonathan: implement the more bit with a simple algorithm which accepts
> 	   all extents it can.
> 	   Also include the response more bit to prevent payload
> 	   overflow]
> [Fan: Do not error if a contained extent is added.]
> [Jonathan: allocate ida after kzalloc]
> [iweiny: fix ida resource leak]
> [fan/djiang: remove unneeded memset]
> [djiang: fix indentation]
> [Jonathan: Fix indentation]
> [Jonathan/djbw: make tag a uuid]
> [djbw: create helper calc_hpa_range() straight away]
> [djbw: Allow for multiple cxled_extents per region_extent]
> [djbw: s/cxl_ed/cxled]
> [djbw: s/cxl_release_ed_extent/cxled_release_extent/]
> [djbw: s/reg_ext/region_extent/]
> [djbw: s/dc_extent/extent/]
> [Gregory/djbw: reject shared extents]
> [iweiny: predicate extent.c compile on CONFIG_CXL_REGION]
> ---
>  drivers/cxl/core/Makefile |   2 +-
>  drivers/cxl/core/core.h   |  13 ++
>  drivers/cxl/core/extent.c | 345 ++++++++++++++++++++++++++++++++++++++++++++++
>  drivers/cxl/core/mbox.c   | 268 ++++++++++++++++++++++++++++++++++-
>  drivers/cxl/core/region.c |   6 +
>  drivers/cxl/cxl.h         |  52 ++++++-
>  drivers/cxl/cxlmem.h      |  26 ++++
>  include/linux/cxl-event.h |  32 +++++
>  tools/testing/cxl/Kbuild  |   3 +-
>  9 files changed, 743 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/cxl/core/Makefile b/drivers/cxl/core/Makefile
> index 9259bcc6773c..3b812515e725 100644
> --- a/drivers/cxl/core/Makefile
> +++ b/drivers/cxl/core/Makefile
> @@ -15,4 +15,4 @@ cxl_core-y += hdm.o
>  cxl_core-y += pmu.o
>  cxl_core-y += cdat.o
>  cxl_core-$(CONFIG_TRACING) += trace.o
> -cxl_core-$(CONFIG_CXL_REGION) += region.o
> +cxl_core-$(CONFIG_CXL_REGION) += region.o extent.o
> diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
> index 76c4153a9b2c..8dfc97b2e0a4 100644
> --- a/drivers/cxl/core/core.h
> +++ b/drivers/cxl/core/core.h
> @@ -44,12 +44,24 @@ struct cxl_region *cxl_dpa_to_region(const struct cxl_memdev *cxlmd, u64 dpa,
>  u64 cxl_dpa_to_hpa(struct cxl_region *cxlr, const struct cxl_memdev *cxlmd,
>  		   u64 dpa);
>  
> +int cxl_add_extent(struct cxl_memdev_state *mds, struct cxl_extent *extent);
> +int cxl_rm_extent(struct cxl_memdev_state *mds, struct cxl_extent *extent);
>  #else
>  static inline u64 cxl_dpa_to_hpa(struct cxl_region *cxlr,
>  				 const struct cxl_memdev *cxlmd, u64 dpa)
>  {
>  	return ULLONG_MAX;
>  }
> +static inline int cxl_add_extent(struct cxl_memdev_state *mds,
> +				   struct cxl_extent *extent)
> +{
> +	return 0;
> +}
> +static inline int cxl_rm_extent(struct cxl_memdev_state *mds,
> +				struct cxl_extent *extent)
> +{
> +	return 0;
> +}
>  static inline
>  struct cxl_region *cxl_dpa_to_region(const struct cxl_memdev *cxlmd, u64 dpa,
>  				     struct cxl_endpoint_decoder **cxled)
> @@ -121,5 +133,6 @@ long cxl_pci_get_latency(struct pci_dev *pdev);
>  int cxl_update_hmat_access_coordinates(int nid, struct cxl_region *cxlr,
>  				       enum access_coordinate_class access);
>  bool cxl_need_node_perf_attrs_update(int nid);
> +void memdev_release_extent(struct cxl_memdev_state *mds, struct range *range);
>  
>  #endif /* __CXL_CORE_H__ */
> diff --git a/drivers/cxl/core/extent.c b/drivers/cxl/core/extent.c
> new file mode 100644
> index 000000000000..34456594cdc3
> --- /dev/null
> +++ b/drivers/cxl/core/extent.c
> @@ -0,0 +1,345 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*  Copyright(c) 2024 Intel Corporation. All rights reserved. */
> +
> +#include <linux/device.h>
> +#include <cxl.h>
> +
> +#include "core.h"
> +
> +static void cxled_release_extent(struct cxl_endpoint_decoder *cxled,
> +				 struct cxled_extent *ed_extent)
> +{
> +	struct cxl_memdev_state *mds = cxled_to_mds(cxled);
> +	struct device *dev = &cxled->cxld.dev;
> +
> +	dev_dbg(dev, "Remove extent %par (%*phC)\n", &ed_extent->dpa_range,
> +		CXL_EXTENT_TAG_LEN, ed_extent->tag);
> +	memdev_release_extent(mds, &ed_extent->dpa_range);
> +	kfree(ed_extent);
> +}
> +
> +static void free_region_extent(struct region_extent *region_extent)
> +{
> +	struct cxled_extent *ed_extent;
> +	unsigned long index;
> +
> +	/*
> +	 * Remove from each endpoint decoder the extent which backs this region
> +	 * extent
> +	 */
> +	xa_for_each(&region_extent->decoder_extents, index, ed_extent)
> +		cxled_release_extent(ed_extent->cxled, ed_extent);
> +	xa_destroy(&region_extent->decoder_extents);
> +	ida_free(&region_extent->cxlr_dax->extent_ida, region_extent->dev.id);
> +	kfree(region_extent);
> +}
> +
> +static void region_extent_release(struct device *dev)
> +{
> +	struct region_extent *region_extent = to_region_extent(dev);
> +
> +	free_region_extent(region_extent);
> +}
> +
> +static const struct device_type region_extent_type = {
> +	.name = "extent",
> +	.release = region_extent_release,
> +};
> +
> +bool is_region_extent(struct device *dev)
> +{
> +	return dev->type == &region_extent_type;
> +}
> +EXPORT_SYMBOL_NS_GPL(is_region_extent, CXL);
> +
> +static void region_extent_unregister(void *ext)
> +{
> +	struct region_extent *region_extent = ext;
> +
> +	dev_dbg(&region_extent->dev, "DAX region rm extent HPA %par\n",
> +		&region_extent->hpa_range);
> +	device_unregister(&region_extent->dev);
> +}
> +
> +static void region_rm_extent(struct region_extent *region_extent)
> +{
> +	struct device *region_dev = region_extent->dev.parent;
> +
> +	devm_release_action(region_dev, region_extent_unregister, region_extent);
> +}
> +
> +static struct region_extent *
> +alloc_region_extent(struct cxl_dax_region *cxlr_dax, struct range *hpa_range, u8 *tag)
> +{
> +	int id;
> +
> +	struct region_extent *region_extent __free(kfree) =
> +				kzalloc(sizeof(*region_extent), GFP_KERNEL);
> +	if (!region_extent)
> +		return ERR_PTR(-ENOMEM);
> +
> +	id = ida_alloc(&cxlr_dax->extent_ida, GFP_KERNEL);
> +	if (id < 0)
> +		return ERR_PTR(-ENOMEM);
> +
> +	region_extent->hpa_range = *hpa_range;
> +	region_extent->cxlr_dax = cxlr_dax;
> +	import_uuid(&region_extent->tag, tag);
> +	region_extent->dev.id = id;
> +	xa_init(&region_extent->decoder_extents);
> +	return no_free_ptr(region_extent);
> +}
> +
> +static int online_region_extent(struct region_extent *region_extent)
> +{
> +	struct cxl_dax_region *cxlr_dax = region_extent->cxlr_dax;
> +	struct device *dev;
> +	int rc;
> +
> +	dev = &region_extent->dev;
> +	device_initialize(dev);
> +	device_set_pm_not_required(dev);
> +	dev->parent = &cxlr_dax->dev;
> +	dev->type = &region_extent_type;
> +	rc = dev_set_name(dev, "extent%d.%d", cxlr_dax->cxlr->id, dev->id);
> +	if (rc)
> +		goto err;
> +
> +	rc = device_add(dev);
> +	if (rc)
> +		goto err;
> +
> +	dev_dbg(dev, "region extent HPA %par\n", &region_extent->hpa_range);
> +	return devm_add_action_or_reset(&cxlr_dax->dev, region_extent_unregister,
> +					region_extent);
> +
> +err:
> +	dev_err(&cxlr_dax->dev, "Failed to initialize region extent HPA %par\n",
> +		&region_extent->hpa_range);
> +
> +	put_device(dev);
> +	return rc;
> +}
> +
> +struct match_data {
> +	struct cxl_endpoint_decoder *cxled;
> +	struct range *new_range;
> +};
> +
> +static int match_contains(struct device *dev, void *data)
> +{
> +	struct region_extent *region_extent = to_region_extent(dev);
> +	struct match_data *md = data;
> +	struct cxled_extent *entry;
> +	unsigned long index;
> +
> +	if (!region_extent)
> +		return 0;
> +
> +	xa_for_each(&region_extent->decoder_extents, index, entry) {
> +		if (md->cxled == entry->cxled &&
> +		    range_contains(&entry->dpa_range, md->new_range))
> +			return true;
> +	}
> +	return false;
> +}
> +
> +static bool extents_contain(struct cxl_dax_region *cxlr_dax,
> +			    struct cxl_endpoint_decoder *cxled,
> +			    struct range *new_range)
> +{
> +	struct device *extent_device;
> +	struct match_data md = {
> +		.cxled = cxled,
> +		.new_range = new_range,
> +	};
> +
> +	extent_device = device_find_child(&cxlr_dax->dev, &md, match_contains);
> +	if (!extent_device)
> +		return false;
> +
> +	put_device(extent_device);
> +	return true;
> +}
> +
> +static int match_overlaps(struct device *dev, void *data)
> +{
> +	struct region_extent *region_extent = to_region_extent(dev);
> +	struct match_data *md = data;
> +	struct cxled_extent *entry;
> +	unsigned long index;
> +
> +	if (!region_extent)
> +		return 0;
> +
> +	xa_for_each(&region_extent->decoder_extents, index, entry) {
> +		if (md->cxled == entry->cxled &&
> +		    range_overlaps(&entry->dpa_range, md->new_range))
> +			return true;
> +	}
> +
> +	return false;
> +}
> +
> +static bool extents_overlap(struct cxl_dax_region *cxlr_dax,
> +			    struct cxl_endpoint_decoder *cxled,
> +			    struct range *new_range)
> +{
> +	struct device *extent_device;
> +	struct match_data md = {
> +		.cxled = cxled,
> +		.new_range = new_range,
> +	};
> +
> +	extent_device = device_find_child(&cxlr_dax->dev, &md, match_overlaps);
> +	if (!extent_device)
> +		return false;
> +
> +	put_device(extent_device);
> +	return true;
> +}
> +
> +static void calc_hpa_range(struct cxl_endpoint_decoder *cxled,
> +			   struct cxl_dax_region *cxlr_dax,
> +			   struct range *dpa_range,
> +			   struct range *hpa_range)
> +{
> +	resource_size_t dpa_offset, hpa;
> +
> +	dpa_offset = dpa_range->start - cxled->dpa_res->start;
> +	hpa = cxled->cxld.hpa_range.start + dpa_offset;
> +
> +	hpa_range->start = hpa - cxlr_dax->hpa_range.start;
> +	hpa_range->end = hpa_range->start + range_len(dpa_range) - 1;
> +}
> +
> +static int cxlr_rm_extent(struct device *dev, void *data)
> +{
> +	struct region_extent *region_extent = to_region_extent(dev);
> +	struct range *region_hpa_range = data;
> +
> +	if (!region_extent)
> +		return 0;
> +
> +	/*
> +	 * Any extent which 'touches' the released range is removed.
> +	 */
> +	if (range_overlaps(region_hpa_range, &region_extent->hpa_range)) {
> +		dev_dbg(dev, "Remove region extent HPA %par\n",
> +			&region_extent->hpa_range);
> +		region_rm_extent(region_extent);
> +	}
> +	return 0;
> +}
> +
> +int cxl_rm_extent(struct cxl_memdev_state *mds, struct cxl_extent *extent)
> +{
> +	u64 start_dpa = le64_to_cpu(extent->start_dpa);
> +	struct cxl_memdev *cxlmd = mds->cxlds.cxlmd;
> +	struct cxl_endpoint_decoder *cxled;
> +	struct range hpa_range, dpa_range;
> +	struct cxl_region *cxlr;
> +
> +	dpa_range = (struct range) {
> +		.start = start_dpa,
> +		.end = start_dpa + le64_to_cpu(extent->length) - 1,
> +	};
> +
> +	guard(rwsem_read)(&cxl_region_rwsem);
> +	cxlr = cxl_dpa_to_region(cxlmd, start_dpa, &cxled);
> +	if (!cxlr) {
> +		memdev_release_extent(mds, &dpa_range);
> +		return -ENXIO;
> +	}
> +
> +	calc_hpa_range(cxled, cxlr->cxlr_dax, &dpa_range, &hpa_range);
> +
> +	/* Remove region extents which overlap */
> +	return device_for_each_child(&cxlr->cxlr_dax->dev, &hpa_range,
> +				     cxlr_rm_extent);
> +}
> +
> +static int cxlr_add_extent(struct cxl_dax_region *cxlr_dax,
> +			   struct cxl_endpoint_decoder *cxled,
> +			   struct cxled_extent *ed_extent)
> +{
> +	struct region_extent *region_extent;
> +	struct range hpa_range;
> +	int rc;
> +
> +	calc_hpa_range(cxled, cxlr_dax, &ed_extent->dpa_range, &hpa_range);
> +
> +	region_extent = alloc_region_extent(cxlr_dax, &hpa_range, ed_extent->tag);
> +	if (IS_ERR(region_extent))
> +		return PTR_ERR(region_extent);
> +
> +	rc = xa_insert(&region_extent->decoder_extents, (unsigned long)ed_extent, ed_extent,
> +		       GFP_KERNEL);
> +	if (rc) {
> +		free_region_extent(region_extent);
> +		return rc;
> +	}
> +
> +	/* device model handles freeing region_extent */
> +	return online_region_extent(region_extent);
> +}
> +
> +/* Callers are expected to ensure cxled has been attached to a region */
> +int cxl_add_extent(struct cxl_memdev_state *mds, struct cxl_extent *extent)
> +{
> +	u64 start_dpa = le64_to_cpu(extent->start_dpa);
> +	struct cxl_memdev *cxlmd = mds->cxlds.cxlmd;
> +	struct cxl_endpoint_decoder *cxled;
> +	struct range ed_range, ext_range;
> +	struct cxl_dax_region *cxlr_dax;
> +	struct cxled_extent *ed_extent;
> +	struct cxl_region *cxlr;
> +	struct device *dev;
> +
> +	ext_range = (struct range) {
> +		.start = start_dpa,
> +		.end = start_dpa + le64_to_cpu(extent->length) - 1,
> +	};
> +
> +	guard(rwsem_read)(&cxl_region_rwsem);
> +	cxlr = cxl_dpa_to_region(cxlmd, start_dpa, &cxled);
> +	if (!cxlr)
> +		return -ENXIO;
> +
> +	cxlr_dax = cxled->cxld.region->cxlr_dax;
> +	dev = &cxled->cxld.dev;
> +	ed_range = (struct range) {
> +		.start = cxled->dpa_res->start,
> +		.end = cxled->dpa_res->end,
> +	};
> +
> +	dev_dbg(&cxled->cxld.dev, "Checking ED (%pr) for extent %par\n",
> +		cxled->dpa_res, &ext_range);
> +
> +	if (!range_contains(&ed_range, &ext_range)) {
> +		dev_err_ratelimited(dev,
> +				    "DC extent DPA %par (%*phC) is not fully in ED %par\n",
> +				    &ext_range.start, CXL_EXTENT_TAG_LEN,
> +				    extent->tag, &ed_range);
> +		return -ENXIO;
> +	}
> +
> +	if (extents_contain(cxlr_dax, cxled, &ext_range))
> +		return 0;
> +
> +	if (extents_overlap(cxlr_dax, cxled, &ext_range))
> +		return -ENXIO;
> +
> +	ed_extent = kzalloc(sizeof(*ed_extent), GFP_KERNEL);
> +	if (!ed_extent)
> +		return -ENOMEM;
> +
> +	ed_extent->cxled = cxled;
> +	ed_extent->dpa_range = ext_range;
> +	memcpy(ed_extent->tag, extent->tag, CXL_EXTENT_TAG_LEN);
> +
> +	dev_dbg(dev, "Add extent %par (%*phC)\n", &ed_extent->dpa_range,
> +		CXL_EXTENT_TAG_LEN, ed_extent->tag);
> +
> +	return cxlr_add_extent(cxlr_dax, cxled, ed_extent);
> +}
> diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
> index 01a447aaa1b1..f629ad7488ac 100644
> --- a/drivers/cxl/core/mbox.c
> +++ b/drivers/cxl/core/mbox.c
> @@ -882,6 +882,48 @@ int cxl_enumerate_cmds(struct cxl_memdev_state *mds)
>  }
>  EXPORT_SYMBOL_NS_GPL(cxl_enumerate_cmds, CXL);
>  
> +static int cxl_validate_extent(struct cxl_memdev_state *mds,
> +			       struct cxl_extent *extent)
> +{
> +	u64 start = le64_to_cpu(extent->start_dpa);
> +	u64 length = le64_to_cpu(extent->length);
> +	struct device *dev = mds->cxlds.dev;
> +
> +	struct range ext_range = (struct range){
> +		.start = start,
> +		.end = start + length - 1,
> +	};
> +
> +	if (le16_to_cpu(extent->shared_extn_seq) != 0) {
> +		dev_err_ratelimited(dev,
> +				    "DC extent DPA %par (%*phC) can not be shared\n",
> +				    &ext_range.start, CXL_EXTENT_TAG_LEN,
> +				    extent->tag);
> +		return -ENXIO;
> +	}
> +
> +	/* Extents must not cross DC region boundary's */
> +	for (int i = 0; i < mds->nr_dc_region; i++) {
> +		struct cxl_dc_region_info *dcr = &mds->dc_region[i];
> +		struct range region_range = (struct range) {
> +			.start = dcr->base,
> +			.end = dcr->base + dcr->decode_len - 1,
> +		};
> +
> +		if (range_contains(&region_range, &ext_range)) {
> +			dev_dbg(dev, "DC extent DPA %par (DCR:%d:%#llx)(%*phC)\n",
> +				&ext_range, i, start - dcr->base,
> +				CXL_EXTENT_TAG_LEN, extent->tag);
> +			return 0;
> +		}
> +	}
> +
> +	dev_err_ratelimited(dev,
> +			    "DC extent DPA %par (%*phC) is not in any DC region\n",
> +			    &ext_range, CXL_EXTENT_TAG_LEN, extent->tag);
> +	return -ENXIO;
> +}
> +
>  void cxl_event_trace_record(const struct cxl_memdev *cxlmd,
>  			    enum cxl_event_log_type type,
>  			    enum cxl_event_type event_type,
> @@ -1009,6 +1051,207 @@ static int cxl_clear_event_record(struct cxl_memdev_state *mds,
>  	return rc;
>  }
>  
> +static int cxl_send_dc_response(struct cxl_memdev_state *mds, int opcode,
> +				struct xarray *extent_array, int cnt)
> +{
> +	struct cxl_mbox_dc_response *p;
> +	struct cxl_mbox_cmd mbox_cmd;
> +	struct cxl_extent *extent;
> +	unsigned long index;
> +	u32 pl_index;
> +	int rc = 0;
> +
> +	size_t pl_size = struct_size(p, extent_list, cnt);
> +	u32 max_extents = cnt;
> +
> +	/* May have to use more bit on response. */
> +	if (pl_size > mds->payload_size) {
> +		max_extents = (mds->payload_size - sizeof(*p)) /
> +			      sizeof(struct updated_extent_list);
> +		pl_size = struct_size(p, extent_list, max_extents);
> +	}
> +
> +	struct cxl_mbox_dc_response *response __free(kfree) =
> +						kzalloc(pl_size, GFP_KERNEL);
> +	if (!response)
> +		return -ENOMEM;
> +
> +	pl_index = 0;
> +	xa_for_each(extent_array, index, extent) {
> +
> +		response->extent_list[pl_index].dpa_start = extent->start_dpa;
> +		response->extent_list[pl_index].length = extent->length;
> +		pl_index++;
> +		response->extent_list_size = cpu_to_le32(pl_index);
> +
> +		if (pl_index == max_extents) {
> +			mbox_cmd = (struct cxl_mbox_cmd) {
> +				.opcode = opcode,
> +				.size_in = struct_size(response, extent_list,
> +						       pl_index),
> +				.payload_in = response,
> +			};
> +
> +			response->flags = 0;
> +			if (pl_index < cnt)
> +				response->flags &= CXL_DCD_EVENT_MORE;
> +
> +			rc = cxl_internal_send_cmd(mds, &mbox_cmd);
> +			if (rc)
> +				return rc;
> +			pl_index = 0;
> +		}
> +	}
> +
> +	if (pl_index) {
> +		mbox_cmd = (struct cxl_mbox_cmd) {
> +			.opcode = opcode,
> +			.size_in = struct_size(response, extent_list,
> +					       pl_index),
> +			.payload_in = response,
> +		};
> +
> +		response->flags = 0;
> +		rc = cxl_internal_send_cmd(mds, &mbox_cmd);
> +	}
> +
> +	return rc;
> +}
> +
> +void memdev_release_extent(struct cxl_memdev_state *mds, struct range *range)
> +{
> +	struct device *dev = mds->cxlds.dev;
> +	struct xarray extent_list;
> +
> +	struct cxl_extent extent = {
> +		.start_dpa = cpu_to_le64(range->start),
> +		.length = cpu_to_le64(range_len(range)),
> +	};
> +
> +	dev_dbg(dev, "Release response dpa %par\n", range);
> +
> +	xa_init(&extent_list);
> +	if (xa_insert(&extent_list, 0, &extent, GFP_KERNEL)) {
> +		dev_dbg(dev, "Failed to release %par\n", range);
> +		goto destroy;
> +	}
> +
> +	if (cxl_send_dc_response(mds, CXL_MBOX_OP_RELEASE_DC, &extent_list, 1))
> +		dev_dbg(dev, "Failed to release %par\n", range);
> +
> +destroy:
> +	xa_destroy(&extent_list);
> +}
> +
> +static int validate_add_extent(struct cxl_memdev_state *mds,
> +			       struct cxl_extent *extent)
> +{
> +	int rc;
> +
> +	rc = cxl_validate_extent(mds, extent);
> +	if (rc)
> +		return rc;
> +
> +	return cxl_add_extent(mds, extent);
> +}
> +
> +static int cxl_add_pending(struct cxl_memdev_state *mds)
> +{
> +	struct device *dev = mds->cxlds.dev;
> +	struct cxl_extent *extent;
> +	unsigned long index;
> +	unsigned long cnt = 0;
> +	int rc;
> +
> +	xa_for_each(&mds->pending_extents, index, extent) {
> +		if (validate_add_extent(mds, extent)) {
> +			dev_dbg(dev, "unconsumed DC extent DPA:%#llx LEN:%#llx\n",
> +				le64_to_cpu(extent->start_dpa),
> +				le64_to_cpu(extent->length));
> +			xa_erase(&mds->pending_extents, index);
> +			kfree(extent);
> +			continue;
> +		}
> +		cnt++;
> +	}
> +	rc = cxl_send_dc_response(mds, CXL_MBOX_OP_ADD_DC_RESPONSE,
> +				  &mds->pending_extents, cnt);
> +	xa_for_each(&mds->pending_extents, index, extent) {
> +		xa_erase(&mds->pending_extents, index);
> +		kfree(extent);
> +	}
> +	return rc;
> +}
> +
> +static int handle_add_event(struct cxl_memdev_state *mds,
> +			    struct cxl_event_dcd *event)
> +{
> +	struct cxl_extent *tmp = kzalloc(sizeof(*tmp), GFP_KERNEL);
> +	struct device *dev = mds->cxlds.dev;
> +
> +	if (!tmp)
> +		return -ENOMEM;
> +
> +	memcpy(tmp, &event->extent, sizeof(*tmp));
> +	if (xa_insert(&mds->pending_extents, (unsigned long)tmp, tmp,
> +		      GFP_KERNEL)) {
> +		kfree(tmp);
> +		return -ENOMEM;
> +	}
> +
> +	if (event->flags & CXL_DCD_EVENT_MORE) {
> +		dev_dbg(dev, "more bit set; delay the surfacing of extent\n");
> +		return 0;
> +	}
> +
> +	/* extents are removed and free'ed in cxl_add_pending() */
> +	return cxl_add_pending(mds);
> +}
> +
> +static char *cxl_dcd_evt_type_str(u8 type)
> +{
> +	switch (type) {
> +	case DCD_ADD_CAPACITY:
> +		return "add";
> +	case DCD_RELEASE_CAPACITY:
> +		return "release";
> +	case DCD_FORCED_CAPACITY_RELEASE:
> +		return "force release";
> +	default:
> +		break;
> +	}
> +
> +	return "<unknown>";
> +}
> +
> +static int cxl_handle_dcd_event_records(struct cxl_memdev_state *mds,
> +					struct cxl_event_record_raw *raw_rec)
> +{
> +	struct cxl_event_dcd *event = &raw_rec->event.dcd;
> +	struct cxl_extent *extent = &event->extent;
> +	struct device *dev = mds->cxlds.dev;
> +	uuid_t *id = &raw_rec->id;
> +
> +	if (!uuid_equal(id, &CXL_EVENT_DC_EVENT_UUID))
> +		return -EINVAL;
> +
> +	dev_dbg(dev, "DCD event %s : DPA:%#llx LEN:%#llx\n",
> +		cxl_dcd_evt_type_str(event->event_type),
> +		le64_to_cpu(extent->start_dpa), le64_to_cpu(extent->length));
> +
> +	switch (event->event_type) {
> +	case DCD_ADD_CAPACITY:
> +		return handle_add_event(mds, event);
> +	case DCD_RELEASE_CAPACITY:
> +		return cxl_rm_extent(mds, &event->extent);
> +	case DCD_FORCED_CAPACITY_RELEASE:
> +		dev_err_ratelimited(dev, "Forced release event ignored.\n");
> +		return 0;
> +	default:
> +		return -EINVAL;
> +	}
> +}
> +
>  static void cxl_mem_get_records_log(struct cxl_memdev_state *mds,
>  				    enum cxl_event_log_type type)
>  {
> @@ -1044,9 +1287,17 @@ static void cxl_mem_get_records_log(struct cxl_memdev_state *mds,
>  		if (!nr_rec)
>  			break;
>  
> -		for (i = 0; i < nr_rec; i++)
> +		for (i = 0; i < nr_rec; i++) {
>  			__cxl_event_trace_record(cxlmd, type,
>  						 &payload->records[i]);
> +			if (type == CXL_EVENT_TYPE_DCD) {
> +				rc = cxl_handle_dcd_event_records(mds,
> +								  &payload->records[i]);
> +				if (rc)
> +					dev_err_ratelimited(dev, "dcd event failed: %d\n",
> +							    rc);
> +			}
> +		}
>  
>  		if (payload->flags & CXL_GET_EVENT_FLAG_OVERFLOW)
>  			trace_cxl_overflow(cxlmd, type, payload);
> @@ -1078,6 +1329,8 @@ void cxl_mem_get_event_records(struct cxl_memdev_state *mds, u32 status)
>  {
>  	dev_dbg(mds->cxlds.dev, "Reading event logs: %x\n", status);
>  
> +	if (cxl_dcd_supported(mds) && (status & CXLDEV_EVENT_STATUS_DCD))
> +		cxl_mem_get_records_log(mds, CXL_EVENT_TYPE_DCD);
>  	if (status & CXLDEV_EVENT_STATUS_FATAL)
>  		cxl_mem_get_records_log(mds, CXL_EVENT_TYPE_FATAL);
>  	if (status & CXLDEV_EVENT_STATUS_FAIL)
> @@ -1610,6 +1863,17 @@ int cxl_poison_state_init(struct cxl_memdev_state *mds)
>  }
>  EXPORT_SYMBOL_NS_GPL(cxl_poison_state_init, CXL);
>  
> +static void clear_pending_extents(void *_mds)
> +{
> +	struct cxl_memdev_state *mds = _mds;
> +	struct cxl_extent *extent;
> +	unsigned long index;
> +
> +	xa_for_each(&mds->pending_extents, index, extent)
> +		kfree(extent);
> +	xa_destroy(&mds->pending_extents);
> +}
> +
>  struct cxl_memdev_state *cxl_memdev_state_create(struct device *dev)
>  {
>  	struct cxl_memdev_state *mds;
> @@ -1628,6 +1892,8 @@ struct cxl_memdev_state *cxl_memdev_state_create(struct device *dev)
>  	mds->cxlds.type = CXL_DEVTYPE_CLASSMEM;
>  	mds->ram_perf.qos_class = CXL_QOS_CLASS_INVALID;
>  	mds->pmem_perf.qos_class = CXL_QOS_CLASS_INVALID;
> +	xa_init(&mds->pending_extents);
> +	devm_add_action_or_reset(dev, clear_pending_extents, mds);
>  
>  	return mds;
>  }
> diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
> index 8e0884b52f84..8c9171f914fb 100644
> --- a/drivers/cxl/core/region.c
> +++ b/drivers/cxl/core/region.c
> @@ -3037,6 +3037,7 @@ static void cxl_dax_region_release(struct device *dev)
>  {
>  	struct cxl_dax_region *cxlr_dax = to_cxl_dax_region(dev);
>  
> +	ida_destroy(&cxlr_dax->extent_ida);
>  	kfree(cxlr_dax);
>  }
>  
> @@ -3090,6 +3091,8 @@ static struct cxl_dax_region *cxl_dax_region_alloc(struct cxl_region *cxlr)
>  
>  	dev = &cxlr_dax->dev;
>  	cxlr_dax->cxlr = cxlr;
> +	cxlr->cxlr_dax = cxlr_dax;
> +	ida_init(&cxlr_dax->extent_ida);
>  	device_initialize(dev);
>  	lockdep_set_class(&dev->mutex, &cxl_dax_region_key);
>  	device_set_pm_not_required(dev);
> @@ -3190,7 +3193,10 @@ static int devm_cxl_add_pmem_region(struct cxl_region *cxlr)
>  static void cxlr_dax_unregister(void *_cxlr_dax)
>  {
>  	struct cxl_dax_region *cxlr_dax = _cxlr_dax;
> +	struct cxl_region *cxlr = cxlr_dax->cxlr;
>  
> +	cxlr->cxlr_dax = NULL;
> +	cxlr_dax->cxlr = NULL;
>  	device_unregister(&cxlr_dax->dev);
>  }
>  
> diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
> index 16861c867537..c858e3957fd5 100644
> --- a/drivers/cxl/cxl.h
> +++ b/drivers/cxl/cxl.h
> @@ -11,6 +11,7 @@
>  #include <linux/log2.h>
>  #include <linux/node.h>
>  #include <linux/io.h>
> +#include <linux/cxl-event.h>
>  
>  extern const struct nvdimm_security_ops *cxl_security_ops;
>  
> @@ -169,11 +170,13 @@ static inline int ways_to_eiw(unsigned int ways, u8 *eiw)
>  #define CXLDEV_EVENT_STATUS_WARN		BIT(1)
>  #define CXLDEV_EVENT_STATUS_FAIL		BIT(2)
>  #define CXLDEV_EVENT_STATUS_FATAL		BIT(3)
> +#define CXLDEV_EVENT_STATUS_DCD			BIT(4)
>  
>  #define CXLDEV_EVENT_STATUS_ALL (CXLDEV_EVENT_STATUS_INFO |	\
>  				 CXLDEV_EVENT_STATUS_WARN |	\
>  				 CXLDEV_EVENT_STATUS_FAIL |	\
> -				 CXLDEV_EVENT_STATUS_FATAL)
> +				 CXLDEV_EVENT_STATUS_FATAL |	\
> +				 CXLDEV_EVENT_STATUS_DCD)
>  
>  /* CXL rev 3.0 section 8.2.9.2.4; Table 8-52 */
>  #define CXLDEV_EVENT_INT_MODE_MASK	GENMASK(1, 0)
> @@ -444,6 +447,18 @@ enum cxl_decoder_state {
>  	CXL_DECODER_STATE_AUTO,
>  };
>  
> +/**
> + * struct cxled_extent - Extent within an endpoint decoder
> + * @cxled: Reference to the endpoint decoder
> + * @dpa_range: DPA range this extent covers within the decoder
> + * @tag: Tag from device for this extent
> + */
> +struct cxled_extent {
> +	struct cxl_endpoint_decoder *cxled;
> +	struct range dpa_range;
> +	u8 tag[CXL_EXTENT_TAG_LEN];
> +};
> +
>  /**
>   * struct cxl_endpoint_decoder - Endpoint  / SPA to DPA decoder
>   * @cxld: base cxl_decoder_object
> @@ -569,6 +584,7 @@ struct cxl_region_params {
>   * @type: Endpoint decoder target type
>   * @cxl_nvb: nvdimm bridge for coordinating @cxlr_pmem setup / shutdown
>   * @cxlr_pmem: (for pmem regions) cached copy of the nvdimm bridge
> + * @cxlr_dax: (for DC regions) cached copy of CXL DAX bridge
>   * @flags: Region state flags
>   * @params: active + config params for the region
>   * @coord: QoS access coordinates for the region
> @@ -582,6 +598,7 @@ struct cxl_region {
>  	enum cxl_decoder_type type;
>  	struct cxl_nvdimm_bridge *cxl_nvb;
>  	struct cxl_pmem_region *cxlr_pmem;
> +	struct cxl_dax_region *cxlr_dax;
>  	unsigned long flags;
>  	struct cxl_region_params params;
>  	struct access_coordinate coord[ACCESS_COORDINATE_MAX];
> @@ -622,12 +639,45 @@ struct cxl_pmem_region {
>  	struct cxl_pmem_region_mapping mapping[];
>  };
>  
> +/* See CXL 3.0 8.2.9.2.1.5 */

Update the reference to reflect CXL 3.1.

Fan

> +enum dc_event {
> +	DCD_ADD_CAPACITY,
> +	DCD_RELEASE_CAPACITY,
> +	DCD_FORCED_CAPACITY_RELEASE,
> +	DCD_REGION_CONFIGURATION_UPDATED,
> +};
> +
>  struct cxl_dax_region {
>  	struct device dev;
>  	struct cxl_region *cxlr;
>  	struct range hpa_range;
> +	struct ida extent_ida;
>  };
>  
> +/**
> + * struct region_extent - CXL DAX region extent
> + * @dev: device representing this extent
> + * @cxlr_dax: back reference to parent region device
> + * @hpa_range: HPA range of this extent
> + * @tag: tag of the extent
> + * @decoder_extents: Endpoint decoder extents which make up this region extent
> + */
> +struct region_extent {
> +	struct device dev;
> +	struct cxl_dax_region *cxlr_dax;
> +	struct range hpa_range;
> +	uuid_t tag;
> +	struct xarray decoder_extents;
> +};
> +
> +bool is_region_extent(struct device *dev);
> +static inline struct region_extent *to_region_extent(struct device *dev)
> +{
> +	if (!is_region_extent(dev))
> +		return NULL;
> +	return container_of(dev, struct region_extent, dev);
> +}
> +
>  /**
>   * struct cxl_port - logical collection of upstream port devices and
>   *		     downstream port devices to construct a CXL memory
> diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
> index d41bec5433db..3a40fe1f0be7 100644
> --- a/drivers/cxl/cxlmem.h
> +++ b/drivers/cxl/cxlmem.h
> @@ -497,6 +497,7 @@ struct cxl_dc_region_info {
>   * @pmem_perf: performance data entry matched to PMEM partition
>   * @nr_dc_region: number of DC regions implemented in the memory device
>   * @dc_region: array containing info about the DC regions
> + * @pending_extents: array of extents pending during more bit processing
>   * @event: event log driver state
>   * @poison: poison driver state info
>   * @security: security driver state info
> @@ -532,6 +533,7 @@ struct cxl_memdev_state {
>  
>  	u8 nr_dc_region;
>  	struct cxl_dc_region_info dc_region[CXL_MAX_DC_REGION];
> +	struct xarray pending_extents;
>  
>  	struct cxl_event_state event;
>  	struct cxl_poison_state poison;
> @@ -607,6 +609,21 @@ enum cxl_opcode {
>  	UUID_INIT(0x5e1819d9, 0x11a9, 0x400c, 0x81, 0x1f, 0xd6, 0x07, 0x19,     \
>  		  0x40, 0x3d, 0x86)
>  
> +/*
> + * Add Dynamic Capacity Response
> + * CXL rev 3.1 section 8.2.9.9.9.3; Table 8-168 & Table 8-169
> + */
> +struct cxl_mbox_dc_response {
> +	__le32 extent_list_size;
> +	u8 flags;
> +	u8 reserved[3];
> +	struct updated_extent_list {
> +		__le64 dpa_start;
> +		__le64 length;
> +		u8 reserved[8];
> +	} __packed extent_list[];
> +} __packed;
> +
>  struct cxl_mbox_get_supported_logs {
>  	__le16 entries;
>  	u8 rsvd[6];
> @@ -669,6 +686,14 @@ struct cxl_mbox_identify {
>  	UUID_INIT(0xfe927475, 0xdd59, 0x4339, 0xa5, 0x86, 0x79, 0xba, 0xb1, \
>  		  0x13, 0xb7, 0x74)
>  
> +/*
> + * Dynamic Capacity Event Record
> + * CXL rev 3.1 section 8.2.9.2.1; Table 8-43
> + */
> +#define CXL_EVENT_DC_EVENT_UUID                                             \
> +	UUID_INIT(0xca95afa7, 0xf183, 0x4018, 0x8c, 0x2f, 0x95, 0x26, 0x8e, \
> +		  0x10, 0x1a, 0x2a)
> +
>  /*
>   * Get Event Records output payload
>   * CXL rev 3.0 section 8.2.9.2.2; Table 8-50
> @@ -694,6 +719,7 @@ enum cxl_event_log_type {
>  	CXL_EVENT_TYPE_WARN,
>  	CXL_EVENT_TYPE_FAIL,
>  	CXL_EVENT_TYPE_FATAL,
> +	CXL_EVENT_TYPE_DCD,
>  	CXL_EVENT_TYPE_MAX
>  };
>  
> diff --git a/include/linux/cxl-event.h b/include/linux/cxl-event.h
> index 0bea1afbd747..eeda8059d81a 100644
> --- a/include/linux/cxl-event.h
> +++ b/include/linux/cxl-event.h
> @@ -96,11 +96,43 @@ struct cxl_event_mem_module {
>  	u8 reserved[0x3d];
>  } __packed;
>  
> +/*
> + * CXL rev 3.1 section 8.2.9.2.1.6; Table 8-51
> + */
> +#define CXL_EXTENT_TAG_LEN 0x10
> +struct cxl_extent {
> +	__le64 start_dpa;
> +	__le64 length;
> +	u8 tag[CXL_EXTENT_TAG_LEN];
> +	__le16 shared_extn_seq;
> +	u8 reserved[0x6];
> +} __packed;
> +
> +/*
> + * Dynamic Capacity Event Record
> + * CXL rev 3.1 section 8.2.9.2.1.6; Table 8-50
> + */
> +#define CXL_DCD_EVENT_MORE			BIT(0)
> +struct cxl_event_dcd {
> +	struct cxl_event_record_hdr hdr;
> +	u8 event_type;
> +	u8 validity_flags;
> +	__le16 host_id;
> +	u8 region_index;
> +	u8 flags;
> +	u8 reserved1[0x2];
> +	struct cxl_extent extent;
> +	u8 reserved2[0x18];
> +	__le32 num_avail_extents;
> +	__le32 num_avail_tags;
> +} __packed;
> +
>  union cxl_event {
>  	struct cxl_event_generic generic;
>  	struct cxl_event_gen_media gen_media;
>  	struct cxl_event_dram dram;
>  	struct cxl_event_mem_module mem_module;
> +	struct cxl_event_dcd dcd;
>  	/* dram & gen_media event header */
>  	struct cxl_event_media_hdr media_hdr;
>  } __packed;
> diff --git a/tools/testing/cxl/Kbuild b/tools/testing/cxl/Kbuild
> index 030b388800f0..8238588fffdf 100644
> --- a/tools/testing/cxl/Kbuild
> +++ b/tools/testing/cxl/Kbuild
> @@ -61,7 +61,8 @@ cxl_core-y += $(CXL_CORE_SRC)/hdm.o
>  cxl_core-y += $(CXL_CORE_SRC)/pmu.o
>  cxl_core-y += $(CXL_CORE_SRC)/cdat.o
>  cxl_core-$(CONFIG_TRACING) += $(CXL_CORE_SRC)/trace.o
> -cxl_core-$(CONFIG_CXL_REGION) += $(CXL_CORE_SRC)/region.o
> +cxl_core-$(CONFIG_CXL_REGION) += $(CXL_CORE_SRC)/region.o \
> +				 $(CXL_CORE_SRC)/extent.o
>  cxl_core-y += config_check.o
>  cxl_core-y += cxl_core_test.o
>  cxl_core-y += cxl_core_exports.o
> 
> -- 
> 2.45.2
> 

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v3 02/25] printk: Add print format (%par) for struct range
  2024-08-22 17:53     ` Ira Weiny
  2024-08-22 18:10       ` Andy Shevchenko
@ 2024-08-26 13:17       ` Petr Mladek
  2024-08-26 13:24         ` Andy Shevchenko
  1 sibling, 1 reply; 120+ messages in thread
From: Petr Mladek @ 2024-08-26 13:17 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Dave Jiang, Fan Ni, Jonathan Cameron, Navneet Singh, Chris Mason,
	Josef Bacik, David Sterba, Steven Rostedt, Andy Shevchenko,
	Rasmus Villemoes, Sergey Senozhatsky, Jonathan Corbet,
	Andrew Morton, Dan Williams, Davidlohr Bueso, Alison Schofield,
	Vishal Verma, linux-btrfs, linux-cxl, linux-kernel, linux-doc,
	nvdimm

On Thu 2024-08-22 12:53:32, Ira Weiny wrote:
> Petr Mladek wrote:
> > On Fri 2024-08-16 09:44:10, Ira Weiny wrote:
> > > The use of struct range in the CXL subsystem is growing.  In particular,
> > > the addition of Dynamic Capacity devices uses struct range in a number
> > > of places which are reported in debug and error messages.
> > > 
> > > To wit requiring the printing of the start/end fields in each print
> > > became cumbersome.  Dan Williams mentions in [1] that it might be time
> > > to have a print specifier for struct range similar to struct resource
> > > 
> > > A few alternatives were considered including '%pn' for 'print raNge' but
> > > %par follows that struct range is most often used to store a range of
> > > physical addresses.  So use '%par' for 'print address range'.
> > > 
> > > diff --git a/lib/vsprintf.c b/lib/vsprintf.c
> > > index 2d71b1115916..c132178fac07 100644
> > > --- a/lib/vsprintf.c
> > > +++ b/lib/vsprintf.c
> > > @@ -1140,6 +1140,39 @@ char *resource_string(char *buf, char *end, struct resource *res,
> > >  	return string_nocheck(buf, end, sym, spec);
> > >  }
> > >  
> > > +static noinline_for_stack
> > > +char *range_string(char *buf, char *end, const struct range *range,
> > > +		      struct printf_spec spec, const char *fmt)
> > > +{
> > > +#define RANGE_PRINTK_SIZE		16
> > > +#define RANGE_DECODED_BUF_SIZE		((2 * sizeof(struct range)) + 4)
> > > +#define RANGE_PRINT_BUF_SIZE		sizeof("[range - ]")

[...]

> > > +	static const struct printf_spec range_spec = {
> > > +		.base = 16,
> > > +		.field_width = RANGE_PRINTK_SIZE,
> 
> However, my testing indicates this needs to be.
> 
>                 .field_width = 18, /* 2 (0x) + 2 * 8 (bytes) */

Makes sense. Great catch!

> ... to properly zero pad the value.  Does that make sense?
>
> > > +		.precision = -1,
> > > +		.flags = SPECIAL | SMALL | ZEROPAD,
> > > +	};
> > > +
> > > +	*p++ = '[';
> > > +	p = string_nocheck(p, pend, "range ", str_spec);
> > > +	p = number(p, pend, range->start, range_spec);
> > > +	*p++ = '-';
> > > +	p = number(p, pend, range->end, range_spec);
> > > +	*p++ = ']';
> > > +	*p = '\0';

Best Regards,
Petr

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v3 02/25] printk: Add print format (%par) for struct range
  2024-08-22 18:10       ` Andy Shevchenko
@ 2024-08-26 13:23         ` Petr Mladek
  2024-08-26 17:23           ` Andy Shevchenko
  0 siblings, 1 reply; 120+ messages in thread
From: Petr Mladek @ 2024-08-26 13:23 UTC (permalink / raw)
  To: Andy Shevchenko
  Cc: Ira Weiny, Dave Jiang, Fan Ni, Jonathan Cameron, Navneet Singh,
	Chris Mason, Josef Bacik, David Sterba, Steven Rostedt,
	Rasmus Villemoes, Sergey Senozhatsky, Jonathan Corbet,
	Andrew Morton, Dan Williams, Davidlohr Bueso, Alison Schofield,
	Vishal Verma, linux-btrfs, linux-cxl, linux-kernel, linux-doc,
	nvdimm

On Thu 2024-08-22 21:10:25, Andy Shevchenko wrote:
> On Thu, Aug 22, 2024 at 12:53:32PM -0500, Ira Weiny wrote:
> > Petr Mladek wrote:
> > > On Fri 2024-08-16 09:44:10, Ira Weiny wrote:
> 
> ...
> 
> > > > +	%par	[range 0x60000000-0x6fffffff] or
> > > 
> > > It seems that it is always 64-bit. It prints:
> > > 
> > > struct range {
> > > 	u64   start;
> > > 	u64   end;
> > > };
> > 
> > Indeed.  Thanks I should not have just copied/pasted.
> 
> With that said, I'm not sure the %pa is a good placeholder for this ('a' stands
> to "address" AFAIU). Perhaps this should go somewhere under %pr/%pR?

The r/R in %pr/%pR actually stands for "resource".

But "%ra" really looks like a better choice than "%par". Both
"resource"  and "range" starts with 'r'. Also the struct resource
is printed as a range of values.

> > > > +		[range 0x0000000060000000-0x000000006fffffff]
> > > > +
> > > > +For printing struct range.  A variation of printing a physical address is to
> > > > +print the value of struct range which are often used to hold a physical address
> > > > +range.
> > > > +
> > > > +Passed by reference.

Best Regards,
Petr

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v3 02/25] printk: Add print format (%par) for struct range
  2024-08-26 13:17       ` Petr Mladek
@ 2024-08-26 13:24         ` Andy Shevchenko
  0 siblings, 0 replies; 120+ messages in thread
From: Andy Shevchenko @ 2024-08-26 13:24 UTC (permalink / raw)
  To: Petr Mladek
  Cc: Ira Weiny, Dave Jiang, Fan Ni, Jonathan Cameron, Navneet Singh,
	Chris Mason, Josef Bacik, David Sterba, Steven Rostedt,
	Rasmus Villemoes, Sergey Senozhatsky, Jonathan Corbet,
	Andrew Morton, Dan Williams, Davidlohr Bueso, Alison Schofield,
	Vishal Verma, linux-btrfs, linux-cxl, linux-kernel, linux-doc,
	nvdimm

On Mon, Aug 26, 2024 at 03:17:26PM +0200, Petr Mladek wrote:
> On Thu 2024-08-22 12:53:32, Ira Weiny wrote:
> > Petr Mladek wrote:
> > > On Fri 2024-08-16 09:44:10, Ira Weiny wrote:

[...]

> > > > +	static const struct printf_spec range_spec = {
> > > > +		.base = 16,
> > > > +		.field_width = RANGE_PRINTK_SIZE,
> > 
> > However, my testing indicates this needs to be.
> > 
> >                 .field_width = 18, /* 2 (0x) + 2 * 8 (bytes) */
> 
> Makes sense. Great catch!

Which effectively means usage of special_hex_number().
But again, consider to unite this with %pR/r implementation(s).

> > ... to properly zero pad the value.  Does that make sense?
> >
> > > > +		.precision = -1,
> > > > +		.flags = SPECIAL | SMALL | ZEROPAD,
> > > > +	};

-- 
With Best Regards,
Andy Shevchenko



^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v3 02/25] printk: Add print format (%par) for struct range
  2024-08-26 13:23         ` Petr Mladek
@ 2024-08-26 17:23           ` Andy Shevchenko
  2024-08-26 21:17             ` Ira Weiny
  0 siblings, 1 reply; 120+ messages in thread
From: Andy Shevchenko @ 2024-08-26 17:23 UTC (permalink / raw)
  To: Petr Mladek
  Cc: Ira Weiny, Dave Jiang, Fan Ni, Jonathan Cameron, Navneet Singh,
	Chris Mason, Josef Bacik, David Sterba, Steven Rostedt,
	Rasmus Villemoes, Sergey Senozhatsky, Jonathan Corbet,
	Andrew Morton, Dan Williams, Davidlohr Bueso, Alison Schofield,
	Vishal Verma, linux-btrfs, linux-cxl, linux-kernel, linux-doc,
	nvdimm

On Mon, Aug 26, 2024 at 03:23:50PM +0200, Petr Mladek wrote:
> On Thu 2024-08-22 21:10:25, Andy Shevchenko wrote:
> > On Thu, Aug 22, 2024 at 12:53:32PM -0500, Ira Weiny wrote:
> > > Petr Mladek wrote:
> > > > On Fri 2024-08-16 09:44:10, Ira Weiny wrote:

...

> > > > > +	%par	[range 0x60000000-0x6fffffff] or
> > > > 
> > > > It seems that it is always 64-bit. It prints:
> > > > 
> > > > struct range {
> > > > 	u64   start;
> > > > 	u64   end;
> > > > };
> > > 
> > > Indeed.  Thanks I should not have just copied/pasted.
> > 
> > With that said, I'm not sure the %pa is a good placeholder for this ('a' stands
> > to "address" AFAIU). Perhaps this should go somewhere under %pr/%pR?
> 
> The r/R in %pr/%pR actually stands for "resource".
> 
> But "%ra" really looks like a better choice than "%par". Both
> "resource"  and "range" starts with 'r'. Also the struct resource
> is printed as a range of values.

Fine with me as long as it:
1) doesn't collide with %pa namespace
2) tries to deduplicate existing code as much as possible.

> > > > > +		[range 0x0000000060000000-0x000000006fffffff]
> > > > > +
> > > > > +For printing struct range.  A variation of printing a physical address is to
> > > > > +print the value of struct range which are often used to hold a physical address
> > > > > +range.
> > > > > +
> > > > > +Passed by reference.

-- 
With Best Regards,
Andy Shevchenko



^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v3 23/25] cxl/mem: Trace Dynamic capacity Event Record
  2024-08-20 22:54   ` Dave Jiang
@ 2024-08-26 18:02     ` Ira Weiny
  0 siblings, 0 replies; 120+ messages in thread
From: Ira Weiny @ 2024-08-26 18:02 UTC (permalink / raw)
  To: Dave Jiang, ira.weiny, Fan Ni, Jonathan Cameron, Navneet Singh,
	Chris Mason, Josef Bacik, David Sterba, Petr Mladek,
	Steven Rostedt, Andy Shevchenko, Rasmus Villemoes,
	Sergey Senozhatsky, Jonathan Corbet, Andrew Morton
  Cc: Dan Williams, Davidlohr Bueso, Alison Schofield, Vishal Verma,
	linux-btrfs, linux-cxl, linux-kernel, linux-doc, nvdimm

Dave Jiang wrote:
> 
> 
> On 8/16/24 7:44 AM, ira.weiny@intel.com wrote:
> > From: Navneet Singh <navneet.singh@intel.com>
> > 
> > CXL rev 3.1 section 8.2.9.2.1 adds the Dynamic Capacity Event Records.
> > User space can use trace events for debugging of DC capacity changes.
> > 
> > Add DC trace points to the trace log.
> > 
> > Signed-off-by: Navneet Singh <navneet.singh@intel.com>
> > Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> Reviewed-by: Dave Jiang <dave.jiang@intel.com>
> 
> small nit below
> 
> > 
> > ---
> > Changes:
> > [Alison: Update commit message]
> > ---
> >  drivers/cxl/core/mbox.c  |  4 +++
> >  drivers/cxl/core/trace.h | 65 ++++++++++++++++++++++++++++++++++++++++++++++++
> >  2 files changed, 69 insertions(+)
> > 
> > diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
> > index d43ac8eabf56..8202fc6c111d 100644
> > --- a/drivers/cxl/core/mbox.c
> > +++ b/drivers/cxl/core/mbox.c
> > @@ -977,6 +977,10 @@ static void __cxl_event_trace_record(const struct cxl_memdev *cxlmd,
> >  		ev_type = CXL_CPER_EVENT_DRAM;
> >  	else if (uuid_equal(uuid, &CXL_EVENT_MEM_MODULE_UUID))
> >  		ev_type = CXL_CPER_EVENT_MEM_MODULE;
> > +	else if (uuid_equal(uuid, &CXL_EVENT_DC_EVENT_UUID)) {
> > +		trace_cxl_dynamic_capacity(cxlmd, type, &record->event.dcd);
> > +		return;
> > +	}
> >  
> >  	cxl_event_trace_record(cxlmd, type, ev_type, uuid, &record->event);
> >  }
> > diff --git a/drivers/cxl/core/trace.h b/drivers/cxl/core/trace.h
> > index 9167cfba7f59..a3a5269311ee 100644
> > --- a/drivers/cxl/core/trace.h
> > +++ b/drivers/cxl/core/trace.h
> > @@ -731,6 +731,71 @@ TRACE_EVENT(cxl_poison,
> >  	)
> >  );
> >  
> > +/*
> > + * DYNAMIC CAPACITY Event Record - DER
> > + *
> > + * CXL rev 3.0 section 8.2.9.2.1.5 Table 8-47
> 
> Should we just use 3.1 since it's the latest?

Yep done.
Ira

[snip]

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v3 02/25] printk: Add print format (%par) for struct range
  2024-08-26 17:23           ` Andy Shevchenko
@ 2024-08-26 21:17             ` Ira Weiny
  2024-08-27  7:43               ` Petr Mladek
  2024-08-27 13:17               ` Andy Shevchenko
  0 siblings, 2 replies; 120+ messages in thread
From: Ira Weiny @ 2024-08-26 21:17 UTC (permalink / raw)
  To: Andy Shevchenko, Petr Mladek
  Cc: Ira Weiny, Dave Jiang, Fan Ni, Jonathan Cameron, Navneet Singh,
	Chris Mason, Josef Bacik, David Sterba, Steven Rostedt,
	Rasmus Villemoes, Sergey Senozhatsky, Jonathan Corbet,
	Andrew Morton, Dan Williams, Davidlohr Bueso, Alison Schofield,
	Vishal Verma, linux-btrfs, linux-cxl, linux-kernel, linux-doc,
	nvdimm

Andy Shevchenko wrote:
> On Mon, Aug 26, 2024 at 03:23:50PM +0200, Petr Mladek wrote:
> > On Thu 2024-08-22 21:10:25, Andy Shevchenko wrote:
> > > On Thu, Aug 22, 2024 at 12:53:32PM -0500, Ira Weiny wrote:
> > > > Petr Mladek wrote:
> > > > > On Fri 2024-08-16 09:44:10, Ira Weiny wrote:
> 
> ...
> 
> > > > > > +	%par	[range 0x60000000-0x6fffffff] or
> > > > > 
> > > > > It seems that it is always 64-bit. It prints:
> > > > > 
> > > > > struct range {
> > > > > 	u64   start;
> > > > > 	u64   end;
> > > > > };
> > > > 
> > > > Indeed.  Thanks I should not have just copied/pasted.
> > > 
> > > With that said, I'm not sure the %pa is a good placeholder for this ('a' stands
> > > to "address" AFAIU). Perhaps this should go somewhere under %pr/%pR?

I'm speaking a bit for Dan here but also the logical way I thought of
things.

1) %p does not dictate anything about the format of the data.  Rather
   indicates that what is passed is a pointer.  Because we are passing a
   pointer to a range struct %pXX makes sense.
2) %pa indicates what follows is 'address'.  This was a bit of creative
   license because, as I said in the commit message most of the time
   struct range contains an address range.  So for this narrow use case it
   also makes sense.
3) %par r for range.

%p[rR] is taken.  %pra confuses things IMO.

> > 
> > The r/R in %pr/%pR actually stands for "resource".
> > 
> > But "%ra" really looks like a better choice than "%par". Both
> > "resource"  and "range" starts with 'r'. Also the struct resource
> > is printed as a range of values.

%r could be used I think.  But this breaks with the convention of passing a
pointer and how to interpret it.  The other idea I had, mentioned in the commit
message was %pn.  Meaning passed by pointer 'raNge'.

I think that follows better than %r.  That would be another break from C99.
But we don't have to follow that.

> 
> Fine with me as long as it:
> 1) doesn't collide with %pa namespace
> 2) tries to deduplicate existing code as much as possible.

Andy, I'm not quite following how you expect to share the code between
resource_string() and range_string()?

There is very little duplicated code.  In fact with Petr's suggestions and some
more work range_string() is quite simple:

+static noinline_for_stack
+char *range_string(char *buf, char *end, const struct range *range,
+                     struct printf_spec spec, const char *fmt)
+{
+#define RANGE_DECODED_BUF_SIZE         ((2 * sizeof(struct range)) + 4)
+#define RANGE_PRINT_BUF_SIZE           sizeof("[range -]")
+       char sym[RANGE_DECODED_BUF_SIZE + RANGE_PRINT_BUF_SIZE];
+       char *p = sym, *pend = sym + sizeof(sym);
+
+       *p++ = '[';
+       p = string_nocheck(p, pend, "range ", default_str_spec);
+       p = special_hex_number(p, pend, range->start, sizeof(range->start));
+       *p++ = '-';
+       p = special_hex_number(p, pend, range->end, sizeof(range->end));
+       *p++ = ']';
+       *p = '\0';
+
+       return string_nocheck(buf, end, sym, spec);
+}


Also this is the bulk of the patch except for documentation and the new
testing code.  [new patch below]

Am I missing your point somehow?  I considered cramming a struct range into a
struct resource to let resource_string() process the data.  But that would
involve creating a new IORESOURCE_* flag (not ideal) and also does not allow
for the larger u64 data in struct range should this be a 32 bit physical
address config.

Most importantly that would not be much less code AFAICT.

Ira


[snip]
<new patch>

commit a5f0305d319eac7c6e480851378695f8bd42a3d0
Author: Ira Weiny <ira.weiny@intel.com>
Date:   Fri Jun 28 16:47:06 2024 -0500

    printk: Add print format (%par) for struct range

    The use of struct range in the CXL subsystem is growing.  In particular,
    the addition of Dynamic Capacity devices uses struct range in a number
    of places which are reported in debug and error messages.

    To wit requiring the printing of the start/end fields in each print
    became cumbersome.  Dan Williams mentions in [1] that it might be time
    to have a print specifier for struct range similar to struct resource

    A few alternatives were considered including '%pn' for 'print raNge' but
    %par follows that struct range is most often used to store a range of
    physical addresses.  So use '%par' for 'print address range'.

    To: Petr Mladek <pmladek@suse.com> (maintainer:VSPRINTF)
    To: Steven Rostedt <rostedt@goodmis.org> (maintainer:VSPRINTF)
    To: Jonathan Corbet <corbet@lwn.net> (maintainer:DOCUMENTATION)
    Cc: linux-doc@vger.kernel.org (open list:DOCUMENTATION)
    Cc: linux-kernel@vger.kernel.org (open list)
    Link: https://lore.kernel.org/all/663922b475e50_d54d72945b@dwillia2-xfh.jf.intel.com.notmuch/ [1]
    Suggested-by: "Dan Williams" <dan.j.williams@intel.com>
    Signed-off-by: Ira Weiny <ira.weiny@intel.com>

    ---
    Changes:
    [iweiny: use special_hex_number()]
    [Petr: Update documentation]
    [Petr: use 'range -']
    [Petr: fixup printf_spec specifiers]
    [Petr: add lib/test_printf test]

diff --git a/Documentation/core-api/printk-formats.rst b/Documentation/core-api/printk-formats.rst
index 4451ef501936..1bdfcd40c81e 100644
--- a/Documentation/core-api/printk-formats.rst
+++ b/Documentation/core-api/printk-formats.rst
@@ -231,6 +231,19 @@ width of the CPU data path.

 Passed by reference.

+Struct Range
+------------
+
+::
+
+       %par    [range 0x0000000060000000-0x000000006fffffff]
+
+For printing struct range.  A variation of printing a physical address is to
+print the value of struct range which are often used to hold a physical address
+range.
+
+Passed by reference.
+
 DMA address types dma_addr_t
 ----------------------------

diff --git a/lib/test_printf.c b/lib/test_printf.c
index 965cb6f28527..2f20b0c30024 100644
--- a/lib/test_printf.c
+++ b/lib/test_printf.c
@@ -388,6 +388,25 @@ struct_resource(void)
 {
 }

+static void __init
+struct_range(void)
+{
+       struct range test_range = {
+               .start = 0xc0ffee00ba5eba11,
+               .end = 0xc0ffee00ba5eba11,
+       };
+
+       test("[range 0xc0ffee00ba5eba11-0xc0ffee00ba5eba11]",
+            "%par", &test_range);
+
+       test_range = (struct range) {
+               .start = 0xc0ffee,
+               .end = 0xba5eba11,
+       };
+       test("[range 0x0000000000c0ffee-0x00000000ba5eba11]",
+            "%par", &test_range);
+}
+
 static void __init
 addr(void)
 {
@@ -789,6 +808,7 @@ test_pointer(void)
        symbol_ptr();
        kernel_ptr();
        struct_resource();
+       struct_range();
        addr();
        escaped_str();
        hex_string();
diff --git a/lib/vsprintf.c b/lib/vsprintf.c
index 2d71b1115916..a754eefef252 100644
--- a/lib/vsprintf.c
+++ b/lib/vsprintf.c
@@ -1140,6 +1140,26 @@ char *resource_string(char *buf, char *end, struct resource *res,
        return string_nocheck(buf, end, sym, spec);
 }

+static noinline_for_stack
+char *range_string(char *buf, char *end, const struct range *range,
+                     struct printf_spec spec, const char *fmt)
+{
+#define RANGE_DECODED_BUF_SIZE         ((2 * sizeof(struct range)) + 4)
+#define RANGE_PRINT_BUF_SIZE           sizeof("[range -]")
+       char sym[RANGE_DECODED_BUF_SIZE + RANGE_PRINT_BUF_SIZE];
+       char *p = sym, *pend = sym + sizeof(sym);
+
+       *p++ = '[';
+       p = string_nocheck(p, pend, "range ", default_str_spec);
+       p = special_hex_number(p, pend, range->start, sizeof(range->start));
+       *p++ = '-';
+       p = special_hex_number(p, pend, range->end, sizeof(range->end));
+       *p++ = ']';
+       *p = '\0';
+
+       return string_nocheck(buf, end, sym, spec);
+}
+
 static noinline_for_stack
 char *hex_string(char *buf, char *end, u8 *addr, struct printf_spec spec,
                 const char *fmt)
@@ -1802,6 +1822,8 @@ char *address_val(char *buf, char *end, const void *addr,
                return buf;

        switch (fmt[1]) {
+       case 'r':
+               return range_string(buf, end, addr, spec, fmt);
        case 'd':
                num = *(const dma_addr_t *)addr;
                size = sizeof(dma_addr_t);
@@ -2364,6 +2386,8 @@ char *rust_fmt_argument(char *buf, char *end, void *ptr);
  *            to use print_hex_dump() for the larger input.
  * - 'a[pd]' For address types [p] phys_addr_t, [d] dma_addr_t and derivatives
  *           (default assumed to be phys_addr_t, passed by reference)
+ * - 'ar' For decoded struct ranges (a variation of physical address which are
+ *        most often stored in struct ranges.
  * - 'd[234]' For a dentry name (optionally 2-4 last components)
  * - 'D[234]' Same as 'd' but for a struct file
  * - 'g' For block_device name (gendisk + partition number)

^ permalink raw reply related	[flat|nested] 120+ messages in thread

* Re: [PATCH v3 02/25] printk: Add print format (%par) for struct range
  2024-08-26 21:17             ` Ira Weiny
@ 2024-08-27  7:43               ` Petr Mladek
  2024-08-27 13:21                 ` Andy Shevchenko
  2024-08-27 21:44                 ` Ira Weiny
  2024-08-27 13:17               ` Andy Shevchenko
  1 sibling, 2 replies; 120+ messages in thread
From: Petr Mladek @ 2024-08-27  7:43 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Andy Shevchenko, Dave Jiang, Fan Ni, Jonathan Cameron,
	Navneet Singh, Chris Mason, Josef Bacik, David Sterba,
	Steven Rostedt, Rasmus Villemoes, Sergey Senozhatsky,
	Jonathan Corbet, Andrew Morton, Dan Williams, Davidlohr Bueso,
	Alison Schofield, Vishal Verma, linux-btrfs, linux-cxl,
	linux-kernel, linux-doc, nvdimm

On Mon 2024-08-26 16:17:52, Ira Weiny wrote:
> Andy Shevchenko wrote:
> > On Mon, Aug 26, 2024 at 03:23:50PM +0200, Petr Mladek wrote:
> > > On Thu 2024-08-22 21:10:25, Andy Shevchenko wrote:
> > > > On Thu, Aug 22, 2024 at 12:53:32PM -0500, Ira Weiny wrote:
> > > > > Petr Mladek wrote:
> > > > > > On Fri 2024-08-16 09:44:10, Ira Weiny wrote:
> > 
> > ...
> > 
> > > > > > > +	%par	[range 0x60000000-0x6fffffff] or
> > > > > > 
> > > > > > It seems that it is always 64-bit. It prints:
> > > > > > 
> > > > > > struct range {
> > > > > > 	u64   start;
> > > > > > 	u64   end;
> > > > > > };
> > > > > 
> > > > > Indeed.  Thanks I should not have just copied/pasted.
> > > > 
> > > > With that said, I'm not sure the %pa is a good placeholder for this ('a' stands
> > > > to "address" AFAIU). Perhaps this should go somewhere under %pr/%pR?
> 
> I'm speaking a bit for Dan here but also the logical way I thought of
> things.
> 
> 1) %p does not dictate anything about the format of the data.  Rather
>    indicates that what is passed is a pointer.  Because we are passing a
>    pointer to a range struct %pXX makes sense.
> 2) %pa indicates what follows is 'address'.  This was a bit of creative
>    license because, as I said in the commit message most of the time
>    struct range contains an address range.  So for this narrow use case it
>    also makes sense.
> 3) %par r for range.

Yes. I got it.

Well, is struct range really used for addresses? It rather looks like
a range of any 64-bit values.

> %p[rR] is taken.  %pra confuses things IMO.

Another variants might be %pr64 or %prange.

IMHO, there is no good solution. We are trying to find the least
bad one. The meaning should be as obvious and as least confusing
as possible.

Honestly, I do not have a strong opinion. I kind of like %prange ;-)
But I could live with all other variants, except for %pn mentioned below.

> > > The r/R in %pr/%pR actually stands for "resource".
> > > 
> > > But "%ra" really looks like a better choice than "%par". Both
> > > "resource"  and "range" starts with 'r'. Also the struct resource
> > > is printed as a range of values.
> 
> %r could be used I think.  But this breaks with the convention of passing a
> pointer and how to interpret it.

How exactly does it break the convention, please?

Do you passing a pointer to struct range instead of a pointer to
struct resource?

It should not be a big problem as long as the vsprintf() code is
able to guess the right pointer type from the %pXX modifier.

> The other idea I had, mentioned in the commit
> message was %pn.  Meaning passed by pointer 'raNge'.

This looks like the worst variant to me.

> > Fine with me as long as it:
> > 1) doesn't collide with %pa namespace
> > 2) tries to deduplicate existing code as much as possible.
> 
> Andy, I'm not quite following how you expect to share the code between
> resource_string() and range_string()?
> 
> There is very little duplicated code.  In fact with Petr's suggestions and some
> more work range_string() is quite simple:
>
> +static noinline_for_stack
> +char *range_string(char *buf, char *end, const struct range *range,
> +                     struct printf_spec spec, const char *fmt)
> +{
> +#define RANGE_DECODED_BUF_SIZE         ((2 * sizeof(struct range)) + 4)
> +#define RANGE_PRINT_BUF_SIZE           sizeof("[range -]")
> +       char sym[RANGE_DECODED_BUF_SIZE + RANGE_PRINT_BUF_SIZE];
> +       char *p = sym, *pend = sym + sizeof(sym);
> +
> +       *p++ = '[';
> +       p = string_nocheck(p, pend, "range ", default_str_spec);
> +       p = special_hex_number(p, pend, range->start, sizeof(range->start));
> +       *p++ = '-';
> +       p = special_hex_number(p, pend, range->end, sizeof(range->end));
> +       *p++ = ']';
> +       *p = '\0';
> +
> +       return string_nocheck(buf, end, sym, spec);
> +}

I agree that there is not much duplicated code in the end.

> Also this is the bulk of the patch except for documentation and the new
> testing code.  [new patch below]
> 
> Am I missing your point somehow?  I considered cramming a struct range into a
> struct resource to let resource_string() process the data.  But that would
> involve creating a new IORESOURCE_* flag (not ideal) and also does not allow
> for the larger u64 data in struct range should this be a 32 bit physical
> address config.

This would be nasty. I believe that this is not what Andy meant.

Best Regards,
Petr

PS: I have vacation until the end of the week, so my next eventual
    reaction would be delayed.

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v3 18/25] cxl/extent: Process DCD events and realize region extents
  2024-08-23 21:32   ` Fan Ni
@ 2024-08-27 12:08     ` Jonathan Cameron
  2024-08-27 16:02       ` Fan Ni
  0 siblings, 1 reply; 120+ messages in thread
From: Jonathan Cameron @ 2024-08-27 12:08 UTC (permalink / raw)
  To: Fan Ni
  Cc: ira.weiny, Dave Jiang, Navneet Singh, Chris Mason, Josef Bacik,
	David Sterba, Petr Mladek, Steven Rostedt, Andy Shevchenko,
	Rasmus Villemoes, Sergey Senozhatsky, Jonathan Corbet,
	Andrew Morton, Dan Williams, Davidlohr Bueso, Alison Schofield,
	Vishal Verma, linux-btrfs, linux-cxl, linux-kernel, linux-doc,
	nvdimm

On Fri, 23 Aug 2024 14:32:32 -0700
Fan Ni <nifan.cxl@gmail.com> wrote:

> On Fri, Aug 16, 2024 at 09:44:26AM -0500, ira.weiny@intel.com wrote:
> > From: Navneet Singh <navneet.singh@intel.com>
> > 
> > A dynamic capacity device (DCD) sends events to signal the host for
> > changes in the availability of Dynamic Capacity (DC) memory.  These
> > events contain extents describing a DPA range and meta data for memory
> > to be added or removed.  Events may be sent from the device at any time.
> > 
> > Three types of events can be signaled, Add, Release, and Force Release.
> > 
> > On add, the host may accept or reject the memory being offered.  If no
> > region exists, or the extent is invalid, the extent should be rejected.
> > Add extent events may be grouped by a 'more' bit which indicates those
> > extents should be processed as a group.
> > 
> > On remove, the host can delay the response until the host is safely not
> > using the memory.  If no region exists the release can be sent
> > immediately.  The host may also release extents (or partial extents) at
> > any time.  Thus the 'more' bit grouping of release events is of less
> > value and can be ignored in favor of sending multiple release capacity
> > responses for groups of release events.
> > 
> > Force removal is intended as a mechanism between the FM and the device
> > and intended only when the host is unresponsive, out of sync, or
> > otherwise broken.  Purposely ignore force removal events.
> > 
> > Regions are made up of one or more devices which may be surfacing memory
> > to the host.  Once all devices in a region have surfaced an extent the
> > region can expose a corresponding extent for the user to consume.
> > Without interleaving a device extent forms a 1:1 relationship with the
> > region extent.  Immediately surface a region extent upon getting a
> > device extent.
> > 
> > Per the specification the device is allowed to offer or remove extents
> > at any time.  However, anticipated use cases can expect extents to be
> > offered, accepted, and removed in well defined chunks.
> > 
> > Simplify extent tracking with the following restrictions.
> > 
> > 	1) Flag for removal any extent which overlaps a requested
> > 	   release range.
> > 	2) Refuse the offer of extents which overlap already accepted
> > 	   memory ranges.
> > 	3) Accept again a range which has already been accepted by the
> > 	   host.  (It is likely the device has an error because it
> > 	   should already know that this range was accepted.  But from
> > 	   the host point of view it is safe to acknowledge that
> > 	   acceptance again.)
> > 
> > Management of the region extent devices must be synchronized with
> > potential uses of the memory within the DAX layer.  Create region extent
> > devices as children of the cxl_dax_region device such that the DAX
> > region driver can co-drive them and synchronize with the DAX layer.
> > Synchronization and management is handled in a subsequent patch.
> > 
> > Process DCD events and create region devices.
> > 
> > Signed-off-by: Navneet Singh <navneet.singh@intel.com>
> > Co-developed-by: Ira Weiny <ira.weiny@intel.com>
> > Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> >   
> 
> One minor change inline.
Hi Fan,

Crop please.  I scanned past it 3 times when scrolling without noticing
what you'd actually commented on.

> > +/* See CXL 3.0 8.2.9.2.1.5 */  
> 
> Update the reference to reflect CXL 3.1.
> 
> Fan
> 

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v3 02/25] printk: Add print format (%par) for struct range
  2024-08-26 21:17             ` Ira Weiny
  2024-08-27  7:43               ` Petr Mladek
@ 2024-08-27 13:17               ` Andy Shevchenko
  2024-08-28  4:12                 ` Ira Weiny
  1 sibling, 1 reply; 120+ messages in thread
From: Andy Shevchenko @ 2024-08-27 13:17 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Petr Mladek, Dave Jiang, Fan Ni, Jonathan Cameron, Navneet Singh,
	Chris Mason, Josef Bacik, David Sterba, Steven Rostedt,
	Rasmus Villemoes, Sergey Senozhatsky, Jonathan Corbet,
	Andrew Morton, Dan Williams, Davidlohr Bueso, Alison Schofield,
	Vishal Verma, linux-btrfs, linux-cxl, linux-kernel, linux-doc,
	nvdimm

On Mon, Aug 26, 2024 at 04:17:52PM -0500, Ira Weiny wrote:
> Andy Shevchenko wrote:
> > On Mon, Aug 26, 2024 at 03:23:50PM +0200, Petr Mladek wrote:
> > > On Thu 2024-08-22 21:10:25, Andy Shevchenko wrote:
> > > > On Thu, Aug 22, 2024 at 12:53:32PM -0500, Ira Weiny wrote:
> > > > > Petr Mladek wrote:
> > > > > > On Fri 2024-08-16 09:44:10, Ira Weiny wrote:

...

> > > > > > > +	%par	[range 0x60000000-0x6fffffff] or
> > > > > > 
> > > > > > It seems that it is always 64-bit. It prints:
> > > > > > 
> > > > > > struct range {
> > > > > > 	u64   start;
> > > > > > 	u64   end;
> > > > > > };
> > > > > 
> > > > > Indeed.  Thanks I should not have just copied/pasted.
> > > > 
> > > > With that said, I'm not sure the %pa is a good placeholder for this ('a' stands
> > > > to "address" AFAIU). Perhaps this should go somewhere under %pr/%pR?
> 
> I'm speaking a bit for Dan here but also the logical way I thought of
> things.
> 
> 1) %p does not dictate anything about the format of the data.  Rather
>    indicates that what is passed is a pointer.  Because we are passing a
>    pointer to a range struct %pXX makes sense.

There is no objection to that.

> 2) %pa indicates what follows is 'address'.  This was a bit of creative
>    license because, as I said in the commit message most of the time
>    struct range contains an address range.  So for this narrow use case it
>    also makes sense.

As in the discussion it was pointed out that struct range is always 64-bit,
limiting it to the "address" is a wrong assumption as we are talking generic
printing routine here. We don't know what users will be in the future on 32-bit
platforms, or what data (semantically) is being held by this structure.

> 3) %par r for range.

I understand, but again struct range != address.

> %p[rR] is taken.
> %pra confuses things IMO.

It doesn't confuse me. :-) But I believe Petr also has a rationale behind this
proposal as he described earlier.

> > > The r/R in %pr/%pR actually stands for "resource".
> > > 
> > > But "%ra" really looks like a better choice than "%par". Both
> > > "resource"  and "range" starts with 'r'. Also the struct resource
> > > is printed as a range of values.
> 
> %r could be used I think.  But this breaks with the convention of passing a
> pointer and how to interpret it.  The other idea I had, mentioned in the commit
> message was %pn.  Meaning passed by pointer 'raNge'.

No, we can't use %r or anything else that is documented for the standard
printf() format specifiers, otherwise you will get a compiler warning and
basically it means no go.

> I think that follows better than %r.  That would be another break from C99.
> But we don't have to follow that.
> 
> > Fine with me as long as it:
> > 1) doesn't collide with %pa namespace
> > 2) tries to deduplicate existing code as much as possible.
> 
> Andy, I'm not quite following how you expect to share the code between
> resource_string() and range_string()?
> 
> There is very little duplicated code.  In fact with Petr's suggestions and some
> more work range_string() is quite simple:
> 
> +static noinline_for_stack
> +char *range_string(char *buf, char *end, const struct range *range,
> +                     struct printf_spec spec, const char *fmt)
> +{
> +#define RANGE_DECODED_BUF_SIZE         ((2 * sizeof(struct range)) + 4)
> +#define RANGE_PRINT_BUF_SIZE           sizeof("[range -]")
> +       char sym[RANGE_DECODED_BUF_SIZE + RANGE_PRINT_BUF_SIZE];
> +       char *p = sym, *pend = sym + sizeof(sym);


Missing check for pointer, but it's not that I wanted to tell.

> +       *p++ = '[';
> +       p = string_nocheck(p, pend, "range ", default_str_spec);

Hmm... %pr uses str_spec, what the difference can be here?

> +       p = special_hex_number(p, pend, range->start, sizeof(range->start));
> +       *p++ = '-';
> +       p = special_hex_number(p, pend, range->end, sizeof(range->end));

This is basically the copy of %pr implementation.

	p = number(p, pend, res->start, *specp);
	if (res->start != res->end) {
		*p++ = '-';
		p = number(p, pend, res->end, *specp);
	}

Would it be possible to unify? I think so, but it requires a bit of thinking.

That's why testing is very important in this kind of generic code.

> +       *p++ = ']';
> +       *p = '\0';
> +
> +       return string_nocheck(buf, end, sym, spec);
> +}
> 
> Also this is the bulk of the patch except for documentation and the new
> testing code.  [new patch below]
> 
> Am I missing your point somehow?

See above.

> I considered cramming a struct range into a
> struct resource to let resource_string() process the data.  But that would
> involve creating a new IORESOURCE_* flag (not ideal) and also does not allow
> for the larger u64 data in struct range should this be a 32 bit physical
> address config.

No, that's not what I was expecting.

> Most importantly that would not be much less code AFAICT.

...

> +       %par    [range 0x0000000060000000-0x000000006fffffff]

I still think this is not okay to use %pa namespace.

...

> +static void __init
> +struct_range(void)
> +{
> +       struct range test_range = {
> +               .start = 0xc0ffee00ba5eba11,
> +               .end = 0xc0ffee00ba5eba11,
> +       };
> +
> +       test("[range 0xc0ffee00ba5eba11-0xc0ffee00ba5eba11]",
> +            "%par", &test_range);
> +
> +       test_range = (struct range) {
> +               .start = 0xc0ffee,
> +               .end = 0xba5eba11,
> +       };
> +       test("[range 0x0000000000c0ffee-0x00000000ba5eba11]",
> +            "%par", &test_range);

Case when start == end?
Case when end < start?

> +}

...

> +       *p++ = '[';
> +       p = string_nocheck(p, pend, "range ", default_str_spec);
> +       p = special_hex_number(p, pend, range->start, sizeof(range->start));
> +       *p++ = '-';
> +       p = special_hex_number(p, pend, range->end, sizeof(range->end));
> +       *p++ = ']';
> +       *p = '\0';

As per above comments.

-- 
With Best Regards,
Andy Shevchenko



^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v3 18/25] cxl/extent: Process DCD events and realize region extents
  2024-08-16 14:44 ` [PATCH v3 18/25] cxl/extent: Process DCD events and realize region extents ira.weiny
  2024-08-19 18:51   ` Dave Jiang
  2024-08-23 21:32   ` Fan Ni
@ 2024-08-27 13:18   ` Jonathan Cameron
  2024-08-29 21:16     ` Ira Weiny
  2024-09-03  6:37   ` Li, Ming4
  2024-09-05 19:30   ` Fan Ni
  4 siblings, 1 reply; 120+ messages in thread
From: Jonathan Cameron @ 2024-08-27 13:18 UTC (permalink / raw)
  To: ira.weiny
  Cc: Dave Jiang, Fan Ni, Navneet Singh, Chris Mason, Josef Bacik,
	David Sterba, Petr Mladek, Steven Rostedt, Andy Shevchenko,
	Rasmus Villemoes, Sergey Senozhatsky, Jonathan Corbet,
	Andrew Morton, Dan Williams, Davidlohr Bueso, Alison Schofield,
	Vishal Verma, linux-btrfs, linux-cxl, linux-kernel, linux-doc,
	nvdimm

On Fri, 16 Aug 2024 09:44:26 -0500
ira.weiny@intel.com wrote:

> From: Navneet Singh <navneet.singh@intel.com>
> 
> A dynamic capacity device (DCD) sends events to signal the host for
> changes in the availability of Dynamic Capacity (DC) memory.  These
> events contain extents describing a DPA range and meta data for memory
> to be added or removed.  Events may be sent from the device at any time.
> 
> Three types of events can be signaled, Add, Release, and Force Release.
> 
> On add, the host may accept or reject the memory being offered.  If no
> region exists, or the extent is invalid, the extent should be rejected.
> Add extent events may be grouped by a 'more' bit which indicates those
> extents should be processed as a group.
> 
> On remove, the host can delay the response until the host is safely not
> using the memory.  If no region exists the release can be sent
> immediately.  The host may also release extents (or partial extents) at
> any time.  Thus the 'more' bit grouping of release events is of less
> value and can be ignored in favor of sending multiple release capacity
> responses for groups of release events.
> 
> Force removal is intended as a mechanism between the FM and the device
> and intended only when the host is unresponsive, out of sync, or
> otherwise broken.  Purposely ignore force removal events.
> 
> Regions are made up of one or more devices which may be surfacing memory
> to the host.  Once all devices in a region have surfaced an extent the
> region can expose a corresponding extent for the user to consume.
> Without interleaving a device extent forms a 1:1 relationship with the
> region extent.  Immediately surface a region extent upon getting a
> device extent.
> 
> Per the specification the device is allowed to offer or remove extents
> at any time.  However, anticipated use cases can expect extents to be
> offered, accepted, and removed in well defined chunks.
> 
> Simplify extent tracking with the following restrictions.
> 
> 	1) Flag for removal any extent which overlaps a requested
> 	   release range.
> 	2) Refuse the offer of extents which overlap already accepted
> 	   memory ranges.
> 	3) Accept again a range which has already been accepted by the
> 	   host.  (It is likely the device has an error because it
> 	   should already know that this range was accepted.  But from
> 	   the host point of view it is safe to acknowledge that
> 	   acceptance again.)
> 
> Management of the region extent devices must be synchronized with
> potential uses of the memory within the DAX layer.  Create region extent
> devices as children of the cxl_dax_region device such that the DAX
> region driver can co-drive them and synchronize with the DAX layer.
> Synchronization and management is handled in a subsequent patch.
> 
> Process DCD events and create region devices.
> 
> Signed-off-by: Navneet Singh <navneet.singh@intel.com>
> Co-developed-by: Ira Weiny <ira.weiny@intel.com>
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> 

A few minor bits and pieces inline.

Jonathan

> diff --git a/drivers/cxl/core/extent.c b/drivers/cxl/core/extent.c
> new file mode 100644
> index 000000000000..34456594cdc3
> --- /dev/null
> +++ b/drivers/cxl/core/extent.c



> +static int match_contains(struct device *dev, void *data)
> +{
> +	struct region_extent *region_extent = to_region_extent(dev);
> +	struct match_data *md = data;
> +	struct cxled_extent *entry;
> +	unsigned long index;
> +
> +	if (!region_extent)
> +		return 0;
> +
> +	xa_for_each(&region_extent->decoder_extents, index, entry) {
> +		if (md->cxled == entry->cxled &&
> +		    range_contains(&entry->dpa_range, md->new_range))
> +			return true;
As below, this returns int, so shouldn't be true or false.

> +	}
> +	return false;
> +}

> +static int match_overlaps(struct device *dev, void *data)
> +{
> +	struct region_extent *region_extent = to_region_extent(dev);
> +	struct match_data *md = data;
> +	struct cxled_extent *entry;
> +	unsigned long index;
> +
> +	if (!region_extent)
> +		return 0;
> +
> +	xa_for_each(&region_extent->decoder_extents, index, entry) {
> +		if (md->cxled == entry->cxled &&
> +		    range_overlaps(&entry->dpa_range, md->new_range))
> +			return true;

returns int, so returning true or false is odd.

> +	}
> +
> +	return false;
> +}


> +int cxl_rm_extent(struct cxl_memdev_state *mds, struct cxl_extent *extent)
> +{
> +	u64 start_dpa = le64_to_cpu(extent->start_dpa);
> +	struct cxl_memdev *cxlmd = mds->cxlds.cxlmd;
> +	struct cxl_endpoint_decoder *cxled;
> +	struct range hpa_range, dpa_range;
> +	struct cxl_region *cxlr;
> +
> +	dpa_range = (struct range) {
> +		.start = start_dpa,
> +		.end = start_dpa + le64_to_cpu(extent->length) - 1,
> +	};
> +
> +	guard(rwsem_read)(&cxl_region_rwsem);
> +	cxlr = cxl_dpa_to_region(cxlmd, start_dpa, &cxled);
> +	if (!cxlr) {
> +		memdev_release_extent(mds, &dpa_range);

How does this condition happen?  Perhaps a comment needed.

> +		return -ENXIO;
> +	}
> +
> +	calc_hpa_range(cxled, cxlr->cxlr_dax, &dpa_range, &hpa_range);
> +
> +	/* Remove region extents which overlap */
> +	return device_for_each_child(&cxlr->cxlr_dax->dev, &hpa_range,
> +				     cxlr_rm_extent);
> +}
> +
> +static int cxlr_add_extent(struct cxl_dax_region *cxlr_dax,
> +			   struct cxl_endpoint_decoder *cxled,
> +			   struct cxled_extent *ed_extent)
> +{
> +	struct region_extent *region_extent;
> +	struct range hpa_range;
> +	int rc;
> +
> +	calc_hpa_range(cxled, cxlr_dax, &ed_extent->dpa_range, &hpa_range);
> +
> +	region_extent = alloc_region_extent(cxlr_dax, &hpa_range, ed_extent->tag);
> +	if (IS_ERR(region_extent))
> +		return PTR_ERR(region_extent);
> +
> +	rc = xa_insert(&region_extent->decoder_extents, (unsigned long)ed_extent, ed_extent,

I'd wrap that earlier to keep the line a bit shorter.

> +		       GFP_KERNEL);
> +	if (rc) {
> +		free_region_extent(region_extent);
> +		return rc;
> +	}
> +
> +	/* device model handles freeing region_extent */
> +	return online_region_extent(region_extent);
> +}
> +
> +/* Callers are expected to ensure cxled has been attached to a region */
> +int cxl_add_extent(struct cxl_memdev_state *mds, struct cxl_extent *extent)
> +{
> +	u64 start_dpa = le64_to_cpu(extent->start_dpa);
> +	struct cxl_memdev *cxlmd = mds->cxlds.cxlmd;
> +	struct cxl_endpoint_decoder *cxled;
> +	struct range ed_range, ext_range;
> +	struct cxl_dax_region *cxlr_dax;
> +	struct cxled_extent *ed_extent;
> +	struct cxl_region *cxlr;
> +	struct device *dev;
> +
> +	ext_range = (struct range) {
> +		.start = start_dpa,
> +		.end = start_dpa + le64_to_cpu(extent->length) - 1,
> +	};
> +
> +	guard(rwsem_read)(&cxl_region_rwsem);
> +	cxlr = cxl_dpa_to_region(cxlmd, start_dpa, &cxled);
> +	if (!cxlr)
> +		return -ENXIO;
> +
> +	cxlr_dax = cxled->cxld.region->cxlr_dax;
> +	dev = &cxled->cxld.dev;
> +	ed_range = (struct range) {
> +		.start = cxled->dpa_res->start,
> +		.end = cxled->dpa_res->end,
> +	};
> +
> +	dev_dbg(&cxled->cxld.dev, "Checking ED (%pr) for extent %par\n",
> +		cxled->dpa_res, &ext_range);
> +
> +	if (!range_contains(&ed_range, &ext_range)) {
> +		dev_err_ratelimited(dev,
> +				    "DC extent DPA %par (%*phC) is not fully in ED %par\n",
> +				    &ext_range.start, CXL_EXTENT_TAG_LEN,
> +				    extent->tag, &ed_range);
> +		return -ENXIO;
> +	}
> +
> +	if (extents_contain(cxlr_dax, cxled, &ext_range))

This case confuses me. If the extents are already there I think we should
error out or at least print something as that's very wrong.

> +		return 0;
> +
> +	if (extents_overlap(cxlr_dax, cxled, &ext_range))
> +		return -ENXIO;
> +
> +	ed_extent = kzalloc(sizeof(*ed_extent), GFP_KERNEL);
> +	if (!ed_extent)
> +		return -ENOMEM;
> +
> +	ed_extent->cxled = cxled;
> +	ed_extent->dpa_range = ext_range;
> +	memcpy(ed_extent->tag, extent->tag, CXL_EXTENT_TAG_LEN);
> +
> +	dev_dbg(dev, "Add extent %par (%*phC)\n", &ed_extent->dpa_range,
> +		CXL_EXTENT_TAG_LEN, ed_extent->tag);
> +
> +	return cxlr_add_extent(cxlr_dax, cxled, ed_extent);
> +}
> diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
> index 01a447aaa1b1..f629ad7488ac 100644
> --- a/drivers/cxl/core/mbox.c
> +++ b/drivers/cxl/core/mbox.c
> @@ -882,6 +882,48 @@ int cxl_enumerate_cmds(struct cxl_memdev_state *mds)
>  }
>  EXPORT_SYMBOL_NS_GPL(cxl_enumerate_cmds, CXL);
>  
> +static int cxl_validate_extent(struct cxl_memdev_state *mds,
> +			       struct cxl_extent *extent)
> +{
> +	u64 start = le64_to_cpu(extent->start_dpa);
> +	u64 length = le64_to_cpu(extent->length);
> +	struct device *dev = mds->cxlds.dev;
> +
> +	struct range ext_range = (struct range){
> +		.start = start,
> +		.end = start + length - 1,
> +	};
> +
> +	if (le16_to_cpu(extent->shared_extn_seq) != 0) {

That's not the 'main' way to tell if an extent is shared because
we could have a single extent (so seq == 0).
Should verify it's not in a DCD region that
is shareable to make this decision.

I've lost track on the region handling so maybe you already do
this by not including those regions at all?

> +		dev_err_ratelimited(dev,
> +				    "DC extent DPA %par (%*phC) can not be shared\n",
> +				    &ext_range.start, CXL_EXTENT_TAG_LEN,
> +				    extent->tag);
> +		return -ENXIO;
> +	}
> +
> +	/* Extents must not cross DC region boundary's */
> +	for (int i = 0; i < mds->nr_dc_region; i++) {
> +		struct cxl_dc_region_info *dcr = &mds->dc_region[i];
> +		struct range region_range = (struct range) {
> +			.start = dcr->base,
> +			.end = dcr->base + dcr->decode_len - 1,
> +		};
> +
> +		if (range_contains(&region_range, &ext_range)) {
> +			dev_dbg(dev, "DC extent DPA %par (DCR:%d:%#llx)(%*phC)\n",
> +				&ext_range, i, start - dcr->base,
> +				CXL_EXTENT_TAG_LEN, extent->tag);
> +			return 0;
> +		}
> +	}
> +
> +	dev_err_ratelimited(dev,
> +			    "DC extent DPA %par (%*phC) is not in any DC region\n",
> +			    &ext_range, CXL_EXTENT_TAG_LEN, extent->tag);
> +	return -ENXIO;
> +}
> +
>  void cxl_event_trace_record(const struct cxl_memdev *cxlmd,
>  			    enum cxl_event_log_type type,
>  			    enum cxl_event_type event_type,
> @@ -1009,6 +1051,207 @@ static int cxl_clear_event_record(struct cxl_memdev_state *mds,
>  	return rc;
>  }
>  
> +static int cxl_send_dc_response(struct cxl_memdev_state *mds, int opcode,
> +				struct xarray *extent_array, int cnt)
> +{
> +	struct cxl_mbox_dc_response *p;
> +	struct cxl_mbox_cmd mbox_cmd;
> +	struct cxl_extent *extent;
> +	unsigned long index;
> +	u32 pl_index;
> +	int rc = 0;
> +
> +	size_t pl_size = struct_size(p, extent_list, cnt);
> +	u32 max_extents = cnt;
> +
What is cnt is zero? All extents rejected so none in the
extent_array. Need to send a zero extent response to reject
them all IIRC.

> +	/* May have to use more bit on response. */
> +	if (pl_size > mds->payload_size) {
> +		max_extents = (mds->payload_size - sizeof(*p)) /
> +			      sizeof(struct updated_extent_list);
> +		pl_size = struct_size(p, extent_list, max_extents);
> +	}
> +
> +	struct cxl_mbox_dc_response *response __free(kfree) =
> +						kzalloc(pl_size, GFP_KERNEL);
> +	if (!response)
> +		return -ENOMEM;
> +
> +	pl_index = 0;
> +	xa_for_each(extent_array, index, extent) {
> +
> +		response->extent_list[pl_index].dpa_start = extent->start_dpa;
> +		response->extent_list[pl_index].length = extent->length;
> +		pl_index++;
> +		response->extent_list_size = cpu_to_le32(pl_index);
> +
> +		if (pl_index == max_extents) {
> +			mbox_cmd = (struct cxl_mbox_cmd) {
> +				.opcode = opcode,
> +				.size_in = struct_size(response, extent_list,
> +						       pl_index),
> +				.payload_in = response,
> +			};
> +
> +			response->flags = 0;
> +			if (pl_index < cnt)
> +				response->flags &= CXL_DCD_EVENT_MORE;
> +
> +			rc = cxl_internal_send_cmd(mds, &mbox_cmd);
> +			if (rc)
> +				return rc;
> +			pl_index = 0;
> +		}
> +	}
> +
> +	if (pl_index) {
|| !cnt 

I think so we send a nothing accepted message.

> +		mbox_cmd = (struct cxl_mbox_cmd) {
> +			.opcode = opcode,
> +			.size_in = struct_size(response, extent_list,
> +					       pl_index),
> +			.payload_in = response,
> +		};
> +
> +		response->flags = 0;
> +		rc = cxl_internal_send_cmd(mds, &mbox_cmd);
		if (rc)
			return rc;
> +	}
> +

return 0;  So that reader doesn't have to check what rc was in !pl_index
case and avoids assigning rc right at the top.


> +	return rc;
> +}


> +static int cxl_add_pending(struct cxl_memdev_state *mds)
> +{
> +	struct device *dev = mds->cxlds.dev;
> +	struct cxl_extent *extent;
> +	unsigned long index;
> +	unsigned long cnt = 0;
> +	int rc;
> +
> +	xa_for_each(&mds->pending_extents, index, extent) {
> +		if (validate_add_extent(mds, extent)) {


Add a comment here that not accepting an extent but
accepting some or none means this one was rejected (I'd forgotten how
that bit worked)

> +			dev_dbg(dev, "unconsumed DC extent DPA:%#llx LEN:%#llx\n",
> +				le64_to_cpu(extent->start_dpa),
> +				le64_to_cpu(extent->length));
> +			xa_erase(&mds->pending_extents, index);
> +			kfree(extent);
> +			continue;
> +		}
> +		cnt++;
> +	}
> +	rc = cxl_send_dc_response(mds, CXL_MBOX_OP_ADD_DC_RESPONSE,
> +				  &mds->pending_extents, cnt);
> +	xa_for_each(&mds->pending_extents, index, extent) {
> +		xa_erase(&mds->pending_extents, index);
> +		kfree(extent);
> +	}
> +	return rc;
> +}
> +
> +static int handle_add_event(struct cxl_memdev_state *mds,
> +			    struct cxl_event_dcd *event)
> +{
> +	struct cxl_extent *tmp = kzalloc(sizeof(*tmp), GFP_KERNEL);
> +	struct device *dev = mds->cxlds.dev;
> +
> +	if (!tmp)
> +		return -ENOMEM;
> +
> +	memcpy(tmp, &event->extent, sizeof(*tmp));

kmemdup?

> +	if (xa_insert(&mds->pending_extents, (unsigned long)tmp, tmp,
> +		      GFP_KERNEL)) {
> +		kfree(tmp);
> +		return -ENOMEM;
> +	}
> +
> +	if (event->flags & CXL_DCD_EVENT_MORE) {
> +		dev_dbg(dev, "more bit set; delay the surfacing of extent\n");
> +		return 0;
> +	}
> +
> +	/* extents are removed and free'ed in cxl_add_pending() */
> +	return cxl_add_pending(mds);
> +}

>  static void cxl_mem_get_records_log(struct cxl_memdev_state *mds,
>  				    enum cxl_event_log_type type)
>  {
> @@ -1044,9 +1287,17 @@ static void cxl_mem_get_records_log(struct cxl_memdev_state *mds,
>  		if (!nr_rec)
>  			break;
>  
> -		for (i = 0; i < nr_rec; i++)
> +		for (i = 0; i < nr_rec; i++) {
>  			__cxl_event_trace_record(cxlmd, type,
>  						 &payload->records[i]);
> +			if (type == CXL_EVENT_TYPE_DCD) {
Bit of a deep indent so maybe flip logic?

Logic wise it's a bit dubious as we might want to match other
types in future though so up to you.

			if (type != CXL_EVENT_TYPE_DCD)
				continue;

			rc = 

> +				rc = cxl_handle_dcd_event_records(mds,
> +								  &payload->records[i]);
> +				if (rc)
> +					dev_err_ratelimited(dev, "dcd event failed: %d\n",
> +							    rc);
> +			}
> +		}
>  

>  struct cxl_memdev_state *cxl_memdev_state_create(struct device *dev)
>  {
>  	struct cxl_memdev_state *mds;
> @@ -1628,6 +1892,8 @@ struct cxl_memdev_state *cxl_memdev_state_create(struct device *dev)
>  	mds->cxlds.type = CXL_DEVTYPE_CLASSMEM;
>  	mds->ram_perf.qos_class = CXL_QOS_CLASS_INVALID;
>  	mds->pmem_perf.qos_class = CXL_QOS_CLASS_INVALID;
> +	xa_init(&mds->pending_extents);
> +	devm_add_action_or_reset(dev, clear_pending_extents, mds);

Why don't you need to check if this failed? Definitely seems unlikely
to leave things in a good state. Unlikely to fail of course, but you never know.

>  
>  	return mds;
>  }

> @@ -3090,6 +3091,8 @@ static struct cxl_dax_region *cxl_dax_region_alloc(struct cxl_region *cxlr)
>  
>  	dev = &cxlr_dax->dev;
>  	cxlr_dax->cxlr = cxlr;
> +	cxlr->cxlr_dax = cxlr_dax;
> +	ida_init(&cxlr_dax->extent_ida);
>  	device_initialize(dev);
>  	lockdep_set_class(&dev->mutex, &cxl_dax_region_key);
>  	device_set_pm_not_required(dev);
> @@ -3190,7 +3193,10 @@ static int devm_cxl_add_pmem_region(struct cxl_region *cxlr)
>  static void cxlr_dax_unregister(void *_cxlr_dax)
>  {
>  	struct cxl_dax_region *cxlr_dax = _cxlr_dax;
> +	struct cxl_region *cxlr = cxlr_dax->cxlr;
>  
> +	cxlr->cxlr_dax = NULL;
> +	cxlr_dax->cxlr = NULL;

cxlr_dax->cxlr was assigned before this patch. 

I'm not seeing any new checks on these being non null so why
are the needed?  If there is a good reason for this then
a comment would be useful.

>  	device_unregister(&cxlr_dax->dev);
>  }
>  



^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v3 02/25] printk: Add print format (%par) for struct range
  2024-08-27  7:43               ` Petr Mladek
@ 2024-08-27 13:21                 ` Andy Shevchenko
  2024-08-27 21:44                 ` Ira Weiny
  1 sibling, 0 replies; 120+ messages in thread
From: Andy Shevchenko @ 2024-08-27 13:21 UTC (permalink / raw)
  To: Petr Mladek
  Cc: Ira Weiny, Dave Jiang, Fan Ni, Jonathan Cameron, Navneet Singh,
	Chris Mason, Josef Bacik, David Sterba, Steven Rostedt,
	Rasmus Villemoes, Sergey Senozhatsky, Jonathan Corbet,
	Andrew Morton, Dan Williams, Davidlohr Bueso, Alison Schofield,
	Vishal Verma, linux-btrfs, linux-cxl, linux-kernel, linux-doc,
	nvdimm

On Tue, Aug 27, 2024 at 09:43:32AM +0200, Petr Mladek wrote:
> On Mon 2024-08-26 16:17:52, Ira Weiny wrote:
> > Andy Shevchenko wrote:

...

> But I could live with all other variants, except for %pn mentioned below.

I believe %r is also no go as we most likely get a complier warning.

...

> > Am I missing your point somehow?  I considered cramming a struct range into a
> > struct resource to let resource_string() process the data.  But that would
> > involve creating a new IORESOURCE_* flag (not ideal) and also does not allow
> > for the larger u64 data in struct range should this be a 32 bit physical
> > address config.
> 
> This would be nasty. I believe that this is not what Andy meant.

You are right, this is not what I meant.

-- 
With Best Regards,
Andy Shevchenko



^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v3 20/25] dax/bus: Factor out dev dax resize logic
  2024-08-16 14:44 ` [PATCH v3 20/25] dax/bus: Factor out dev dax resize logic Ira Weiny
  2024-08-19 22:35   ` Dave Jiang
@ 2024-08-27 13:26   ` Jonathan Cameron
  2024-08-29 21:36     ` Ira Weiny
  1 sibling, 1 reply; 120+ messages in thread
From: Jonathan Cameron @ 2024-08-27 13:26 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Dave Jiang, Fan Ni, Navneet Singh, Chris Mason, Josef Bacik,
	David Sterba, Petr Mladek, Steven Rostedt, Andy Shevchenko,
	Rasmus Villemoes, Sergey Senozhatsky, Jonathan Corbet,
	Andrew Morton, Dan Williams, Davidlohr Bueso, Alison Schofield,
	Vishal Verma, linux-btrfs, linux-cxl, linux-kernel, linux-doc,
	nvdimm

On Fri, 16 Aug 2024 09:44:28 -0500
Ira Weiny <ira.weiny@intel.com> wrote:

> Dynamic Capacity regions must limit dev dax resources to those areas
> which have extents backing real memory.  Such DAX regions are dubbed
> 'sparse' regions.  In order to manage where memory is available four
> alternatives were considered:
> 
> 1) Create a single region resource child on region creation which
>    reserves the entire region.  Then as extents are added punch holes in
>    this reservation.  This requires new resource manipulation to punch
>    the holes and still requires an additional iteration over the extent
>    areas which may already have existing dev dax resources used.
> 
> 2) Maintain an ordered xarray of extents which can be queried while
>    processing the resize logic.  The issue is that existing region->res
>    children may artificially limit the allocation size sent to
>    alloc_dev_dax_range().  IE the resource children can't be directly
>    used in the resize logic to find where space in the region is.  This
>    also poses a problem of managing the available size in 2 places.
> 
> 3) Maintain a separate resource tree with extents.  This option is the
>    same as 2) but with the different data structure.  Most ideally there
>    should be a unified representation of the resource tree not two places
>    to look for space.
> 
> 4) Create region resource children for each extent.  Manage the dax dev
>    resize logic in the same way as before but use a region child
>    (extent) resource as the parents to find space within each extent.
> 
> Option 4 can leverage the existing resize algorithm to find space within
> the extents.  It manages the available space in a singular resource tree
> which is less complicated for finding space.
> 
> In preparation for this change, factor out the dev_dax_resize logic.
> For static regions use dax_region->res as the parent to find space for
> the dax ranges.  Future patches will use the same algorithm with
> individual extent resources as the parent.
> 
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
I'm not 100% confident on this one, so will probably take another look
before giving a tag.
One trivial comment below.



> +static ssize_t dev_dax_resize(struct dax_region *dax_region,
> +		struct dev_dax *dev_dax, resource_size_t size)
> +{
> +	resource_size_t avail = dax_region_avail_size(dax_region), to_alloc;
> +	resource_size_t dev_size = dev_dax_size(dev_dax);
> +	struct device *dev = &dev_dax->dev;
> +	resource_size_t alloc = 0;

No path in which this is not set before use.

> +
> +	if (dev->driver)
> +		return -EBUSY;
> +	if (size == dev_size)
> +		return 0;
> +	if (size > dev_size && size - dev_size > avail)
> +		return -ENOSPC;
> +	if (size < dev_size)
> +		return dev_dax_shrink(dev_dax, size);
> +
> +	to_alloc = size - dev_size;
> +	if (dev_WARN_ONCE(dev, !alloc_is_aligned(dev_dax, to_alloc),
> +			"resize of %pa misaligned\n", &to_alloc))
> +		return -ENXIO;
> +
> +retry:
> +	alloc = dev_dax_resize_static(&dax_region->res, dev_dax, to_alloc);
> +	if (alloc <= 0)
> +		return alloc;
>  	to_alloc -= alloc;
>  	if (to_alloc)
>  		goto retry;



^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v3 21/25] dax/region: Create resources on sparse DAX regions
  2024-08-16 14:44 ` [PATCH v3 21/25] dax/region: Create resources on sparse DAX regions ira.weiny
  2024-08-18 11:38   ` Markus Elfring
  2024-08-19 23:30   ` Dave Jiang
@ 2024-08-27 14:12   ` Jonathan Cameron
  2024-08-29 21:54     ` Ira Weiny
  2 siblings, 1 reply; 120+ messages in thread
From: Jonathan Cameron @ 2024-08-27 14:12 UTC (permalink / raw)
  To: ira.weiny
  Cc: Dave Jiang, Fan Ni, Navneet Singh, Chris Mason, Josef Bacik,
	David Sterba, Petr Mladek, Steven Rostedt, Andy Shevchenko,
	Rasmus Villemoes, Sergey Senozhatsky, Jonathan Corbet,
	Andrew Morton, Dan Williams, Davidlohr Bueso, Alison Schofield,
	Vishal Verma, linux-btrfs, linux-cxl, linux-kernel, linux-doc,
	nvdimm

On Fri, 16 Aug 2024 09:44:29 -0500
ira.weiny@intel.com wrote:

> From: Navneet Singh <navneet.singh@intel.com>
> 
> DAX regions which map dynamic capacity partitions require that memory be
> allowed to come and go.  Recall sparse regions were created for this
> purpose.  Now that extents can be realized within DAX regions the DAX
> region driver can start tracking sub-resource information.
> 
> The tight relationship between DAX region operations and extent
> operations require memory changes to be controlled synchronously with
> the user of the region.  Synchronize through the dax_region_rwsem and by
> having the region driver drive both the region device as well as the
> extent sub-devices.
> 
> Recall requests to remove extents can happen at any time and that a host
> is not obligated to release the memory until it is not being used.  If
> an extent is not used allow a release response.
> 
> The DAX layer has no need for the details of the CXL memory extent
> devices.  Expose extents to the DAX layer as device children of the DAX
> region device.  A single callback from the driver aids the DAX layer to
> determine if the child device is an extent.  The DAX layer also
> registers a devres function to automatically clean up when the device is
> removed from the region.
> 
> There is a race between extents being surfaced and the dax_cxl driver
> being loaded.  The driver must therefore scan for any existing extents
> while still under the device lock.
> 
> Respond to extent notifications.  Manage the DAX region resource tree
> based on the extents lifetime.  Return the status of remove
> notifications to lower layers such that it can manage the hardware
> appropriately.
> 
> Signed-off-by: Navneet Singh <navneet.singh@intel.com>
> Co-developed-by: Ira Weiny <ira.weiny@intel.com>
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
A few minor comments inline.

Jonathan

> 
> ---
> Changes:
> [iweiny: patch reorder]
> [iweiny: move hunks from other patches to clarify code changes and
>          add/release flows WRT dax regions]
> [iweiny: use %par]
> [iweiny: clean up variable names]
> [iweiny: Simplify sparse_ops]
> [Fan: avoid open coding range_len()]
> [djbw: s/reg_ext/region_extent]
> ---
>  drivers/cxl/core/extent.c |  76 +++++++++++++--
>  drivers/cxl/cxl.h         |   6 ++
>  drivers/dax/bus.c         | 243 +++++++++++++++++++++++++++++++++++++++++-----
>  drivers/dax/bus.h         |   3 +-
>  drivers/dax/cxl.c         |  63 +++++++++++-
>  drivers/dax/dax-private.h |  34 +++++++
>  drivers/dax/hmem/hmem.c   |   2 +-
>  drivers/dax/pmem.c        |   2 +-
>  8 files changed, 391 insertions(+), 38 deletions(-)
> 
> diff --git a/drivers/cxl/core/extent.c b/drivers/cxl/core/extent.c
> index d7d526a51e2b..103b0bec3a4a 100644
> --- a/drivers/cxl/core/extent.c
> +++ b/drivers/cxl/core/extent.c
> @@ -271,20 +271,67 @@ static void calc_hpa_range(struct cxl_endpoint_decoder *cxled,
>  	hpa_range->end = hpa_range->start + range_len(dpa_range) - 1;
>  }
>  
> +static int cxlr_notify_extent(struct cxl_region *cxlr, enum dc_event event,
> +			      struct region_extent *region_extent)
> +{
> +	struct cxl_dax_region *cxlr_dax;
> +	struct device *dev;
> +	int rc = 0;
> +
> +	cxlr_dax = cxlr->cxlr_dax;
> +	dev = &cxlr_dax->dev;
> +	dev_dbg(dev, "Trying notify: type %d HPA %par\n",
> +		event, &region_extent->hpa_range);
> +
> +	/*
> +	 * NOTE the lack of a driver indicates a notification has failed.  No
> +	 * user space coordiantion was possible.
> +	 */
> +	device_lock(dev);

I'd use guard() for this as then can just return the notify result
and drop local variable rc.


> +	if (dev->driver) {
> +		struct cxl_driver *driver = to_cxl_drv(dev->driver);
> +		struct cxl_notify_data notify_data = (struct cxl_notify_data) {
> +			.event = event,
> +			.region_extent = region_extent,
> +		};
> +
> +		if (driver->notify) {
> +			dev_dbg(dev, "Notify: type %d HPA %par\n",
> +				event, &region_extent->hpa_range);
> +			rc = driver->notify(dev, &notify_data);
> +		}
> +	}
> +	device_unlock(dev);
> +	return rc;
> +}
>
> @@ -338,8 +390,20 @@ static int cxlr_add_extent(struct cxl_dax_region *cxlr_dax,
>  		return rc;
>  	}
>  
> -	/* device model handles freeing region_extent */
> -	return online_region_extent(region_extent);
> +	rc = online_region_extent(region_extent);
> +	/* device model handled freeing region_extent */
> +	if (rc)
> +		return rc;
> +
> +	rc = cxlr_notify_extent(cxlr_dax->cxlr, DCD_ADD_CAPACITY, region_extent);
> +	/*
> +	 * The region device was breifly live but DAX layer ensures it was not

briefly

> +	 * used
> +	 */
> +	if (rc)
> +		region_rm_extent(region_extent);	
> +
> +	return rc;
>  }

> diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
> index 975860371d9f..f14b0cfa7edd 100644
> --- a/drivers/dax/bus.c
> +++ b/drivers/dax/bus.c

> +EXPORT_SYMBOL_GPL(dax_region_add_resource);
> +
> +int dax_region_rm_resource(struct dax_region *dax_region,
> +			   struct device *dev)
> +{
> +	struct dax_resource *dax_resource;
> +
> +	guard(rwsem_write)(&dax_region_rwsem);
> +
> +	dax_resource = dev_get_drvdata(dev);
> +	if (!dax_resource)
> +		return 0;
> +
> +	if (dax_resource->use_cnt)
> +		return -EBUSY;
> +
> +	/* avoid races with users trying to use the extent */

Not obvious to me from local code, why does releasing the resource
here avoid a race?  Perhaps the comment needs expanding.

> +	__dax_release_resource(dax_resource);
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(dax_region_rm_resource);
> +


> +static ssize_t dev_dax_resize_sparse(struct dax_region *dax_region,
> +				     struct dev_dax *dev_dax,
> +				     resource_size_t to_alloc)
> +{
> +	struct dax_resource *dax_resource;
> +	resource_size_t available_size;
> +	struct device *extent_dev;
> +	ssize_t alloc;
> +
> +	extent_dev = device_find_child(dax_region->dev, dax_region,
> +				       find_free_extent);

There is a __free for put device and it will tidy this up a tiny bit.

> +	if (!extent_dev)
> +		return 0;
> +
> +	dax_resource = dev_get_drvdata(extent_dev);
> +	if (!dax_resource)
> +		return 0;
> +
> +	available_size = dax_avail_size(dax_resource->res);
> +	to_alloc = min(available_size, to_alloc);
I'd put those two inline and skip the local variables unless
they have more use in later patches.

	alloc = __dev_dax_resize(dax_resources->res, dev_dax,
				 min(dax_avail_size(dax_resources->res), to_alloc),
				 dax_resource);
	
				
> +	alloc = __dev_dax_resize(dax_resource->res, dev_dax, to_alloc, dax_resource);
> +	if (alloc > 0)
> +		dax_resource->use_cnt++;
> +	put_device(extent_dev);
> +	return alloc;
> +}
> +

> @@ -1494,8 +1679,14 @@ static struct dev_dax *__devm_create_dev_dax(struct dev_dax_data *data)
>  	device_initialize(dev);
>  	dev_set_name(dev, "dax%d.%d", dax_region->id, dev_dax->id);
>  
> +	if (is_sparse(dax_region) && data->size) {
> +		dev_err(parent, "Sparse DAX region devices are created initially with 0 size");
must be created initially with 0 size.

Otherwise this error message says that they are, so why is it an error?

> +		rc = -EINVAL;
> +		goto err_id;
> +	}
> +
>  	rc = alloc_dev_dax_range(&dax_region->res, dev_dax, dax_region->res.start,
> -				 data->size);
> +				 data->size, NULL);
>  	if (rc)
>  		goto err_range;
>  

> diff --git a/drivers/dax/cxl.c b/drivers/dax/cxl.c
> index 367e86b1c22a..bf3b82b0120d 100644
> --- a/drivers/dax/cxl.c
> +++ b/drivers/dax/cxl.c
> @@ -5,6 +5,60 @@

...

> +static int cxl_dax_region_notify(struct device *dev,
> +				 struct cxl_notify_data *notify_data)
> +{
> +	struct cxl_dax_region *cxlr_dax = to_cxl_dax_region(dev);
> +	struct dax_region *dax_region = dev_get_drvdata(dev);
> +	struct region_extent *region_extent = notify_data->region_extent;
> +
> +	switch (notify_data->event) {
> +	case DCD_ADD_CAPACITY:
> +		return __cxl_dax_add_resource(dax_region, region_extent);
> +	case DCD_RELEASE_CAPACITY:
> +		return dax_region_rm_resource(dax_region, &region_extent->dev);
> +	case DCD_FORCED_CAPACITY_RELEASE:
> +	default:
> +		dev_err(&cxlr_dax->dev, "Unknown DC event %d\n",
> +			notify_data->event);
> +		break;
Might as well return here and not below.
Makes it really really obvious this is the error path and currently the only
one that hits the return statement.
> +	}
> +
> +	return -ENXIO;
> +}

>  static int cxl_dax_region_probe(struct device *dev)
>  {
> @@ -24,14 +78,16 @@ static int cxl_dax_region_probe(struct device *dev)
>  		flags |= IORESOURCE_DAX_SPARSE_CAP;
>  
>  	dax_region = alloc_dax_region(dev, cxlr->id, &cxlr_dax->hpa_range, nid,
> -				      PMD_SIZE, flags);
> +				      PMD_SIZE, flags, &sparse_ops);
>  	if (!dax_region)
>  		return -ENOMEM;
>  
> -	if (cxlr->mode == CXL_REGION_DC)
> +	if (cxlr->mode == CXL_REGION_DC) {
> +		device_for_each_child(&cxlr_dax->dev, dax_region,
> +				      cxl_dax_add_resource);
>  		/* Add empty seed dax device */
>  		dev_size = 0;
> -	else
> +	} else

Coding style says that you need brackets for all branches if
one needs them (as multiline).  Just above:
https://www.kernel.org/doc/html/v4.10/process/coding-style.html#spaces


>  		dev_size = range_len(&cxlr_dax->hpa_range);
>  
>  	data = (struct dev_dax_data) {


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v3 22/25] cxl/region: Read existing extents on region creation
  2024-08-16 14:44 ` [PATCH v3 22/25] cxl/region: Read existing extents on region creation ira.weiny
  2024-08-20  0:06   ` Dave Jiang
@ 2024-08-27 14:19   ` Jonathan Cameron
  2024-09-05 19:35   ` Fan Ni
  2 siblings, 0 replies; 120+ messages in thread
From: Jonathan Cameron @ 2024-08-27 14:19 UTC (permalink / raw)
  To: ira.weiny
  Cc: Dave Jiang, Fan Ni, Navneet Singh, Chris Mason, Josef Bacik,
	David Sterba, Petr Mladek, Steven Rostedt, Andy Shevchenko,
	Rasmus Villemoes, Sergey Senozhatsky, Jonathan Corbet,
	Andrew Morton, Dan Williams, Davidlohr Bueso, Alison Schofield,
	Vishal Verma, linux-btrfs, linux-cxl, linux-kernel, linux-doc,
	nvdimm

On Fri, 16 Aug 2024 09:44:30 -0500
ira.weiny@intel.com wrote:

> From: Navneet Singh <navneet.singh@intel.com>
> 
> Dynamic capacity device extents may be left in an accepted state on a
> device due to an unexpected host crash.  In this case it is expected
> that the creation of a new region on top of a DC partition can read
> those extents and surface them for continued use.
> 
> Once all endpoint decoders are part of a region and the region is being
> realized a read of the devices extent list can reveal these previously
> accepted extents.
> 
> CXL r3.1 specifies the mailbox call Get Dynamic Capacity Extent List for
> this purpose.  The call returns all the extents for all dynamic capacity
> partitions.  If the fabric manager is adding extents to any DCD
> partition, the extent list for the recovered region may change.  In this
> case the query must retry.  Upon retry the query could encounter extents
> which were accepted on a previous list query.  Adding such extents is
> ignored without error because they are entirely within a previous
> accepted extent.
> 
> The scan for existing extents races with the dax_cxl driver.  This is
> synchronized through the region device lock.  Extents which are found
> after the driver has loaded will surface through the normal notification
> path while extents seen prior to the driver are read during driver load.

Ah. So the earlier code to just eat duplicates was to handle this race.
Add a comment there perhaps so people like me get less confused :)

Jonathan

> 
> Signed-off-by: Navneet Singh <navneet.singh@intel.com>
> Co-developed-by: Ira Weiny <ira.weiny@intel.com>
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>

LGTM
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v3 23/25] cxl/mem: Trace Dynamic capacity Event Record
  2024-08-16 14:44 ` [PATCH v3 23/25] cxl/mem: Trace Dynamic capacity Event Record ira.weiny
  2024-08-20 22:54   ` Dave Jiang
@ 2024-08-27 14:20   ` Jonathan Cameron
  2024-09-05 19:38   ` Fan Ni
  2 siblings, 0 replies; 120+ messages in thread
From: Jonathan Cameron @ 2024-08-27 14:20 UTC (permalink / raw)
  To: ira.weiny
  Cc: Dave Jiang, Fan Ni, Navneet Singh, Chris Mason, Josef Bacik,
	David Sterba, Petr Mladek, Steven Rostedt, Andy Shevchenko,
	Rasmus Villemoes, Sergey Senozhatsky, Jonathan Corbet,
	Andrew Morton, Dan Williams, Davidlohr Bueso, Alison Schofield,
	Vishal Verma, linux-btrfs, linux-cxl, linux-kernel, linux-doc,
	nvdimm

On Fri, 16 Aug 2024 09:44:31 -0500
ira.weiny@intel.com wrote:

> From: Navneet Singh <navneet.singh@intel.com>
> 
> CXL rev 3.1 section 8.2.9.2.1 adds the Dynamic Capacity Event Records.
> User space can use trace events for debugging of DC capacity changes.
> 
> Add DC trace points to the trace log.
> 
> Signed-off-by: Navneet Singh <navneet.singh@intel.com>
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>

LGTM
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>



^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v3 24/25] tools/testing/cxl: Make event logs dynamic
  2024-08-16 14:44 ` [PATCH v3 24/25] tools/testing/cxl: Make event logs dynamic Ira Weiny
  2024-08-20 23:30   ` Dave Jiang
@ 2024-08-27 14:32   ` Jonathan Cameron
  2024-09-09 13:57     ` Ira Weiny
  1 sibling, 1 reply; 120+ messages in thread
From: Jonathan Cameron @ 2024-08-27 14:32 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Dave Jiang, Fan Ni, Navneet Singh, Chris Mason, Josef Bacik,
	David Sterba, Petr Mladek, Steven Rostedt, Andy Shevchenko,
	Rasmus Villemoes, Sergey Senozhatsky, Jonathan Corbet,
	Andrew Morton, Dan Williams, Davidlohr Bueso, Alison Schofield,
	Vishal Verma, linux-btrfs, linux-cxl, linux-kernel, linux-doc,
	nvdimm

On Fri, 16 Aug 2024 09:44:32 -0500
Ira Weiny <ira.weiny@intel.com> wrote:

> The test event logs were created as static arrays as an easy way to mock
> events.  Dynamic Capacity Device (DCD) test support requires events be
> generated dynamically when extents are created or destroyed.
> 
> Modify the event log storage to be dynamically allocated.  Reuse the
> static event data to create the dynamic events in the new logs without
> inventing complex event injection for the previous tests.  Simplify the
> processing of the logs by using the event log array index as the handle.
> Add a lock to manage concurrency required when user space is allowed to
> control DCD extents
> 
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Probably make sense to spinkle some guard() magic in here
to avoid all the places where you goto end of function to release the lock
> 
> ---
> Changes:
> [iweiny: rebase]
> ---
>  tools/testing/cxl/test/mem.c | 278 ++++++++++++++++++++++++++-----------------
>  1 file changed, 171 insertions(+), 107 deletions(-)
> 
> diff --git a/tools/testing/cxl/test/mem.c b/tools/testing/cxl/test/mem.c
> index 129f179b0ac5..674fc7f086cd 100644
> --- a/tools/testing/cxl/test/mem.c
> +++ b/tools/testing/cxl/test/mem.c
> @@ -125,18 +125,27 @@ static struct {
>  
>  #define PASS_TRY_LIMIT 3
>  
> -#define CXL_TEST_EVENT_CNT_MAX 15
> +#define CXL_TEST_EVENT_CNT_MAX 17

Seems you added a couple more. Don't do that in a patch
just changing allocation approach.

I could find 1 but not sure where other one came from!



> -static void mes_add_event(struct mock_event_store *mes,
> +/* Add the event or free it on 'overflow' */
> +static void mes_add_event(struct cxl_mockmem_data *mdata,
>  			  enum cxl_event_log_type log_type,
>  			  struct cxl_event_record_raw *event)
>  {
> +	struct device *dev = mdata->mds->cxlds.dev;
>  	struct mock_event_log *log;
> +	u16 handle;
>  
>  	if (WARN_ON(log_type >= CXL_EVENT_TYPE_MAX))
>  		return;
>  
> -	log = &mes->mock_logs[log_type];
> +	log = &mdata->mes.mock_logs[log_type];
>  
> -	if ((log->nr_events + 1) > CXL_TEST_EVENT_CNT_MAX) {
> +	write_lock(&log->lock);
> +
> +	handle = log->next_handle;
> +	if ((handle + 1) == log->cur_handle) {
>  		log->nr_overflow++;
> -		log->overflow_reset = log->nr_overflow;
> -		return;
> +		dev_dbg(dev, "Overflowing %d\n", log_type);
> +		devm_kfree(dev, event);
> +		goto unlock;
>  	}
>  
> -	log->events[log->nr_events] = event;
> +	dev_dbg(dev, "Log %d; handle %u\n", log_type, handle);
> +	event->event.generic.hdr.handle = cpu_to_le16(handle);
> +	log->events[handle] = event;
> +	event_inc_handle(&log->next_handle);
>  	log->nr_events++;
> +
> +unlock:
> +	write_unlock(&log->lock);
> +}
> +

>  
>  /*
> @@ -233,8 +254,8 @@ static int mock_get_event(struct device *dev, struct cxl_mbox_cmd *cmd)
>  {
>  	struct cxl_get_event_payload *pl;
>  	struct mock_event_log *log;
> -	u16 nr_overflow;
>  	u8 log_type;
> +	u16 handle;
>  	int i;
>  
>  	if (cmd->size_in != sizeof(log_type))
> @@ -254,29 +275,39 @@ static int mock_get_event(struct device *dev, struct cxl_mbox_cmd *cmd)
>  	memset(cmd->payload_out, 0, struct_size(pl, records, 0));
>  
>  	log = event_find_log(dev, log_type);
> -	if (!log || event_log_empty(log))
> +	if (!log)
>  		return 0;
>  
>  	pl = cmd->payload_out;
>  
> -	for (i = 0; i < ret_limit && !event_log_empty(log); i++) {
> -		memcpy(&pl->records[i], event_get_current(log),
> -		       sizeof(pl->records[i]));
> -		pl->records[i].event.generic.hdr.handle =
> -				event_get_cur_event_handle(log);
> -		log->cur_idx++;
> +	read_lock(&log->lock);
> +
> +	handle = log->cur_handle;
> +	dev_dbg(dev, "Get log %d handle %u next %u\n",
> +		log_type, handle, log->next_handle);
> +	for (i = 0;
> +	     i < ret_limit && handle != log->next_handle;
As below, maybe combine 2 lines above into 1.


> +	     i++, event_inc_handle(&handle)) {
> +		struct cxl_event_record_raw *cur;
> +
> +		cur = log->events[handle];
> +		dev_dbg(dev, "Sending event log %d handle %d idx %u\n",
> +			log_type, le16_to_cpu(cur->event.generic.hdr.handle),
> +			handle);
> +		memcpy(&pl->records[i], cur, sizeof(pl->records[i]));
> +		pl->records[i].event.generic.hdr.handle = cpu_to_le16(handle);
>  	}
>  
>  	cmd->size_out = struct_size(pl, records, i);
>  	pl->record_count = cpu_to_le16(i);
> -	if (!event_log_empty(log))
> +	if (log->nr_events > i)
>  		pl->flags |= CXL_GET_EVENT_FLAG_MORE_RECORDS;
>  
>  	if (log->nr_overflow) {
>  		u64 ns;
>  
>  		pl->flags |= CXL_GET_EVENT_FLAG_OVERFLOW;
> -		pl->overflow_err_count = cpu_to_le16(nr_overflow);
> +		pl->overflow_err_count = cpu_to_le16(log->nr_overflow);
>  		ns = ktime_get_real_ns();
>  		ns -= 5000000000; /* 5s ago */
>  		pl->first_overflow_timestamp = cpu_to_le64(ns);
> @@ -285,16 +316,17 @@ static int mock_get_event(struct device *dev, struct cxl_mbox_cmd *cmd)
>  		pl->last_overflow_timestamp = cpu_to_le64(ns);
>  	}
>  
> +	read_unlock(&log->lock);
Another one maybe for guard()

>  	return 0;
>  }
>  
>  static int mock_clear_event(struct device *dev, struct cxl_mbox_cmd *cmd)
>  {
>  	struct cxl_mbox_clear_event_payload *pl = cmd->payload_in;
> -	struct mock_event_log *log;
>  	u8 log_type = pl->event_log;
> +	struct mock_event_log *log;
> +	int nr, rc = 0;
>  	u16 handle;
> -	int nr;
>  
>  	if (log_type >= CXL_EVENT_TYPE_MAX)
>  		return -EINVAL;
> @@ -303,24 +335,23 @@ static int mock_clear_event(struct device *dev, struct cxl_mbox_cmd *cmd)
>  	if (!log)
>  		return 0; /* No mock data in this log */
>  
> -	/*
> -	 * This check is technically not invalid per the specification AFAICS.
> -	 * (The host could 'guess' handles and clear them in order).
> -	 * However, this is not good behavior for the host so test it.
> -	 */
> -	if (log->clear_idx + pl->nr_recs > log->cur_idx) {
> -		dev_err(dev,
> -			"Attempting to clear more events than returned!\n");
> -		return -EINVAL;
> -	}
> +	write_lock(&log->lock);
Use a guard()?
>  
>  	/* Check handle order prior to clearing events */
> -	for (nr = 0, handle = event_get_clear_handle(log);
> -	     nr < pl->nr_recs;
> -	     nr++, handle++) {
> +	handle = log->cur_handle;
> +	for (nr = 0;
> +	     nr < pl->nr_recs && handle != log->next_handle;

I'd combine the two lines above.

> +	     nr++, event_inc_handle(&handle)) {
> +
> +		dev_dbg(dev, "Checking clear of %d handle %u plhandle %u\n",
> +			log_type, handle,
> +			le16_to_cpu(pl->handles[nr]));
> +
>  		if (handle != le16_to_cpu(pl->handles[nr])) {
> -			dev_err(dev, "Clearing events out of order\n");
> -			return -EINVAL;
> +			dev_err(dev, "Clearing events out of order %u %u\n",
> +				handle, le16_to_cpu(pl->handles[nr]));
> +			rc = -EINVAL;
> +			goto unlock;
>  		}
>  	}
>  
> @@ -328,25 +359,12 @@ static int mock_clear_event(struct device *dev, struct cxl_mbox_cmd *cmd)
>  		log->nr_overflow = 0;
>  
>  	/* Clear events */
> -	log->clear_idx += pl->nr_recs;
> -	return 0;
> -}

>  
>  struct cxl_event_record_raw maint_needed = {
> @@ -475,8 +493,27 @@ static int mock_set_timestamp(struct cxl_dev_state *cxlds,
>  	return 0;
>  }
>  

> +static void cxl_mock_add_event_logs(struct cxl_mockmem_data *mdata)
>  {
> +	struct mock_event_store *mes = &mdata->mes;
> +	struct device *dev = mdata->mds->cxlds.dev;
> +
>  	put_unaligned_le16(CXL_GMER_VALID_CHANNEL | CXL_GMER_VALID_RANK,
>  			   &gen_media.rec.media_hdr.validity_flags);
>  
> @@ -484,43 +521,60 @@ static void cxl_mock_add_event_logs(struct mock_event_store *mes)
>  			   CXL_DER_VALID_BANK | CXL_DER_VALID_COLUMN,
>  			   &dram.rec.media_hdr.validity_flags);
>  
> -	mes_add_event(mes, CXL_EVENT_TYPE_INFO, &maint_needed);
> -	mes_add_event(mes, CXL_EVENT_TYPE_INFO,
> +	dev_dbg(dev, "Generating fake event logs %d\n",
> +		CXL_EVENT_TYPE_INFO);
> +	add_event_from_static(mdata, CXL_EVENT_TYPE_INFO, &maint_needed);
> +	add_event_from_static(mdata, CXL_EVENT_TYPE_INFO,
>  		      (struct cxl_event_record_raw *)&gen_media);
> -	mes_add_event(mes, CXL_EVENT_TYPE_INFO,
> +	add_event_from_static(mdata, CXL_EVENT_TYPE_INFO,
>  		      (struct cxl_event_record_raw *)&mem_module);
>  	mes->ev_status |= CXLDEV_EVENT_STATUS_INFO;
>  
> -	mes_add_event(mes, CXL_EVENT_TYPE_FAIL, &maint_needed);
> -	mes_add_event(mes, CXL_EVENT_TYPE_FAIL, &hardware_replace);
> -	mes_add_event(mes, CXL_EVENT_TYPE_FAIL,
> +	dev_dbg(dev, "Generating fake event logs %d\n",
> +		CXL_EVENT_TYPE_FAIL);
> +	add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL, &maint_needed);
> +	add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL,
> +		      (struct cxl_event_record_raw *)&mem_module);

So this one is new?  I can't spot the other one...


> +	add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL, &hardware_replace);
> +	add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL,
>  		      (struct cxl_event_record_raw *)&dram);
> -	mes_add_event(mes, CXL_EVENT_TYPE_FAIL,
> +	add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL,
>  		      (struct cxl_event_record_raw *)&gen_media);
> -	mes_add_event(mes, CXL_EVENT_TYPE_FAIL,
> +	add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL,
>  		      (struct cxl_event_record_raw *)&mem_module);
> -	mes_add_event(mes, CXL_EVENT_TYPE_FAIL, &hardware_replace);
> -	mes_add_event(mes, CXL_EVENT_TYPE_FAIL,
> +	add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL, &hardware_replace);
> +	add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL,
>  		      (struct cxl_event_record_raw *)&dram);
>  	/* Overflow this log */
> -	mes_add_event(mes, CXL_EVENT_TYPE_FAIL, &hardware_replace);
> -	mes_add_event(mes, CXL_EVENT_TYPE_FAIL, &hardware_replace);
> -	mes_add_event(mes, CXL_EVENT_TYPE_FAIL, &hardware_replace);
> -	mes_add_event(mes, CXL_EVENT_TYPE_FAIL, &hardware_replace);
> -	mes_add_event(mes, CXL_EVENT_TYPE_FAIL, &hardware_replace);
> -	mes_add_event(mes, CXL_EVENT_TYPE_FAIL, &hardware_replace);
> -	mes_add_event(mes, CXL_EVENT_TYPE_FAIL, &hardware_replace);
> -	mes_add_event(mes, CXL_EVENT_TYPE_FAIL, &hardware_replace);
> -	mes_add_event(mes, CXL_EVENT_TYPE_FAIL, &hardware_replace);
> -	mes_add_event(mes, CXL_EVENT_TYPE_FAIL, &hardware_replace);
> +	add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL, &hardware_replace);
> +	add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL, &hardware_replace);
> +	add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL, &hardware_replace);
> +	add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL, &hardware_replace);
> +	add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL, &hardware_replace);
> +	add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL, &hardware_replace);
> +	add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL, &hardware_replace);
> +	add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL, &hardware_replace);
> +	add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL, &hardware_replace);
> +	add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL, &hardware_replace);
>  	mes->ev_status |= CXLDEV_EVENT_STATUS_FAIL;
>  
> -	mes_add_event(mes, CXL_EVENT_TYPE_FATAL, &hardware_replace);
> -	mes_add_event(mes, CXL_EVENT_TYPE_FATAL,
> +	dev_dbg(dev, "Generating fake event logs %d\n",
> +		CXL_EVENT_TYPE_FATAL);
The dev_dbg() fine but not really part of making it dynamic, so adds
a bit of noise. Maybe not worth splitting out though.
> +	add_event_from_static(mdata, CXL_EVENT_TYPE_FATAL, &hardware_replace);
> +	add_event_from_static(mdata, CXL_EVENT_TYPE_FATAL,
>  		      (struct cxl_event_record_raw *)&dram);
>  	mes->ev_status |= CXLDEV_EVENT_STATUS_FATAL;
>  }



^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v3 25/25] tools/testing/cxl: Add DC Regions to mock mem data
  2024-08-16 14:44 ` [PATCH v3 25/25] tools/testing/cxl: Add DC Regions to mock mem data Ira Weiny
@ 2024-08-27 14:39   ` Jonathan Cameron
  2024-09-09 14:08     ` Ira Weiny
  0 siblings, 1 reply; 120+ messages in thread
From: Jonathan Cameron @ 2024-08-27 14:39 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Dave Jiang, Fan Ni, Navneet Singh, Chris Mason, Josef Bacik,
	David Sterba, Petr Mladek, Steven Rostedt, Andy Shevchenko,
	Rasmus Villemoes, Sergey Senozhatsky, Jonathan Corbet,
	Andrew Morton, Dan Williams, Davidlohr Bueso, Alison Schofield,
	Vishal Verma, linux-btrfs, linux-cxl, linux-kernel, linux-doc,
	nvdimm

On Fri, 16 Aug 2024 09:44:33 -0500
Ira Weiny <ira.weiny@intel.com> wrote:

> cxl_test provides a good way to ensure quick smoke and regression
> testing.  The complexity of Dynamic Capacity (DC) extent processing as
> well as the complexity of the new sparse DAX regions can mostly be
> tested through cxl_test.  This includes management of sparse regions and
> DAX devices on those regions; the management of extent device lifetimes;
> and the processing of DCD events.
> 
> The only missing functionality from this test is actual interrupt
> processing.
> 
> Mock memory devices can easily mock DC information and manage fake
> extent data.
> 
> Define mock_dc_region information within the mock memory data.  Add
> sysfs entries on the mock device to inject and delete extents.
> 
> The inject format is <start>:<length>:<tag>:<more_flag>
> The delete format is <start>:<length>
> 
> Directly call the event irq callback to simulate irqs to process the
> test extents.
> 
> Add DC mailbox commands to the CEL and implement those commands.
> 
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>

Minor stuff inline.

Thanks,

Jonathan

> +static int mock_get_dc_config(struct device *dev,
> +			      struct cxl_mbox_cmd *cmd)
> +{
> +	struct cxl_mbox_get_dc_config_in *dc_config = cmd->payload_in;
> +	struct cxl_mockmem_data *mdata = dev_get_drvdata(dev);
> +	u8 region_requested, region_start_idx, region_ret_cnt;
> +	struct cxl_mbox_get_dc_config_out *resp;
> +	int i;
> +
> +	region_requested = dc_config->region_count;
> +	if (region_requested > NUM_MOCK_DC_REGIONS)
> +		region_requested = NUM_MOCK_DC_REGIONS;

	region_requested = min(...)

> +
> +	if (cmd->size_out < struct_size(resp, region, region_requested))
> +		return -EINVAL;
> +
> +	memset(cmd->payload_out, 0, cmd->size_out);
> +	resp = cmd->payload_out;
> +
> +	region_start_idx = dc_config->start_region_index;
> +	region_ret_cnt = 0;
> +	for (i = 0; i < NUM_MOCK_DC_REGIONS; i++) {
> +		if (i >= region_start_idx) {
> +			memcpy(&resp->region[region_ret_cnt],
> +				&mdata->dc_regions[i],
> +				sizeof(resp->region[region_ret_cnt]));
> +			region_ret_cnt++;
> +		}
> +	}
> +	resp->avail_region_count = NUM_MOCK_DC_REGIONS;
> +	resp->regions_returned = i;
> +
> +	dev_dbg(dev, "Returning %d dc regions\n", region_ret_cnt);
> +	return 0;
> +}



> +static void cxl_mock_mem_remove(struct platform_device *pdev)
> +{
> +	struct cxl_mockmem_data *mdata = dev_get_drvdata(&pdev->dev);
> +	struct cxl_memdev_state *mds = mdata->mds;
> +
> +	dev_dbg(mds->cxlds.dev, "Removing extents\n");

Clean this up as it doesn't do anything!

> +}
> +

> @@ -1689,14 +2142,261 @@ static ssize_t sanitize_timeout_store(struct device *dev,
>  
>  	return count;
>  }
> -
Grump ;)  No whitespace changes in a patch doing anything 'useful'.
>  static DEVICE_ATTR_RW(sanitize_timeout);
>  

> +static int log_dc_event(struct cxl_mockmem_data *mdata, enum dc_event type,
> +			u64 start, u64 length, const char *tag_str, bool more)
> +{
> +	struct device *dev = mdata->mds->cxlds.dev;
> +	struct cxl_test_dcd *dcd_event;
> +
> +	dev_dbg(dev, "mock device log event %d\n", type);
> +
> +	dcd_event = devm_kmemdup(dev, &dcd_event_rec_template,
> +				     sizeof(*dcd_event), GFP_KERNEL);
> +	if (!dcd_event)
> +		return -ENOMEM;
> +
> +	dcd_event->rec.flags = 0;
> +	if (more)
> +		dcd_event->rec.flags |= CXL_DCD_EVENT_MORE;
> +	dcd_event->rec.event_type = type;
> +	dcd_event->rec.extent.start_dpa = cpu_to_le64(start);
> +	dcd_event->rec.extent.length = cpu_to_le64(length);
> +	memcpy(dcd_event->rec.extent.tag, tag_str,
> +	       min(sizeof(dcd_event->rec.extent.tag),
> +		   strlen(tag_str)));
> +
> +	mes_add_event(mdata, CXL_EVENT_TYPE_DCD,
> +		      (struct cxl_event_record_raw *)dcd_event);
I guess this is where the missing event in previous patch come from.

Increment the number here, not back in that patch.

Jonathan


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v3 18/25] cxl/extent: Process DCD events and realize region extents
  2024-08-27 12:08     ` Jonathan Cameron
@ 2024-08-27 16:02       ` Fan Ni
  0 siblings, 0 replies; 120+ messages in thread
From: Fan Ni @ 2024-08-27 16:02 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: Fan Ni, ira.weiny, Dave Jiang, Navneet Singh, Chris Mason,
	Josef Bacik, David Sterba, Petr Mladek, Steven Rostedt,
	Andy Shevchenko, Rasmus Villemoes, Sergey Senozhatsky,
	Jonathan Corbet, Andrew Morton, Dan Williams, Davidlohr Bueso,
	Alison Schofield, Vishal Verma, linux-btrfs, linux-cxl,
	linux-kernel, linux-doc, nvdimm

On Tue, Aug 27, 2024 at 01:08:29PM +0100, Jonathan Cameron wrote:
> On Fri, 23 Aug 2024 14:32:32 -0700
> Fan Ni <nifan.cxl@gmail.com> wrote:
> 
> > On Fri, Aug 16, 2024 at 09:44:26AM -0500, ira.weiny@intel.com wrote:
> > > From: Navneet Singh <navneet.singh@intel.com>
> > > 
> > > A dynamic capacity device (DCD) sends events to signal the host for
> > > changes in the availability of Dynamic Capacity (DC) memory.  These
> > > events contain extents describing a DPA range and meta data for memory
> > > to be added or removed.  Events may be sent from the device at any time.
> > > 
> > > Three types of events can be signaled, Add, Release, and Force Release.
> > > 
> > > On add, the host may accept or reject the memory being offered.  If no
> > > region exists, or the extent is invalid, the extent should be rejected.
> > > Add extent events may be grouped by a 'more' bit which indicates those
> > > extents should be processed as a group.
> > > 
> > > On remove, the host can delay the response until the host is safely not
> > > using the memory.  If no region exists the release can be sent
> > > immediately.  The host may also release extents (or partial extents) at
> > > any time.  Thus the 'more' bit grouping of release events is of less
> > > value and can be ignored in favor of sending multiple release capacity
> > > responses for groups of release events.
> > > 
> > > Force removal is intended as a mechanism between the FM and the device
> > > and intended only when the host is unresponsive, out of sync, or
> > > otherwise broken.  Purposely ignore force removal events.
> > > 
> > > Regions are made up of one or more devices which may be surfacing memory
> > > to the host.  Once all devices in a region have surfaced an extent the
> > > region can expose a corresponding extent for the user to consume.
> > > Without interleaving a device extent forms a 1:1 relationship with the
> > > region extent.  Immediately surface a region extent upon getting a
> > > device extent.
> > > 
> > > Per the specification the device is allowed to offer or remove extents
> > > at any time.  However, anticipated use cases can expect extents to be
> > > offered, accepted, and removed in well defined chunks.
> > > 
> > > Simplify extent tracking with the following restrictions.
> > > 
> > > 	1) Flag for removal any extent which overlaps a requested
> > > 	   release range.
> > > 	2) Refuse the offer of extents which overlap already accepted
> > > 	   memory ranges.
> > > 	3) Accept again a range which has already been accepted by the
> > > 	   host.  (It is likely the device has an error because it
> > > 	   should already know that this range was accepted.  But from
> > > 	   the host point of view it is safe to acknowledge that
> > > 	   acceptance again.)
> > > 
> > > Management of the region extent devices must be synchronized with
> > > potential uses of the memory within the DAX layer.  Create region extent
> > > devices as children of the cxl_dax_region device such that the DAX
> > > region driver can co-drive them and synchronize with the DAX layer.
> > > Synchronization and management is handled in a subsequent patch.
> > > 
> > > Process DCD events and create region devices.
> > > 
> > > Signed-off-by: Navneet Singh <navneet.singh@intel.com>
> > > Co-developed-by: Ira Weiny <ira.weiny@intel.com>
> > > Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> > >   
> > 
> > One minor change inline.
> Hi Fan,
> 
> Crop please.  I scanned past it 3 times when scrolling without noticing
> what you'd actually commented on.

Sure. I will crop in the future.
Thanks for the tips, Jonathan.

Fan

> 
> > > +/* See CXL 3.0 8.2.9.2.1.5 */  
> > 
> > Update the reference to reflect CXL 3.1.
> > 
> > Fan
> > 

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v3 02/25] printk: Add print format (%par) for struct range
  2024-08-27  7:43               ` Petr Mladek
  2024-08-27 13:21                 ` Andy Shevchenko
@ 2024-08-27 21:44                 ` Ira Weiny
  1 sibling, 0 replies; 120+ messages in thread
From: Ira Weiny @ 2024-08-27 21:44 UTC (permalink / raw)
  To: Petr Mladek, Ira Weiny
  Cc: Andy Shevchenko, Dave Jiang, Fan Ni, Jonathan Cameron,
	Navneet Singh, Chris Mason, Josef Bacik, David Sterba,
	Steven Rostedt, Rasmus Villemoes, Sergey Senozhatsky,
	Jonathan Corbet, Andrew Morton, Dan Williams, Davidlohr Bueso,
	Alison Schofield, Vishal Verma, linux-btrfs, linux-cxl,
	linux-kernel, linux-doc, nvdimm

Petr Mladek wrote:
> On Mon 2024-08-26 16:17:52, Ira Weiny wrote:
> > Andy Shevchenko wrote:
> > > On Mon, Aug 26, 2024 at 03:23:50PM +0200, Petr Mladek wrote:
> > > > On Thu 2024-08-22 21:10:25, Andy Shevchenko wrote:
> > > > > On Thu, Aug 22, 2024 at 12:53:32PM -0500, Ira Weiny wrote:
> > > > > > Petr Mladek wrote:
> > > > > > > On Fri 2024-08-16 09:44:10, Ira Weiny wrote:
> > > 
> > > ...
> > > 
> > > > > > > > +	%par	[range 0x60000000-0x6fffffff] or
> > > > > > > 
> > > > > > > It seems that it is always 64-bit. It prints:
> > > > > > > 
> > > > > > > struct range {
> > > > > > > 	u64   start;
> > > > > > > 	u64   end;
> > > > > > > };
> > > > > > 
> > > > > > Indeed.  Thanks I should not have just copied/pasted.
> > > > > 
> > > > > With that said, I'm not sure the %pa is a good placeholder for this ('a' stands
> > > > > to "address" AFAIU). Perhaps this should go somewhere under %pr/%pR?
> > 
> > I'm speaking a bit for Dan here but also the logical way I thought of
> > things.
> > 
> > 1) %p does not dictate anything about the format of the data.  Rather
> >    indicates that what is passed is a pointer.  Because we are passing a
> >    pointer to a range struct %pXX makes sense.
> > 2) %pa indicates what follows is 'address'.  This was a bit of creative
> >    license because, as I said in the commit message most of the time
> >    struct range contains an address range.  So for this narrow use case it
> >    also makes sense.
> > 3) %par r for range.
> 
> Yes. I got it.
> 
> Well, is struct range really used for addresses?

Commonly yes.  But I agree with Andy that it is not always.

> It rather looks like
> a range of any 64-bit values.
> 
> > %p[rR] is taken.  %pra confuses things IMO.
> 
> Another variants might be %pr64 or %prange.
> 
> IMHO, there is no good solution. We are trying to find the least
> bad one. The meaning should be as obvious and as least confusing
> as possible.

Yep.

> 
> Honestly, I do not have a strong opinion. I kind of like %prange ;-)
> But I could live with all other variants, except for %pn mentioned below.
> 
> > > > The r/R in %pr/%pR actually stands for "resource".
> > > > 
> > > > But "%ra" really looks like a better choice than "%par". Both
> > > > "resource"  and "range" starts with 'r'. Also the struct resource
> > > > is printed as a range of values.
> > 
> > %r could be used I think.  But this breaks with the convention of passing a
> > pointer and how to interpret it.
> 
> How exactly does it break the convention, please?
> 
> Do you passing a pointer to struct range instead of a pointer to
> struct resource?

Yes a pointer is passed as the parameter.  This is what %p means AFAIU.
Then the modifier is applied to know what we are pointing to.

> 
> It should not be a big problem as long as the vsprintf() code is
> able to guess the right pointer type from the %pXX modifier.
> 
> > The other idea I had, mentioned in the commit
> > message was %pn.  Meaning passed by pointer 'raNge'.
> 
> This looks like the worst variant to me.

Fair enough.

> 
> > > Fine with me as long as it:
> > > 1) doesn't collide with %pa namespace
> > > 2) tries to deduplicate existing code as much as possible.
> > 
> > Andy, I'm not quite following how you expect to share the code between
> > resource_string() and range_string()?
> > 
> > There is very little duplicated code.  In fact with Petr's suggestions and some
> > more work range_string() is quite simple:
> >
> > +static noinline_for_stack
> > +char *range_string(char *buf, char *end, const struct range *range,
> > +                     struct printf_spec spec, const char *fmt)
> > +{
> > +#define RANGE_DECODED_BUF_SIZE         ((2 * sizeof(struct range)) + 4)
> > +#define RANGE_PRINT_BUF_SIZE           sizeof("[range -]")
> > +       char sym[RANGE_DECODED_BUF_SIZE + RANGE_PRINT_BUF_SIZE];
> > +       char *p = sym, *pend = sym + sizeof(sym);
> > +
> > +       *p++ = '[';
> > +       p = string_nocheck(p, pend, "range ", default_str_spec);
> > +       p = special_hex_number(p, pend, range->start, sizeof(range->start));
> > +       *p++ = '-';
> > +       p = special_hex_number(p, pend, range->end, sizeof(range->end));
> > +       *p++ = ']';
> > +       *p = '\0';
> > +
> > +       return string_nocheck(buf, end, sym, spec);
> > +}
> 
> I agree that there is not much duplicated code in the end.
> 
> > Also this is the bulk of the patch except for documentation and the new
> > testing code.  [new patch below]
> > 
> > Am I missing your point somehow?  I considered cramming a struct range into a
> > struct resource to let resource_string() process the data.  But that would
> > involve creating a new IORESOURCE_* flag (not ideal) and also does not allow
> > for the larger u64 data in struct range should this be a 32 bit physical
> > address config.
> 
> This would be nasty. I believe that this is not what Andy meant.

Nope.

> 
> Best Regards,
> Petr
> 
> PS: I have vacation until the end of the week, so my next eventual
>     reaction would be delayed.

No hurry.  I'm still mucking around with it,
Ira

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v3 02/25] printk: Add print format (%par) for struct range
  2024-08-27 13:17               ` Andy Shevchenko
@ 2024-08-28  4:12                 ` Ira Weiny
  2024-08-28 13:50                   ` Andy Shevchenko
  0 siblings, 1 reply; 120+ messages in thread
From: Ira Weiny @ 2024-08-28  4:12 UTC (permalink / raw)
  To: Andy Shevchenko, Ira Weiny
  Cc: Petr Mladek, Dave Jiang, Fan Ni, Jonathan Cameron, Navneet Singh,
	Chris Mason, Josef Bacik, David Sterba, Steven Rostedt,
	Rasmus Villemoes, Sergey Senozhatsky, Jonathan Corbet,
	Andrew Morton, Dan Williams, Davidlohr Bueso, Alison Schofield,
	Vishal Verma, linux-btrfs, linux-cxl, linux-kernel, linux-doc,
	nvdimm

Andy Shevchenko wrote:
> On Mon, Aug 26, 2024 at 04:17:52PM -0500, Ira Weiny wrote:
> > Andy Shevchenko wrote:
> > > On Mon, Aug 26, 2024 at 03:23:50PM +0200, Petr Mladek wrote:
> > > > On Thu 2024-08-22 21:10:25, Andy Shevchenko wrote:
> > > > > On Thu, Aug 22, 2024 at 12:53:32PM -0500, Ira Weiny wrote:
> > > > > > Petr Mladek wrote:
> > > > > > > On Fri 2024-08-16 09:44:10, Ira Weiny wrote:
> 

[snip]

> > > > > 
> > > > > With that said, I'm not sure the %pa is a good placeholder for this ('a' stands
> > > > > to "address" AFAIU). Perhaps this should go somewhere under %pr/%pR?
> > 
> > I'm speaking a bit for Dan here but also the logical way I thought of
> > things.
> > 
> > 1) %p does not dictate anything about the format of the data.  Rather
> >    indicates that what is passed is a pointer.  Because we are passing a
> >    pointer to a range struct %pXX makes sense.
> 
> There is no objection to that.
> 
> > 2) %pa indicates what follows is 'address'.  This was a bit of creative
> >    license because, as I said in the commit message most of the time
> >    struct range contains an address range.  So for this narrow use case it
> >    also makes sense.
> 
> As in the discussion it was pointed out that struct range is always 64-bit,
> limiting it to the "address" is a wrong assumption as we are talking generic
> printing routine here. We don't know what users will be in the future on 32-bit
> platforms, or what data (semantically) is being held by this structure.
> 
> > 3) %par r for range.
> 
> I understand, but again struct range != address.

Agreed.

> 
> > %p[rR] is taken.
> > %pra confuses things IMO.
> 
> It doesn't confuse me. :-) But I believe Petr also has a rationale behind this
> proposal as he described earlier.

%pra it is then.

> 
> > > > The r/R in %pr/%pR actually stands for "resource".
> > > > 
> > > > But "%ra" really looks like a better choice than "%par". Both
> > > > "resource"  and "range" starts with 'r'. Also the struct resource
> > > > is printed as a range of values.
> > 
> > %r could be used I think.  But this breaks with the convention of passing a
> > pointer and how to interpret it.  The other idea I had, mentioned in the commit
> > message was %pn.  Meaning passed by pointer 'raNge'.
> 
> No, we can't use %r or anything else that is documented for the standard
> printf() format specifiers, otherwise you will get a compiler warning and
> basically it means no go.

I was not thrilled with %r anyway.

> 
> > I think that follows better than %r.  That would be another break from C99.
> > But we don't have to follow that.
> > 
> > > Fine with me as long as it:
> > > 1) doesn't collide with %pa namespace
> > > 2) tries to deduplicate existing code as much as possible.
> > 
> > Andy, I'm not quite following how you expect to share the code between
> > resource_string() and range_string()?
> > 
> > There is very little duplicated code.  In fact with Petr's suggestions and some
> > more work range_string() is quite simple:
> > 
> > +static noinline_for_stack
> > +char *range_string(char *buf, char *end, const struct range *range,
> > +                     struct printf_spec spec, const char *fmt)
> > +{
> > +#define RANGE_DECODED_BUF_SIZE         ((2 * sizeof(struct range)) + 4)
> > +#define RANGE_PRINT_BUF_SIZE           sizeof("[range -]")
> > +       char sym[RANGE_DECODED_BUF_SIZE + RANGE_PRINT_BUF_SIZE];
> > +       char *p = sym, *pend = sym + sizeof(sym);
> 
> 
> Missing check for pointer, but it's not that I wanted to tell.

No it was not missing.  It was checked in address_val() already.  However, with
%pra I'll have to add it in.

> 
> > +       *p++ = '[';
> > +       p = string_nocheck(p, pend, "range ", default_str_spec);
> 
> Hmm... %pr uses str_spec, what the difference can be here?

str_spec is designed for variable length strings which are used based on the
struct resource flags.  Struct range does not vary so default_str_spec works.

> 
> > +       p = special_hex_number(p, pend, range->start, sizeof(range->start));
> > +       *p++ = '-';
> > +       p = special_hex_number(p, pend, range->end, sizeof(range->end));
> 
> This is basically the copy of %pr implementation.

Only at a very basic level.  struct resource has a variable spec while struct
range does not.  This causes complexity to make the code the same.

> 
> 	p = number(p, pend, res->start, *specp);
> 	if (res->start != res->end) {
> 		*p++ = '-';
> 		p = number(p, pend, res->end, *specp);
> 	}
> 
> Would it be possible to unify? I think so, but it requires a bit of thinking.

Not much thinking.  But the issue is that they are not close enough to justify
the extra complexity IMHO.

Making the outputs match with a common function takes 13 lines of code[1]
including the declaration of a print specification which, as this thread
already showed, is non-trivial to understand.

__Also__ this is currently crashing on me and I can't figure out why.

$ git diff --stat
 lib/vsprintf.c | 32 ++++++++++++++++++++++++--------
 1 file changed, 24 insertions(+), 8 deletions(-)


OTOH to force a unified output, only takes 2 lines of duplicated code.[2]  This
is a very minor expense of duplicate code which is much easier to follow.

$ git diff --stat
 lib/vsprintf.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)


> 
> That's why testing is very important in this kind of generic code.

Yep.  But the struct resource test was stubbed out.  I've added some basic
ones.  But there are many more variations of struct resource prints.  I'm not
sure I've not broken them.

> 
> > +       *p++ = ']';
> > +       *p = '\0';
> > +
> > +       return string_nocheck(buf, end, sym, spec);
> > +}
> > 
> > Also this is the bulk of the patch except for documentation and the new
> > testing code.  [new patch below]
> > 
> > Am I missing your point somehow?
> 
> See above.
> 
> > I considered cramming a struct range into a
> > struct resource to let resource_string() process the data.  But that would
> > involve creating a new IORESOURCE_* flag (not ideal) and also does not allow
> > for the larger u64 data in struct range should this be a 32 bit physical
> > address config.
> 
> No, that's not what I was expecting.

Good.

> 
> > Most importantly that would not be much less code AFAICT.
> 
> ...
> 
> > +       %par    [range 0x0000000060000000-0x000000006fffffff]
> 
> I still think this is not okay to use %pa namespace.

Agreed.  Lets go with %pra

> 
> ...
> 
> > +static void __init
> > +struct_range(void)
> > +{
> > +       struct range test_range = {
> > +               .start = 0xc0ffee00ba5eba11,
> > +               .end = 0xc0ffee00ba5eba11,
> > +       };
> > +
> > +       test("[range 0xc0ffee00ba5eba11-0xc0ffee00ba5eba11]",
> > +            "%par", &test_range);
> > +
> > +       test_range = (struct range) {
> > +               .start = 0xc0ffee,
> > +               .end = 0xba5eba11,
> > +       };
> > +       test("[range 0x0000000000c0ffee-0x00000000ba5eba11]",
> > +            "%par", &test_range);
> 
> Case when start == end?

Yes, that is the 1st case.

> Case when end < start?

I had no intention of having the output dictated by the values.

	test("[range 0x0000000000c0ffee-0x0000000000c0ffee]",
and
	test("[range 0x00000000ba5eba11-0x0000000000c0ffee]",

... are acceptable to me.

> 
> > +}
> 
> ...
> 
> > +       *p++ = '[';
> > +       p = string_nocheck(p, pend, "range ", default_str_spec);
> > +       p = special_hex_number(p, pend, range->start, sizeof(range->start));
> > +       *p++ = '-';
> > +       p = special_hex_number(p, pend, range->end, sizeof(range->end));
> > +       *p++ = ']';
> > +       *p = '\0';
> 
> As per above comments.


Thanks for the review,
Ira

[1] sample diff

diff --git a/lib/vsprintf.c b/lib/vsprintf.c
index 6be1ca13790c..84757e75e047 100644
--- a/lib/vsprintf.c
+++ b/lib/vsprintf.c
@@ -1039,6 +1039,18 @@ static const struct printf_spec default_dec04_spec = {
        .flags = ZEROPAD,
 };
 
+static noinline_for_stack
+char *hex_range(char *buf, char *end, u64 start_val, u64 end_val,
+               struct printf_spec spec)
+{
+       buf = number(buf, end, start_val, spec);
+       if (start_val != end_val) {
+               *buf++ = '-';
+               buf = number(buf, end, end_val, spec);
+       }
+       return buf;
+}
+
 static noinline_for_stack
 char *resource_string(char *buf, char *end, struct resource *res,
                      struct printf_spec spec, const char *fmt)
@@ -1115,11 +1127,7 @@ char *resource_string(char *buf, char *end, struct resource *res,
                p = string_nocheck(p, pend, "size ", str_spec);
                p = number(p, pend, resource_size(res), *specp);
        } else {
-               p = number(p, pend, res->start, *specp);
-               if (res->start != res->end) {
-                       *p++ = '-';
-                       p = number(p, pend, res->end, *specp);
-               }
+               p = hex_range(p, pend, res->start, res->end, *specp);
        }
        if (decode) {
                if (res->flags & IORESOURCE_MEM_64)
@@ -1149,11 +1157,19 @@ char *range_string(char *buf, char *end, const struct range *range,
        char sym[RANGE_DECODED_BUF_SIZE + RANGE_PRINT_BUF_SIZE];
        char *p = sym, *pend = sym + sizeof(sym);
 
+       struct printf_spec range_spec = {
+               spec.field_width = 2 + 2 * sizeof(range->start), /* 0x + 2 * u64 */
+               spec.flags = SPECIAL | SMALL | ZEROPAD,
+               spec.base = 16,
+               spec.precision = -1,
+       };
+
+       if (check_pointer(&buf, end, range, spec))
+               return buf;
+
        *p++ = '[';
        p = string_nocheck(p, pend, "range ", default_str_spec);
-       p = special_hex_number(p, pend, range->start, sizeof(range->start));
-       *p++ = '-';
-       p = special_hex_number(p, pend, range->end, sizeof(range->end));
+       p = hex_range(p, pend, range->start, range->end, range_spec);
        *p++ = ']';
        *p = '\0';
 



[2] sample diff

diff --git a/lib/vsprintf.c b/lib/vsprintf.c
index a754eefef252..e6870eb703a4 100644
--- a/lib/vsprintf.c
+++ b/lib/vsprintf.c
@@ -1149,11 +1149,16 @@ char *range_string(char *buf, char *end, const struct range *range,
        char sym[RANGE_DECODED_BUF_SIZE + RANGE_PRINT_BUF_SIZE];
        char *p = sym, *pend = sym + sizeof(sym);

+       if (check_pointer(&buf, end, range, spec))
+               return buf;
+
        *p++ = '[';
        p = string_nocheck(p, pend, "range ", default_str_spec);
        p = special_hex_number(p, pend, range->start, sizeof(range->start));
-       *p++ = '-';
-       p = special_hex_number(p, pend, range->end, sizeof(range->end));
+       if (range->start != range->end) {
+               *p++ = '-';
+               p = special_hex_number(p, pend, range->end, sizeof(range->end));
+       }
        *p++ = ']';
        *p = '\0';

^ permalink raw reply related	[flat|nested] 120+ messages in thread

* Re: [PATCH v3 02/25] printk: Add print format (%par) for struct range
  2024-08-28  4:12                 ` Ira Weiny
@ 2024-08-28 13:50                   ` Andy Shevchenko
  0 siblings, 0 replies; 120+ messages in thread
From: Andy Shevchenko @ 2024-08-28 13:50 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Petr Mladek, Dave Jiang, Fan Ni, Jonathan Cameron, Navneet Singh,
	Chris Mason, Josef Bacik, David Sterba, Steven Rostedt,
	Rasmus Villemoes, Sergey Senozhatsky, Jonathan Corbet,
	Andrew Morton, Dan Williams, Davidlohr Bueso, Alison Schofield,
	Vishal Verma, linux-btrfs, linux-cxl, linux-kernel, linux-doc,
	nvdimm

On Tue, Aug 27, 2024 at 11:12:47PM -0500, Ira Weiny wrote:
> Andy Shevchenko wrote:
> > On Mon, Aug 26, 2024 at 04:17:52PM -0500, Ira Weiny wrote:
> > > Andy Shevchenko wrote:
> > > > On Mon, Aug 26, 2024 at 03:23:50PM +0200, Petr Mladek wrote:
> > > > > On Thu 2024-08-22 21:10:25, Andy Shevchenko wrote:
> > > > > > On Thu, Aug 22, 2024 at 12:53:32PM -0500, Ira Weiny wrote:
> > > > > > > Petr Mladek wrote:
> > > > > > > > On Fri 2024-08-16 09:44:10, Ira Weiny wrote:

[snip]

> > > +char *range_string(char *buf, char *end, const struct range *range,
> > > +                     struct printf_spec spec, const char *fmt)
> > > +{
> > > +#define RANGE_DECODED_BUF_SIZE         ((2 * sizeof(struct range)) + 4)
> > > +#define RANGE_PRINT_BUF_SIZE           sizeof("[range -]")
> > > +       char sym[RANGE_DECODED_BUF_SIZE + RANGE_PRINT_BUF_SIZE];
> > > +       char *p = sym, *pend = sym + sizeof(sym);
> > 
> > Missing check for pointer, but it's not that I wanted to tell.
> 
> No it was not missing.  It was checked in address_val() already.  However, with
> %pra I'll have to add it in.

Ah, I haven't noticed the address_val() implementation details, thanks for
elaborating!

> > > +       *p++ = '[';
> > > +       p = string_nocheck(p, pend, "range ", default_str_spec);
> > 
> > Hmm... %pr uses str_spec, what the difference can be here?
> 
> str_spec is designed for variable length strings which are used based on the
> struct resource flags.  Struct range does not vary so default_str_spec works.

Okay, makes sense.

> > > +       p = special_hex_number(p, pend, range->start, sizeof(range->start));
> > > +       *p++ = '-';
> > > +       p = special_hex_number(p, pend, range->end, sizeof(range->end));
> > 
> > This is basically the copy of %pr implementation.
> 
> Only at a very basic level.  struct resource has a variable spec while struct
> range does not.  This causes complexity to make the code the same.

Fair enough, that's why I said "as much as possible to deduplicate". If you
think this is not worth it, let's do without an additional complications then.

> > 	p = number(p, pend, res->start, *specp);
> > 	if (res->start != res->end) {
> > 		*p++ = '-';
> > 		p = number(p, pend, res->end, *specp);
> > 	}
> > 
> > Would it be possible to unify? I think so, but it requires a bit of thinking.
> 
> Not much thinking.  But the issue is that they are not close enough to justify
> the extra complexity IMHO.

Okay!

> Making the outputs match with a common function takes 13 lines of code[1]
> including the declaration of a print specification which, as this thread
> already showed, is non-trivial to understand.

> __Also__ this is currently crashing on me and I can't figure out why.
> 
> $ git diff --stat
>  lib/vsprintf.c | 32 ++++++++++++++++++++++++--------
>  1 file changed, 24 insertions(+), 8 deletions(-)
> 
> OTOH to force a unified output, only takes 2 lines of duplicated code.[2]  This
> is a very minor expense of duplicate code which is much easier to follow.
> 
> $ git diff --stat
>  lib/vsprintf.c | 9 +++++++--
>  1 file changed, 7 insertions(+), 2 deletions(-)

Yep, got it.

> > That's why testing is very important in this kind of generic code.
> 
> Yep.  But the struct resource test was stubbed out.  I've added some basic
> ones.  But there are many more variations of struct resource prints.  I'm not
> sure I've not broken them.

Yeah, so make it then separated branches for %pr and %pra. You will take the
correct argument type in each of them. There are existing examples there.

Probably an initial 'r'/'R' parsing should be moved to pointer().

> > > +       *p++ = ']';
> > > +       *p = '\0';
> > > +
> > > +       return string_nocheck(buf, end, sym, spec);
> > > +}

...

> > > +       struct range test_range = {
> > > +               .start = 0xc0ffee00ba5eba11,
> > > +               .end = 0xc0ffee00ba5eba11,
> > > +       };
> > > +
> > > +       test("[range 0xc0ffee00ba5eba11-0xc0ffee00ba5eba11]",
> > > +            "%par", &test_range);
> > > +
> > > +       test_range = (struct range) {
> > > +               .start = 0xc0ffee,
> > > +               .end = 0xba5eba11,
> > > +       };
> > > +       test("[range 0x0000000000c0ffee-0x00000000ba5eba11]",
> > > +            "%par", &test_range);
> > 
> > Case when start == end?
> 
> Yes, that is the 1st case.

Thumb up!

> > Case when end < start?
> 
> I had no intention of having the output dictated by the values.
> 
> 	test("[range 0x0000000000c0ffee-0x0000000000c0ffee]",
> and
> 	test("[range 0x00000000ba5eba11-0x0000000000c0ffee]",
> 
> ... are acceptable to me.

But it seems the %pr in the first case doesn't do range, just a single value,
which makes sense to me (and this thread proved it) to avoid needless pedantic
checking of each value. It means that at a glance you may tell start == end.
Not sure about end < start case, but the point is just let's make it mimicing
%pr behaviour.

...

> +static noinline_for_stack
> +char *hex_range(char *buf, char *end, u64 start_val, u64 end_val,
> +               struct printf_spec spec)
> +{
> +       buf = number(buf, end, start_val, spec);
> +       if (start_val != end_val) {
> +               *buf++ = '-';
> +               buf = number(buf, end, end_val, spec);
> +       }
> +       return buf;
> +}
> +
>  static noinline_for_stack
>  char *resource_string(char *buf, char *end, struct resource *res,
>                       struct printf_spec spec, const char *fmt)
> @@ -1115,11 +1127,7 @@ char *resource_string(char *buf, char *end, struct resource *res,
>                 p = string_nocheck(p, pend, "size ", str_spec);
>                 p = number(p, pend, resource_size(res), *specp);
>         } else {
> -               p = number(p, pend, res->start, *specp);
> -               if (res->start != res->end) {
> -                       *p++ = '-';
> -                       p = number(p, pend, res->end, *specp);
> -               }
> +               p = hex_range(p, pend, res->start, res->end, *specp);
>         }
>         if (decode) {
>                 if (res->flags & IORESOURCE_MEM_64)
> @@ -1149,11 +1157,19 @@ char *range_string(char *buf, char *end, const struct range *range,
>         char sym[RANGE_DECODED_BUF_SIZE + RANGE_PRINT_BUF_SIZE];
>         char *p = sym, *pend = sym + sizeof(sym);
>  
> +       struct printf_spec range_spec = {
> +               spec.field_width = 2 + 2 * sizeof(range->start), /* 0x + 2 * u64 */
> +               spec.flags = SPECIAL | SMALL | ZEROPAD,
> +               spec.base = 16,
> +               spec.precision = -1,
> +       };

But this can be deduplicated from special_hex_number(), no?
Something like

fill_special_hex_number_spec()
{
}

special_hex_number()
{
	fill_special_hex_number_spec();
}

special_hex_range()
{
	fill_special_hex_number_spec();
}

Would it be better?

> +       if (check_pointer(&buf, end, range, spec))
> +               return buf;
> +
>         *p++ = '[';
>         p = string_nocheck(p, pend, "range ", default_str_spec);
> -       p = special_hex_number(p, pend, range->start, sizeof(range->start));
> -       *p++ = '-';
> -       p = special_hex_number(p, pend, range->end, sizeof(range->end));
> +       p = hex_range(p, pend, range->start, range->end, range_spec);
>         *p++ = ']';
>         *p = '\0';

so, can you check if with the above implemented we can actually enforce unified
format for %pr and %pra?

...

> [2] sample diff

>         p = special_hex_number(p, pend, range->start, sizeof(range->start));
> -       *p++ = '-';
> -       p = special_hex_number(p, pend, range->end, sizeof(range->end));
> +       if (range->start != range->end) {
> +               *p++ = '-';
> +               p = special_hex_number(p, pend, range->end, sizeof(range->end));
> +       }

There is a possibility to supply a callback, but it seems to me much
overcomplicated approach.

...

If we go the second way (the latter one here) can you add a comment in both
%pr/%pra code excerpts to point to each other that the format is unified
between them? It might help in the future to optimise the code if needed at
all.

-- 
With Best Regards,
Andy Shevchenko



^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v3 19/25] cxl/region/extent: Expose region extent information in sysfs
  2024-08-16 14:44 ` [PATCH v3 19/25] cxl/region/extent: Expose region extent information in sysfs ira.weiny
  2024-08-19 19:05   ` Dave Jiang
  2024-08-23 17:19   ` Jonathan Cameron
@ 2024-08-28 17:44   ` Fan Ni
  2 siblings, 0 replies; 120+ messages in thread
From: Fan Ni @ 2024-08-28 17:44 UTC (permalink / raw)
  To: ira.weiny
  Cc: Dave Jiang, Jonathan Cameron, Navneet Singh, Chris Mason,
	Josef Bacik, David Sterba, Petr Mladek, Steven Rostedt,
	Andy Shevchenko, Rasmus Villemoes, Sergey Senozhatsky,
	Jonathan Corbet, Andrew Morton, Dan Williams, Davidlohr Bueso,
	Alison Schofield, Vishal Verma, linux-btrfs, linux-cxl,
	linux-kernel, linux-doc, nvdimm

On Fri, Aug 16, 2024 at 09:44:27AM -0500, ira.weiny@intel.com wrote:
> From: Navneet Singh <navneet.singh@intel.com>
> 
> Extent information can be helpful to the user to coordinate memory usage
> with the external orchestrator and FM.
> 
> Expose the details of region extents by creating the following
> sysfs entries.
> 
>         /sys/bus/cxl/devices/dax_regionX/extentX.Y
>         /sys/bus/cxl/devices/dax_regionX/extentX.Y/offset
>         /sys/bus/cxl/devices/dax_regionX/extentX.Y/length
>         /sys/bus/cxl/devices/dax_regionX/extentX.Y/tag
> 
> Signed-off-by: Navneet Singh <navneet.singh@intel.com>
> Co-developed-by: Ira Weiny <ira.weiny@intel.com>
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>

Reviewed-by: Fan Ni <fan.ni@samsung.com>

> 
> ---
> Changes:
> [iweiny: split this out]
> [Jonathan: add documentation for extent sysfs]
> [Jonathan/djbw: s/label/tag]
> [Jonathan/djbw: treat tag as uuid]
> [djbw: use __ATTRIBUTE_GROUPS]
> [djbw: make tag invisible if it is empty]
> [djbw/iweiny: use conventional id names for extents; extentX.Y]
> ---
>  Documentation/ABI/testing/sysfs-bus-cxl | 13 ++++++++
>  drivers/cxl/core/extent.c               | 58 +++++++++++++++++++++++++++++++++
>  2 files changed, 71 insertions(+)
> 
> diff --git a/Documentation/ABI/testing/sysfs-bus-cxl b/Documentation/ABI/testing/sysfs-bus-cxl
> index 3a5ee88e551b..e97e6a73c960 100644
> --- a/Documentation/ABI/testing/sysfs-bus-cxl
> +++ b/Documentation/ABI/testing/sysfs-bus-cxl
> @@ -599,3 +599,16 @@ Description:
>  		See Documentation/ABI/stable/sysfs-devices-node. access0 provides
>  		the number to the closest initiator and access1 provides the
>  		number to the closest CPU.
> +
> +What:		/sys/bus/cxl/devices/dax_regionX/extentX.Y/offset
> +		/sys/bus/cxl/devices/dax_regionX/extentX.Y/length
> +		/sys/bus/cxl/devices/dax_regionX/extentX.Y/tag
> +Date:		October, 2024
> +KernelVersion:	v6.12
> +Contact:	linux-cxl@vger.kernel.org
> +Description:
> +		(RO) [For Dynamic Capacity regions only]  Extent offset and
> +		length within the region.  Users can use the extent information
> +		to create DAX devices on specific extents.  This is done by
> +		creating and destroying DAX devices in specific sequences and
> +		looking at the mappings created.
> diff --git a/drivers/cxl/core/extent.c b/drivers/cxl/core/extent.c
> index 34456594cdc3..d7d526a51e2b 100644
> --- a/drivers/cxl/core/extent.c
> +++ b/drivers/cxl/core/extent.c
> @@ -6,6 +6,63 @@
>  
>  #include "core.h"
>  
> +static ssize_t offset_show(struct device *dev, struct device_attribute *attr,
> +			   char *buf)
> +{
> +	struct region_extent *region_extent = to_region_extent(dev);
> +
> +	return sysfs_emit(buf, "%#llx\n", region_extent->hpa_range.start);
> +}
> +static DEVICE_ATTR_RO(offset);
> +
> +static ssize_t length_show(struct device *dev, struct device_attribute *attr,
> +			   char *buf)
> +{
> +	struct region_extent *region_extent = to_region_extent(dev);
> +	u64 length = range_len(&region_extent->hpa_range);
> +
> +	return sysfs_emit(buf, "%#llx\n", length);
> +}
> +static DEVICE_ATTR_RO(length);
> +
> +static ssize_t tag_show(struct device *dev, struct device_attribute *attr,
> +			char *buf)
> +{
> +	struct region_extent *region_extent = to_region_extent(dev);
> +
> +	return sysfs_emit(buf, "%pUb\n", &region_extent->tag);
> +}
> +static DEVICE_ATTR_RO(tag);
> +
> +static struct attribute *region_extent_attrs[] = {
> +	&dev_attr_offset.attr,
> +	&dev_attr_length.attr,
> +	&dev_attr_tag.attr,
> +	NULL,
> +};
> +
> +static uuid_t empty_tag = { 0 };
> +
> +static umode_t region_extent_visible(struct kobject *kobj,
> +				     struct attribute *a, int n)
> +{
> +	struct device *dev = kobj_to_dev(kobj);
> +	struct region_extent *region_extent = to_region_extent(dev);
> +
> +	if (a == &dev_attr_tag.attr &&
> +	    uuid_equal(&region_extent->tag, &empty_tag))
> +		return 0;
> +
> +	return a->mode;
> +}
> +
> +static const struct attribute_group region_extent_attribute_group = {
> +	.attrs = region_extent_attrs,
> +	.is_visible = region_extent_visible,
> +};
> +
> +__ATTRIBUTE_GROUPS(region_extent_attribute);
> +
>  static void cxled_release_extent(struct cxl_endpoint_decoder *cxled,
>  				 struct cxled_extent *ed_extent)
>  {
> @@ -44,6 +101,7 @@ static void region_extent_release(struct device *dev)
>  static const struct device_type region_extent_type = {
>  	.name = "extent",
>  	.release = region_extent_release,
> +	.groups = region_extent_attribute_groups,
>  };
>  
>  bool is_region_extent(struct device *dev)
> 
> -- 
> 2.45.2
> 

-- 
Fan Ni

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v3 18/25] cxl/extent: Process DCD events and realize region extents
  2024-08-27 13:18   ` Jonathan Cameron
@ 2024-08-29 21:16     ` Ira Weiny
  2024-08-30  9:21       ` Jonathan Cameron
  0 siblings, 1 reply; 120+ messages in thread
From: Ira Weiny @ 2024-08-29 21:16 UTC (permalink / raw)
  To: Jonathan Cameron, ira.weiny
  Cc: Dave Jiang, Fan Ni, Navneet Singh, Chris Mason, Josef Bacik,
	David Sterba, Petr Mladek, Steven Rostedt, Andy Shevchenko,
	Rasmus Villemoes, Sergey Senozhatsky, Jonathan Corbet,
	Andrew Morton, Dan Williams, Davidlohr Bueso, Alison Schofield,
	Vishal Verma, linux-btrfs, linux-cxl, linux-kernel, linux-doc,
	nvdimm

Jonathan Cameron wrote:
> On Fri, 16 Aug 2024 09:44:26 -0500
> ira.weiny@intel.com wrote:
> 
> > From: Navneet Singh <navneet.singh@intel.com>
> > 

[snip]

> > +static int match_contains(struct device *dev, void *data)
> > +{
> > +	struct region_extent *region_extent = to_region_extent(dev);
> > +	struct match_data *md = data;
> > +	struct cxled_extent *entry;
> > +	unsigned long index;
> > +
> > +	if (!region_extent)
> > +		return 0;
> > +
> > +	xa_for_each(&region_extent->decoder_extents, index, entry) {
> > +		if (md->cxled == entry->cxled &&
> > +		    range_contains(&entry->dpa_range, md->new_range))
> > +			return true;
> As below, this returns int, so shouldn't be true or false.

Yep.  Thanks.

> 
> > +	}
> > +	return false;
> > +}
> 
> > +static int match_overlaps(struct device *dev, void *data)
> > +{
> > +	struct region_extent *region_extent = to_region_extent(dev);
> > +	struct match_data *md = data;
> > +	struct cxled_extent *entry;
> > +	unsigned long index;
> > +
> > +	if (!region_extent)
> > +		return 0;
> > +
> > +	xa_for_each(&region_extent->decoder_extents, index, entry) {
> > +		if (md->cxled == entry->cxled &&
> > +		    range_overlaps(&entry->dpa_range, md->new_range))
> > +			return true;
> 
> returns int, so returning true or false is odd.

Yep.

> 
> > +	}
> > +
> > +	return false;
> > +}
> 
> 
> > +int cxl_rm_extent(struct cxl_memdev_state *mds, struct cxl_extent *extent)
> > +{
> > +	u64 start_dpa = le64_to_cpu(extent->start_dpa);
> > +	struct cxl_memdev *cxlmd = mds->cxlds.cxlmd;
> > +	struct cxl_endpoint_decoder *cxled;
> > +	struct range hpa_range, dpa_range;
> > +	struct cxl_region *cxlr;
> > +
> > +	dpa_range = (struct range) {
> > +		.start = start_dpa,
> > +		.end = start_dpa + le64_to_cpu(extent->length) - 1,
> > +	};
> > +
> > +	guard(rwsem_read)(&cxl_region_rwsem);
> > +	cxlr = cxl_dpa_to_region(cxlmd, start_dpa, &cxled);
> > +	if (!cxlr) {
> > +		memdev_release_extent(mds, &dpa_range);
> 
> How does this condition happen?  Perhaps a comment needed.

Fair enough.  Proposed comment.

	/*
	 * No region can happen here for a few reasons:
	 *
	 * 1) Extents were accepted and the host crashed/rebooted
	 *    leaving them in an accepted state.  On reboot the host
	 *    has not yet created a region to own them.
	 *
	 * 2) Region destruction won the race with the device releasing
	 *    all the extents.  Here the release will be a duplicate of
	 *    the one sent via region destruction.
	 *
	 * 3) The device is confused and releasing extents for which no
	 *    region ever existed.
	 *
	 * In all these cases make sure the device knows we are not
	 * using this extent.
	 */

Item 2 is AFAICS ok with the spec.

> 
> > +		return -ENXIO;
> > +	}
> > +
> > +	calc_hpa_range(cxled, cxlr->cxlr_dax, &dpa_range, &hpa_range);
> > +
> > +	/* Remove region extents which overlap */
> > +	return device_for_each_child(&cxlr->cxlr_dax->dev, &hpa_range,
> > +				     cxlr_rm_extent);
> > +}
> > +
> > +static int cxlr_add_extent(struct cxl_dax_region *cxlr_dax,
> > +			   struct cxl_endpoint_decoder *cxled,
> > +			   struct cxled_extent *ed_extent)
> > +{
> > +	struct region_extent *region_extent;
> > +	struct range hpa_range;
> > +	int rc;
> > +
> > +	calc_hpa_range(cxled, cxlr_dax, &ed_extent->dpa_range, &hpa_range);
> > +
> > +	region_extent = alloc_region_extent(cxlr_dax, &hpa_range, ed_extent->tag);
> > +	if (IS_ERR(region_extent))
> > +		return PTR_ERR(region_extent);
> > +
> > +	rc = xa_insert(&region_extent->decoder_extents, (unsigned long)ed_extent, ed_extent,
> 
> I'd wrap that earlier to keep the line a bit shorter.

Done.

> 
> > +		       GFP_KERNEL);
> > +	if (rc) {
> > +		free_region_extent(region_extent);
> > +		return rc;
> > +	}
> > +
> > +	/* device model handles freeing region_extent */
> > +	return online_region_extent(region_extent);
> > +}
> > +
> > +/* Callers are expected to ensure cxled has been attached to a region */
> > +int cxl_add_extent(struct cxl_memdev_state *mds, struct cxl_extent *extent)
> > +{
> > +	u64 start_dpa = le64_to_cpu(extent->start_dpa);
> > +	struct cxl_memdev *cxlmd = mds->cxlds.cxlmd;
> > +	struct cxl_endpoint_decoder *cxled;
> > +	struct range ed_range, ext_range;
> > +	struct cxl_dax_region *cxlr_dax;
> > +	struct cxled_extent *ed_extent;
> > +	struct cxl_region *cxlr;
> > +	struct device *dev;
> > +
> > +	ext_range = (struct range) {
> > +		.start = start_dpa,
> > +		.end = start_dpa + le64_to_cpu(extent->length) - 1,
> > +	};
> > +
> > +	guard(rwsem_read)(&cxl_region_rwsem);
> > +	cxlr = cxl_dpa_to_region(cxlmd, start_dpa, &cxled);
> > +	if (!cxlr)
> > +		return -ENXIO;
> > +
> > +	cxlr_dax = cxled->cxld.region->cxlr_dax;
> > +	dev = &cxled->cxld.dev;
> > +	ed_range = (struct range) {
> > +		.start = cxled->dpa_res->start,
> > +		.end = cxled->dpa_res->end,
> > +	};
> > +
> > +	dev_dbg(&cxled->cxld.dev, "Checking ED (%pr) for extent %par\n",
> > +		cxled->dpa_res, &ext_range);
> > +
> > +	if (!range_contains(&ed_range, &ext_range)) {
> > +		dev_err_ratelimited(dev,
> > +				    "DC extent DPA %par (%*phC) is not fully in ED %par\n",
> > +				    &ext_range.start, CXL_EXTENT_TAG_LEN,
> > +				    extent->tag, &ed_range);
> > +		return -ENXIO;
> > +	}
> > +
> > +	if (extents_contain(cxlr_dax, cxled, &ext_range))
> 
> This case confuses me. If the extents are already there I think we should
> error out or at least print something as that's very wrong.

I thought we discussed this in one of the community meetings that it would be
ok to accept these.  We could certainly print a warning here.

In all honestly I'm wondering if these restrictions are really needed anymore.
But at the same time I really, really, really don't think anyone has a good use
case to have to support these cases.  So I'm keeping the code simple for now.

> 
> > +		return 0;
> > +
> > +	if (extents_overlap(cxlr_dax, cxled, &ext_range))
> > +		return -ENXIO;
> > +
> > +	ed_extent = kzalloc(sizeof(*ed_extent), GFP_KERNEL);
> > +	if (!ed_extent)
> > +		return -ENOMEM;
> > +
> > +	ed_extent->cxled = cxled;
> > +	ed_extent->dpa_range = ext_range;
> > +	memcpy(ed_extent->tag, extent->tag, CXL_EXTENT_TAG_LEN);
> > +
> > +	dev_dbg(dev, "Add extent %par (%*phC)\n", &ed_extent->dpa_range,
> > +		CXL_EXTENT_TAG_LEN, ed_extent->tag);
> > +
> > +	return cxlr_add_extent(cxlr_dax, cxled, ed_extent);
> > +}
> > diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
> > index 01a447aaa1b1..f629ad7488ac 100644
> > --- a/drivers/cxl/core/mbox.c
> > +++ b/drivers/cxl/core/mbox.c
> > @@ -882,6 +882,48 @@ int cxl_enumerate_cmds(struct cxl_memdev_state *mds)
> >  }
> >  EXPORT_SYMBOL_NS_GPL(cxl_enumerate_cmds, CXL);
> >  
> > +static int cxl_validate_extent(struct cxl_memdev_state *mds,
> > +			       struct cxl_extent *extent)
> > +{
> > +	u64 start = le64_to_cpu(extent->start_dpa);
> > +	u64 length = le64_to_cpu(extent->length);
> > +	struct device *dev = mds->cxlds.dev;
> > +
> > +	struct range ext_range = (struct range){
> > +		.start = start,
> > +		.end = start + length - 1,
> > +	};
> > +
> > +	if (le16_to_cpu(extent->shared_extn_seq) != 0) {
> 
> That's not the 'main' way to tell if an extent is shared because
> we could have a single extent (so seq == 0).
> Should verify it's not in a DCD region that
> is shareable to make this decision.

Ah...  :-/

> 
> I've lost track on the region handling so maybe you already do
> this by not including those regions at all?

I don't think so.

I'll add the region check.  I see now why I glossed over this though.  The
shared nature of a DCD partition is defined in the DSMAS.

Is that correct?  Or am I missing something in the spec?

> 
> > +		dev_err_ratelimited(dev,
> > +				    "DC extent DPA %par (%*phC) can not be shared\n",
> > +				    &ext_range.start, CXL_EXTENT_TAG_LEN,
> > +				    extent->tag);
> > +		return -ENXIO;
> > +	}
> > +
> > +	/* Extents must not cross DC region boundary's */
> > +	for (int i = 0; i < mds->nr_dc_region; i++) {
> > +		struct cxl_dc_region_info *dcr = &mds->dc_region[i];
> > +		struct range region_range = (struct range) {
> > +			.start = dcr->base,
> > +			.end = dcr->base + dcr->decode_len - 1,
> > +		};
> > +
> > +		if (range_contains(&region_range, &ext_range)) {
> > +			dev_dbg(dev, "DC extent DPA %par (DCR:%d:%#llx)(%*phC)\n",
> > +				&ext_range, i, start - dcr->base,
> > +				CXL_EXTENT_TAG_LEN, extent->tag);
> > +			return 0;
> > +		}
> > +	}
> > +
> > +	dev_err_ratelimited(dev,
> > +			    "DC extent DPA %par (%*phC) is not in any DC region\n",
> > +			    &ext_range, CXL_EXTENT_TAG_LEN, extent->tag);
> > +	return -ENXIO;
> > +}
> > +
> >  void cxl_event_trace_record(const struct cxl_memdev *cxlmd,
> >  			    enum cxl_event_log_type type,
> >  			    enum cxl_event_type event_type,
> > @@ -1009,6 +1051,207 @@ static int cxl_clear_event_record(struct cxl_memdev_state *mds,
> >  	return rc;
> >  }
> >  
> > +static int cxl_send_dc_response(struct cxl_memdev_state *mds, int opcode,
> > +				struct xarray *extent_array, int cnt)
> > +{
> > +	struct cxl_mbox_dc_response *p;
> > +	struct cxl_mbox_cmd mbox_cmd;
> > +	struct cxl_extent *extent;
> > +	unsigned long index;
> > +	u32 pl_index;
> > +	int rc = 0;
> > +
> > +	size_t pl_size = struct_size(p, extent_list, cnt);
> > +	u32 max_extents = cnt;
> > +
> What is cnt is zero? All extents rejected so none in the
> extent_array. Need to send a zero extent response to reject
> them all IIRC.

yes.  I missed that thanks.

> 
> > +	/* May have to use more bit on response. */
> > +	if (pl_size > mds->payload_size) {
> > +		max_extents = (mds->payload_size - sizeof(*p)) /
> > +			      sizeof(struct updated_extent_list);
> > +		pl_size = struct_size(p, extent_list, max_extents);
> > +	}
> > +
> > +	struct cxl_mbox_dc_response *response __free(kfree) =
> > +						kzalloc(pl_size, GFP_KERNEL);
> > +	if (!response)
> > +		return -ENOMEM;
> > +
> > +	pl_index = 0;
> > +	xa_for_each(extent_array, index, extent) {
> > +
> > +		response->extent_list[pl_index].dpa_start = extent->start_dpa;
> > +		response->extent_list[pl_index].length = extent->length;
> > +		pl_index++;
> > +		response->extent_list_size = cpu_to_le32(pl_index);
> > +
> > +		if (pl_index == max_extents) {
> > +			mbox_cmd = (struct cxl_mbox_cmd) {
> > +				.opcode = opcode,
> > +				.size_in = struct_size(response, extent_list,
> > +						       pl_index),
> > +				.payload_in = response,
> > +			};
> > +
> > +			response->flags = 0;
> > +			if (pl_index < cnt)
> > +				response->flags &= CXL_DCD_EVENT_MORE;
> > +
> > +			rc = cxl_internal_send_cmd(mds, &mbox_cmd);
> > +			if (rc)
> > +				return rc;
> > +			pl_index = 0;
> > +		}
> > +	}
> > +
> > +	if (pl_index) {
> || !cnt 
> 
> I think so we send a nothing accepted message.

Yep.

> 
> > +		mbox_cmd = (struct cxl_mbox_cmd) {
> > +			.opcode = opcode,
> > +			.size_in = struct_size(response, extent_list,
> > +					       pl_index),
> > +			.payload_in = response,
> > +		};
> > +
> > +		response->flags = 0;
> > +		rc = cxl_internal_send_cmd(mds, &mbox_cmd);
> 		if (rc)
> 			return rc;
> > +	}
> > +
> 
> return 0;  So that reader doesn't have to check what rc was in !pl_index
> case and avoids assigning rc right at the top.

Ah thanks.  That might have been left over from something previous.

> 
> 
> > +	return rc;
> > +}
> 
> 
> > +static int cxl_add_pending(struct cxl_memdev_state *mds)
> > +{
> > +	struct device *dev = mds->cxlds.dev;
> > +	struct cxl_extent *extent;
> > +	unsigned long index;
> > +	unsigned long cnt = 0;
> > +	int rc;
> > +
> > +	xa_for_each(&mds->pending_extents, index, extent) {
> > +		if (validate_add_extent(mds, extent)) {
> 
> 
> Add a comment here that not accepting an extent but
> accepting some or none means this one was rejected (I'd forgotten how
> that bit worked)

Ok yeah that may not be clear without reading the spec closely.

	/*
	 * Any extents which are to be rejected are omitted from
	 * the response.  An empty response means all are
	 * rejected.
	 */

> 
> > +			dev_dbg(dev, "unconsumed DC extent DPA:%#llx LEN:%#llx\n",
> > +				le64_to_cpu(extent->start_dpa),
> > +				le64_to_cpu(extent->length));
> > +			xa_erase(&mds->pending_extents, index);
> > +			kfree(extent);
> > +			continue;
> > +		}
> > +		cnt++;
> > +	}
> > +	rc = cxl_send_dc_response(mds, CXL_MBOX_OP_ADD_DC_RESPONSE,
> > +				  &mds->pending_extents, cnt);
> > +	xa_for_each(&mds->pending_extents, index, extent) {
> > +		xa_erase(&mds->pending_extents, index);
> > +		kfree(extent);
> > +	}
> > +	return rc;
> > +}
> > +
> > +static int handle_add_event(struct cxl_memdev_state *mds,
> > +			    struct cxl_event_dcd *event)
> > +{
> > +	struct cxl_extent *tmp = kzalloc(sizeof(*tmp), GFP_KERNEL);
> > +	struct device *dev = mds->cxlds.dev;
> > +
> > +	if (!tmp)
> > +		return -ENOMEM;
> > +
> > +	memcpy(tmp, &event->extent, sizeof(*tmp));
> 
> kmemdup?

yep.

> 
> > +	if (xa_insert(&mds->pending_extents, (unsigned long)tmp, tmp,
> > +		      GFP_KERNEL)) {
> > +		kfree(tmp);
> > +		return -ENOMEM;
> > +	}
> > +
> > +	if (event->flags & CXL_DCD_EVENT_MORE) {
> > +		dev_dbg(dev, "more bit set; delay the surfacing of extent\n");
> > +		return 0;
> > +	}
> > +
> > +	/* extents are removed and free'ed in cxl_add_pending() */
> > +	return cxl_add_pending(mds);
> > +}
> 
> >  static void cxl_mem_get_records_log(struct cxl_memdev_state *mds,
> >  				    enum cxl_event_log_type type)
> >  {
> > @@ -1044,9 +1287,17 @@ static void cxl_mem_get_records_log(struct cxl_memdev_state *mds,
> >  		if (!nr_rec)
> >  			break;
> >  
> > -		for (i = 0; i < nr_rec; i++)
> > +		for (i = 0; i < nr_rec; i++) {
> >  			__cxl_event_trace_record(cxlmd, type,
> >  						 &payload->records[i]);
> > +			if (type == CXL_EVENT_TYPE_DCD) {
> Bit of a deep indent so maybe flip logic?
> 
> Logic wise it's a bit dubious as we might want to match other
> types in future though so up to you.

I was thinking more along these lines.  But the rc is unneeded.  That print
can be in the handle function.


Something like this:

diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
index 88b823afe482..e86a483d80eb 100644
--- a/drivers/cxl/core/mbox.c
+++ b/drivers/cxl/core/mbox.c
@@ -1231,16 +1231,17 @@ static char *cxl_dcd_evt_type_str(u8 type)
        return "<unknown>";
 }

-static int cxl_handle_dcd_event_records(struct cxl_memdev_state *mds,
+static void cxl_handle_dcd_event_records(struct cxl_memdev_state *mds,
                                        struct cxl_event_record_raw *raw_rec)
 {
        struct cxl_event_dcd *event = &raw_rec->event.dcd;
        struct cxl_extent *extent = &event->extent;
        struct device *dev = mds->cxlds.dev;
        uuid_t *id = &raw_rec->id;
+       int rc;

        if (!uuid_equal(id, &CXL_EVENT_DC_EVENT_UUID))
-               return -EINVAL;
+               return;

        dev_dbg(dev, "DCD event %s : DPA:%#llx LEN:%#llx\n",
                cxl_dcd_evt_type_str(event->event_type),
@@ -1248,15 +1249,22 @@ static int cxl_handle_dcd_event_records(struct cxl_memdev_state *mds,

        switch (event->event_type) {
        case DCD_ADD_CAPACITY:
-               return handle_add_event(mds, event);
+               rc = handle_add_event(mds, event);
+               break;
        case DCD_RELEASE_CAPACITY:
-               return cxl_rm_extent(mds, &event->extent);
+               rc = cxl_rm_extent(mds, &event->extent);
+               break;
        case DCD_FORCED_CAPACITY_RELEASE:
                dev_err_ratelimited(dev, "Forced release event ignored.\n");
-               return 0;
+               rc = 0;
+               break;
        default:
-               return -EINVAL;
+               rc = -EINVAL;
+               break;
        }
+
+       if (rc)
+               dev_err_ratelimited(dev, "dcd event failed: %d\n", rc);
 }

 static void cxl_mem_get_records_log(struct cxl_memdev_state *mds,
@@ -1297,13 +1305,9 @@ static void cxl_mem_get_records_log(struct cxl_memdev_state *mds,
                for (i = 0; i < nr_rec; i++) {
                        __cxl_event_trace_record(cxlmd, type,
                                                 &payload->records[i]);
-                       if (type == CXL_EVENT_TYPE_DCD) {
-                               rc = cxl_handle_dcd_event_records(mds,
-                                                                 &payload->records[i]);
-                               if (rc)
-                                       dev_err_ratelimited(dev, "dcd event failed: %d\n",
-                                                           rc);
-                       }
+                       if (type == CXL_EVENT_TYPE_DCD)
+                               cxl_handle_dcd_event_records(mds,
+                                                       &payload->records[i]);
                }

                if (payload->flags & CXL_GET_EVENT_FLAG_OVERFLOW)
<end diff>

> 
> 			if (type != CXL_EVENT_TYPE_DCD)
> 				continue;
> 
> 			rc = 
> 
> > +				rc = cxl_handle_dcd_event_records(mds,
> > +								  &payload->records[i]);
> > +				if (rc)
> > +					dev_err_ratelimited(dev, "dcd event failed: %d\n",
> > +							    rc);
> > +			}
> > +		}
> >  
> 
> >  struct cxl_memdev_state *cxl_memdev_state_create(struct device *dev)
> >  {
> >  	struct cxl_memdev_state *mds;
> > @@ -1628,6 +1892,8 @@ struct cxl_memdev_state *cxl_memdev_state_create(struct device *dev)
> >  	mds->cxlds.type = CXL_DEVTYPE_CLASSMEM;
> >  	mds->ram_perf.qos_class = CXL_QOS_CLASS_INVALID;
> >  	mds->pmem_perf.qos_class = CXL_QOS_CLASS_INVALID;
> > +	xa_init(&mds->pending_extents);
> > +	devm_add_action_or_reset(dev, clear_pending_extents, mds);
> 
> Why don't you need to check if this failed? Definitely seems unlikely
> to leave things in a good state. Unlikely to fail of course, but you never know.

yea good catch.

> 
> >  
> >  	return mds;
> >  }
> 
> > @@ -3090,6 +3091,8 @@ static struct cxl_dax_region *cxl_dax_region_alloc(struct cxl_region *cxlr)
> >  
> >  	dev = &cxlr_dax->dev;
> >  	cxlr_dax->cxlr = cxlr;
> > +	cxlr->cxlr_dax = cxlr_dax;
> > +	ida_init(&cxlr_dax->extent_ida);
> >  	device_initialize(dev);
> >  	lockdep_set_class(&dev->mutex, &cxl_dax_region_key);
> >  	device_set_pm_not_required(dev);
> > @@ -3190,7 +3193,10 @@ static int devm_cxl_add_pmem_region(struct cxl_region *cxlr)
> >  static void cxlr_dax_unregister(void *_cxlr_dax)
> >  {
> >  	struct cxl_dax_region *cxlr_dax = _cxlr_dax;
> > +	struct cxl_region *cxlr = cxlr_dax->cxlr;
> >  
> > +	cxlr->cxlr_dax = NULL;
> > +	cxlr_dax->cxlr = NULL;
> 
> cxlr_dax->cxlr was assigned before this patch. 
> 
> I'm not seeing any new checks on these being non null so why
> are the needed?  If there is a good reason for this then
> a comment would be useful.

I'm not sure anymore either.  Perhaps this was left over from an earlier
version.  Or was something I thought I would need that ended up getting
removed.  I'll test without this hunk and remove it if I can.

Thanks for the review,
Ira

[snip]

^ permalink raw reply related	[flat|nested] 120+ messages in thread

* Re: [PATCH v3 20/25] dax/bus: Factor out dev dax resize logic
  2024-08-27 13:26   ` Jonathan Cameron
@ 2024-08-29 21:36     ` Ira Weiny
  0 siblings, 0 replies; 120+ messages in thread
From: Ira Weiny @ 2024-08-29 21:36 UTC (permalink / raw)
  To: Jonathan Cameron, Ira Weiny
  Cc: Dave Jiang, Fan Ni, Navneet Singh, Chris Mason, Josef Bacik,
	David Sterba, Petr Mladek, Steven Rostedt, Andy Shevchenko,
	Rasmus Villemoes, Sergey Senozhatsky, Jonathan Corbet,
	Andrew Morton, Dan Williams, Davidlohr Bueso, Alison Schofield,
	Vishal Verma, linux-btrfs, linux-cxl, linux-kernel, linux-doc,
	nvdimm

Jonathan Cameron wrote:
> On Fri, 16 Aug 2024 09:44:28 -0500
> Ira Weiny <ira.weiny@intel.com> wrote:
> 
> > Dynamic Capacity regions must limit dev dax resources to those areas
> > which have extents backing real memory.  Such DAX regions are dubbed
> > 'sparse' regions.  In order to manage where memory is available four
> > alternatives were considered:
> > 
> > 1) Create a single region resource child on region creation which
> >    reserves the entire region.  Then as extents are added punch holes in
> >    this reservation.  This requires new resource manipulation to punch
> >    the holes and still requires an additional iteration over the extent
> >    areas which may already have existing dev dax resources used.
> > 
> > 2) Maintain an ordered xarray of extents which can be queried while
> >    processing the resize logic.  The issue is that existing region->res
> >    children may artificially limit the allocation size sent to
> >    alloc_dev_dax_range().  IE the resource children can't be directly
> >    used in the resize logic to find where space in the region is.  This
> >    also poses a problem of managing the available size in 2 places.
> > 
> > 3) Maintain a separate resource tree with extents.  This option is the
> >    same as 2) but with the different data structure.  Most ideally there
> >    should be a unified representation of the resource tree not two places
> >    to look for space.
> > 
> > 4) Create region resource children for each extent.  Manage the dax dev
> >    resize logic in the same way as before but use a region child
> >    (extent) resource as the parents to find space within each extent.
> > 
> > Option 4 can leverage the existing resize algorithm to find space within
> > the extents.  It manages the available space in a singular resource tree
> > which is less complicated for finding space.
> > 
> > In preparation for this change, factor out the dev_dax_resize logic.
> > For static regions use dax_region->res as the parent to find space for
> > the dax ranges.  Future patches will use the same algorithm with
> > individual extent resources as the parent.
> > 
> > Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> I'm not 100% confident on this one, so will probably take another look
> before giving a tag.

Thanks.

> One trivial comment below.
> 
> 
> 
> > +static ssize_t dev_dax_resize(struct dax_region *dax_region,
> > +		struct dev_dax *dev_dax, resource_size_t size)
> > +{
> > +	resource_size_t avail = dax_region_avail_size(dax_region), to_alloc;
> > +	resource_size_t dev_size = dev_dax_size(dev_dax);
> > +	struct device *dev = &dev_dax->dev;
> > +	resource_size_t alloc = 0;
> 
> No path in which this is not set before use.

fixed thanks
Ira

[snip]

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v3 21/25] dax/region: Create resources on sparse DAX regions
  2024-08-27 14:12   ` Jonathan Cameron
@ 2024-08-29 21:54     ` Ira Weiny
  0 siblings, 0 replies; 120+ messages in thread
From: Ira Weiny @ 2024-08-29 21:54 UTC (permalink / raw)
  To: Jonathan Cameron, ira.weiny
  Cc: Dave Jiang, Fan Ni, Navneet Singh, Chris Mason, Josef Bacik,
	David Sterba, Petr Mladek, Steven Rostedt, Andy Shevchenko,
	Rasmus Villemoes, Sergey Senozhatsky, Jonathan Corbet,
	Andrew Morton, Dan Williams, Davidlohr Bueso, Alison Schofield,
	Vishal Verma, linux-btrfs, linux-cxl, linux-kernel, linux-doc,
	nvdimm

Jonathan Cameron wrote:
> On Fri, 16 Aug 2024 09:44:29 -0500
> ira.weiny@intel.com wrote:
> 
> > From: Navneet Singh <navneet.singh@intel.com>
> > 

[snip]

> > diff --git a/drivers/cxl/core/extent.c b/drivers/cxl/core/extent.c
> > index d7d526a51e2b..103b0bec3a4a 100644
> > --- a/drivers/cxl/core/extent.c
> > +++ b/drivers/cxl/core/extent.c
> > @@ -271,20 +271,67 @@ static void calc_hpa_range(struct cxl_endpoint_decoder *cxled,
> >  	hpa_range->end = hpa_range->start + range_len(dpa_range) - 1;
> >  }
> >  
> > +static int cxlr_notify_extent(struct cxl_region *cxlr, enum dc_event event,
> > +			      struct region_extent *region_extent)
> > +{
> > +	struct cxl_dax_region *cxlr_dax;
> > +	struct device *dev;
> > +	int rc = 0;
> > +
> > +	cxlr_dax = cxlr->cxlr_dax;
> > +	dev = &cxlr_dax->dev;
> > +	dev_dbg(dev, "Trying notify: type %d HPA %par\n",
> > +		event, &region_extent->hpa_range);
> > +
> > +	/*
> > +	 * NOTE the lack of a driver indicates a notification has failed.  No
> > +	 * user space coordiantion was possible.
> > +	 */
> > +	device_lock(dev);
> 
> I'd use guard() for this as then can just return the notify result
> and drop local variable rc

Yep Dave already mentioned this and it is done.

> 
> 
> > +	if (dev->driver) {
> > +		struct cxl_driver *driver = to_cxl_drv(dev->driver);
> > +		struct cxl_notify_data notify_data = (struct cxl_notify_data) {
> > +			.event = event,
> > +			.region_extent = region_extent,
> > +		};
> > +
> > +		if (driver->notify) {
> > +			dev_dbg(dev, "Notify: type %d HPA %par\n",
> > +				event, &region_extent->hpa_range);
> > +			rc = driver->notify(dev, &notify_data);
> > +		}
> > +	}
> > +	device_unlock(dev);
> > +	return rc;
> > +}
> >
> > @@ -338,8 +390,20 @@ static int cxlr_add_extent(struct cxl_dax_region *cxlr_dax,
> >  		return rc;
> >  	}
> >  
> > -	/* device model handles freeing region_extent */
> > -	return online_region_extent(region_extent);
> > +	rc = online_region_extent(region_extent);
> > +	/* device model handled freeing region_extent */
> > +	if (rc)
> > +		return rc;
> > +
> > +	rc = cxlr_notify_extent(cxlr_dax->cxlr, DCD_ADD_CAPACITY, region_extent);
> > +	/*
> > +	 * The region device was breifly live but DAX layer ensures it was not
> 
> briefly

Fixed Thanks

> 
> > +	 * used
> > +	 */
> > +	if (rc)
> > +		region_rm_extent(region_extent);	
> > +
> > +	return rc;
> >  }
> 
> > diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
> > index 975860371d9f..f14b0cfa7edd 100644
> > --- a/drivers/dax/bus.c
> > +++ b/drivers/dax/bus.c
> 
> > +EXPORT_SYMBOL_GPL(dax_region_add_resource);
> > +
> > +int dax_region_rm_resource(struct dax_region *dax_region,
> > +			   struct device *dev)
> > +{
> > +	struct dax_resource *dax_resource;
> > +
> > +	guard(rwsem_write)(&dax_region_rwsem);
> > +
> > +	dax_resource = dev_get_drvdata(dev);
> > +	if (!dax_resource)
> > +		return 0;
> > +
> > +	if (dax_resource->use_cnt)
> > +		return -EBUSY;
> > +
> > +	/* avoid races with users trying to use the extent */
> 
> Not obvious to me from local code, why does releasing the resource
> here avoid a race?  Perhaps the comment needs expanding.

We are under the dax_region_rwsem here.  So that avoids the users from seeing
the extent while the lower level code is going to remove it.

How about?

        /*
         * release the resource under dax_region_rwsem to avoid races with
         * users trying to use the extent
         */

> 
> > +	__dax_release_resource(dax_resource);
> > +	return 0;
> > +}
> > +EXPORT_SYMBOL_GPL(dax_region_rm_resource);
> > +
> 
> 
> > +static ssize_t dev_dax_resize_sparse(struct dax_region *dax_region,
> > +				     struct dev_dax *dev_dax,
> > +				     resource_size_t to_alloc)
> > +{
> > +	struct dax_resource *dax_resource;
> > +	resource_size_t available_size;
> > +	struct device *extent_dev;
> > +	ssize_t alloc;
> > +
> > +	extent_dev = device_find_child(dax_region->dev, dax_region,
> > +				       find_free_extent);
> 
> There is a __free for put device and it will tidy this up a tiny bit.

And fix the bug...  :-/  Thanks.

> 
> > +	if (!extent_dev)
> > +		return 0;
> > +
> > +	dax_resource = dev_get_drvdata(extent_dev);
> > +	if (!dax_resource)
> > +		return 0;
> > +
> > +	available_size = dax_avail_size(dax_resource->res);
> > +	to_alloc = min(available_size, to_alloc);
> I'd put those two inline and skip the local variables unless
> they have more use in later patches.

Nope just made the line shorter and I tend to not embed calls like that but it
is fine your way too.  I'll change it.

> 
> 	alloc = __dev_dax_resize(dax_resources->res, dev_dax,
> 				 min(dax_avail_size(dax_resources->res), to_alloc),
> 				 dax_resource);
> 	
> 				
> > +	alloc = __dev_dax_resize(dax_resource->res, dev_dax, to_alloc, dax_resource);
> > +	if (alloc > 0)
> > +		dax_resource->use_cnt++;
> > +	put_device(extent_dev);
> > +	return alloc;
> > +}
> > +
> 
> > @@ -1494,8 +1679,14 @@ static struct dev_dax *__devm_create_dev_dax(struct dev_dax_data *data)
> >  	device_initialize(dev);
> >  	dev_set_name(dev, "dax%d.%d", dax_region->id, dev_dax->id);
> >  
> > +	if (is_sparse(dax_region) && data->size) {
> > +		dev_err(parent, "Sparse DAX region devices are created initially with 0 size");
> must be created initially with 0 size.
> 
> Otherwise this error message says that they are, so why is it an error?

Indeed!

> 
> > +		rc = -EINVAL;
> > +		goto err_id;
> > +	}
> > +
> >  	rc = alloc_dev_dax_range(&dax_region->res, dev_dax, dax_region->res.start,
> > -				 data->size);
> > +				 data->size, NULL);
> >  	if (rc)
> >  		goto err_range;
> >  
> 
> > diff --git a/drivers/dax/cxl.c b/drivers/dax/cxl.c
> > index 367e86b1c22a..bf3b82b0120d 100644
> > --- a/drivers/dax/cxl.c
> > +++ b/drivers/dax/cxl.c
> > @@ -5,6 +5,60 @@
> 
> ...
> 
> > +static int cxl_dax_region_notify(struct device *dev,
> > +				 struct cxl_notify_data *notify_data)
> > +{
> > +	struct cxl_dax_region *cxlr_dax = to_cxl_dax_region(dev);
> > +	struct dax_region *dax_region = dev_get_drvdata(dev);
> > +	struct region_extent *region_extent = notify_data->region_extent;
> > +
> > +	switch (notify_data->event) {
> > +	case DCD_ADD_CAPACITY:
> > +		return __cxl_dax_add_resource(dax_region, region_extent);
> > +	case DCD_RELEASE_CAPACITY:
> > +		return dax_region_rm_resource(dax_region, &region_extent->dev);
> > +	case DCD_FORCED_CAPACITY_RELEASE:
> > +	default:
> > +		dev_err(&cxlr_dax->dev, "Unknown DC event %d\n",
> > +			notify_data->event);
> > +		break;
> Might as well return here and not below.
> Makes it really really obvious this is the error path and currently the only
> one that hits the return statement.
> > +	}

ok

> > +
> > +	return -ENXIO;
> > +}
> 
> >  static int cxl_dax_region_probe(struct device *dev)
> >  {
> > @@ -24,14 +78,16 @@ static int cxl_dax_region_probe(struct device *dev)
> >  		flags |= IORESOURCE_DAX_SPARSE_CAP;
> >  
> >  	dax_region = alloc_dax_region(dev, cxlr->id, &cxlr_dax->hpa_range, nid,
> > -				      PMD_SIZE, flags);
> > +				      PMD_SIZE, flags, &sparse_ops);
> >  	if (!dax_region)
> >  		return -ENOMEM;
> >  
> > -	if (cxlr->mode == CXL_REGION_DC)
> > +	if (cxlr->mode == CXL_REGION_DC) {
> > +		device_for_each_child(&cxlr_dax->dev, dax_region,
> > +				      cxl_dax_add_resource);
> >  		/* Add empty seed dax device */
> >  		dev_size = 0;
> > -	else
> > +	} else
> 
> Coding style says that you need brackets for all branches if
> one needs them (as multiline).  Just above:
> https://www.kernel.org/doc/html/v4.10/process/coding-style.html#spaces
> 

Yep missed it.  thanks for the review!
Ira

[snip]

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v3 18/25] cxl/extent: Process DCD events and realize region extents
  2024-08-29 21:16     ` Ira Weiny
@ 2024-08-30  9:21       ` Jonathan Cameron
  0 siblings, 0 replies; 120+ messages in thread
From: Jonathan Cameron @ 2024-08-30  9:21 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Dave Jiang, Fan Ni, Navneet Singh, Chris Mason, Josef Bacik,
	David Sterba, Petr Mladek, Steven Rostedt, Andy Shevchenko,
	Rasmus Villemoes, Sergey Senozhatsky, Jonathan Corbet,
	Andrew Morton, Dan Williams, Davidlohr Bueso, Alison Schofield,
	Vishal Verma, linux-btrfs, linux-cxl, linux-kernel, linux-doc,
	nvdimm

> > > +int cxl_rm_extent(struct cxl_memdev_state *mds, struct cxl_extent *extent)
> > > +{
> > > +	u64 start_dpa = le64_to_cpu(extent->start_dpa);
> > > +	struct cxl_memdev *cxlmd = mds->cxlds.cxlmd;
> > > +	struct cxl_endpoint_decoder *cxled;
> > > +	struct range hpa_range, dpa_range;
> > > +	struct cxl_region *cxlr;
> > > +
> > > +	dpa_range = (struct range) {
> > > +		.start = start_dpa,
> > > +		.end = start_dpa + le64_to_cpu(extent->length) - 1,
> > > +	};
> > > +
> > > +	guard(rwsem_read)(&cxl_region_rwsem);
> > > +	cxlr = cxl_dpa_to_region(cxlmd, start_dpa, &cxled);
> > > +	if (!cxlr) {
> > > +		memdev_release_extent(mds, &dpa_range);  
> > 
> > How does this condition happen?  Perhaps a comment needed.  
> 
> Fair enough.  Proposed comment.
> 
> 	/*
> 	 * No region can happen here for a few reasons:
> 	 *
> 	 * 1) Extents were accepted and the host crashed/rebooted
> 	 *    leaving them in an accepted state.  On reboot the host
> 	 *    has not yet created a region to own them.
> 	 *
> 	 * 2) Region destruction won the race with the device releasing
> 	 *    all the extents.  Here the release will be a duplicate of
> 	 *    the one sent via region destruction.
> 	 *
> 	 * 3) The device is confused and releasing extents for which no
> 	 *    region ever existed.
> 	 *
> 	 * In all these cases make sure the device knows we are not
> 	 * using this extent.
> 	 */
> 
> Item 2 is AFAICS ok with the spec.

I'm not sure I follow 2.  Why would device be releasing extents
if we haven't given them back? We aren't supporting the mess that
is force removal.

> 
> >   
> > > +		return -ENXIO;
> > > +	}
> > > +
> > > +	calc_hpa_range(cxled, cxlr->cxlr_dax, &dpa_range, &hpa_range);
> > > +
> > > +	/* Remove region extents which overlap */
> > > +	return device_for_each_child(&cxlr->cxlr_dax->dev, &hpa_range,
> > > +				     cxlr_rm_extent);
> > > +}
> > > +
> > > +/* Callers are expected to ensure cxled has been attached to a region */
> > > +int cxl_add_extent(struct cxl_memdev_state *mds, struct cxl_extent *extent)
> > > +{
> > > +	u64 start_dpa = le64_to_cpu(extent->start_dpa);
> > > +	struct cxl_memdev *cxlmd = mds->cxlds.cxlmd;
> > > +	struct cxl_endpoint_decoder *cxled;
> > > +	struct range ed_range, ext_range;
> > > +	struct cxl_dax_region *cxlr_dax;
> > > +	struct cxled_extent *ed_extent;
> > > +	struct cxl_region *cxlr;
> > > +	struct device *dev;
> > > +
> > > +	ext_range = (struct range) {
> > > +		.start = start_dpa,
> > > +		.end = start_dpa + le64_to_cpu(extent->length) - 1,
> > > +	};
> > > +
> > > +	guard(rwsem_read)(&cxl_region_rwsem);
> > > +	cxlr = cxl_dpa_to_region(cxlmd, start_dpa, &cxled);
> > > +	if (!cxlr)
> > > +		return -ENXIO;
> > > +
> > > +	cxlr_dax = cxled->cxld.region->cxlr_dax;
> > > +	dev = &cxled->cxld.dev;
> > > +	ed_range = (struct range) {
> > > +		.start = cxled->dpa_res->start,
> > > +		.end = cxled->dpa_res->end,
> > > +	};
> > > +
> > > +	dev_dbg(&cxled->cxld.dev, "Checking ED (%pr) for extent %par\n",
> > > +		cxled->dpa_res, &ext_range);
> > > +
> > > +	if (!range_contains(&ed_range, &ext_range)) {
> > > +		dev_err_ratelimited(dev,
> > > +				    "DC extent DPA %par (%*phC) is not fully in ED %par\n",
> > > +				    &ext_range.start, CXL_EXTENT_TAG_LEN,
> > > +				    extent->tag, &ed_range);
> > > +		return -ENXIO;
> > > +	}
> > > +
> > > +	if (extents_contain(cxlr_dax, cxled, &ext_range))  
> > 
> > This case confuses me. If the extents are already there I think we should
> > error out or at least print something as that's very wrong.  
> 
> I thought we discussed this in one of the community meetings that it would be
> ok to accept these.  We could certainly print a warning here.

A warning probably does the job of indicating that 'something' odd is going on.
A device should never resend an extent overlapping one it sent before, (assuming
no removal happened inbetween) so this should never happen, but who knows :(

> 
> In all honestly I'm wondering if these restrictions are really needed anymore.
> But at the same time I really, really, really don't think anyone has a good use
> case to have to support these cases.  So I'm keeping the code simple for now.

Fair enough.
> 
> >   
> > > +		return 0;
> > > +
> > > +	if (extents_overlap(cxlr_dax, cxled, &ext_range))
> > > +		return -ENXIO;
> > > +
> > > +	ed_extent = kzalloc(sizeof(*ed_extent), GFP_KERNEL);
> > > +	if (!ed_extent)
> > > +		return -ENOMEM;
> > > +
> > > +	ed_extent->cxled = cxled;
> > > +	ed_extent->dpa_range = ext_range;
> > > +	memcpy(ed_extent->tag, extent->tag, CXL_EXTENT_TAG_LEN);
> > > +
> > > +	dev_dbg(dev, "Add extent %par (%*phC)\n", &ed_extent->dpa_range,
> > > +		CXL_EXTENT_TAG_LEN, ed_extent->tag);
> > > +
> > > +	return cxlr_add_extent(cxlr_dax, cxled, ed_extent);
> > > +}
> > > diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
> > > index 01a447aaa1b1..f629ad7488ac 100644
> > > --- a/drivers/cxl/core/mbox.c
> > > +++ b/drivers/cxl/core/mbox.c
> > > @@ -882,6 +882,48 @@ int cxl_enumerate_cmds(struct cxl_memdev_state *mds)
> > >  }
> > >  EXPORT_SYMBOL_NS_GPL(cxl_enumerate_cmds, CXL);
> > >  
> > > +static int cxl_validate_extent(struct cxl_memdev_state *mds,
> > > +			       struct cxl_extent *extent)
> > > +{
> > > +	u64 start = le64_to_cpu(extent->start_dpa);
> > > +	u64 length = le64_to_cpu(extent->length);
> > > +	struct device *dev = mds->cxlds.dev;
> > > +
> > > +	struct range ext_range = (struct range){
> > > +		.start = start,
> > > +		.end = start + length - 1,
> > > +	};
> > > +
> > > +	if (le16_to_cpu(extent->shared_extn_seq) != 0) {  
> > 
> > That's not the 'main' way to tell if an extent is shared because
> > we could have a single extent (so seq == 0).
> > Should verify it's not in a DCD region that
> > is shareable to make this decision.  
> 
> Ah...  :-/
> 
> > 
> > I've lost track on the region handling so maybe you already do
> > this by not including those regions at all?  
> 
> I don't think so.
> 
> I'll add the region check.  I see now why I glossed over this though.  The
> shared nature of a DCD partition is defined in the DSMAS.
> 
> Is that correct?  Or am I missing something in the spec?

Yes. That's matches my understanding (I might also be missing something
of course :)


> > > +static int cxl_add_pending(struct cxl_memdev_state *mds)
> > > +{
> > > +	struct device *dev = mds->cxlds.dev;
> > > +	struct cxl_extent *extent;
> > > +	unsigned long index;
> > > +	unsigned long cnt = 0;
> > > +	int rc;
> > > +
> > > +	xa_for_each(&mds->pending_extents, index, extent) {
> > > +		if (validate_add_extent(mds, extent)) {  
> > 
> > 
> > Add a comment here that not accepting an extent but
> > accepting some or none means this one was rejected (I'd forgotten how
> > that bit worked)  
> 
> Ok yeah that may not be clear without reading the spec closely.
> 
> 	/*
> 	 * Any extents which are to be rejected are omitted from
> 	 * the response.  An empty response means all are
> 	 * rejected.
> 	 */

Perfect.

> 
> >   
> > > +			dev_dbg(dev, "unconsumed DC extent DPA:%#llx LEN:%#llx\n",
> > > +				le64_to_cpu(extent->start_dpa),
> > > +				le64_to_cpu(extent->length));
> > > +			xa_erase(&mds->pending_extents, index);
> > > +			kfree(extent);
> > > +			continue;
> > > +		}
> > > +		cnt++;
> > > +	}
> > > +	rc = cxl_send_dc_response(mds, CXL_MBOX_OP_ADD_DC_RESPONSE,
> > > +				  &mds->pending_extents, cnt);
> > > +	xa_for_each(&mds->pending_extents, index, extent) {
> > > +		xa_erase(&mds->pending_extents, index);
> > > +		kfree(extent);
> > > +	}
> > > +	return rc;
> > > +}
> > > +

> > >  static void cxl_mem_get_records_log(struct cxl_memdev_state *mds,
> > >  				    enum cxl_event_log_type type)
> > >  {
> > > @@ -1044,9 +1287,17 @@ static void cxl_mem_get_records_log(struct cxl_memdev_state *mds,
> > >  		if (!nr_rec)
> > >  			break;
> > >  
> > > -		for (i = 0; i < nr_rec; i++)
> > > +		for (i = 0; i < nr_rec; i++) {
> > >  			__cxl_event_trace_record(cxlmd, type,
> > >  						 &payload->records[i]);
> > > +			if (type == CXL_EVENT_TYPE_DCD) {  
> > Bit of a deep indent so maybe flip logic?
> > 
> > Logic wise it's a bit dubious as we might want to match other
> > types in future though so up to you.  
> 
> I was thinking more along these lines.  But the rc is unneeded.  That print
> can be in the handle function.
> 
> 
> Something like this:
Looks good to me. (cut to save on scrolling!)

Jonathan

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v3 13/25] cxl/region: Add sparse DAX region support
  2024-08-16 14:44 ` [PATCH v3 13/25] cxl/region: Add sparse DAX region support ira.weiny
                     ` (2 preceding siblings ...)
  2024-08-23 16:59   ` Jonathan Cameron
@ 2024-09-03  2:15   ` Li, Ming4
  3 siblings, 0 replies; 120+ messages in thread
From: Li, Ming4 @ 2024-09-03  2:15 UTC (permalink / raw)
  To: ira.weiny, Dave Jiang, Fan Ni, Jonathan Cameron, Navneet Singh,
	Chris Mason, Josef Bacik, David Sterba, Petr Mladek,
	Steven Rostedt, Andy Shevchenko, Rasmus Villemoes,
	Sergey Senozhatsky, Jonathan Corbet, Andrew Morton
  Cc: Dan Williams, Davidlohr Bueso, Alison Schofield, Vishal Verma,
	linux-btrfs, linux-cxl, linux-kernel, linux-doc, nvdimm

On 8/16/2024 10:44 PM, ira.weiny@intel.com wrote:
> From: Navneet Singh <navneet.singh@intel.com>
>
> Dynamic Capacity CXL regions must allow memory to be added or removed
> dynamically.  In addition to the quantity of memory available the
> location of the memory within a DC partition is dynamic based on the
> extents offered by a device.  CXL DAX regions must accommodate the
> sparseness of this memory in the management of DAX regions and devices.
>
> Introduce the concept of a sparse DAX region.  Add a create_dc_region()
> sysfs entry to create such regions.  Special case DC capable regions to
> create a 0 sized seed DAX device to maintain compatibility which
> requires a default DAX device to hold a region reference.
>
> Indicate 0 byte available capacity until such time that capacity is
> added.
>
> Sparse regions complicate the range mapping of dax devices.  There is no
> known use case for range mapping on sparse regions.  Avoid the
> complication by preventing range mapping of dax devices on sparse
> regions.
>
> Interleaving is deferred for now.  Add checks.
>
> Signed-off-by: Navneet Singh <navneet.singh@intel.com>
> Co-developed-by: Ira Weiny <ira.weiny@intel.com>
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
>
> ---
> Changes:
> [Fan: use single function for dc region store]
> [djiang: avoid setting dev_size twice]
> [djbw: Check DCD support and interleave restriction on region creation]
> [iweiny: squash patch : dax/region: Prevent range mapping allocation on sparse regions]
> [iwieny: remove reviews]
> [iweiny: rebase to master]
> [iweiny: push sysfs version to 6.12]
> [iweiny: make cxled_to_mds inline]
> ---
>  Documentation/ABI/testing/sysfs-bus-cxl | 22 ++++++++--------
>  drivers/cxl/core/core.h                 | 12 +++++++++
>  drivers/cxl/core/port.c                 |  1 +
>  drivers/cxl/core/region.c               | 46 +++++++++++++++++++++++++++++++--
>  drivers/dax/bus.c                       | 10 +++++++
>  drivers/dax/bus.h                       |  1 +
>  drivers/dax/cxl.c                       | 16 ++++++++++--
>  7 files changed, 93 insertions(+), 15 deletions(-)
>
[...]
> @@ -2185,8 +2191,13 @@ static size_t store_targetN(struct cxl_region *cxlr, const char *buf, int pos,
>  			goto out;
>  		}
>  
> -		rc = attach_target(cxlr, to_cxl_endpoint_decoder(dev), pos,
> -				   TASK_INTERRUPTIBLE);
> +		cxled = to_cxl_endpoint_decoder(dev);
> +		if (cxlr->mode == CXL_REGION_DC &&
> +		    !cxl_dcd_supported(cxled_to_mds(cxled))) {
> +			dev_dbg(dev, "DCD unsupported\n");
> +			return -EINVAL;

need a 'goto out' here to dereference the device?


> +		}
> +		rc = attach_target(cxlr, cxled, pos, TASK_INTERRUPTIBLE);
>  out:
>  		put_device(dev);
>  	}
> @@ -2534,6 +2545,7 @@ static struct cxl_region *__create_region(struct cxl_root_decoder *cxlrd,
>  	switch (mode) {
>  	case CXL_REGION_RAM:
>  	case CXL_REGION_PMEM:
> +	case CXL_REGION_DC:
>  		break;
>  	default:
>  		dev_err(&cxlrd->cxlsd.cxld.dev, "unsupported mode %s\n",
> @@ -2587,6 +2599,20 @@ static ssize_t create_ram_region_store(struct device *dev,
>  }
>  DEVICE_ATTR_RW(create_ram_region);
>  
> +static ssize_t create_dc_region_show(struct device *dev,
> +				     struct device_attribute *attr, char *buf)
> +{
> +	return __create_region_show(to_cxl_root_decoder(dev), buf);
> +}
> +
> +static ssize_t create_dc_region_store(struct device *dev,
> +				      struct device_attribute *attr,
> +				      const char *buf, size_t len)
> +{
> +	return create_region_store(dev, buf, len, CXL_REGION_DC);
> +}
> +DEVICE_ATTR_RW(create_dc_region);
> +
>  static ssize_t region_show(struct device *dev, struct device_attribute *attr,
>  			   char *buf)
>  {
> @@ -3168,6 +3194,11 @@ static int devm_cxl_add_dax_region(struct cxl_region *cxlr)
>  	struct device *dev;
>  	int rc;
>  
> +	if (cxlr->mode == CXL_REGION_DC && cxlr->params.interleave_ways != 1) {
> +		dev_err(&cxlr->dev, "Interleaving DC not supported\n");
> +		return -EINVAL;
> +	}
> +
>  	cxlr_dax = cxl_dax_region_alloc(cxlr);
>  	if (IS_ERR(cxlr_dax))
>  		return PTR_ERR(cxlr_dax);
> @@ -3260,6 +3291,16 @@ static struct cxl_region *construct_region(struct cxl_root_decoder *cxlrd,
>  		return ERR_PTR(-EINVAL);
>  
>  	mode = cxl_decoder_to_region_mode(cxled->mode);
> +	if (mode == CXL_REGION_DC) {
> +		if (!cxl_dcd_supported(cxled_to_mds(cxled))) {
> +			dev_err(&cxled->cxld.dev, "DCD unsupported\n");
> +			return ERR_PTR(-EINVAL);
> +		}
> +		if (cxled->cxld.interleave_ways != 1) {
> +			dev_err(&cxled->cxld.dev, "Interleaving and DCD not supported\n");
> +			return ERR_PTR(-EINVAL);
> +		}
> +	}
>  	do {
>  		cxlr = __create_region(cxlrd, mode,
>  				       atomic_read(&cxlrd->region_id));
> @@ -3467,6 +3508,7 @@ static int cxl_region_probe(struct device *dev)
>  	case CXL_REGION_PMEM:
>  		return devm_cxl_add_pmem_region(cxlr);
>  	case CXL_REGION_RAM:
> +	case CXL_REGION_DC:
>  		/*
>  		 * The region can not be manged by CXL if any portion of
>  		 * it is already online as 'System RAM'
> diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
> index fde29e0ad68b..d8cb5195a227 100644
> --- a/drivers/dax/bus.c
> +++ b/drivers/dax/bus.c
> @@ -178,6 +178,11 @@ static bool is_static(struct dax_region *dax_region)
>  	return (dax_region->res.flags & IORESOURCE_DAX_STATIC) != 0;
>  }
>  
> +static bool is_sparse(struct dax_region *dax_region)
> +{
> +	return (dax_region->res.flags & IORESOURCE_DAX_SPARSE_CAP) != 0;
> +}
> +
>  bool static_dev_dax(struct dev_dax *dev_dax)
>  {
>  	return is_static(dev_dax->region);
> @@ -301,6 +306,9 @@ static unsigned long long dax_region_avail_size(struct dax_region *dax_region)
>  
>  	lockdep_assert_held(&dax_region_rwsem);
>  
> +	if (is_sparse(dax_region))
> +		return 0;
> +
>  	for_each_dax_region_resource(dax_region, res)
>  		size -= resource_size(res);
>  	return size;
> @@ -1373,6 +1381,8 @@ static umode_t dev_dax_visible(struct kobject *kobj, struct attribute *a, int n)
>  		return 0;
>  	if (a == &dev_attr_mapping.attr && is_static(dax_region))
>  		return 0;
> +	if (a == &dev_attr_mapping.attr && is_sparse(dax_region))
> +		return 0;
>  	if ((a == &dev_attr_align.attr ||
>  	     a == &dev_attr_size.attr) && is_static(dax_region))
>  		return 0444;
> diff --git a/drivers/dax/bus.h b/drivers/dax/bus.h
> index cbbf64443098..783bfeef42cc 100644
> --- a/drivers/dax/bus.h
> +++ b/drivers/dax/bus.h
> @@ -13,6 +13,7 @@ struct dax_region;
>  /* dax bus specific ioresource flags */
>  #define IORESOURCE_DAX_STATIC BIT(0)
>  #define IORESOURCE_DAX_KMEM BIT(1)
> +#define IORESOURCE_DAX_SPARSE_CAP BIT(2)
>  
>  struct dax_region *alloc_dax_region(struct device *parent, int region_id,
>  		struct range *range, int target_node, unsigned int align,
> diff --git a/drivers/dax/cxl.c b/drivers/dax/cxl.c
> index 9b29e732b39a..367e86b1c22a 100644
> --- a/drivers/dax/cxl.c
> +++ b/drivers/dax/cxl.c
> @@ -13,19 +13,31 @@ static int cxl_dax_region_probe(struct device *dev)
>  	struct cxl_region *cxlr = cxlr_dax->cxlr;
>  	struct dax_region *dax_region;
>  	struct dev_dax_data data;
> +	resource_size_t dev_size;
> +	unsigned long flags;
>  
>  	if (nid == NUMA_NO_NODE)
>  		nid = memory_add_physaddr_to_nid(cxlr_dax->hpa_range.start);
>  
> +	flags = IORESOURCE_DAX_KMEM;
> +	if (cxlr->mode == CXL_REGION_DC)
> +		flags |= IORESOURCE_DAX_SPARSE_CAP;
> +
>  	dax_region = alloc_dax_region(dev, cxlr->id, &cxlr_dax->hpa_range, nid,
> -				      PMD_SIZE, IORESOURCE_DAX_KMEM);
> +				      PMD_SIZE, flags);
>  	if (!dax_region)
>  		return -ENOMEM;
>  
> +	if (cxlr->mode == CXL_REGION_DC)
> +		/* Add empty seed dax device */
> +		dev_size = 0;
> +	else
> +		dev_size = range_len(&cxlr_dax->hpa_range);
> +
>  	data = (struct dev_dax_data) {
>  		.dax_region = dax_region,
>  		.id = -1,
> -		.size = range_len(&cxlr_dax->hpa_range),
> +		.size = dev_size,
>  		.memmap_on_memory = true,
>  	};
>  
>


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v3 18/25] cxl/extent: Process DCD events and realize region extents
  2024-08-16 14:44 ` [PATCH v3 18/25] cxl/extent: Process DCD events and realize region extents ira.weiny
                     ` (2 preceding siblings ...)
  2024-08-27 13:18   ` Jonathan Cameron
@ 2024-09-03  6:37   ` Li, Ming4
  2024-09-05 19:30   ` Fan Ni
  4 siblings, 0 replies; 120+ messages in thread
From: Li, Ming4 @ 2024-09-03  6:37 UTC (permalink / raw)
  To: ira.weiny, Dave Jiang, Fan Ni, Jonathan Cameron, Navneet Singh,
	Chris Mason, Josef Bacik, David Sterba, Petr Mladek,
	Steven Rostedt, Andy Shevchenko, Rasmus Villemoes,
	Sergey Senozhatsky, Jonathan Corbet, Andrew Morton
  Cc: Dan Williams, Davidlohr Bueso, Alison Schofield, Vishal Verma,
	linux-btrfs, linux-cxl, linux-kernel, linux-doc, nvdimm

On 8/16/2024 10:44 PM, ira.weiny@intel.com wrote:
> From: Navneet Singh <navneet.singh@intel.com>
>
> A dynamic capacity device (DCD) sends events to signal the host for
> changes in the availability of Dynamic Capacity (DC) memory.  These
> events contain extents describing a DPA range and meta data for memory
> to be added or removed.  Events may be sent from the device at any time.
>
> Three types of events can be signaled, Add, Release, and Force Release.
>
> On add, the host may accept or reject the memory being offered.  If no
> region exists, or the extent is invalid, the extent should be rejected.
> Add extent events may be grouped by a 'more' bit which indicates those
> extents should be processed as a group.
>
> On remove, the host can delay the response until the host is safely not
> using the memory.  If no region exists the release can be sent
> immediately.  The host may also release extents (or partial extents) at
> any time.  Thus the 'more' bit grouping of release events is of less
> value and can be ignored in favor of sending multiple release capacity
> responses for groups of release events.
>
> Force removal is intended as a mechanism between the FM and the device
> and intended only when the host is unresponsive, out of sync, or
> otherwise broken.  Purposely ignore force removal events.
>
> Regions are made up of one or more devices which may be surfacing memory
> to the host.  Once all devices in a region have surfaced an extent the
> region can expose a corresponding extent for the user to consume.
> Without interleaving a device extent forms a 1:1 relationship with the
> region extent.  Immediately surface a region extent upon getting a
> device extent.
>
> Per the specification the device is allowed to offer or remove extents
> at any time.  However, anticipated use cases can expect extents to be
> offered, accepted, and removed in well defined chunks.
>
> Simplify extent tracking with the following restrictions.
>
> 	1) Flag for removal any extent which overlaps a requested
> 	   release range.
> 	2) Refuse the offer of extents which overlap already accepted
> 	   memory ranges.
> 	3) Accept again a range which has already been accepted by the
> 	   host.  (It is likely the device has an error because it
> 	   should already know that this range was accepted.  But from
> 	   the host point of view it is safe to acknowledge that
> 	   acceptance again.)
>
> Management of the region extent devices must be synchronized with
> potential uses of the memory within the DAX layer.  Create region extent
> devices as children of the cxl_dax_region device such that the DAX
> region driver can co-drive them and synchronize with the DAX layer.
> Synchronization and management is handled in a subsequent patch.
>
> Process DCD events and create region devices.
>
> Signed-off-by: Navneet Singh <navneet.singh@intel.com>
> Co-developed-by: Ira Weiny <ira.weiny@intel.com>
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
>
> ---
> Changes:
> [iweiny: combine this with the extent surface patches to better show the
>          lifetime extent objects in review]
> [iweiny: clean up commit message.]
> [iweiny: move extent verification of the 'read extents on region
>          creation' to this patch]
> [iweiny: Provide for a common path for extent realization between an add
> 	 event and adding existing extents.]
> [iweiny: Persist a check that an extent is within an endpoint decoder]
> [iweiny: reduce exported and non-static calls]
> [iweiny: use %par]
>
> 	<Combined comments from the old patches which were addressed>
>
> [Jonathan: implement the more bit with a simple algorithm which accepts
> 	   all extents it can.
> 	   Also include the response more bit to prevent payload
> 	   overflow]
> [Fan: Do not error if a contained extent is added.]
> [Jonathan: allocate ida after kzalloc]
> [iweiny: fix ida resource leak]
> [fan/djiang: remove unneeded memset]
> [djiang: fix indentation]
> [Jonathan: Fix indentation]
> [Jonathan/djbw: make tag a uuid]
> [djbw: create helper calc_hpa_range() straight away]
> [djbw: Allow for multiple cxled_extents per region_extent]
> [djbw: s/cxl_ed/cxled]
> [djbw: s/cxl_release_ed_extent/cxled_release_extent/]
> [djbw: s/reg_ext/region_extent/]
> [djbw: s/dc_extent/extent/]
> [Gregory/djbw: reject shared extents]
> [iweiny: predicate extent.c compile on CONFIG_CXL_REGION]
> ---
>  drivers/cxl/core/Makefile |   2 +-
>  drivers/cxl/core/core.h   |  13 ++
>  drivers/cxl/core/extent.c | 345 ++++++++++++++++++++++++++++++++++++++++++++++
>  drivers/cxl/core/mbox.c   | 268 ++++++++++++++++++++++++++++++++++-
>  drivers/cxl/core/region.c |   6 +
>  drivers/cxl/cxl.h         |  52 ++++++-
>  drivers/cxl/cxlmem.h      |  26 ++++
>  include/linux/cxl-event.h |  32 +++++
>  tools/testing/cxl/Kbuild  |   3 +-
>  9 files changed, 743 insertions(+), 4 deletions(-)
[...]
> +
> +static bool extents_contain(struct cxl_dax_region *cxlr_dax,
> +			    struct cxl_endpoint_decoder *cxled,
> +			    struct range *new_range)
> +{
> +	struct device *extent_device;
> +	struct match_data md = {
> +		.cxled = cxled,
> +		.new_range = new_range,
> +	};
> +
> +	extent_device = device_find_child(&cxlr_dax->dev, &md, match_contains);

Is it better to use __free(put_device) here to drop below 'put_device(extent_device)'?


> +	if (!extent_device)
> +		return false;
> +
> +	put_device(extent_device);
> +	return true;
> +}
> +
> +static int match_overlaps(struct device *dev, void *data)
> +{
> +	struct region_extent *region_extent = to_region_extent(dev);
> +	struct match_data *md = data;
> +	struct cxled_extent *entry;
> +	unsigned long index;
> +
> +	if (!region_extent)
> +		return 0;
> +
> +	xa_for_each(&region_extent->decoder_extents, index, entry) {
> +		if (md->cxled == entry->cxled &&
> +		    range_overlaps(&entry->dpa_range, md->new_range))
> +			return true;
> +	}
> +
> +	return false;
> +}
> +
> +static bool extents_overlap(struct cxl_dax_region *cxlr_dax,
> +			    struct cxl_endpoint_decoder *cxled,
> +			    struct range *new_range)
> +{
> +	struct device *extent_device;
> +	struct match_data md = {
> +		.cxled = cxled,
> +		.new_range = new_range,
> +	};
> +
> +	extent_device = device_find_child(&cxlr_dax->dev, &md, match_overlaps);

Same as above.


> +	if (!extent_device)
> +		return false;
> +
> +	put_device(extent_device);
> +	return true;
> +}
> +
> +static void calc_hpa_range(struct cxl_endpoint_decoder *cxled,
> +			   struct cxl_dax_region *cxlr_dax,
> +			   struct range *dpa_range,
> +			   struct range *hpa_range)
> +{
> +	resource_size_t dpa_offset, hpa;
> +
> +	dpa_offset = dpa_range->start - cxled->dpa_res->start;
> +	hpa = cxled->cxld.hpa_range.start + dpa_offset;
> +
> +	hpa_range->start = hpa - cxlr_dax->hpa_range.start;
> +	hpa_range->end = hpa_range->start + range_len(dpa_range) - 1;
> +}
> +
> +static int cxlr_rm_extent(struct device *dev, void *data)
> +{
> +	struct region_extent *region_extent = to_region_extent(dev);
> +	struct range *region_hpa_range = data;
> +
> +	if (!region_extent)
> +		return 0;
> +
> +	/*
> +	 * Any extent which 'touches' the released range is removed.
> +	 */
> +	if (range_overlaps(region_hpa_range, &region_extent->hpa_range)) {
> +		dev_dbg(dev, "Remove region extent HPA %par\n",
> +			&region_extent->hpa_range);
> +		region_rm_extent(region_extent);
> +	}
> +	return 0;
> +}
> +
> +int cxl_rm_extent(struct cxl_memdev_state *mds, struct cxl_extent *extent)
> +{
> +	u64 start_dpa = le64_to_cpu(extent->start_dpa);
> +	struct cxl_memdev *cxlmd = mds->cxlds.cxlmd;
> +	struct cxl_endpoint_decoder *cxled;
> +	struct range hpa_range, dpa_range;
> +	struct cxl_region *cxlr;
> +
> +	dpa_range = (struct range) {
> +		.start = start_dpa,
> +		.end = start_dpa + le64_to_cpu(extent->length) - 1,
> +	};
> +
> +	guard(rwsem_read)(&cxl_region_rwsem);
> +	cxlr = cxl_dpa_to_region(cxlmd, start_dpa, &cxled);
> +	if (!cxlr) {
> +		memdev_release_extent(mds, &dpa_range);
> +		return -ENXIO;
> +	}
> +
> +	calc_hpa_range(cxled, cxlr->cxlr_dax, &dpa_range, &hpa_range);
> +
> +	/* Remove region extents which overlap */
> +	return device_for_each_child(&cxlr->cxlr_dax->dev, &hpa_range,
> +				     cxlr_rm_extent);
> +}
> +
> +static int cxlr_add_extent(struct cxl_dax_region *cxlr_dax,
> +			   struct cxl_endpoint_decoder *cxled,
> +			   struct cxled_extent *ed_extent)
> +{
> +	struct region_extent *region_extent;
> +	struct range hpa_range;
> +	int rc;
> +
> +	calc_hpa_range(cxled, cxlr_dax, &ed_extent->dpa_range, &hpa_range);
> +
> +	region_extent = alloc_region_extent(cxlr_dax, &hpa_range, ed_extent->tag);
> +	if (IS_ERR(region_extent))
> +		return PTR_ERR(region_extent);
> +
> +	rc = xa_insert(&region_extent->decoder_extents, (unsigned long)ed_extent, ed_extent,
> +		       GFP_KERNEL);
> +	if (rc) {
> +		free_region_extent(region_extent);
> +		return rc;
> +	}
> +
> +	/* device model handles freeing region_extent */
> +	return online_region_extent(region_extent);
> +}
> +
> +/* Callers are expected to ensure cxled has been attached to a region */
> +int cxl_add_extent(struct cxl_memdev_state *mds, struct cxl_extent *extent)
> +{
> +	u64 start_dpa = le64_to_cpu(extent->start_dpa);
> +	struct cxl_memdev *cxlmd = mds->cxlds.cxlmd;
> +	struct cxl_endpoint_decoder *cxled;
> +	struct range ed_range, ext_range;
> +	struct cxl_dax_region *cxlr_dax;
> +	struct cxled_extent *ed_extent;
> +	struct cxl_region *cxlr;
> +	struct device *dev;
> +
> +	ext_range = (struct range) {
> +		.start = start_dpa,
> +		.end = start_dpa + le64_to_cpu(extent->length) - 1,
> +	};
> +
> +	guard(rwsem_read)(&cxl_region_rwsem);
> +	cxlr = cxl_dpa_to_region(cxlmd, start_dpa, &cxled);
> +	if (!cxlr)
> +		return -ENXIO;
> +
> +	cxlr_dax = cxled->cxld.region->cxlr_dax;
> +	dev = &cxled->cxld.dev;
> +	ed_range = (struct range) {
> +		.start = cxled->dpa_res->start,
> +		.end = cxled->dpa_res->end,
> +	};
> +
> +	dev_dbg(&cxled->cxld.dev, "Checking ED (%pr) for extent %par\n",
> +		cxled->dpa_res, &ext_range);
> +
> +	if (!range_contains(&ed_range, &ext_range)) {
> +		dev_err_ratelimited(dev,
> +				    "DC extent DPA %par (%*phC) is not fully in ED %par\n",
> +				    &ext_range.start, CXL_EXTENT_TAG_LEN,
> +				    extent->tag, &ed_range);
> +		return -ENXIO;
> +	}
> +
> +	if (extents_contain(cxlr_dax, cxled, &ext_range))
> +		return 0;
> +
> +	if (extents_overlap(cxlr_dax, cxled, &ext_range))
> +		return -ENXIO;
> +
> +	ed_extent = kzalloc(sizeof(*ed_extent), GFP_KERNEL);
> +	if (!ed_extent)
> +		return -ENOMEM;
> +
> +	ed_extent->cxled = cxled;
> +	ed_extent->dpa_range = ext_range;
> +	memcpy(ed_extent->tag, extent->tag, CXL_EXTENT_TAG_LEN);
> +
> +	dev_dbg(dev, "Add extent %par (%*phC)\n", &ed_extent->dpa_range,
> +		CXL_EXTENT_TAG_LEN, ed_extent->tag);
> +
> +	return cxlr_add_extent(cxlr_dax, cxled, ed_extent);
> +}
> diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
> index 01a447aaa1b1..f629ad7488ac 100644
> --- a/drivers/cxl/core/mbox.c
> +++ b/drivers/cxl/core/mbox.c
> @@ -882,6 +882,48 @@ int cxl_enumerate_cmds(struct cxl_memdev_state *mds)
>  }
>  EXPORT_SYMBOL_NS_GPL(cxl_enumerate_cmds, CXL);
>  
> +static int cxl_validate_extent(struct cxl_memdev_state *mds,
> +			       struct cxl_extent *extent)
> +{
> +	u64 start = le64_to_cpu(extent->start_dpa);
> +	u64 length = le64_to_cpu(extent->length);
> +	struct device *dev = mds->cxlds.dev;
> +
> +	struct range ext_range = (struct range){
> +		.start = start,
> +		.end = start + length - 1,
> +	};
> +
> +	if (le16_to_cpu(extent->shared_extn_seq) != 0) {
> +		dev_err_ratelimited(dev,
> +				    "DC extent DPA %par (%*phC) can not be shared\n",
> +				    &ext_range.start, CXL_EXTENT_TAG_LEN,
> +				    extent->tag);
> +		return -ENXIO;
> +	}
> +
> +	/* Extents must not cross DC region boundary's */
> +	for (int i = 0; i < mds->nr_dc_region; i++) {
> +		struct cxl_dc_region_info *dcr = &mds->dc_region[i];
> +		struct range region_range = (struct range) {
> +			.start = dcr->base,
> +			.end = dcr->base + dcr->decode_len - 1,
> +		};
> +
> +		if (range_contains(&region_range, &ext_range)) {
> +			dev_dbg(dev, "DC extent DPA %par (DCR:%d:%#llx)(%*phC)\n",
> +				&ext_range, i, start - dcr->base,
> +				CXL_EXTENT_TAG_LEN, extent->tag);
> +			return 0;
> +		}
> +	}
> +
> +	dev_err_ratelimited(dev,
> +			    "DC extent DPA %par (%*phC) is not in any DC region\n",
> +			    &ext_range, CXL_EXTENT_TAG_LEN, extent->tag);
> +	return -ENXIO;
> +}
> +
>  void cxl_event_trace_record(const struct cxl_memdev *cxlmd,
>  			    enum cxl_event_log_type type,
>  			    enum cxl_event_type event_type,
> @@ -1009,6 +1051,207 @@ static int cxl_clear_event_record(struct cxl_memdev_state *mds,
>  	return rc;
>  }
>  
> +static int cxl_send_dc_response(struct cxl_memdev_state *mds, int opcode,
> +				struct xarray *extent_array, int cnt)
> +{
> +	struct cxl_mbox_dc_response *p;
> +	struct cxl_mbox_cmd mbox_cmd;
> +	struct cxl_extent *extent;
> +	unsigned long index;
> +	u32 pl_index;
> +	int rc = 0;
> +
> +	size_t pl_size = struct_size(p, extent_list, cnt);
> +	u32 max_extents = cnt;
> +
> +	/* May have to use more bit on response. */
> +	if (pl_size > mds->payload_size) {
> +		max_extents = (mds->payload_size - sizeof(*p)) /
> +			      sizeof(struct updated_extent_list);
> +		pl_size = struct_size(p, extent_list, max_extents);
> +	}
> +
> +	struct cxl_mbox_dc_response *response __free(kfree) =
> +						kzalloc(pl_size, GFP_KERNEL);
> +	if (!response)
> +		return -ENOMEM;
> +
> +	pl_index = 0;
> +	xa_for_each(extent_array, index, extent) {
> +
> +		response->extent_list[pl_index].dpa_start = extent->start_dpa;
> +		response->extent_list[pl_index].length = extent->length;
> +		pl_index++;
> +		response->extent_list_size = cpu_to_le32(pl_index);
> +
> +		if (pl_index == max_extents) {
> +			mbox_cmd = (struct cxl_mbox_cmd) {
> +				.opcode = opcode,
> +				.size_in = struct_size(response, extent_list,
> +						       pl_index),
> +				.payload_in = response,
> +			};
> +
> +			response->flags = 0;
> +			if (pl_index < cnt)
> +				response->flags &= CXL_DCD_EVENT_MORE;

Should "response->flags |= CXL_DCD_EVENT_MORE"?

And seems like there is a bug if the value of 'cnt' is double the value of 'max_extents'. the response command will be sent in this xa_for_each() scope twice, and CXL_DCD_EVENT_MORE will be set for both times. because 'pl_index < cnt' is always true.


> +
> +			rc = cxl_internal_send_cmd(mds, &mbox_cmd);
> +			if (rc)
> +				return rc;
> +			pl_index = 0;
> +		}
> +	}
> +
> +	if (pl_index) {
> +		mbox_cmd = (struct cxl_mbox_cmd) {
> +			.opcode = opcode,
> +			.size_in = struct_size(response, extent_list,
> +					       pl_index),
> +			.payload_in = response,
> +		};
> +
> +		response->flags = 0;
> +		rc = cxl_internal_send_cmd(mds, &mbox_cmd);
> +	}
> +
> +	return rc;
> +}
> +

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v3 04/25] cxl/pci: Delay event buffer allocation
  2024-08-16 14:44 ` [PATCH v3 04/25] cxl/pci: Delay event buffer allocation Ira Weiny
@ 2024-09-03  6:49   ` Li, Ming4
  2024-09-05 19:44   ` Fan Ni
  1 sibling, 0 replies; 120+ messages in thread
From: Li, Ming4 @ 2024-09-03  6:49 UTC (permalink / raw)
  To: Ira Weiny, Dave Jiang, Fan Ni, Jonathan Cameron, Navneet Singh,
	Chris Mason, Josef Bacik, David Sterba, Petr Mladek,
	Steven Rostedt, Andy Shevchenko, Rasmus Villemoes,
	Sergey Senozhatsky, Jonathan Corbet, Andrew Morton
  Cc: Dan Williams, Davidlohr Bueso, Alison Schofield, Vishal Verma,
	linux-btrfs, linux-cxl, linux-kernel, linux-doc, nvdimm

On 8/16/2024 10:44 PM, Ira Weiny wrote:
> The event buffer does not need to be allocated if something has failed in
> setting up event irq's.
>
> In prep for adjusting event configuration for DCD events move the buffer
> allocation to the end of the event configuration.
>
> Reviewed-by: Davidlohr Bueso <dave@stgolabs.net>
> Reviewed-by: Dave Jiang <dave.jiang@intel.com>
> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Li Ming <ming4.li@intel.com>



^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v3 05/25] cxl/mbox: Flag support for Dynamic Capacity Devices (DCD)
  2024-08-16 14:44 ` [PATCH v3 05/25] cxl/mbox: Flag support for Dynamic Capacity Devices (DCD) ira.weiny
@ 2024-09-03  6:50   ` Li, Ming4
  0 siblings, 0 replies; 120+ messages in thread
From: Li, Ming4 @ 2024-09-03  6:50 UTC (permalink / raw)
  To: ira.weiny, Dave Jiang, Fan Ni, Jonathan Cameron, Navneet Singh,
	Chris Mason, Josef Bacik, David Sterba, Petr Mladek,
	Steven Rostedt, Andy Shevchenko, Rasmus Villemoes,
	Sergey Senozhatsky, Jonathan Corbet, Andrew Morton
  Cc: Dan Williams, Davidlohr Bueso, Alison Schofield, Vishal Verma,
	linux-btrfs, linux-cxl, linux-kernel, linux-doc, nvdimm

On 8/16/2024 10:44 PM, ira.weiny@intel.com wrote:
> From: Navneet Singh <navneet.singh@intel.com>
>
> Per the CXL 3.1 specification software must check the Command Effects
> Log (CEL) for dynamic capacity command support.
>
> Detect support for the DCD commands while reading the CEL, including:
>
> 	Get DC Config
> 	Get DC Extent List
> 	Add DC Response
> 	Release DC
>
> Signed-off-by: Navneet Singh <navneet.singh@intel.com>
> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> Reviewed-by: Fan Ni <fan.ni@samsung.com>
> Reviewed-by: Dave Jiang <dave.jiang@intel.com>
> Reviewed-by: Davidlohr Bueso <dave@stgolabs.net>
> Co-developed-by: Ira Weiny <ira.weiny@intel.com>
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
>
Reviewed-by: Li Ming <ming4.li@intel.com>


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v3 07/25] cxl/core: Separate region mode from decoder mode
  2024-08-16 14:44 ` [PATCH v3 07/25] cxl/core: Separate region mode from decoder mode ira.weiny
  2024-08-16 22:11   ` Dave Jiang
  2024-08-23 15:47   ` Jonathan Cameron
@ 2024-09-03  6:56   ` Li, Ming4
  2 siblings, 0 replies; 120+ messages in thread
From: Li, Ming4 @ 2024-09-03  6:56 UTC (permalink / raw)
  To: ira.weiny, Dave Jiang, Fan Ni, Jonathan Cameron, Navneet Singh,
	Chris Mason, Josef Bacik, David Sterba, Petr Mladek,
	Steven Rostedt, Andy Shevchenko, Rasmus Villemoes,
	Sergey Senozhatsky, Jonathan Corbet, Andrew Morton
  Cc: Dan Williams, Davidlohr Bueso, Alison Schofield, Vishal Verma,
	linux-btrfs, linux-cxl, linux-kernel, linux-doc, nvdimm

On 8/16/2024 10:44 PM, ira.weiny@intel.com wrote:
> From: Navneet Singh <navneet.singh@intel.com>
>
> Until now region modes and decoder modes were equivalent in that both
> modes were either PMEM or RAM.  The addition of Dynamic
> Capacity partitions defines up to 8 DC partitions per device.
>
> The region mode is thus no longer equivalent to the endpoint decoder
> mode.  IOW the endpoint decoders may have modes of DC0-DC7 while the
> region mode is simply DC.
>
> Define a new region mode enumeration which applies to regions separate
> from the decoder mode.  Adjust the code to process these modes
> independently.
>
> There is no equal to decoder mode dead in region modes.  Avoid
> constructing regions with decoders which have been flagged as dead.
>
> Suggested-by: Jonathan Cameron <Jonathan.Cameron@Huawei.com>
> Signed-off-by: Navneet Singh <navneet.singh@intel.com>
> Co-developed-by: Ira Weiny <ira.weiny@intel.com>
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Li Ming <ming4.li@intel.com>

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v3 08/25] cxl/region: Add dynamic capacity decoder and region modes
  2024-08-16 14:44 ` [PATCH v3 08/25] cxl/region: Add dynamic capacity decoder and region modes ira.weiny
  2024-08-16 22:14   ` Dave Jiang
@ 2024-09-03  6:57   ` Li, Ming4
  1 sibling, 0 replies; 120+ messages in thread
From: Li, Ming4 @ 2024-09-03  6:57 UTC (permalink / raw)
  To: ira.weiny, Dave Jiang, Fan Ni, Jonathan Cameron, Navneet Singh,
	Chris Mason, Josef Bacik, David Sterba, Petr Mladek,
	Steven Rostedt, Andy Shevchenko, Rasmus Villemoes,
	Sergey Senozhatsky, Jonathan Corbet, Andrew Morton
  Cc: Dan Williams, Davidlohr Bueso, Alison Schofield, Vishal Verma,
	linux-btrfs, linux-cxl, linux-kernel, linux-doc, nvdimm

On 8/16/2024 10:44 PM, ira.weiny@intel.com wrote:
> From: Navneet Singh <navneet.singh@intel.com>
>
> One or more decoders each pointing to a Dynamic Capacity (DC) partition
> form a CXL software region.  The region mode reflects composition of
> that entire software region.  Decoder mode reflects a specific DC
> partition.  DC partitions are also known as DC regions per CXL
> specification r3.1.
>
> Define the new modes and helper functions required to make the
> association between these new modes.
>
> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> Reviewed-by: Fan Ni <fan.ni@samsung.com>
> Signed-off-by: Navneet Singh <navneet.singh@intel.com>
> Co-developed-by: Ira Weiny <ira.weiny@intel.com>
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
>
Reviewed-by: Li Ming <ming4.li@intel.com>

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v3 12/25] cxl/region: Refactor common create region code
  2024-08-16 14:44 ` [PATCH v3 12/25] cxl/region: Refactor common create region code Ira Weiny
                     ` (2 preceding siblings ...)
  2024-08-23 16:17   ` Jonathan Cameron
@ 2024-09-03  7:04   ` Li, Ming4
  3 siblings, 0 replies; 120+ messages in thread
From: Li, Ming4 @ 2024-09-03  7:04 UTC (permalink / raw)
  To: Ira Weiny, Dave Jiang, Fan Ni, Jonathan Cameron, Navneet Singh,
	Chris Mason, Josef Bacik, David Sterba, Petr Mladek,
	Steven Rostedt, Andy Shevchenko, Rasmus Villemoes,
	Sergey Senozhatsky, Jonathan Corbet, Andrew Morton
  Cc: Dan Williams, Davidlohr Bueso, Alison Schofield, Vishal Verma,
	linux-btrfs, linux-cxl, linux-kernel, linux-doc, nvdimm

On 8/16/2024 10:44 PM, Ira Weiny wrote:
> create_pmem_region_store() and create_ram_region_store() are identical
> with the exception of the region mode.  With the addition of DC region
> mode this would end up being 3 copies of the same code.
>
> Refactor create_pmem_region_store() and create_ram_region_store() to use
> a single common function to be used in subsequent DC code.
>
> Suggested-by: Fan Ni <fan.ni@samsung.com>
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Li Ming <ming4.li@intel.com>

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v3 14/25] cxl/events: Split event msgnum configuration from irq setup
  2024-08-16 14:44 ` [PATCH v3 14/25] cxl/events: Split event msgnum configuration from irq setup Ira Weiny
                     ` (2 preceding siblings ...)
  2024-08-23 17:01   ` Jonathan Cameron
@ 2024-09-03  7:06   ` Li, Ming4
  3 siblings, 0 replies; 120+ messages in thread
From: Li, Ming4 @ 2024-09-03  7:06 UTC (permalink / raw)
  To: Ira Weiny, Dave Jiang, Fan Ni, Jonathan Cameron, Navneet Singh,
	Chris Mason, Josef Bacik, David Sterba, Petr Mladek,
	Steven Rostedt, Andy Shevchenko, Rasmus Villemoes,
	Sergey Senozhatsky, Jonathan Corbet, Andrew Morton
  Cc: Dan Williams, Davidlohr Bueso, Alison Schofield, Vishal Verma,
	linux-btrfs, linux-cxl, linux-kernel, linux-doc, nvdimm

On 8/16/2024 10:44 PM, Ira Weiny wrote:
> Dynamic Capacity Devices (DCD) require event interrupts to process
> memory addition or removal.  BIOS may have control over non-DCD event
> processing.  DCD interrupt configuration needs to be separate from
> memory event interrupt configuration.
>
> Split cxl_event_config_msgnums() from irq setup in preparation for
> separate DCD interrupts configuration.
>
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Li Ming <ming4.li@intel.com>

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v3 15/25] cxl/pci: Factor out interrupt policy check
  2024-08-16 14:44 ` [PATCH v3 15/25] cxl/pci: Factor out interrupt policy check Ira Weiny
  2024-08-22 21:41   ` Fan Ni
@ 2024-09-03  7:07   ` Li, Ming4
  1 sibling, 0 replies; 120+ messages in thread
From: Li, Ming4 @ 2024-09-03  7:07 UTC (permalink / raw)
  To: Ira Weiny, Dave Jiang, Fan Ni, Jonathan Cameron, Navneet Singh,
	Chris Mason, Josef Bacik, David Sterba, Petr Mladek,
	Steven Rostedt, Andy Shevchenko, Rasmus Villemoes,
	Sergey Senozhatsky, Jonathan Corbet, Andrew Morton
  Cc: Dan Williams, Davidlohr Bueso, Alison Schofield, Vishal Verma,
	linux-btrfs, linux-cxl, linux-kernel, linux-doc, nvdimm

On 8/16/2024 10:44 PM, Ira Weiny wrote:
> Dynamic Capacity Devices (DCD) require event interrupts to process
> memory addition or removal.  BIOS may have control over non-DCD event
> processing.  DCD interrupt configuration needs to be separate from
> memory event interrupt configuration.
>
> Factor out event interrupt setting validation.
>
> Reviewed-by: Dave Jiang <dave.jiang@intel.com>
> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
>
Reviewed-by: Li Ming <ming4.li@intel.com>

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v3 16/25] cxl/mem: Configure dynamic capacity interrupts
  2024-08-16 14:44 ` [PATCH v3 16/25] cxl/mem: Configure dynamic capacity interrupts ira.weiny
  2024-08-17  0:02   ` Dave Jiang
  2024-08-23 17:08   ` Jonathan Cameron
@ 2024-09-03  7:09   ` Li, Ming4
  2 siblings, 0 replies; 120+ messages in thread
From: Li, Ming4 @ 2024-09-03  7:09 UTC (permalink / raw)
  To: ira.weiny, Dave Jiang, Fan Ni, Jonathan Cameron, Navneet Singh,
	Chris Mason, Josef Bacik, David Sterba, Petr Mladek,
	Steven Rostedt, Andy Shevchenko, Rasmus Villemoes,
	Sergey Senozhatsky, Jonathan Corbet, Andrew Morton
  Cc: Dan Williams, Davidlohr Bueso, Alison Schofield, Vishal Verma,
	linux-btrfs, linux-cxl, linux-kernel, linux-doc, nvdimm

On 8/16/2024 10:44 PM, ira.weiny@intel.com wrote:
> From: Navneet Singh <navneet.singh@intel.com>
>
> Dynamic Capacity Devices (DCD) support extent change notifications
> through the event log mechanism.  The interrupt mailbox commands were
> extended in CXL 3.1 to support these notifications.  Firmware can't
> configure DCD events to be FW controlled but can retain control of
> memory events.
>
> Configure DCD event log interrupts on devices supporting dynamic
> capacity.  Disable DCD if interrupts are not supported.
>
> Care is taken to preserve the interrupt policy set by the FW if FW first
> has been selected by the BIOS.
>
> Signed-off-by: Navneet Singh <navneet.singh@intel.com>
> Co-developed-by: Ira Weiny <ira.weiny@intel.com>
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Li Ming <ming4.li@intel.com>

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v3 17/25] cxl/core: Return endpoint decoder information from region search
  2024-08-16 14:44 ` [PATCH v3 17/25] cxl/core: Return endpoint decoder information from region search Ira Weiny
  2024-08-19 16:35   ` Dave Jiang
  2024-08-23 17:12   ` Jonathan Cameron
@ 2024-09-03  7:10   ` Li, Ming4
  2 siblings, 0 replies; 120+ messages in thread
From: Li, Ming4 @ 2024-09-03  7:10 UTC (permalink / raw)
  To: Ira Weiny, Dave Jiang, Fan Ni, Jonathan Cameron, Navneet Singh,
	Chris Mason, Josef Bacik, David Sterba, Petr Mladek,
	Steven Rostedt, Andy Shevchenko, Rasmus Villemoes,
	Sergey Senozhatsky, Jonathan Corbet, Andrew Morton
  Cc: Dan Williams, Davidlohr Bueso, Alison Schofield, Vishal Verma,
	linux-btrfs, linux-cxl, linux-kernel, linux-doc, nvdimm

On 8/16/2024 10:44 PM, Ira Weiny wrote:
> cxl_dpa_to_region() finds the region from a <DPA, device> tuple.
> The search involves finding the device endpoint decoder as well.
>
> Dynamic capacity extent processing uses the endpoint decoder HPA
> information to calculate the HPA offset.  In addition, well behaved
> extents should be contained within an endpoint decoder.
>
> Return the endpoint decoder found to be used in subsequent DCD code.
>
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Li Ming <ming4.li@intel.com>

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v3 18/25] cxl/extent: Process DCD events and realize region extents
  2024-08-16 14:44 ` [PATCH v3 18/25] cxl/extent: Process DCD events and realize region extents ira.weiny
                     ` (3 preceding siblings ...)
  2024-09-03  6:37   ` Li, Ming4
@ 2024-09-05 19:30   ` Fan Ni
  4 siblings, 0 replies; 120+ messages in thread
From: Fan Ni @ 2024-09-05 19:30 UTC (permalink / raw)
  To: ira.weiny
  Cc: Dave Jiang, Jonathan Cameron, Navneet Singh, Chris Mason,
	Josef Bacik, David Sterba, Petr Mladek, Steven Rostedt,
	Andy Shevchenko, Rasmus Villemoes, Sergey Senozhatsky,
	Jonathan Corbet, Andrew Morton, Dan Williams, Davidlohr Bueso,
	Alison Schofield, Vishal Verma, linux-btrfs, linux-cxl,
	linux-kernel, linux-doc, nvdimm

On Fri, Aug 16, 2024 at 09:44:26AM -0500, ira.weiny@intel.com wrote:
> From: Navneet Singh <navneet.singh@intel.com>
> 
> A dynamic capacity device (DCD) sends events to signal the host for
> changes in the availability of Dynamic Capacity (DC) memory.  These
> events contain extents describing a DPA range and meta data for memory
> to be added or removed.  Events may be sent from the device at any time.
> 
> Three types of events can be signaled, Add, Release, and Force Release.
> 
> On add, the host may accept or reject the memory being offered.  If no
> region exists, or the extent is invalid, the extent should be rejected.
> Add extent events may be grouped by a 'more' bit which indicates those
> extents should be processed as a group.
> 
> On remove, the host can delay the response until the host is safely not
> using the memory.  If no region exists the release can be sent
> immediately.  The host may also release extents (or partial extents) at
> any time.  Thus the 'more' bit grouping of release events is of less
> value and can be ignored in favor of sending multiple release capacity
> responses for groups of release events.
> 
> Force removal is intended as a mechanism between the FM and the device
> and intended only when the host is unresponsive, out of sync, or
> otherwise broken.  Purposely ignore force removal events.
> 
> Regions are made up of one or more devices which may be surfacing memory
> to the host.  Once all devices in a region have surfaced an extent the
> region can expose a corresponding extent for the user to consume.
> Without interleaving a device extent forms a 1:1 relationship with the
> region extent.  Immediately surface a region extent upon getting a
> device extent.
> 
> Per the specification the device is allowed to offer or remove extents
> at any time.  However, anticipated use cases can expect extents to be
> offered, accepted, and removed in well defined chunks.
> 
> Simplify extent tracking with the following restrictions.
> 
> 	1) Flag for removal any extent which overlaps a requested
> 	   release range.
> 	2) Refuse the offer of extents which overlap already accepted
> 	   memory ranges.
> 	3) Accept again a range which has already been accepted by the
> 	   host.  (It is likely the device has an error because it
> 	   should already know that this range was accepted.  But from
> 	   the host point of view it is safe to acknowledge that
> 	   acceptance again.)
> 
> Management of the region extent devices must be synchronized with
> potential uses of the memory within the DAX layer.  Create region extent
> devices as children of the cxl_dax_region device such that the DAX
> region driver can co-drive them and synchronize with the DAX layer.
> Synchronization and management is handled in a subsequent patch.
> 
> Process DCD events and create region devices.
> 
> Signed-off-by: Navneet Singh <navneet.singh@intel.com>
> Co-developed-by: Ira Weiny <ira.weiny@intel.com>
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> 
One more minor inline.
> +static int cxl_validate_extent(struct cxl_memdev_state *mds,
> +			       struct cxl_extent *extent)
> +{
> +	u64 start = le64_to_cpu(extent->start_dpa);
> +	u64 length = le64_to_cpu(extent->length);
> +	struct device *dev = mds->cxlds.dev;
> +
> +	struct range ext_range = (struct range){
> +		.start = start,
> +		.end = start + length - 1,
> +	};
> +
> +	if (le16_to_cpu(extent->shared_extn_seq) != 0) {
> +		dev_err_ratelimited(dev,
> +				    "DC extent DPA %par (%*phC) can not be shared\n",
> +				    &ext_range.start, CXL_EXTENT_TAG_LEN,
> +				    extent->tag);
> +		return -ENXIO;
> +	}
> +
> +	/* Extents must not cross DC region boundary's */
> +	for (int i = 0; i < mds->nr_dc_region; i++) {
> +		struct cxl_dc_region_info *dcr = &mds->dc_region[i];
> +		struct range region_range = (struct range) {
> +			.start = dcr->base,
> +			.end = dcr->base + dcr->decode_len - 1,
> +		};
> +
> +		if (range_contains(&region_range, &ext_range)) {
> +			dev_dbg(dev, "DC extent DPA %par (DCR:%d:%#llx)(%*phC)\n",
> +				&ext_range, i, start - dcr->base,
> +				CXL_EXTENT_TAG_LEN, extent->tag);
> +			return 0;
> +		}
> +	}

For extent validation, we may need to ensure its size is not 0 based on the spec.
Noted that during testing, do not see issue for the case as 0-sized
extents will be rejected when trying to add even though it passes the
validation.

Fan
 

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v3 22/25] cxl/region: Read existing extents on region creation
  2024-08-16 14:44 ` [PATCH v3 22/25] cxl/region: Read existing extents on region creation ira.weiny
  2024-08-20  0:06   ` Dave Jiang
  2024-08-27 14:19   ` Jonathan Cameron
@ 2024-09-05 19:35   ` Fan Ni
  2 siblings, 0 replies; 120+ messages in thread
From: Fan Ni @ 2024-09-05 19:35 UTC (permalink / raw)
  To: ira.weiny
  Cc: Dave Jiang, Jonathan Cameron, Navneet Singh, Chris Mason,
	Josef Bacik, David Sterba, Petr Mladek, Steven Rostedt,
	Andy Shevchenko, Rasmus Villemoes, Sergey Senozhatsky,
	Jonathan Corbet, Andrew Morton, Dan Williams, Davidlohr Bueso,
	Alison Schofield, Vishal Verma, linux-btrfs, linux-cxl,
	linux-kernel, linux-doc, nvdimm

On Fri, Aug 16, 2024 at 09:44:30AM -0500, ira.weiny@intel.com wrote:
> From: Navneet Singh <navneet.singh@intel.com>
> 
> Dynamic capacity device extents may be left in an accepted state on a
> device due to an unexpected host crash.  In this case it is expected
> that the creation of a new region on top of a DC partition can read
> those extents and surface them for continued use.
> 
> Once all endpoint decoders are part of a region and the region is being
> realized a read of the devices extent list can reveal these previously
> accepted extents.
> 
> CXL r3.1 specifies the mailbox call Get Dynamic Capacity Extent List for
> this purpose.  The call returns all the extents for all dynamic capacity
> partitions.  If the fabric manager is adding extents to any DCD
> partition, the extent list for the recovered region may change.  In this
> case the query must retry.  Upon retry the query could encounter extents
> which were accepted on a previous list query.  Adding such extents is
> ignored without error because they are entirely within a previous
> accepted extent.
> 
> The scan for existing extents races with the dax_cxl driver.  This is
> synchronized through the region device lock.  Extents which are found
> after the driver has loaded will surface through the normal notification
> path while extents seen prior to the driver are read during driver load.
> 
> Signed-off-by: Navneet Singh <navneet.singh@intel.com>
> Co-developed-by: Ira Weiny <ira.weiny@intel.com>
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>

With the minor things fixed mentioned by Jonathan and Dave,

Reviewed-by: Fan Ni <fan.ni@samsung.com>

> 
> ---
> Changes:
> [iweiny: Leverage the new add path from the event processing code such
> 	 that the adding and surfacing of extents flows through the same
> 	 code path for both event processing and existing extents.
> 	 While this does validate existing extents again on start up
> 	 this is an error recovery case / new boot scenario and should
> 	 not cause any major issues while making the code more
> 	 straight forward and maintainable.]
> 
> [iweiny: use %par]
> [iweiny: rebase]
> [iweiny: Move this patch later in the series such that the realization
>          of extents can go through the same path as an add event]
> [Fan: Issue a retry if the gen number changes]
> [djiang: s/uint64_t/u64/]
> [djiang: update function names]
> [Jørgen/djbw: read the generation and total count on first iteration of
>               the Get Extent List call]
> [djbw: s/cxl_mbox_get_dc_extent_in/cxl_mbox_get_extent_in/]
> [djbw: s/cxl_mbox_get_dc_extent_out/cxl_mbox_get_extent_out/]
> [djbw/iweiny: s/cxl_read_dc_extents/cxl_read_extent_list]
> ---
>  drivers/cxl/core/core.h   |   2 +
>  drivers/cxl/core/mbox.c   | 100 ++++++++++++++++++++++++++++++++++++++++++++++
>  drivers/cxl/core/region.c |  12 ++++++
>  drivers/cxl/cxlmem.h      |  21 ++++++++++
>  4 files changed, 135 insertions(+)
> 
> diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
> index 8dfc97b2e0a4..9e54064a6f48 100644
> --- a/drivers/cxl/core/core.h
> +++ b/drivers/cxl/core/core.h
> @@ -21,6 +21,8 @@ cxled_to_mds(struct cxl_endpoint_decoder *cxled)
>  	return container_of(cxlds, struct cxl_memdev_state, cxlds);
>  }
>  
> +void cxl_read_extent_list(struct cxl_endpoint_decoder *cxled);
> +
>  #ifdef CONFIG_CXL_REGION
>  extern struct device_attribute dev_attr_create_pmem_region;
>  extern struct device_attribute dev_attr_create_ram_region;
> diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
> index f629ad7488ac..d43ac8eabf56 100644
> --- a/drivers/cxl/core/mbox.c
> +++ b/drivers/cxl/core/mbox.c
> @@ -1670,6 +1670,106 @@ int cxl_dev_dynamic_capacity_identify(struct cxl_memdev_state *mds)
>  }
>  EXPORT_SYMBOL_NS_GPL(cxl_dev_dynamic_capacity_identify, CXL);
>  
> +/* Return -EAGAIN if the extent list changes while reading */
> +static int __cxl_read_extent_list(struct cxl_endpoint_decoder *cxled)
> +{
> +	u32 current_index, total_read, total_expected, initial_gen_num;
> +	struct cxl_memdev_state *mds = cxled_to_mds(cxled);
> +	struct device *dev = mds->cxlds.dev;
> +	struct cxl_mbox_cmd mbox_cmd;
> +	u32 max_extent_count;
> +	bool first = true;
> +
> +	struct cxl_mbox_get_extent_out *extents __free(kfree) =
> +				kvmalloc(mds->payload_size, GFP_KERNEL);
> +	if (!extents)
> +		return -ENOMEM;
> +
> +	total_read = 0;
> +	current_index = 0;
> +	total_expected = 0;
> +	max_extent_count = (mds->payload_size - sizeof(*extents)) /
> +				sizeof(struct cxl_extent);
> +	do {
> +		struct cxl_mbox_get_extent_in get_extent;
> +		u32 nr_returned, current_total, current_gen_num;
> +		int rc;
> +
> +		get_extent = (struct cxl_mbox_get_extent_in) {
> +			.extent_cnt = max(max_extent_count,
> +					  total_expected - current_index),
> +			.start_extent_index = cpu_to_le32(current_index),
> +		};
> +
> +		mbox_cmd = (struct cxl_mbox_cmd) {
> +			.opcode = CXL_MBOX_OP_GET_DC_EXTENT_LIST,
> +			.payload_in = &get_extent,
> +			.size_in = sizeof(get_extent),
> +			.size_out = mds->payload_size,
> +			.payload_out = extents,
> +			.min_out = 1,
> +		};
> +
> +		rc = cxl_internal_send_cmd(mds, &mbox_cmd);
> +		if (rc < 0)
> +			return rc;
> +
> +		/* Save initial data */
> +		if (first) {
> +			total_expected = le32_to_cpu(extents->total_extent_count);
> +			initial_gen_num = le32_to_cpu(extents->generation_num);
> +			first = false;
> +		}
> +
> +		nr_returned = le32_to_cpu(extents->returned_extent_count);
> +		total_read += nr_returned;
> +		current_total = le32_to_cpu(extents->total_extent_count);
> +		current_gen_num = le32_to_cpu(extents->generation_num);
> +
> +		dev_dbg(dev, "Got extent list %d-%d of %d generation Num:%d\n",
> +			current_index, total_read - 1, current_total, current_gen_num);
> +
> +		if (current_gen_num != initial_gen_num || total_expected != current_total) {
> +			dev_dbg(dev, "Extent list change detected; gen %u != %u : cnt %u != %u\n",
> +				current_gen_num, initial_gen_num,
> +				total_expected, current_total);
> +			return -EAGAIN;
> +		}
> +
> +		for (int i = 0; i < nr_returned ; i++) {
> +			struct cxl_extent *extent = &extents->extent[i];
> +
> +			dev_dbg(dev, "Processing extent %d/%d\n",
> +				current_index + i, total_expected);
> +
> +			rc = validate_add_extent(mds, extent);
> +			if (rc)
> +				continue;
> +		}
> +
> +		current_index += nr_returned;
> +	} while (total_expected > total_read);
> +
> +	return 0;
> +}
> +
> +/**
> + * cxl_read_extent_list() - Read existing extents
> + * @cxled: Endpoint decoder which is part of a region
> + *
> + * Issue the Get Dynamic Capacity Extent List command to the device
> + * and add existing extents if found.
> + */
> +void cxl_read_extent_list(struct cxl_endpoint_decoder *cxled)
> +{
> +	int retry = 10;
> +	int rc;
> +
> +	do {
> +		rc = __cxl_read_extent_list(cxled);
> +	} while (rc == -EAGAIN && retry--);
> +}
> +
>  static int add_dpa_res(struct device *dev, struct resource *parent,
>  		       struct resource *res, resource_size_t start,
>  		       resource_size_t size, const char *type)
> diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
> index 8c9171f914fb..885fb3004784 100644
> --- a/drivers/cxl/core/region.c
> +++ b/drivers/cxl/core/region.c
> @@ -3190,6 +3190,15 @@ static int devm_cxl_add_pmem_region(struct cxl_region *cxlr)
>  	return rc;
>  }
>  
> +static void cxlr_add_existing_extents(struct cxl_region *cxlr)
> +{
> +	struct cxl_region_params *p = &cxlr->params;
> +	int i;
> +
> +	for (i = 0; i < p->nr_targets; i++)
> +		cxl_read_extent_list(p->targets[i]);
> +}
> +
>  static void cxlr_dax_unregister(void *_cxlr_dax)
>  {
>  	struct cxl_dax_region *cxlr_dax = _cxlr_dax;
> @@ -3227,6 +3236,9 @@ static int devm_cxl_add_dax_region(struct cxl_region *cxlr)
>  	dev_dbg(&cxlr->dev, "%s: register %s\n", dev_name(dev->parent),
>  		dev_name(dev));
>  
> +	if (cxlr->mode == CXL_REGION_DC)
> +		cxlr_add_existing_extents(cxlr);
> +
>  	return devm_add_action_or_reset(&cxlr->dev, cxlr_dax_unregister,
>  					cxlr_dax);
>  err:
> diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
> index 3a40fe1f0be7..11c03637488d 100644
> --- a/drivers/cxl/cxlmem.h
> +++ b/drivers/cxl/cxlmem.h
> @@ -624,6 +624,27 @@ struct cxl_mbox_dc_response {
>  	} __packed extent_list[];
>  } __packed;
>  
> +/*
> + * Get Dynamic Capacity Extent List; Input Payload
> + * CXL rev 3.1 section 8.2.9.9.9.2; Table 8-166
> + */
> +struct cxl_mbox_get_extent_in {
> +	__le32 extent_cnt;
> +	__le32 start_extent_index;
> +} __packed;
> +
> +/*
> + * Get Dynamic Capacity Extent List; Output Payload
> + * CXL rev 3.1 section 8.2.9.9.9.2; Table 8-167
> + */
> +struct cxl_mbox_get_extent_out {
> +	__le32 returned_extent_count;
> +	__le32 total_extent_count;
> +	__le32 generation_num;
> +	u8 rsvd[4];
> +	struct cxl_extent extent[];
> +} __packed;
> +
>  struct cxl_mbox_get_supported_logs {
>  	__le16 entries;
>  	u8 rsvd[6];
> 
> -- 
> 2.45.2
> 

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v3 23/25] cxl/mem: Trace Dynamic capacity Event Record
  2024-08-16 14:44 ` [PATCH v3 23/25] cxl/mem: Trace Dynamic capacity Event Record ira.weiny
  2024-08-20 22:54   ` Dave Jiang
  2024-08-27 14:20   ` Jonathan Cameron
@ 2024-09-05 19:38   ` Fan Ni
  2 siblings, 0 replies; 120+ messages in thread
From: Fan Ni @ 2024-09-05 19:38 UTC (permalink / raw)
  To: ira.weiny
  Cc: Dave Jiang, Jonathan Cameron, Navneet Singh, Chris Mason,
	Josef Bacik, David Sterba, Petr Mladek, Steven Rostedt,
	Andy Shevchenko, Rasmus Villemoes, Sergey Senozhatsky,
	Jonathan Corbet, Andrew Morton, Dan Williams, Davidlohr Bueso,
	Alison Schofield, Vishal Verma, linux-btrfs, linux-cxl,
	linux-kernel, linux-doc, nvdimm

On Fri, Aug 16, 2024 at 09:44:31AM -0500, ira.weiny@intel.com wrote:
> From: Navneet Singh <navneet.singh@intel.com>
> 
> CXL rev 3.1 section 8.2.9.2.1 adds the Dynamic Capacity Event Records.
> User space can use trace events for debugging of DC capacity changes.
> 
> Add DC trace points to the trace log.
> 
> Signed-off-by: Navneet Singh <navneet.singh@intel.com>
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> 
With the following cxl spec version fixed, 

Reviewed-by: Fan Ni <fan.ni@samsung.com>

> ---
> Changes:
> [Alison: Update commit message]
> ---
>  drivers/cxl/core/mbox.c  |  4 +++
>  drivers/cxl/core/trace.h | 65 ++++++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 69 insertions(+)
> 
> diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
> index d43ac8eabf56..8202fc6c111d 100644
> --- a/drivers/cxl/core/mbox.c
> +++ b/drivers/cxl/core/mbox.c
> @@ -977,6 +977,10 @@ static void __cxl_event_trace_record(const struct cxl_memdev *cxlmd,
>  		ev_type = CXL_CPER_EVENT_DRAM;
>  	else if (uuid_equal(uuid, &CXL_EVENT_MEM_MODULE_UUID))
>  		ev_type = CXL_CPER_EVENT_MEM_MODULE;
> +	else if (uuid_equal(uuid, &CXL_EVENT_DC_EVENT_UUID)) {
> +		trace_cxl_dynamic_capacity(cxlmd, type, &record->event.dcd);
> +		return;
> +	}
>  
>  	cxl_event_trace_record(cxlmd, type, ev_type, uuid, &record->event);
>  }
> diff --git a/drivers/cxl/core/trace.h b/drivers/cxl/core/trace.h
> index 9167cfba7f59..a3a5269311ee 100644
> --- a/drivers/cxl/core/trace.h
> +++ b/drivers/cxl/core/trace.h
> @@ -731,6 +731,71 @@ TRACE_EVENT(cxl_poison,
>  	)
>  );
>  
> +/*
> + * DYNAMIC CAPACITY Event Record - DER
> + *
> + * CXL rev 3.0 section 8.2.9.2.1.5 Table 8-47
Update to reflect r3.1
Fan
> + */
> +
> +#define CXL_DC_ADD_CAPACITY			0x00
> +#define CXL_DC_REL_CAPACITY			0x01
> +#define CXL_DC_FORCED_REL_CAPACITY		0x02
> +#define CXL_DC_REG_CONF_UPDATED			0x03
> +#define show_dc_evt_type(type)	__print_symbolic(type,		\
> +	{ CXL_DC_ADD_CAPACITY,	"Add capacity"},		\
> +	{ CXL_DC_REL_CAPACITY,	"Release capacity"},		\
> +	{ CXL_DC_FORCED_REL_CAPACITY,	"Forced capacity release"},	\
> +	{ CXL_DC_REG_CONF_UPDATED,	"Region Configuration Updated"	} \
> +)
> +
> +TRACE_EVENT(cxl_dynamic_capacity,
> +
> +	TP_PROTO(const struct cxl_memdev *cxlmd, enum cxl_event_log_type log,
> +		 struct cxl_event_dcd *rec),
> +
> +	TP_ARGS(cxlmd, log, rec),
> +
> +	TP_STRUCT__entry(
> +		CXL_EVT_TP_entry
> +
> +		/* Dynamic capacity Event */
> +		__field(u8, event_type)
> +		__field(u16, hostid)
> +		__field(u8, region_id)
> +		__field(u64, dpa_start)
> +		__field(u64, length)
> +		__array(u8, tag, CXL_EXTENT_TAG_LEN)
> +		__field(u16, sh_extent_seq)
> +	),
> +
> +	TP_fast_assign(
> +		CXL_EVT_TP_fast_assign(cxlmd, log, rec->hdr);
> +
> +		/* Dynamic_capacity Event */
> +		__entry->event_type = rec->event_type;
> +
> +		/* DCD event record data */
> +		__entry->hostid = le16_to_cpu(rec->host_id);
> +		__entry->region_id = rec->region_index;
> +		__entry->dpa_start = le64_to_cpu(rec->extent.start_dpa);
> +		__entry->length = le64_to_cpu(rec->extent.length);
> +		memcpy(__entry->tag, &rec->extent.tag, CXL_EXTENT_TAG_LEN);
> +		__entry->sh_extent_seq = le16_to_cpu(rec->extent.shared_extn_seq);
> +	),
> +
> +	CXL_EVT_TP_printk("event_type='%s' host_id='%d' region_id='%d' " \
> +		"starting_dpa=%llx length=%llx tag=%s " \
> +		"shared_extent_sequence=%d",
> +		show_dc_evt_type(__entry->event_type),
> +		__entry->hostid,
> +		__entry->region_id,
> +		__entry->dpa_start,
> +		__entry->length,
> +		__print_hex(__entry->tag, CXL_EXTENT_TAG_LEN),
> +		__entry->sh_extent_seq
> +	)
> +);
> +
>  #endif /* _CXL_EVENTS_H */
>  
>  #define TRACE_INCLUDE_FILE trace
> 
> -- 
> 2.45.2
> 

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v3 04/25] cxl/pci: Delay event buffer allocation
  2024-08-16 14:44 ` [PATCH v3 04/25] cxl/pci: Delay event buffer allocation Ira Weiny
  2024-09-03  6:49   ` Li, Ming4
@ 2024-09-05 19:44   ` Fan Ni
  1 sibling, 0 replies; 120+ messages in thread
From: Fan Ni @ 2024-09-05 19:44 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Dave Jiang, Jonathan Cameron, Navneet Singh, Chris Mason,
	Josef Bacik, David Sterba, Petr Mladek, Steven Rostedt,
	Andy Shevchenko, Rasmus Villemoes, Sergey Senozhatsky,
	Jonathan Corbet, Andrew Morton, Dan Williams, Davidlohr Bueso,
	Alison Schofield, Vishal Verma, linux-btrfs, linux-cxl,
	linux-kernel, linux-doc, nvdimm

On Fri, Aug 16, 2024 at 09:44:12AM -0500, Ira Weiny wrote:
> The event buffer does not need to be allocated if something has failed in
> setting up event irq's.
> 
> In prep for adjusting event configuration for DCD events move the buffer
> allocation to the end of the event configuration.
> 
> Reviewed-by: Davidlohr Bueso <dave@stgolabs.net>
> Reviewed-by: Dave Jiang <dave.jiang@intel.com>
> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> 
> ---

Reviewed-by: Fan Ni <fan.ni@samsung.com>

> Changes:
> [iweiny: keep tags for early simple patch]
> [Davidlohr, Jonathan, djiang: move to beginning of series]
> 	[Dave feel free to pick this up if you like]
> ---
>  drivers/cxl/pci.c | 8 ++++----
>  1 file changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
> index 4be35dc22202..3a60cd66263e 100644
> --- a/drivers/cxl/pci.c
> +++ b/drivers/cxl/pci.c
> @@ -760,10 +760,6 @@ static int cxl_event_config(struct pci_host_bridge *host_bridge,
>  		return 0;
>  	}
>  
> -	rc = cxl_mem_alloc_event_buf(mds);
> -	if (rc)
> -		return rc;
> -
>  	rc = cxl_event_get_int_policy(mds, &policy);
>  	if (rc)
>  		return rc;
> @@ -777,6 +773,10 @@ static int cxl_event_config(struct pci_host_bridge *host_bridge,
>  		return -EBUSY;
>  	}
>  
> +	rc = cxl_mem_alloc_event_buf(mds);
> +	if (rc)
> +		return rc;
> +
>  	rc = cxl_event_irqsetup(mds);
>  	if (rc)
>  		return rc;
> 
> -- 
> 2.45.2
> 

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v3 24/25] tools/testing/cxl: Make event logs dynamic
  2024-08-27 14:32   ` Jonathan Cameron
@ 2024-09-09 13:57     ` Ira Weiny
  0 siblings, 0 replies; 120+ messages in thread
From: Ira Weiny @ 2024-09-09 13:57 UTC (permalink / raw)
  To: Jonathan Cameron, Ira Weiny
  Cc: Dave Jiang, Fan Ni, Navneet Singh, Chris Mason, Josef Bacik,
	David Sterba, Petr Mladek, Steven Rostedt, Andy Shevchenko,
	Rasmus Villemoes, Sergey Senozhatsky, Jonathan Corbet,
	Andrew Morton, Dan Williams, Davidlohr Bueso, Alison Schofield,
	Vishal Verma, linux-btrfs, linux-cxl, linux-kernel, linux-doc,
	nvdimm

Jonathan Cameron wrote:
> On Fri, 16 Aug 2024 09:44:32 -0500
> Ira Weiny <ira.weiny@intel.com> wrote:
> 
> > The test event logs were created as static arrays as an easy way to mock
> > events.  Dynamic Capacity Device (DCD) test support requires events be
> > generated dynamically when extents are created or destroyed.
> > 
> > Modify the event log storage to be dynamically allocated.  Reuse the
> > static event data to create the dynamic events in the new logs without
> > inventing complex event injection for the previous tests.  Simplify the
> > processing of the logs by using the event log array index as the handle.
> > Add a lock to manage concurrency required when user space is allowed to
> > control DCD extents
> > 
> > Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> Probably make sense to spinkle some guard() magic in here
> to avoid all the places where you goto end of function to release the lock

Yes.  Sorry this patch did not get as much self-review as it should have.

> > 
> > ---
> > Changes:
> > [iweiny: rebase]
> > ---
> >  tools/testing/cxl/test/mem.c | 278 ++++++++++++++++++++++++++-----------------
> >  1 file changed, 171 insertions(+), 107 deletions(-)
> > 
> > diff --git a/tools/testing/cxl/test/mem.c b/tools/testing/cxl/test/mem.c
> > index 129f179b0ac5..674fc7f086cd 100644
> > --- a/tools/testing/cxl/test/mem.c
> > +++ b/tools/testing/cxl/test/mem.c
> > @@ -125,18 +125,27 @@ static struct {
> >  
> >  #define PASS_TRY_LIMIT 3
> >  
> > -#define CXL_TEST_EVENT_CNT_MAX 15
> > +#define CXL_TEST_EVENT_CNT_MAX 17
> 
> Seems you added a couple more. Don't do that in a patch
> just changing allocation approach.
> 
> I could find 1 but not sure where other one came from!

I wasn't sure either.  see below...


[snip]

> > @@ -233,8 +254,8 @@ static int mock_get_event(struct device *dev, struct cxl_mbox_cmd *cmd)
> >  {
> >  	struct cxl_get_event_payload *pl;
> >  	struct mock_event_log *log;
> > -	u16 nr_overflow;
> >  	u8 log_type;
> > +	u16 handle;
> >  	int i;
> >  
> >  	if (cmd->size_in != sizeof(log_type))
> > @@ -254,29 +275,39 @@ static int mock_get_event(struct device *dev, struct cxl_mbox_cmd *cmd)
> >  	memset(cmd->payload_out, 0, struct_size(pl, records, 0));
> >  
> >  	log = event_find_log(dev, log_type);
> > -	if (!log || event_log_empty(log))
> > +	if (!log)
> >  		return 0;
> >  
> >  	pl = cmd->payload_out;
> >  
> > -	for (i = 0; i < ret_limit && !event_log_empty(log); i++) {
> > -		memcpy(&pl->records[i], event_get_current(log),
> > -		       sizeof(pl->records[i]));
> > -		pl->records[i].event.generic.hdr.handle =
> > -				event_get_cur_event_handle(log);
> > -		log->cur_idx++;
> > +	read_lock(&log->lock);
> > +
> > +	handle = log->cur_handle;
> > +	dev_dbg(dev, "Get log %d handle %u next %u\n",
> > +		log_type, handle, log->next_handle);
> > +	for (i = 0;
> > +	     i < ret_limit && handle != log->next_handle;
> As below, maybe combine 2 lines above into 1.

Ok. done.

> 
> 
> > +	     i++, event_inc_handle(&handle)) {
> > +		struct cxl_event_record_raw *cur;
> > +
> > +		cur = log->events[handle];
> > +		dev_dbg(dev, "Sending event log %d handle %d idx %u\n",
> > +			log_type, le16_to_cpu(cur->event.generic.hdr.handle),
> > +			handle);
> > +		memcpy(&pl->records[i], cur, sizeof(pl->records[i]));
> > +		pl->records[i].event.generic.hdr.handle = cpu_to_le16(handle);
> >  	}
> >  
> >  	cmd->size_out = struct_size(pl, records, i);
> >  	pl->record_count = cpu_to_le16(i);
> > -	if (!event_log_empty(log))
> > +	if (log->nr_events > i)
> >  		pl->flags |= CXL_GET_EVENT_FLAG_MORE_RECORDS;
> >  
> >  	if (log->nr_overflow) {
> >  		u64 ns;
> >  
> >  		pl->flags |= CXL_GET_EVENT_FLAG_OVERFLOW;
> > -		pl->overflow_err_count = cpu_to_le16(nr_overflow);
> > +		pl->overflow_err_count = cpu_to_le16(log->nr_overflow);
> >  		ns = ktime_get_real_ns();
> >  		ns -= 5000000000; /* 5s ago */
> >  		pl->first_overflow_timestamp = cpu_to_le64(ns);
> > @@ -285,16 +316,17 @@ static int mock_get_event(struct device *dev, struct cxl_mbox_cmd *cmd)
> >  		pl->last_overflow_timestamp = cpu_to_le64(ns);
> >  	}
> >  
> > +	read_unlock(&log->lock);
> Another one maybe for guard()

done.

> 
> >  	return 0;
> >  }
> >  
> >  static int mock_clear_event(struct device *dev, struct cxl_mbox_cmd *cmd)
> >  {
> >  	struct cxl_mbox_clear_event_payload *pl = cmd->payload_in;
> > -	struct mock_event_log *log;
> >  	u8 log_type = pl->event_log;
> > +	struct mock_event_log *log;
> > +	int nr, rc = 0;
> >  	u16 handle;
> > -	int nr;
> >  
> >  	if (log_type >= CXL_EVENT_TYPE_MAX)
> >  		return -EINVAL;
> > @@ -303,24 +335,23 @@ static int mock_clear_event(struct device *dev, struct cxl_mbox_cmd *cmd)
> >  	if (!log)
> >  		return 0; /* No mock data in this log */
> >  
> > -	/*
> > -	 * This check is technically not invalid per the specification AFAICS.
> > -	 * (The host could 'guess' handles and clear them in order).
> > -	 * However, this is not good behavior for the host so test it.
> > -	 */
> > -	if (log->clear_idx + pl->nr_recs > log->cur_idx) {
> > -		dev_err(dev,
> > -			"Attempting to clear more events than returned!\n");
> > -		return -EINVAL;
> > -	}
> > +	write_lock(&log->lock);
> Use a guard()?

done.

> >  
> >  	/* Check handle order prior to clearing events */
> > -	for (nr = 0, handle = event_get_clear_handle(log);
> > -	     nr < pl->nr_recs;
> > -	     nr++, handle++) {
> > +	handle = log->cur_handle;
> > +	for (nr = 0;
> > +	     nr < pl->nr_recs && handle != log->next_handle;
> 
> I'd combine the two lines above.

Ok. done.

> 
> > +	     nr++, event_inc_handle(&handle)) {
> > +
> > +		dev_dbg(dev, "Checking clear of %d handle %u plhandle %u\n",
> > +			log_type, handle,
> > +			le16_to_cpu(pl->handles[nr]));
> > +
> >  		if (handle != le16_to_cpu(pl->handles[nr])) {
> > -			dev_err(dev, "Clearing events out of order\n");
> > -			return -EINVAL;
> > +			dev_err(dev, "Clearing events out of order %u %u\n",
> > +				handle, le16_to_cpu(pl->handles[nr]));
> > +			rc = -EINVAL;
> > +			goto unlock;
> >  		}
> >  	}
> >  
> > @@ -328,25 +359,12 @@ static int mock_clear_event(struct device *dev, struct cxl_mbox_cmd *cmd)
> >  		log->nr_overflow = 0;
> >  
> >  	/* Clear events */
> > -	log->clear_idx += pl->nr_recs;
> > -	return 0;
> > -}
> 
> >  
> >  struct cxl_event_record_raw maint_needed = {
> > @@ -475,8 +493,27 @@ static int mock_set_timestamp(struct cxl_dev_state *cxlds,
> >  	return 0;
> >  }
> >  
> 
> > +static void cxl_mock_add_event_logs(struct cxl_mockmem_data *mdata)
> >  {
> > +	struct mock_event_store *mes = &mdata->mes;
> > +	struct device *dev = mdata->mds->cxlds.dev;
> > +
> >  	put_unaligned_le16(CXL_GMER_VALID_CHANNEL | CXL_GMER_VALID_RANK,
> >  			   &gen_media.rec.media_hdr.validity_flags);
> >  
> > @@ -484,43 +521,60 @@ static void cxl_mock_add_event_logs(struct mock_event_store *mes)
> >  			   CXL_DER_VALID_BANK | CXL_DER_VALID_COLUMN,
> >  			   &dram.rec.media_hdr.validity_flags);
> >  
> > -	mes_add_event(mes, CXL_EVENT_TYPE_INFO, &maint_needed);
> > -	mes_add_event(mes, CXL_EVENT_TYPE_INFO,
> > +	dev_dbg(dev, "Generating fake event logs %d\n",
> > +		CXL_EVENT_TYPE_INFO);
> > +	add_event_from_static(mdata, CXL_EVENT_TYPE_INFO, &maint_needed);
> > +	add_event_from_static(mdata, CXL_EVENT_TYPE_INFO,
> >  		      (struct cxl_event_record_raw *)&gen_media);
> > -	mes_add_event(mes, CXL_EVENT_TYPE_INFO,
> > +	add_event_from_static(mdata, CXL_EVENT_TYPE_INFO,
> >  		      (struct cxl_event_record_raw *)&mem_module);
> >  	mes->ev_status |= CXLDEV_EVENT_STATUS_INFO;
> >  
> > -	mes_add_event(mes, CXL_EVENT_TYPE_FAIL, &maint_needed);
> > -	mes_add_event(mes, CXL_EVENT_TYPE_FAIL, &hardware_replace);
> > -	mes_add_event(mes, CXL_EVENT_TYPE_FAIL,
> > +	dev_dbg(dev, "Generating fake event logs %d\n",
> > +		CXL_EVENT_TYPE_FAIL);
> > +	add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL, &maint_needed);
> > +	add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL,
> > +		      (struct cxl_event_record_raw *)&mem_module);
> 
> So this one is new?  I can't spot the other one...

Its coming back to me now.  The cxl-events.sh test relied on an expected number
of each type of event (Including an overflow count) which were completely
fabricated previous to this patch.

	num_overflow_expected=1
	num_fatal_expected=2
	num_failure_expected=16
	num_info_expected=3

To maintain backwards compatibility this new code needed to preserve those
counts.  The buffers and number of entries were adjusted to make the output
match.  However now the logs need to actually over flow to create the overflow
error.  Furthermore, the handles are the array entries.  cxl-events.sh passes
before and after this patch.

That said, my math was wrong.  A max of 16 with 16+ entries added to the
failure log should result in the counts above.  I added a couple extra to the
overflow though.

Good catch on this.  I basically hacked it to match and moved on.  I've cleaned
it up for the next version.

> 
> 
> > +	add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL, &hardware_replace);
> > +	add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL,
> >  		      (struct cxl_event_record_raw *)&dram);
> > -	mes_add_event(mes, CXL_EVENT_TYPE_FAIL,
> > +	add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL,
> >  		      (struct cxl_event_record_raw *)&gen_media);
> > -	mes_add_event(mes, CXL_EVENT_TYPE_FAIL,
> > +	add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL,
> >  		      (struct cxl_event_record_raw *)&mem_module);
> > -	mes_add_event(mes, CXL_EVENT_TYPE_FAIL, &hardware_replace);
> > -	mes_add_event(mes, CXL_EVENT_TYPE_FAIL,
> > +	add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL, &hardware_replace);
> > +	add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL,
> >  		      (struct cxl_event_record_raw *)&dram);
> >  	/* Overflow this log */
> > -	mes_add_event(mes, CXL_EVENT_TYPE_FAIL, &hardware_replace);
> > -	mes_add_event(mes, CXL_EVENT_TYPE_FAIL, &hardware_replace);
> > -	mes_add_event(mes, CXL_EVENT_TYPE_FAIL, &hardware_replace);
> > -	mes_add_event(mes, CXL_EVENT_TYPE_FAIL, &hardware_replace);
> > -	mes_add_event(mes, CXL_EVENT_TYPE_FAIL, &hardware_replace);
> > -	mes_add_event(mes, CXL_EVENT_TYPE_FAIL, &hardware_replace);
> > -	mes_add_event(mes, CXL_EVENT_TYPE_FAIL, &hardware_replace);
> > -	mes_add_event(mes, CXL_EVENT_TYPE_FAIL, &hardware_replace);
> > -	mes_add_event(mes, CXL_EVENT_TYPE_FAIL, &hardware_replace);
> > -	mes_add_event(mes, CXL_EVENT_TYPE_FAIL, &hardware_replace);
> > +	add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL, &hardware_replace);
> > +	add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL, &hardware_replace);
> > +	add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL, &hardware_replace);
> > +	add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL, &hardware_replace);
> > +	add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL, &hardware_replace);
> > +	add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL, &hardware_replace);
> > +	add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL, &hardware_replace);
> > +	add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL, &hardware_replace);
> > +	add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL, &hardware_replace);
> > +	add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL, &hardware_replace);
> >  	mes->ev_status |= CXLDEV_EVENT_STATUS_FAIL;
> >  
> > -	mes_add_event(mes, CXL_EVENT_TYPE_FATAL, &hardware_replace);
> > -	mes_add_event(mes, CXL_EVENT_TYPE_FATAL,
> > +	dev_dbg(dev, "Generating fake event logs %d\n",
> > +		CXL_EVENT_TYPE_FATAL);
> The dev_dbg() fine but not really part of making it dynamic, so adds
> a bit of noise. Maybe not worth splitting out though.

It's just debugging that we are indeed adding these to the now dynamic list.  I
added a print for each type.  I've added even more debugging with the clean up.

I'm going to leave it in for now because it is part of ensuring the dynamic
events work.

Thanks,
Ira

> > +	add_event_from_static(mdata, CXL_EVENT_TYPE_FATAL, &hardware_replace);
> > +	add_event_from_static(mdata, CXL_EVENT_TYPE_FATAL,
> >  		      (struct cxl_event_record_raw *)&dram);
> >  	mes->ev_status |= CXLDEV_EVENT_STATUS_FATAL;
> >  }
> 
> 



^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v3 25/25] tools/testing/cxl: Add DC Regions to mock mem data
  2024-08-27 14:39   ` Jonathan Cameron
@ 2024-09-09 14:08     ` Ira Weiny
  0 siblings, 0 replies; 120+ messages in thread
From: Ira Weiny @ 2024-09-09 14:08 UTC (permalink / raw)
  To: Jonathan Cameron, Ira Weiny
  Cc: Dave Jiang, Fan Ni, Navneet Singh, Chris Mason, Josef Bacik,
	David Sterba, Petr Mladek, Steven Rostedt, Andy Shevchenko,
	Rasmus Villemoes, Sergey Senozhatsky, Jonathan Corbet,
	Andrew Morton, Dan Williams, Davidlohr Bueso, Alison Schofield,
	Vishal Verma, linux-btrfs, linux-cxl, linux-kernel, linux-doc,
	nvdimm

Jonathan Cameron wrote:
> On Fri, 16 Aug 2024 09:44:33 -0500
> Ira Weiny <ira.weiny@intel.com> wrote:
> 
> > cxl_test provides a good way to ensure quick smoke and regression
> > testing.  The complexity of Dynamic Capacity (DC) extent processing as
> > well as the complexity of the new sparse DAX regions can mostly be
> > tested through cxl_test.  This includes management of sparse regions and
> > DAX devices on those regions; the management of extent device lifetimes;
> > and the processing of DCD events.
> > 
> > The only missing functionality from this test is actual interrupt
> > processing.
> > 
> > Mock memory devices can easily mock DC information and manage fake
> > extent data.
> > 
> > Define mock_dc_region information within the mock memory data.  Add
> > sysfs entries on the mock device to inject and delete extents.
> > 
> > The inject format is <start>:<length>:<tag>:<more_flag>
> > The delete format is <start>:<length>
> > 
> > Directly call the event irq callback to simulate irqs to process the
> > test extents.
> > 
> > Add DC mailbox commands to the CEL and implement those commands.
> > 
> > Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> 
> Minor stuff inline.
> 
> Thanks,
> 
> Jonathan
> 
> > +static int mock_get_dc_config(struct device *dev,
> > +			      struct cxl_mbox_cmd *cmd)
> > +{
> > +	struct cxl_mbox_get_dc_config_in *dc_config = cmd->payload_in;
> > +	struct cxl_mockmem_data *mdata = dev_get_drvdata(dev);
> > +	u8 region_requested, region_start_idx, region_ret_cnt;
> > +	struct cxl_mbox_get_dc_config_out *resp;
> > +	int i;
> > +
> > +	region_requested = dc_config->region_count;
> > +	if (region_requested > NUM_MOCK_DC_REGIONS)
> > +		region_requested = NUM_MOCK_DC_REGIONS;
> 
> 	region_requested = min(...)

Sure.

> 
> > +
> > +	if (cmd->size_out < struct_size(resp, region, region_requested))
> > +		return -EINVAL;
> > +
> > +	memset(cmd->payload_out, 0, cmd->size_out);
> > +	resp = cmd->payload_out;
> > +
> > +	region_start_idx = dc_config->start_region_index;
> > +	region_ret_cnt = 0;
> > +	for (i = 0; i < NUM_MOCK_DC_REGIONS; i++) {
> > +		if (i >= region_start_idx) {
> > +			memcpy(&resp->region[region_ret_cnt],
> > +				&mdata->dc_regions[i],
> > +				sizeof(resp->region[region_ret_cnt]));
> > +			region_ret_cnt++;
> > +		}
> > +	}
> > +	resp->avail_region_count = NUM_MOCK_DC_REGIONS;
> > +	resp->regions_returned = i;
> > +
> > +	dev_dbg(dev, "Returning %d dc regions\n", region_ret_cnt);
> > +	return 0;
> > +}
> 
> 
> 
> > +static void cxl_mock_mem_remove(struct platform_device *pdev)
> > +{
> > +	struct cxl_mockmem_data *mdata = dev_get_drvdata(&pdev->dev);
> > +	struct cxl_memdev_state *mds = mdata->mds;
> > +
> > +	dev_dbg(mds->cxlds.dev, "Removing extents\n");
> 
> Clean this up as it doesn't do anything!

Opps...  Must have been some previous xarray thing.  Or perhaps I'm leaking
some memory here?  I'll double check.

> 
> > +}
> > +
> 
> > @@ -1689,14 +2142,261 @@ static ssize_t sanitize_timeout_store(struct device *dev,
> >  
> >  	return count;
> >  }
> > -
> Grump ;)  No whitespace changes in a patch doing anything 'useful'.
> >  static DEVICE_ATTR_RW(sanitize_timeout);
> >  
> 
> > +static int log_dc_event(struct cxl_mockmem_data *mdata, enum dc_event type,
> > +			u64 start, u64 length, const char *tag_str, bool more)
> > +{
> > +	struct device *dev = mdata->mds->cxlds.dev;
> > +	struct cxl_test_dcd *dcd_event;
> > +
> > +	dev_dbg(dev, "mock device log event %d\n", type);
> > +
> > +	dcd_event = devm_kmemdup(dev, &dcd_event_rec_template,
> > +				     sizeof(*dcd_event), GFP_KERNEL);
> > +	if (!dcd_event)
> > +		return -ENOMEM;
> > +
> > +	dcd_event->rec.flags = 0;
> > +	if (more)
> > +		dcd_event->rec.flags |= CXL_DCD_EVENT_MORE;
> > +	dcd_event->rec.event_type = type;
> > +	dcd_event->rec.extent.start_dpa = cpu_to_le64(start);
> > +	dcd_event->rec.extent.length = cpu_to_le64(length);
> > +	memcpy(dcd_event->rec.extent.tag, tag_str,
> > +	       min(sizeof(dcd_event->rec.extent.tag),
> > +		   strlen(tag_str)));
> > +
> > +	mes_add_event(mdata, CXL_EVENT_TYPE_DCD,
> > +		      (struct cxl_event_record_raw *)dcd_event);
> I guess this is where the missing event in previous patch come from.
> 
> Increment the number here, not back in that patch.

No that patch needed to pass the counts correctly for the event test.  DCD
event records have nothing to do with that.

I've cleaned up the patch.

Ira

^ permalink raw reply	[flat|nested] 120+ messages in thread

end of thread, other threads:[~2024-09-09 14:08 UTC | newest]

Thread overview: 120+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-08-16 14:44 [PATCH v3 00/25] DCD: Add support for Dynamic Capacity Devices (DCD) Ira Weiny
2024-08-16 14:44 ` [PATCH v3 01/25] range: Add range_overlaps() Ira Weiny
2024-08-16 14:44 ` [PATCH v3 02/25] printk: Add print format (%par) for struct range Ira Weiny
2024-08-20 14:08   ` Petr Mladek
2024-08-22 17:53     ` Ira Weiny
2024-08-22 18:10       ` Andy Shevchenko
2024-08-26 13:23         ` Petr Mladek
2024-08-26 17:23           ` Andy Shevchenko
2024-08-26 21:17             ` Ira Weiny
2024-08-27  7:43               ` Petr Mladek
2024-08-27 13:21                 ` Andy Shevchenko
2024-08-27 21:44                 ` Ira Weiny
2024-08-27 13:17               ` Andy Shevchenko
2024-08-28  4:12                 ` Ira Weiny
2024-08-28 13:50                   ` Andy Shevchenko
2024-08-26 13:17       ` Petr Mladek
2024-08-26 13:24         ` Andy Shevchenko
2024-08-16 14:44 ` [PATCH v3 03/25] dax: Document dax dev range tuple Ira Weiny
2024-08-16 20:58   ` Dave Jiang
2024-08-23 15:29   ` Jonathan Cameron
2024-08-16 14:44 ` [PATCH v3 04/25] cxl/pci: Delay event buffer allocation Ira Weiny
2024-09-03  6:49   ` Li, Ming4
2024-09-05 19:44   ` Fan Ni
2024-08-16 14:44 ` [PATCH v3 05/25] cxl/mbox: Flag support for Dynamic Capacity Devices (DCD) ira.weiny
2024-09-03  6:50   ` Li, Ming4
2024-08-16 14:44 ` [PATCH v3 06/25] cxl/mem: Read dynamic capacity configuration from the device ira.weiny
2024-08-16 21:45   ` Dave Jiang
2024-08-20 17:01     ` Fan Ni
2024-08-23  2:01       ` Ira Weiny
2024-08-23  2:02       ` Ira Weiny
2024-08-23 15:45   ` Jonathan Cameron
2024-08-16 14:44 ` [PATCH v3 07/25] cxl/core: Separate region mode from decoder mode ira.weiny
2024-08-16 22:11   ` Dave Jiang
2024-08-23 15:47   ` Jonathan Cameron
2024-09-03  6:56   ` Li, Ming4
2024-08-16 14:44 ` [PATCH v3 08/25] cxl/region: Add dynamic capacity decoder and region modes ira.weiny
2024-08-16 22:14   ` Dave Jiang
2024-09-03  6:57   ` Li, Ming4
2024-08-16 14:44 ` [PATCH v3 09/25] cxl/hdm: Add dynamic capacity size support to endpoint decoders ira.weiny
2024-08-16 23:08   ` Dave Jiang
2024-08-23  2:26     ` Ira Weiny
2024-08-23 16:09   ` Jonathan Cameron
2024-08-16 14:44 ` [PATCH v3 10/25] cxl/port: Add endpoint decoder DC mode support to sysfs ira.weiny
2024-08-16 23:17   ` Dave Jiang
2024-08-23 16:12   ` Jonathan Cameron
2024-08-16 14:44 ` [PATCH v3 11/25] cxl/mem: Expose DCD partition capabilities in sysfs ira.weiny
2024-08-16 23:42   ` Dave Jiang
2024-08-23  2:28     ` Ira Weiny
2024-08-23 14:58       ` Dave Jiang
2024-08-23 16:14       ` Jonathan Cameron
2024-08-16 14:44 ` [PATCH v3 12/25] cxl/region: Refactor common create region code Ira Weiny
2024-08-16 23:43   ` Dave Jiang
2024-08-22 18:51   ` Fan Ni
2024-08-23 16:17   ` Jonathan Cameron
2024-09-03  7:04   ` Li, Ming4
2024-08-16 14:44 ` [PATCH v3 13/25] cxl/region: Add sparse DAX region support ira.weiny
2024-08-16 23:51   ` Dave Jiang
2024-08-22 18:50   ` Fan Ni
2024-08-23 16:59   ` Jonathan Cameron
2024-09-03  2:15   ` Li, Ming4
2024-08-16 14:44 ` [PATCH v3 14/25] cxl/events: Split event msgnum configuration from irq setup Ira Weiny
2024-08-16 23:57   ` Dave Jiang
2024-08-22 21:39   ` Fan Ni
2024-08-23 17:01   ` Jonathan Cameron
2024-09-03  7:06   ` Li, Ming4
2024-08-16 14:44 ` [PATCH v3 15/25] cxl/pci: Factor out interrupt policy check Ira Weiny
2024-08-22 21:41   ` Fan Ni
2024-09-03  7:07   ` Li, Ming4
2024-08-16 14:44 ` [PATCH v3 16/25] cxl/mem: Configure dynamic capacity interrupts ira.weiny
2024-08-17  0:02   ` Dave Jiang
2024-08-23 17:08   ` Jonathan Cameron
2024-09-03  7:09   ` Li, Ming4
2024-08-16 14:44 ` [PATCH v3 17/25] cxl/core: Return endpoint decoder information from region search Ira Weiny
2024-08-19 16:35   ` Dave Jiang
2024-08-23 17:12   ` Jonathan Cameron
2024-09-03  7:10   ` Li, Ming4
2024-08-16 14:44 ` [PATCH v3 18/25] cxl/extent: Process DCD events and realize region extents ira.weiny
2024-08-19 18:51   ` Dave Jiang
2024-08-23  2:53     ` Ira Weiny
2024-08-23 21:32   ` Fan Ni
2024-08-27 12:08     ` Jonathan Cameron
2024-08-27 16:02       ` Fan Ni
2024-08-27 13:18   ` Jonathan Cameron
2024-08-29 21:16     ` Ira Weiny
2024-08-30  9:21       ` Jonathan Cameron
2024-09-03  6:37   ` Li, Ming4
2024-09-05 19:30   ` Fan Ni
2024-08-16 14:44 ` [PATCH v3 19/25] cxl/region/extent: Expose region extent information in sysfs ira.weiny
2024-08-19 19:05   ` Dave Jiang
2024-08-23  2:58     ` Ira Weiny
2024-08-23 17:17       ` Jonathan Cameron
2024-08-23 17:19   ` Jonathan Cameron
2024-08-28 17:44   ` Fan Ni
2024-08-16 14:44 ` [PATCH v3 20/25] dax/bus: Factor out dev dax resize logic Ira Weiny
2024-08-19 22:35   ` Dave Jiang
2024-08-27 13:26   ` Jonathan Cameron
2024-08-29 21:36     ` Ira Weiny
2024-08-16 14:44 ` [PATCH v3 21/25] dax/region: Create resources on sparse DAX regions ira.weiny
2024-08-18 11:38   ` Markus Elfring
2024-08-19 23:30   ` Dave Jiang
2024-08-23 14:28     ` Ira Weiny
2024-08-27 14:12   ` Jonathan Cameron
2024-08-29 21:54     ` Ira Weiny
2024-08-16 14:44 ` [PATCH v3 22/25] cxl/region: Read existing extents on region creation ira.weiny
2024-08-20  0:06   ` Dave Jiang
2024-08-23 21:31     ` Ira Weiny
2024-08-27 14:19   ` Jonathan Cameron
2024-09-05 19:35   ` Fan Ni
2024-08-16 14:44 ` [PATCH v3 23/25] cxl/mem: Trace Dynamic capacity Event Record ira.weiny
2024-08-20 22:54   ` Dave Jiang
2024-08-26 18:02     ` Ira Weiny
2024-08-27 14:20   ` Jonathan Cameron
2024-09-05 19:38   ` Fan Ni
2024-08-16 14:44 ` [PATCH v3 24/25] tools/testing/cxl: Make event logs dynamic Ira Weiny
2024-08-20 23:30   ` Dave Jiang
2024-08-27 14:32   ` Jonathan Cameron
2024-09-09 13:57     ` Ira Weiny
2024-08-16 14:44 ` [PATCH v3 25/25] tools/testing/cxl: Add DC Regions to mock mem data Ira Weiny
2024-08-27 14:39   ` Jonathan Cameron
2024-09-09 14:08     ` Ira Weiny

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).