* [PATCH v8 00/21] DCD: Add support for Dynamic Capacity Devices (DCD)
@ 2024-12-11 3:42 Ira Weiny
2024-12-11 3:42 ` [PATCH v8 01/21] cxl/mbox: Flag " Ira Weiny
` (20 more replies)
0 siblings, 21 replies; 34+ messages in thread
From: Ira Weiny @ 2024-12-11 3:42 UTC (permalink / raw)
To: Dave Jiang, Fan Ni, Jonathan Cameron, Jonathan Corbet,
Andrew Morton, Kees Cook, Gustavo A. R. Silva
Cc: Dan Williams, Davidlohr Bueso, Alison Schofield, Vishal Verma,
Ira Weiny, linux-cxl, linux-doc, nvdimm, linux-kernel,
linux-hardening, Li Ming, Jonathan Cameron
A git tree of this series can be found here:
https://github.com/weiny2/linux-kernel/tree/dcd-v4-2024-12-10
Rebase this off 6.13 cleanups.
Series info
===========
This series has 2 parts:
Patch 1-19: Core DCD support
Patch 20-21: cxl_test support
Background
==========
A Dynamic Capacity Device (DCD) (CXL 3.1 sec 9.13.3) is a CXL memory
device that allows memory capacity within a region to change
dynamically without the need for resetting the device, reconfiguring
HDM decoders, or reconfiguring software DAX regions.
One of the biggest use cases for Dynamic Capacity is to allow hosts to
share memory dynamically within a data center without increasing the
per-host attached memory.
The general flow for the addition or removal of memory is to have an
orchestrator coordinate the use of the memory. Generally there are 5
actors in such a system, the Orchestrator, Fabric Manager, the Logical
device, the Host Kernel, and a Host User.
Typical work flows are shown below.
Orchestrator FM Device Host Kernel Host User
| | | | |
|-------------- Create region ----------------------->|
| | | | |
| | | |<-- Create ---|
| | | | Region |
|<------------- Signal done --------------------------|
| | | | |
|-- Add ----->|-- Add --->|--- Add --->| |
| Capacity | Extent | Extent | |
| | | | |
| |<- Accept -|<- Accept -| |
| | Extent | Extent | |
| | | |<- Create --->|
| | | | DAX dev |-- Use memory
| | | | | |
| | | | | |
| | | |<- Release ---| <-+
| | | | DAX dev |
| | | | |
|<------------- Signal done --------------------------|
| | | | |
|-- Remove -->|- Release->|- Release ->| |
| Capacity | Extent | Extent | |
| | | | |
| |<- Release-|<- Release -| |
| | Extent | Extent | |
| | | | |
|-- Add ----->|-- Add --->|--- Add --->| |
| Capacity | Extent | Extent | |
| | | | |
| |<- Accept -|<- Accept -| |
| | Extent | Extent | |
| | | |<- Create ----|
| | | | DAX dev |-- Use memory
| | | | | |
| | | |<- Release ---| <-+
| | | | DAX dev |
|<------------- Signal done --------------------------|
| | | | |
|-- Remove -->|- Release->|- Release ->| |
| Capacity | Extent | Extent | |
| | | | |
| |<- Release-|<- Release -| |
| | Extent | Extent | |
| | | | |
|-- Add ----->|-- Add --->|--- Add --->| |
| Capacity | Extent | Extent | |
| | | |<- Create ----|
| | | | DAX dev |-- Use memory
| | | | | |
|-- Remove -->|- Release->|- Release ->| | |
| Capacity | Extent | Extent | | |
| | | | | |
| | | (Release Ignored) | |
| | | | | |
| | | |<- Release ---| <-+
| | | | DAX dev |
|<------------- Signal done --------------------------|
| | | | |
| |- Release->|- Release ->| |
| | Extent | Extent | |
| | | | |
| |<- Release-|<- Release -| |
| | Extent | Extent | |
| | | |<- Destroy ---|
| | | | Region |
| | | | |
Implementation
==============
The series still requires the creation of regions and DAX devices to be
closely synchronized with the Orchestrator and Fabric Manager. The host
kernel will reject extents if a region is not yet created. It also
ignores extent release if memory is in use (DAX device created). These
synchronizations are not anticipated to be an issue with real
applications.
In order to allow for capacity to be added and removed a new concept of
a sparse DAX region is introduced. A sparse DAX region may have 0 or
more bytes of available space. The total space depends on the number
and size of the extents which have been added.
Initially it is anticipated that users of the memory will carefully
coordinate the surfacing of additional capacity with the creation of DAX
devices which use that capacity. Therefore, the allocation of the
memory to DAX devices does not allow for specific associations between
DAX device and extent. This keeps allocations very similar to existing
DAX region behavior.
To keep the DAX memory allocation aligned with the existing DAX devices
which do not have tags extents are not allowed to have tags. Future
support for tags is planned.
Great care was taken to keep the extent tracking simple. Some xarray's
needed to be added but extra software objects were kept to a minimum.
Region extents continue to be tracked as sub-devices of the DAX region.
This ensures that region destruction cleans up all extent allocations
properly.
The major functionality of this series includes:
- Getting the dynamic capacity (DC) configuration information from cxl
devices
- Configuring the DC partitions reported by hardware
- Enhancing the CXL and DAX regions for dynamic capacity support
a. Maintain a logical separation between hardware extents and
software managed region extents. This provides an
abstraction between the layers and should allow for
interleaving in the future
- Get hardware extent lists for endpoint decoders upon
region creation.
- Adjust extent/region memory available on the following events.
a. Add capacity Events
b. Release capacity events
- Host response for add capacity
a. do not accept the extent if:
If the region does not exist
or an error occurs realizing the extent
b. If the region does exist
realize a DAX region extent with 1:1 mapping (no
interleave yet)
c. Support the event more bit by processing a list of extents
marked with the more bit together before setting up a
response.
- Host response for remove capacity
a. If no DAX device references the extent; release the extent
b. If a reference does exist, ignore the request.
(Require FM to issue release again.)
- Modify DAX device creation/resize to account for extents within a
sparse DAX region
- Trace Dynamic Capacity events for debugging
- Add cxl-test infrastructure to allow for faster unit testing
(See new ndctl branch for cxl-dcd.sh test[1])
- Only support 0 value extent tags
Fan Ni's upstream of Qemu DCD was used for testing.
Remaining work:
1) Allow mapping to specific extents (perhaps based on
label/tag)
1a) devise region size reporting based on tags
2) Interleave support
Possible additional work depending on requirements:
1) Accept a new extent which extends (but overlaps) an existing
extent(s)
2) Release extents when DAX devices are released if a release
was previously seen from the device
3) Rework DAX device interfaces, memfd has been explored a bit
[1] https://github.com/weiny2/ndctl/tree/dcd-region2-2024-12-11
---
Changes in v8:
- iweiny: rebase off of 6.13
- iweiny: Use %pra which landed in 6.13
- Link to v7: https://patch.msgid.link/20241107-dcd-type2-upstream-v7-0-56a84e66bc36@intel.com
---
Ira Weiny (21):
cxl/mbox: Flag support for Dynamic Capacity Devices (DCD)
cxl/mem: Read dynamic capacity configuration from the device
cxl/core: Separate region mode from decoder mode
cxl/region: Add dynamic capacity decoder and region modes
cxl/hdm: Add dynamic capacity size support to endpoint decoders
cxl/cdat: Gather DSMAS data for DCD regions
cxl/mem: Expose DCD partition capabilities in sysfs
cxl/port: Add endpoint decoder DC mode support to sysfs
cxl/region: Add sparse DAX region support
cxl/events: Split event msgnum configuration from irq setup
cxl/pci: Factor out interrupt policy check
cxl/mem: Configure dynamic capacity interrupts
cxl/core: Return endpoint decoder information from region search
cxl/extent: Process DCD events and realize region extents
cxl/region/extent: Expose region extent information in sysfs
dax/bus: Factor out dev dax resize logic
dax/region: Create resources on sparse DAX regions
cxl/region: Read existing extents on region creation
cxl/mem: Trace Dynamic capacity Event Record
tools/testing/cxl: Make event logs dynamic
tools/testing/cxl: Add DC Regions to mock mem data
Documentation/ABI/testing/sysfs-bus-cxl | 125 +++-
drivers/cxl/core/Makefile | 2 +-
drivers/cxl/core/cdat.c | 42 +-
drivers/cxl/core/core.h | 34 +-
drivers/cxl/core/extent.c | 494 +++++++++++++++
drivers/cxl/core/hdm.c | 210 ++++++-
drivers/cxl/core/mbox.c | 603 +++++++++++++++++-
drivers/cxl/core/memdev.c | 128 +++-
drivers/cxl/core/port.c | 19 +-
drivers/cxl/core/region.c | 165 ++++-
drivers/cxl/core/trace.h | 65 ++
drivers/cxl/cxl.h | 122 +++-
drivers/cxl/cxlmem.h | 132 +++-
drivers/cxl/pci.c | 116 +++-
drivers/dax/bus.c | 356 +++++++++--
drivers/dax/bus.h | 4 +-
drivers/dax/cxl.c | 71 ++-
drivers/dax/dax-private.h | 40 ++
drivers/dax/hmem/hmem.c | 2 +-
drivers/dax/pmem.c | 2 +-
include/cxl/event.h | 32 +
include/linux/ioport.h | 3 +
tools/testing/cxl/Kbuild | 3 +-
tools/testing/cxl/test/mem.c | 1019 +++++++++++++++++++++++++++----
24 files changed, 3499 insertions(+), 290 deletions(-)
---
base-commit: 7cb1b466315004af98f6ba6c2546bb713ca3c237
change-id: 20230604-dcd-type2-upstream-0cd15f6216fd
Best regards,
--
Ira Weiny <ira.weiny@intel.com>
^ permalink raw reply [flat|nested] 34+ messages in thread
* [PATCH v8 01/21] cxl/mbox: Flag support for Dynamic Capacity Devices (DCD)
2024-12-11 3:42 [PATCH v8 00/21] DCD: Add support for Dynamic Capacity Devices (DCD) Ira Weiny
@ 2024-12-11 3:42 ` Ira Weiny
2025-01-03 22:57 ` Dan Williams
2024-12-11 3:42 ` [PATCH v8 02/21] cxl/mem: Read dynamic capacity configuration from the device Ira Weiny
` (19 subsequent siblings)
20 siblings, 1 reply; 34+ messages in thread
From: Ira Weiny @ 2024-12-11 3:42 UTC (permalink / raw)
To: Dave Jiang, Fan Ni, Jonathan Cameron, Jonathan Corbet,
Andrew Morton, Kees Cook, Gustavo A. R. Silva
Cc: Dan Williams, Davidlohr Bueso, Alison Schofield, Vishal Verma,
Ira Weiny, linux-cxl, linux-doc, nvdimm, linux-kernel,
linux-hardening, Li Ming
Per the CXL 3.1 specification software must check the Command Effects
Log (CEL) for dynamic capacity command support.
Detect support for the DCD commands while reading the CEL, including:
Get DC Config
Get DC Extent List
Add DC Response
Release DC
Based on an original patch by Navneet Singh.
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: Fan Ni <fan.ni@samsung.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Reviewed-by: Davidlohr Bueso <dave@stgolabs.net>
Reviewed-by: Li Ming <ming.li@zohomail.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
---
drivers/cxl/core/mbox.c | 33 +++++++++++++++++++++++++++++++++
drivers/cxl/cxlmem.h | 15 +++++++++++++++
2 files changed, 48 insertions(+)
diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
index 548564c770c02c0a4571a00ae3f6de8f63183183..599934d066518341eb6ea9fc3319cd7098cbc2f3 100644
--- a/drivers/cxl/core/mbox.c
+++ b/drivers/cxl/core/mbox.c
@@ -164,6 +164,34 @@ static void cxl_set_security_cmd_enabled(struct cxl_security_state *security,
}
}
+static bool cxl_is_dcd_command(u16 opcode)
+{
+#define CXL_MBOX_OP_DCD_CMDS 0x48
+
+ return (opcode >> 8) == CXL_MBOX_OP_DCD_CMDS;
+}
+
+static void cxl_set_dcd_cmd_enabled(struct cxl_memdev_state *mds,
+ u16 opcode)
+{
+ switch (opcode) {
+ case CXL_MBOX_OP_GET_DC_CONFIG:
+ set_bit(CXL_DCD_ENABLED_GET_CONFIG, mds->dcd_cmds);
+ break;
+ case CXL_MBOX_OP_GET_DC_EXTENT_LIST:
+ set_bit(CXL_DCD_ENABLED_GET_EXTENT_LIST, mds->dcd_cmds);
+ break;
+ case CXL_MBOX_OP_ADD_DC_RESPONSE:
+ set_bit(CXL_DCD_ENABLED_ADD_RESPONSE, mds->dcd_cmds);
+ break;
+ case CXL_MBOX_OP_RELEASE_DC:
+ set_bit(CXL_DCD_ENABLED_RELEASE, mds->dcd_cmds);
+ break;
+ default:
+ break;
+ }
+}
+
static bool cxl_is_poison_command(u16 opcode)
{
#define CXL_MBOX_OP_POISON_CMDS 0x43
@@ -751,6 +779,11 @@ static void cxl_walk_cel(struct cxl_memdev_state *mds, size_t size, u8 *cel)
enabled++;
}
+ if (cxl_is_dcd_command(opcode)) {
+ cxl_set_dcd_cmd_enabled(mds, opcode);
+ enabled++;
+ }
+
dev_dbg(dev, "Opcode 0x%04x %s\n", opcode,
enabled ? "enabled" : "unsupported by driver");
}
diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
index 2a25d1957ddb9772b8d4dca92534ba76a909f8b3..e8907c403edbd83c8a36b8d013c6bc3391207ee6 100644
--- a/drivers/cxl/cxlmem.h
+++ b/drivers/cxl/cxlmem.h
@@ -239,6 +239,15 @@ struct cxl_event_state {
struct mutex log_lock;
};
+/* Device enabled DCD commands */
+enum dcd_cmd_enabled_bits {
+ CXL_DCD_ENABLED_GET_CONFIG,
+ CXL_DCD_ENABLED_GET_EXTENT_LIST,
+ CXL_DCD_ENABLED_ADD_RESPONSE,
+ CXL_DCD_ENABLED_RELEASE,
+ CXL_DCD_ENABLED_MAX
+};
+
/* Device enabled poison commands */
enum poison_cmd_enabled_bits {
CXL_POISON_ENABLED_LIST,
@@ -461,6 +470,7 @@ static inline struct cxl_dev_state *mbox_to_cxlds(struct cxl_mailbox *cxl_mbox)
* @lsa_size: Size of Label Storage Area
* (CXL 2.0 8.2.9.5.1.1 Identify Memory Device)
* @firmware_version: Firmware version for the memory device.
+ * @dcd_cmds: List of DCD commands implemented by memory device
* @enabled_cmds: Hardware commands found enabled in CEL.
* @exclusive_cmds: Commands that are kernel-internal only
* @total_bytes: sum of all possible capacities
@@ -485,6 +495,7 @@ struct cxl_memdev_state {
struct cxl_dev_state cxlds;
size_t lsa_size;
char firmware_version[0x10];
+ DECLARE_BITMAP(dcd_cmds, CXL_DCD_ENABLED_MAX);
DECLARE_BITMAP(enabled_cmds, CXL_MEM_COMMAND_ID_MAX);
DECLARE_BITMAP(exclusive_cmds, CXL_MEM_COMMAND_ID_MAX);
u64 total_bytes;
@@ -554,6 +565,10 @@ enum cxl_opcode {
CXL_MBOX_OP_UNLOCK = 0x4503,
CXL_MBOX_OP_FREEZE_SECURITY = 0x4504,
CXL_MBOX_OP_PASSPHRASE_SECURE_ERASE = 0x4505,
+ CXL_MBOX_OP_GET_DC_CONFIG = 0x4800,
+ CXL_MBOX_OP_GET_DC_EXTENT_LIST = 0x4801,
+ CXL_MBOX_OP_ADD_DC_RESPONSE = 0x4802,
+ CXL_MBOX_OP_RELEASE_DC = 0x4803,
CXL_MBOX_OP_MAX = 0x10000
};
--
2.47.1
^ permalink raw reply related [flat|nested] 34+ messages in thread
* [PATCH v8 02/21] cxl/mem: Read dynamic capacity configuration from the device
2024-12-11 3:42 [PATCH v8 00/21] DCD: Add support for Dynamic Capacity Devices (DCD) Ira Weiny
2024-12-11 3:42 ` [PATCH v8 01/21] cxl/mbox: Flag " Ira Weiny
@ 2024-12-11 3:42 ` Ira Weiny
2025-01-15 2:35 ` Dan Williams
2024-12-11 3:42 ` [PATCH v8 03/21] cxl/core: Separate region mode from decoder mode Ira Weiny
` (18 subsequent siblings)
20 siblings, 1 reply; 34+ messages in thread
From: Ira Weiny @ 2024-12-11 3:42 UTC (permalink / raw)
To: Dave Jiang, Fan Ni, Jonathan Cameron, Jonathan Corbet,
Andrew Morton, Kees Cook, Gustavo A. R. Silva
Cc: Dan Williams, Davidlohr Bueso, Alison Schofield, Vishal Verma,
Ira Weiny, linux-cxl, linux-doc, nvdimm, linux-kernel,
linux-hardening, Li Ming
Devices which optionally support Dynamic Capacity (DC) are configured
via mailbox commands. CXL 3.1 requires the host to issue the Get DC
Configuration command in order to properly configure DCDs. Without the
Get DC Configuration command DCD can't be supported.
Implement the DC mailbox commands as specified in CXL 3.1 section
8.2.9.9.9 (opcodes 48XXh) to read and store the DCD configuration
information. Disable DCD if DCD is not supported. Leverage the Get DC
Configuration command supported bit to indicate if DCD is supported.
Linux has no use for the trailing fields of the Get Dynamic Capacity
Configuration Output Payload (Total number of supported extents, number
of available extents, total number of supported tags, and number of
available tags). Avoid defining those fields to use the more useful
dynamic C array.
Based on an original patch by Navneet Singh.
Cc: Li Ming <ming.li@zohomail.com>
Cc: Kees Cook <kees@kernel.org>
Cc: Gustavo A. R. Silva <gustavoars@kernel.org>
Cc: linux-hardening@vger.kernel.org
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
---
Changes:
[iweiny: fix EXPORT_SYMBOL_NS_GPL(cxl_dev_dynamic_capacity_identify)]
[iweiny: limit variable scope in cxl_dev_dynamic_capacity_identify]
---
drivers/cxl/core/mbox.c | 166 +++++++++++++++++++++++++++++++++++++++++++++++-
drivers/cxl/cxlmem.h | 64 ++++++++++++++++++-
drivers/cxl/pci.c | 4 ++
3 files changed, 232 insertions(+), 2 deletions(-)
diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
index 599934d066518341eb6ea9fc3319cd7098cbc2f3..a4cf9fbb1edfa275e8566bfacea03a49d68f9319 100644
--- a/drivers/cxl/core/mbox.c
+++ b/drivers/cxl/core/mbox.c
@@ -1168,7 +1168,7 @@ int cxl_dev_state_identify(struct cxl_memdev_state *mds)
if (rc < 0)
return rc;
- mds->total_bytes =
+ mds->static_bytes =
le64_to_cpu(id.total_capacity) * CXL_CAPACITY_MULTIPLIER;
mds->volatile_only_bytes =
le64_to_cpu(id.volatile_capacity) * CXL_CAPACITY_MULTIPLIER;
@@ -1274,6 +1274,154 @@ int cxl_mem_sanitize(struct cxl_memdev *cxlmd, u16 cmd)
return rc;
}
+static int cxl_dc_save_region_info(struct cxl_memdev_state *mds, u8 index,
+ struct cxl_dc_region_config *region_config)
+{
+ struct cxl_dc_region_info *dcr = &mds->dc_region[index];
+ struct device *dev = mds->cxlds.dev;
+
+ dcr->base = le64_to_cpu(region_config->region_base);
+ dcr->decode_len = le64_to_cpu(region_config->region_decode_length);
+ dcr->decode_len *= CXL_CAPACITY_MULTIPLIER;
+ dcr->len = le64_to_cpu(region_config->region_length);
+ dcr->blk_size = le64_to_cpu(region_config->region_block_size);
+ dcr->dsmad_handle = le32_to_cpu(region_config->region_dsmad_handle);
+ dcr->flags = region_config->flags;
+ snprintf(dcr->name, CXL_DC_REGION_STRLEN, "dc%d", index);
+
+ /* Check regions are in increasing DPA order */
+ if (index > 0) {
+ struct cxl_dc_region_info *prev_dcr = &mds->dc_region[index - 1];
+
+ if ((prev_dcr->base + prev_dcr->decode_len) > dcr->base) {
+ dev_err(dev,
+ "DPA ordering violation for DC region %d and %d\n",
+ index - 1, index);
+ return -EINVAL;
+ }
+ }
+
+ if (!IS_ALIGNED(dcr->base, SZ_256M) ||
+ !IS_ALIGNED(dcr->base, dcr->blk_size)) {
+ dev_err(dev, "DC region %d invalid base %#llx blk size %#llx\n",
+ index, dcr->base, dcr->blk_size);
+ return -EINVAL;
+ }
+
+ if (dcr->decode_len == 0 || dcr->len == 0 || dcr->decode_len < dcr->len ||
+ !IS_ALIGNED(dcr->len, dcr->blk_size)) {
+ dev_err(dev, "DC region %d invalid length; decode %#llx len %#llx blk size %#llx\n",
+ index, dcr->decode_len, dcr->len, dcr->blk_size);
+ return -EINVAL;
+ }
+
+ if (dcr->blk_size == 0 || dcr->blk_size % CXL_DCD_BLOCK_LINE_SIZE ||
+ !is_power_of_2(dcr->blk_size)) {
+ dev_err(dev, "DC region %d invalid block size; %#llx\n",
+ index, dcr->blk_size);
+ return -EINVAL;
+ }
+
+ dev_dbg(dev,
+ "DC region %s base %#llx length %#llx block size %#llx\n",
+ dcr->name, dcr->base, dcr->decode_len, dcr->blk_size);
+
+ return 0;
+}
+
+/* Returns the number of regions in dc_resp or -ERRNO */
+static int cxl_get_dc_config(struct cxl_memdev_state *mds, u8 start_region,
+ struct cxl_mbox_get_dc_config_out *dc_resp,
+ size_t dc_resp_size)
+{
+ struct cxl_mbox_get_dc_config_in get_dc = (struct cxl_mbox_get_dc_config_in) {
+ .region_count = CXL_MAX_DC_REGION,
+ .start_region_index = start_region,
+ };
+ struct cxl_mbox_cmd mbox_cmd = (struct cxl_mbox_cmd) {
+ .opcode = CXL_MBOX_OP_GET_DC_CONFIG,
+ .payload_in = &get_dc,
+ .size_in = sizeof(get_dc),
+ .size_out = dc_resp_size,
+ .payload_out = dc_resp,
+ .min_out = 1,
+ };
+ struct device *dev = mds->cxlds.dev;
+ int rc;
+
+ rc = cxl_internal_send_cmd(&mds->cxlds.cxl_mbox, &mbox_cmd);
+ if (rc < 0)
+ return rc;
+
+ dev_dbg(dev, "Read %d/%d DC regions\n",
+ dc_resp->regions_returned, dc_resp->avail_region_count);
+ return dc_resp->regions_returned;
+}
+
+/**
+ * cxl_dev_dynamic_capacity_identify() - Reads the dynamic capacity
+ * information from the device.
+ * @mds: The memory device state
+ *
+ * Read Dynamic Capacity information from the device and populate the state
+ * structures for later use.
+ *
+ * Return: 0 if identify was executed successfully, -ERRNO on error.
+ */
+int cxl_dev_dynamic_capacity_identify(struct cxl_memdev_state *mds)
+{
+ size_t dc_resp_size = mds->cxlds.cxl_mbox.payload_size;
+ struct device *dev = mds->cxlds.dev;
+ u8 start_region;
+
+ if (!cxl_dcd_supported(mds)) {
+ dev_dbg(dev, "DCD not supported\n");
+ return 0;
+ }
+
+ struct cxl_mbox_get_dc_config_out *dc_resp __free(kfree) =
+ kvmalloc(dc_resp_size, GFP_KERNEL);
+ if (!dc_resp)
+ return -ENOMEM;
+
+ start_region = 0;
+ do {
+ int rc, i, j;
+
+ rc = cxl_get_dc_config(mds, start_region, dc_resp, dc_resp_size);
+ if (rc < 0) {
+ dev_err(dev, "Failed to get DC config: %d\n", rc);
+ return rc;
+ }
+
+ mds->nr_dc_region += rc;
+
+ if (mds->nr_dc_region < 1 || mds->nr_dc_region > CXL_MAX_DC_REGION) {
+ dev_err(dev, "Invalid num of dynamic capacity regions %d\n",
+ mds->nr_dc_region);
+ return -EINVAL;
+ }
+
+ for (i = start_region, j = 0; i < mds->nr_dc_region; i++, j++) {
+ rc = cxl_dc_save_region_info(mds, i, &dc_resp->region[j]);
+ if (rc)
+ return rc;
+ }
+
+ start_region = mds->nr_dc_region;
+
+ } while (mds->nr_dc_region < dc_resp->avail_region_count);
+
+ mds->dynamic_bytes =
+ mds->dc_region[mds->nr_dc_region - 1].base +
+ mds->dc_region[mds->nr_dc_region - 1].decode_len -
+ mds->dc_region[0].base;
+ dev_dbg(dev, "Total dynamic range: %#llx\n", mds->dynamic_bytes);
+
+ return 0;
+}
+EXPORT_SYMBOL_NS_GPL(cxl_dev_dynamic_capacity_identify, "CXL");
+
static int add_dpa_res(struct device *dev, struct resource *parent,
struct resource *res, resource_size_t start,
resource_size_t size, const char *type)
@@ -1304,8 +1452,15 @@ int cxl_mem_create_range_info(struct cxl_memdev_state *mds)
{
struct cxl_dev_state *cxlds = &mds->cxlds;
struct device *dev = cxlds->dev;
+ size_t untenanted_mem;
int rc;
+ mds->total_bytes = mds->static_bytes;
+ if (mds->nr_dc_region) {
+ untenanted_mem = mds->dc_region[0].base - mds->static_bytes;
+ mds->total_bytes += untenanted_mem + mds->dynamic_bytes;
+ }
+
if (!cxlds->media_ready) {
cxlds->dpa_res = DEFINE_RES_MEM(0, 0);
cxlds->ram_res = DEFINE_RES_MEM(0, 0);
@@ -1315,6 +1470,15 @@ int cxl_mem_create_range_info(struct cxl_memdev_state *mds)
cxlds->dpa_res = DEFINE_RES_MEM(0, mds->total_bytes);
+ for (int i = 0; i < mds->nr_dc_region; i++) {
+ struct cxl_dc_region_info *dcr = &mds->dc_region[i];
+
+ rc = add_dpa_res(dev, &cxlds->dpa_res, &cxlds->dc_res[i],
+ dcr->base, dcr->decode_len, dcr->name);
+ if (rc)
+ return rc;
+ }
+
if (mds->partition_align_bytes == 0) {
rc = add_dpa_res(dev, &cxlds->dpa_res, &cxlds->ram_res, 0,
mds->volatile_only_bytes, "ram");
diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
index e8907c403edbd83c8a36b8d013c6bc3391207ee6..05a0718aea73b3b2a02c608bae198eac7c462523 100644
--- a/drivers/cxl/cxlmem.h
+++ b/drivers/cxl/cxlmem.h
@@ -403,6 +403,7 @@ enum cxl_devtype {
CXL_DEVTYPE_CLASSMEM,
};
+#define CXL_MAX_DC_REGION 8
/**
* struct cxl_dpa_perf - DPA performance property entry
* @dpa_range: range for DPA address
@@ -434,6 +435,8 @@ struct cxl_dpa_perf {
* @dpa_res: Overall DPA resource tree for the device
* @pmem_res: Active Persistent memory capacity configuration
* @ram_res: Active Volatile memory capacity configuration
+ * @dc_res: Active Dynamic Capacity memory configuration for each possible
+ * region
* @serial: PCIe Device Serial Number
* @type: Generic Memory Class device or Vendor Specific Memory device
* @cxl_mbox: CXL mailbox context
@@ -449,11 +452,23 @@ struct cxl_dev_state {
struct resource dpa_res;
struct resource pmem_res;
struct resource ram_res;
+ struct resource dc_res[CXL_MAX_DC_REGION];
u64 serial;
enum cxl_devtype type;
struct cxl_mailbox cxl_mbox;
};
+#define CXL_DC_REGION_STRLEN 8
+struct cxl_dc_region_info {
+ u64 base;
+ u64 decode_len;
+ u64 len;
+ u64 blk_size;
+ u32 dsmad_handle;
+ u8 flags;
+ u8 name[CXL_DC_REGION_STRLEN];
+};
+
static inline struct cxl_dev_state *mbox_to_cxlds(struct cxl_mailbox *cxl_mbox)
{
return dev_get_drvdata(cxl_mbox->host);
@@ -473,7 +488,9 @@ static inline struct cxl_dev_state *mbox_to_cxlds(struct cxl_mailbox *cxl_mbox)
* @dcd_cmds: List of DCD commands implemented by memory device
* @enabled_cmds: Hardware commands found enabled in CEL.
* @exclusive_cmds: Commands that are kernel-internal only
- * @total_bytes: sum of all possible capacities
+ * @total_bytes: length of all possible capacities
+ * @static_bytes: length of possible static RAM and PMEM partitions
+ * @dynamic_bytes: length of possible DC partitions (DC Regions)
* @volatile_only_bytes: hard volatile capacity
* @persistent_only_bytes: hard persistent capacity
* @partition_align_bytes: alignment size for partition-able capacity
@@ -483,6 +500,8 @@ static inline struct cxl_dev_state *mbox_to_cxlds(struct cxl_mailbox *cxl_mbox)
* @next_persistent_bytes: persistent capacity change pending device reset
* @ram_perf: performance data entry matched to RAM partition
* @pmem_perf: performance data entry matched to PMEM partition
+ * @nr_dc_region: number of DC regions implemented in the memory device
+ * @dc_region: array containing info about the DC regions
* @event: event log driver state
* @poison: poison driver state info
* @security: security driver state info
@@ -499,6 +518,8 @@ struct cxl_memdev_state {
DECLARE_BITMAP(enabled_cmds, CXL_MEM_COMMAND_ID_MAX);
DECLARE_BITMAP(exclusive_cmds, CXL_MEM_COMMAND_ID_MAX);
u64 total_bytes;
+ u64 static_bytes;
+ u64 dynamic_bytes;
u64 volatile_only_bytes;
u64 persistent_only_bytes;
u64 partition_align_bytes;
@@ -510,6 +531,9 @@ struct cxl_memdev_state {
struct cxl_dpa_perf ram_perf;
struct cxl_dpa_perf pmem_perf;
+ u8 nr_dc_region;
+ struct cxl_dc_region_info dc_region[CXL_MAX_DC_REGION];
+
struct cxl_event_state event;
struct cxl_poison_state poison;
struct cxl_security_state security;
@@ -708,6 +732,32 @@ struct cxl_mbox_set_partition_info {
#define CXL_SET_PARTITION_IMMEDIATE_FLAG BIT(0)
+/* See CXL 3.1 Table 8-163 get dynamic capacity config Input Payload */
+struct cxl_mbox_get_dc_config_in {
+ u8 region_count;
+ u8 start_region_index;
+} __packed;
+
+/* See CXL 3.1 Table 8-164 get dynamic capacity config Output Payload */
+struct cxl_mbox_get_dc_config_out {
+ u8 avail_region_count;
+ u8 regions_returned;
+ u8 rsvd[6];
+ /* See CXL 3.1 Table 8-165 */
+ struct cxl_dc_region_config {
+ __le64 region_base;
+ __le64 region_decode_length;
+ __le64 region_length;
+ __le64 region_block_size;
+ __le32 region_dsmad_handle;
+ u8 flags;
+ u8 rsvd[3];
+ } __packed region[] __counted_by(regions_returned);
+ /* Trailing fields unused */
+} __packed;
+#define CXL_DYNAMIC_CAPACITY_SANITIZE_ON_RELEASE_FLAG BIT(0)
+#define CXL_DCD_BLOCK_LINE_SIZE 0x40
+
/* Set Timestamp CXL 3.0 Spec 8.2.9.4.2 */
struct cxl_mbox_set_timestamp_in {
__le64 timestamp;
@@ -831,6 +881,7 @@ enum {
int cxl_internal_send_cmd(struct cxl_mailbox *cxl_mbox,
struct cxl_mbox_cmd *cmd);
int cxl_dev_state_identify(struct cxl_memdev_state *mds);
+int cxl_dev_dynamic_capacity_identify(struct cxl_memdev_state *mds);
int cxl_await_media_ready(struct cxl_dev_state *cxlds);
int cxl_enumerate_cmds(struct cxl_memdev_state *mds);
int cxl_mem_create_range_info(struct cxl_memdev_state *mds);
@@ -844,6 +895,17 @@ void cxl_event_trace_record(const struct cxl_memdev *cxlmd,
enum cxl_event_log_type type,
enum cxl_event_type event_type,
const uuid_t *uuid, union cxl_event *evt);
+
+static inline bool cxl_dcd_supported(struct cxl_memdev_state *mds)
+{
+ return test_bit(CXL_DCD_ENABLED_GET_CONFIG, mds->dcd_cmds);
+}
+
+static inline void cxl_disable_dcd(struct cxl_memdev_state *mds)
+{
+ clear_bit(CXL_DCD_ENABLED_GET_CONFIG, mds->dcd_cmds);
+}
+
int cxl_set_timestamp(struct cxl_memdev_state *mds);
int cxl_poison_state_init(struct cxl_memdev_state *mds);
int cxl_mem_get_poison(struct cxl_memdev *cxlmd, u64 offset, u64 len,
diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
index 0241d1d7133a4b9c3fe3fddfdc0bcc9cf807ee11..5082625a7b3f51a84f894a3265e922e51b794b68 100644
--- a/drivers/cxl/pci.c
+++ b/drivers/cxl/pci.c
@@ -989,6 +989,10 @@ static int cxl_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
if (rc)
return rc;
+ rc = cxl_dev_dynamic_capacity_identify(mds);
+ if (rc)
+ cxl_disable_dcd(mds);
+
rc = cxl_mem_create_range_info(mds);
if (rc)
return rc;
--
2.47.1
^ permalink raw reply related [flat|nested] 34+ messages in thread
* [PATCH v8 03/21] cxl/core: Separate region mode from decoder mode
2024-12-11 3:42 [PATCH v8 00/21] DCD: Add support for Dynamic Capacity Devices (DCD) Ira Weiny
2024-12-11 3:42 ` [PATCH v8 01/21] cxl/mbox: Flag " Ira Weiny
2024-12-11 3:42 ` [PATCH v8 02/21] cxl/mem: Read dynamic capacity configuration from the device Ira Weiny
@ 2024-12-11 3:42 ` Ira Weiny
2024-12-11 3:42 ` [PATCH v8 04/21] cxl/region: Add dynamic capacity decoder and region modes Ira Weiny
` (17 subsequent siblings)
20 siblings, 0 replies; 34+ messages in thread
From: Ira Weiny @ 2024-12-11 3:42 UTC (permalink / raw)
To: Dave Jiang, Fan Ni, Jonathan Cameron, Jonathan Corbet,
Andrew Morton, Kees Cook, Gustavo A. R. Silva
Cc: Dan Williams, Davidlohr Bueso, Alison Schofield, Vishal Verma,
Ira Weiny, linux-cxl, linux-doc, nvdimm, linux-kernel,
linux-hardening, Jonathan Cameron, Li Ming
Until now region modes and decoder modes were equivalent in that both
modes were either PMEM or RAM. The addition of Dynamic
Capacity partitions defines up to 8 DC partitions per device.
The region mode is thus no longer equivalent to the endpoint decoder
mode. IOW the endpoint decoders may have modes of DC0-DC7 while the
region mode is simply DC.
Define a new region mode enumeration which applies to regions separate
from the decoder mode. Adjust the code to process these modes
independently.
There is no equal to decoder mode dead in region modes. Avoid
constructing regions with decoders which have been flagged as dead.
Suggested-by: Jonathan Cameron <Jonathan.Cameron@Huawei.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: Fan Ni <fan.ni@samsung.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Reviewed-by: Li Ming <ming.li@zohomail.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
---
drivers/cxl/core/cdat.c | 6 ++--
drivers/cxl/core/region.c | 77 ++++++++++++++++++++++++++++++++++-------------
drivers/cxl/cxl.h | 26 ++++++++++++++--
3 files changed, 83 insertions(+), 26 deletions(-)
diff --git a/drivers/cxl/core/cdat.c b/drivers/cxl/core/cdat.c
index 8153f8d83a164a20b948517bb3f09e278c80d681..401a19359aee77167fb6fe9e3d8fd5e9a077ab88 100644
--- a/drivers/cxl/core/cdat.c
+++ b/drivers/cxl/core/cdat.c
@@ -571,17 +571,17 @@ static bool dpa_perf_contains(struct cxl_dpa_perf *perf,
}
static struct cxl_dpa_perf *cxled_get_dpa_perf(struct cxl_endpoint_decoder *cxled,
- enum cxl_decoder_mode mode)
+ enum cxl_region_mode mode)
{
struct cxl_memdev *cxlmd = cxled_to_memdev(cxled);
struct cxl_memdev_state *mds = to_cxl_memdev_state(cxlmd->cxlds);
struct cxl_dpa_perf *perf;
switch (mode) {
- case CXL_DECODER_RAM:
+ case CXL_REGION_RAM:
perf = &mds->ram_perf;
break;
- case CXL_DECODER_PMEM:
+ case CXL_REGION_PMEM:
perf = &mds->pmem_perf;
break;
default:
diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
index d778996507984a759bbe84e7acac3774e0c7af98..1e9f8f2b4e28294fda5199bd1001225eec041ec0 100644
--- a/drivers/cxl/core/region.c
+++ b/drivers/cxl/core/region.c
@@ -144,7 +144,7 @@ static ssize_t uuid_show(struct device *dev, struct device_attribute *attr,
rc = down_read_interruptible(&cxl_region_rwsem);
if (rc)
return rc;
- if (cxlr->mode != CXL_DECODER_PMEM)
+ if (cxlr->mode != CXL_REGION_PMEM)
rc = sysfs_emit(buf, "\n");
else
rc = sysfs_emit(buf, "%pUb\n", &p->uuid);
@@ -441,7 +441,7 @@ static umode_t cxl_region_visible(struct kobject *kobj, struct attribute *a,
* Support tooling that expects to find a 'uuid' attribute for all
* regions regardless of mode.
*/
- if (a == &dev_attr_uuid.attr && cxlr->mode != CXL_DECODER_PMEM)
+ if (a == &dev_attr_uuid.attr && cxlr->mode != CXL_REGION_PMEM)
return 0444;
return a->mode;
}
@@ -604,7 +604,7 @@ static ssize_t mode_show(struct device *dev, struct device_attribute *attr,
{
struct cxl_region *cxlr = to_cxl_region(dev);
- return sysfs_emit(buf, "%s\n", cxl_decoder_mode_name(cxlr->mode));
+ return sysfs_emit(buf, "%s\n", cxl_region_mode_name(cxlr->mode));
}
static DEVICE_ATTR_RO(mode);
@@ -630,7 +630,7 @@ static int alloc_hpa(struct cxl_region *cxlr, resource_size_t size)
/* ways, granularity and uuid (if PMEM) need to be set before HPA */
if (!p->interleave_ways || !p->interleave_granularity ||
- (cxlr->mode == CXL_DECODER_PMEM && uuid_is_null(&p->uuid)))
+ (cxlr->mode == CXL_REGION_PMEM && uuid_is_null(&p->uuid)))
return -ENXIO;
div64_u64_rem(size, (u64)SZ_256M * p->interleave_ways, &remainder);
@@ -1870,6 +1870,17 @@ static int cxl_region_sort_targets(struct cxl_region *cxlr)
return rc;
}
+static bool cxl_modes_compatible(enum cxl_region_mode rmode,
+ enum cxl_decoder_mode dmode)
+{
+ if (rmode == CXL_REGION_RAM && dmode == CXL_DECODER_RAM)
+ return true;
+ if (rmode == CXL_REGION_PMEM && dmode == CXL_DECODER_PMEM)
+ return true;
+
+ return false;
+}
+
static int cxl_region_attach(struct cxl_region *cxlr,
struct cxl_endpoint_decoder *cxled, int pos)
{
@@ -1889,9 +1900,11 @@ static int cxl_region_attach(struct cxl_region *cxlr,
return rc;
}
- if (cxled->mode != cxlr->mode) {
- dev_dbg(&cxlr->dev, "%s region mode: %d mismatch: %d\n",
- dev_name(&cxled->cxld.dev), cxlr->mode, cxled->mode);
+ if (!cxl_modes_compatible(cxlr->mode, cxled->mode)) {
+ dev_dbg(&cxlr->dev, "%s region mode: %s mismatch decoder: %s\n",
+ dev_name(&cxled->cxld.dev),
+ cxl_region_mode_name(cxlr->mode),
+ cxl_decoder_mode_name(cxled->mode));
return -EINVAL;
}
@@ -2447,7 +2460,7 @@ static int cxl_region_calculate_adistance(struct notifier_block *nb,
* devm_cxl_add_region - Adds a region to a decoder
* @cxlrd: root decoder
* @id: memregion id to create, or memregion_free() on failure
- * @mode: mode for the endpoint decoders of this region
+ * @mode: mode of this region
* @type: select whether this is an expander or accelerator (type-2 or type-3)
*
* This is the second step of region initialization. Regions exist within an
@@ -2458,7 +2471,7 @@ static int cxl_region_calculate_adistance(struct notifier_block *nb,
*/
static struct cxl_region *devm_cxl_add_region(struct cxl_root_decoder *cxlrd,
int id,
- enum cxl_decoder_mode mode,
+ enum cxl_region_mode mode,
enum cxl_decoder_type type)
{
struct cxl_port *port = to_cxl_port(cxlrd->cxlsd.cxld.dev.parent);
@@ -2512,16 +2525,17 @@ static ssize_t create_ram_region_show(struct device *dev,
}
static struct cxl_region *__create_region(struct cxl_root_decoder *cxlrd,
- enum cxl_decoder_mode mode, int id)
+ enum cxl_region_mode mode, int id)
{
int rc;
switch (mode) {
- case CXL_DECODER_RAM:
- case CXL_DECODER_PMEM:
+ case CXL_REGION_RAM:
+ case CXL_REGION_PMEM:
break;
default:
- dev_err(&cxlrd->cxlsd.cxld.dev, "unsupported mode %d\n", mode);
+ dev_err(&cxlrd->cxlsd.cxld.dev, "unsupported mode %s\n",
+ cxl_region_mode_name(mode));
return ERR_PTR(-EINVAL);
}
@@ -2538,7 +2552,7 @@ static struct cxl_region *__create_region(struct cxl_root_decoder *cxlrd,
}
static ssize_t create_region_store(struct device *dev, const char *buf,
- size_t len, enum cxl_decoder_mode mode)
+ size_t len, enum cxl_region_mode mode)
{
struct cxl_root_decoder *cxlrd = to_cxl_root_decoder(dev);
struct cxl_region *cxlr;
@@ -2559,7 +2573,7 @@ static ssize_t create_pmem_region_store(struct device *dev,
struct device_attribute *attr,
const char *buf, size_t len)
{
- return create_region_store(dev, buf, len, CXL_DECODER_PMEM);
+ return create_region_store(dev, buf, len, CXL_REGION_PMEM);
}
DEVICE_ATTR_RW(create_pmem_region);
@@ -2567,7 +2581,7 @@ static ssize_t create_ram_region_store(struct device *dev,
struct device_attribute *attr,
const char *buf, size_t len)
{
- return create_region_store(dev, buf, len, CXL_DECODER_RAM);
+ return create_region_store(dev, buf, len, CXL_REGION_RAM);
}
DEVICE_ATTR_RW(create_ram_region);
@@ -3210,6 +3224,22 @@ static int match_region_by_range(struct device *dev, void *data)
return rc;
}
+static enum cxl_region_mode
+cxl_decoder_to_region_mode(enum cxl_decoder_mode mode)
+{
+ switch (mode) {
+ case CXL_DECODER_NONE:
+ return CXL_REGION_NONE;
+ case CXL_DECODER_RAM:
+ return CXL_REGION_RAM;
+ case CXL_DECODER_PMEM:
+ return CXL_REGION_PMEM;
+ case CXL_DECODER_MIXED:
+ default:
+ return CXL_REGION_MIXED;
+ }
+}
+
/* Establish an empty region covering the given HPA range */
static struct cxl_region *construct_region(struct cxl_root_decoder *cxlrd,
struct cxl_endpoint_decoder *cxled)
@@ -3218,12 +3248,17 @@ static struct cxl_region *construct_region(struct cxl_root_decoder *cxlrd,
struct cxl_port *port = cxlrd_to_port(cxlrd);
struct range *hpa = &cxled->cxld.hpa_range;
struct cxl_region_params *p;
+ enum cxl_region_mode mode;
struct cxl_region *cxlr;
struct resource *res;
int rc;
+ if (cxled->mode == CXL_DECODER_DEAD)
+ return ERR_PTR(-EINVAL);
+
+ mode = cxl_decoder_to_region_mode(cxled->mode);
do {
- cxlr = __create_region(cxlrd, cxled->mode,
+ cxlr = __create_region(cxlrd, mode,
atomic_read(&cxlrd->region_id));
} while (IS_ERR(cxlr) && PTR_ERR(cxlr) == -EBUSY);
@@ -3426,9 +3461,9 @@ static int cxl_region_probe(struct device *dev)
return rc;
switch (cxlr->mode) {
- case CXL_DECODER_PMEM:
+ case CXL_REGION_PMEM:
return devm_cxl_add_pmem_region(cxlr);
- case CXL_DECODER_RAM:
+ case CXL_REGION_RAM:
/*
* The region can not be manged by CXL if any portion of
* it is already online as 'System RAM'
@@ -3440,8 +3475,8 @@ static int cxl_region_probe(struct device *dev)
return 0;
return devm_cxl_add_dax_region(cxlr);
default:
- dev_dbg(&cxlr->dev, "unsupported region mode: %d\n",
- cxlr->mode);
+ dev_dbg(&cxlr->dev, "unsupported region mode: %s\n",
+ cxl_region_mode_name(cxlr->mode));
return -ENXIO;
}
}
diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
index f6015f24ad3818966571e0aaea2b974f09af5f7c..2c832ef1c62c2d7879ce944b599374b5fc70c3fc 100644
--- a/drivers/cxl/cxl.h
+++ b/drivers/cxl/cxl.h
@@ -397,6 +397,27 @@ static inline const char *cxl_decoder_mode_name(enum cxl_decoder_mode mode)
return "mixed";
}
+enum cxl_region_mode {
+ CXL_REGION_NONE,
+ CXL_REGION_RAM,
+ CXL_REGION_PMEM,
+ CXL_REGION_MIXED,
+};
+
+static inline const char *cxl_region_mode_name(enum cxl_region_mode mode)
+{
+ static const char * const names[] = {
+ [CXL_REGION_NONE] = "none",
+ [CXL_REGION_RAM] = "ram",
+ [CXL_REGION_PMEM] = "pmem",
+ [CXL_REGION_MIXED] = "mixed",
+ };
+
+ if (mode >= CXL_REGION_NONE && mode <= CXL_REGION_MIXED)
+ return names[mode];
+ return "mixed";
+}
+
/*
* Track whether this decoder is reserved for region autodiscovery, or
* free for userspace provisioning.
@@ -524,7 +545,8 @@ struct cxl_region_params {
* struct cxl_region - CXL region
* @dev: This region's device
* @id: This region's id. Id is globally unique across all regions
- * @mode: Endpoint decoder allocation / access mode
+ * @mode: Region mode which defines which endpoint decoder modes the region is
+ * compatible with
* @type: Endpoint decoder target type
* @cxl_nvb: nvdimm bridge for coordinating @cxlr_pmem setup / shutdown
* @cxlr_pmem: (for pmem regions) cached copy of the nvdimm bridge
@@ -537,7 +559,7 @@ struct cxl_region_params {
struct cxl_region {
struct device dev;
int id;
- enum cxl_decoder_mode mode;
+ enum cxl_region_mode mode;
enum cxl_decoder_type type;
struct cxl_nvdimm_bridge *cxl_nvb;
struct cxl_pmem_region *cxlr_pmem;
--
2.47.1
^ permalink raw reply related [flat|nested] 34+ messages in thread
* [PATCH v8 04/21] cxl/region: Add dynamic capacity decoder and region modes
2024-12-11 3:42 [PATCH v8 00/21] DCD: Add support for Dynamic Capacity Devices (DCD) Ira Weiny
` (2 preceding siblings ...)
2024-12-11 3:42 ` [PATCH v8 03/21] cxl/core: Separate region mode from decoder mode Ira Weiny
@ 2024-12-11 3:42 ` Ira Weiny
2024-12-11 3:42 ` [PATCH v8 05/21] cxl/hdm: Add dynamic capacity size support to endpoint decoders Ira Weiny
` (16 subsequent siblings)
20 siblings, 0 replies; 34+ messages in thread
From: Ira Weiny @ 2024-12-11 3:42 UTC (permalink / raw)
To: Dave Jiang, Fan Ni, Jonathan Cameron, Jonathan Corbet,
Andrew Morton, Kees Cook, Gustavo A. R. Silva
Cc: Dan Williams, Davidlohr Bueso, Alison Schofield, Vishal Verma,
Ira Weiny, linux-cxl, linux-doc, nvdimm, linux-kernel,
linux-hardening, Li Ming
One or more decoders each pointing to a Dynamic Capacity (DC) partition
form a CXL software region. The region mode reflects composition of
that entire software region. Decoder mode reflects a specific DC
partition. DC partitions are also known as DC regions per CXL
specification v3.1.
Define the new modes and helper functions required to make the
association between these new modes.
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: Fan Ni <fan.ni@samsung.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Reviewed-by: Li Ming <ming.li@zohomail.com>
Link: https://lore.kernel.org/all/663922b475e50_d54d72945b@dwillia2-xfh.jf.intel.com.notmuch/ [1]
Suggested-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
---
drivers/cxl/core/region.c | 4 ++++
drivers/cxl/cxl.h | 23 +++++++++++++++++++++++
2 files changed, 27 insertions(+)
diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
index 1e9f8f2b4e28294fda5199bd1001225eec041ec0..6c1a63610f5ba79b1da57cc37df4e2b5b88588a6 100644
--- a/drivers/cxl/core/region.c
+++ b/drivers/cxl/core/region.c
@@ -1877,6 +1877,8 @@ static bool cxl_modes_compatible(enum cxl_region_mode rmode,
return true;
if (rmode == CXL_REGION_PMEM && dmode == CXL_DECODER_PMEM)
return true;
+ if (rmode == CXL_REGION_DC && cxl_decoder_mode_is_dc(dmode))
+ return true;
return false;
}
@@ -3234,6 +3236,8 @@ cxl_decoder_to_region_mode(enum cxl_decoder_mode mode)
return CXL_REGION_RAM;
case CXL_DECODER_PMEM:
return CXL_REGION_PMEM;
+ case CXL_DECODER_DC0 ... CXL_DECODER_DC7:
+ return CXL_REGION_DC;
case CXL_DECODER_MIXED:
default:
return CXL_REGION_MIXED;
diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
index 2c832ef1c62c2d7879ce944b599374b5fc70c3fc..e61d4e3830a5428f671f5fc61f9e522d51f3fb0c 100644
--- a/drivers/cxl/cxl.h
+++ b/drivers/cxl/cxl.h
@@ -379,6 +379,14 @@ enum cxl_decoder_mode {
CXL_DECODER_NONE,
CXL_DECODER_RAM,
CXL_DECODER_PMEM,
+ CXL_DECODER_DC0,
+ CXL_DECODER_DC1,
+ CXL_DECODER_DC2,
+ CXL_DECODER_DC3,
+ CXL_DECODER_DC4,
+ CXL_DECODER_DC5,
+ CXL_DECODER_DC6,
+ CXL_DECODER_DC7,
CXL_DECODER_MIXED,
CXL_DECODER_DEAD,
};
@@ -389,6 +397,14 @@ static inline const char *cxl_decoder_mode_name(enum cxl_decoder_mode mode)
[CXL_DECODER_NONE] = "none",
[CXL_DECODER_RAM] = "ram",
[CXL_DECODER_PMEM] = "pmem",
+ [CXL_DECODER_DC0] = "dc0",
+ [CXL_DECODER_DC1] = "dc1",
+ [CXL_DECODER_DC2] = "dc2",
+ [CXL_DECODER_DC3] = "dc3",
+ [CXL_DECODER_DC4] = "dc4",
+ [CXL_DECODER_DC5] = "dc5",
+ [CXL_DECODER_DC6] = "dc6",
+ [CXL_DECODER_DC7] = "dc7",
[CXL_DECODER_MIXED] = "mixed",
};
@@ -397,10 +413,16 @@ static inline const char *cxl_decoder_mode_name(enum cxl_decoder_mode mode)
return "mixed";
}
+static inline bool cxl_decoder_mode_is_dc(enum cxl_decoder_mode mode)
+{
+ return (mode >= CXL_DECODER_DC0 && mode <= CXL_DECODER_DC7);
+}
+
enum cxl_region_mode {
CXL_REGION_NONE,
CXL_REGION_RAM,
CXL_REGION_PMEM,
+ CXL_REGION_DC,
CXL_REGION_MIXED,
};
@@ -410,6 +432,7 @@ static inline const char *cxl_region_mode_name(enum cxl_region_mode mode)
[CXL_REGION_NONE] = "none",
[CXL_REGION_RAM] = "ram",
[CXL_REGION_PMEM] = "pmem",
+ [CXL_REGION_DC] = "dc",
[CXL_REGION_MIXED] = "mixed",
};
--
2.47.1
^ permalink raw reply related [flat|nested] 34+ messages in thread
* [PATCH v8 05/21] cxl/hdm: Add dynamic capacity size support to endpoint decoders
2024-12-11 3:42 [PATCH v8 00/21] DCD: Add support for Dynamic Capacity Devices (DCD) Ira Weiny
` (3 preceding siblings ...)
2024-12-11 3:42 ` [PATCH v8 04/21] cxl/region: Add dynamic capacity decoder and region modes Ira Weiny
@ 2024-12-11 3:42 ` Ira Weiny
2024-12-11 3:42 ` [PATCH v8 06/21] cxl/cdat: Gather DSMAS data for DCD regions Ira Weiny
` (15 subsequent siblings)
20 siblings, 0 replies; 34+ messages in thread
From: Ira Weiny @ 2024-12-11 3:42 UTC (permalink / raw)
To: Dave Jiang, Fan Ni, Jonathan Cameron, Jonathan Corbet,
Andrew Morton, Kees Cook, Gustavo A. R. Silva
Cc: Dan Williams, Davidlohr Bueso, Alison Schofield, Vishal Verma,
Ira Weiny, linux-cxl, linux-doc, nvdimm, linux-kernel,
linux-hardening
To support Dynamic Capacity Devices (DCD) endpoint decoders will need to
map DC partitions (regions). In addition to assigning the size of the
DC partition, the decoder must assign any skip value from the previous
decoder. This must be done within a contiguous DPA space.
Two complications arise with Dynamic Capacity regions which did not
exist with Ram and PMEM partitions. First, gaps in the DPA space can
exist between and around the DC partitions. Second, the Linux resource
tree does not allow a resource to be marked across existing nodes within
a tree.
For clarity, below is an example of an 60GB device with 10GB of RAM,
10GB of PMEM and 10GB for each of 2 DC partitions. The desired CXL
mapping is 5GB of RAM, 5GB of PMEM, and 5GB of DC1.
DPA RANGE
(dpa_res)
0GB 10GB 20GB 30GB 40GB 50GB 60GB
|----------|----------|----------|----------|----------|----------|
RAM PMEM DC0 DC1
(ram_res) (pmem_res) (dc_res[0]) (dc_res[1])
|----------|----------| <gap> |----------| <gap> |----------|
RAM PMEM DC1
|XXXXX|----|XXXXX|----|----------|----------|----------|XXXXX-----|
0GB 5GB 10GB 15GB 20GB 30GB 40GB 50GB 60GB
The previous skip resource between RAM and PMEM was always a child of
the RAM resource and fit nicely [see (S) below]. Because of this
simplicity this skip resource reference was not stored in any CXL state.
On release the skip range could be calculated based on the endpoint
decoders stored values.
Now when DC1 is being mapped 4 skip resources must be created as
children. One for the PMEM resource (A), two of the parent DPA resource
(B,D), and one more child of the DC0 resource (C).
0GB 10GB 20GB 30GB 40GB 50GB 60GB
|----------|----------|----------|----------|----------|----------|
| |
|----------|----------| | |----------| | |----------|
| | | | |
(S) (A) (B) (C) (D)
v v v v v
|XXXXX|----|XXXXX|----|----------|----------|----------|XXXXX-----|
skip skip skip skip skip
Expand the calculation of DPA free space and enhance the logic to
support this more complex skipping. To track the potential of multiple
skip resources an xarray is attached to the endpoint decoder. The
existing algorithm between RAM and PMEM is consolidated within the new
one to streamline the code even though the result is the storage of a
single skip resource in the xarray.
Based on an original patch by Navneet Singh.
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
---
drivers/cxl/core/hdm.c | 194 ++++++++++++++++++++++++++++++++++++++++++++----
drivers/cxl/core/port.c | 2 +
drivers/cxl/cxl.h | 2 +
3 files changed, 182 insertions(+), 16 deletions(-)
diff --git a/drivers/cxl/core/hdm.c b/drivers/cxl/core/hdm.c
index 28edd5822486851912393f066478252b20abc19d..e15241f94d17b774aa5befb37fb453af637a17ce 100644
--- a/drivers/cxl/core/hdm.c
+++ b/drivers/cxl/core/hdm.c
@@ -223,6 +223,23 @@ void cxl_dpa_debug(struct seq_file *file, struct cxl_dev_state *cxlds)
}
EXPORT_SYMBOL_NS_GPL(cxl_dpa_debug, "CXL");
+static void cxl_skip_release(struct cxl_endpoint_decoder *cxled)
+{
+ struct cxl_dev_state *cxlds = cxled_to_memdev(cxled)->cxlds;
+ struct cxl_port *port = cxled_to_port(cxled);
+ struct device *dev = &port->dev;
+ struct resource *res;
+ unsigned long index;
+
+ xa_for_each(&cxled->skip_xa, index, res) {
+ dev_dbg(dev, "decoder%d.%d: releasing skipped space; %pr\n",
+ port->id, cxled->cxld.id, res);
+ __release_region(&cxlds->dpa_res, res->start,
+ resource_size(res));
+ xa_erase(&cxled->skip_xa, index);
+ }
+}
+
/*
* Must be called in a context that synchronizes against this decoder's
* port ->remove() callback (like an endpoint decoder sysfs attribute)
@@ -233,15 +250,11 @@ static void __cxl_dpa_release(struct cxl_endpoint_decoder *cxled)
struct cxl_port *port = cxled_to_port(cxled);
struct cxl_dev_state *cxlds = cxlmd->cxlds;
struct resource *res = cxled->dpa_res;
- resource_size_t skip_start;
lockdep_assert_held_write(&cxl_dpa_rwsem);
- /* save @skip_start, before @res is released */
- skip_start = res->start - cxled->skip;
__release_region(&cxlds->dpa_res, res->start, resource_size(res));
- if (cxled->skip)
- __release_region(&cxlds->dpa_res, skip_start, cxled->skip);
+ cxl_skip_release(cxled);
cxled->skip = 0;
cxled->dpa_res = NULL;
put_device(&cxled->cxld.dev);
@@ -268,6 +281,105 @@ static void devm_cxl_dpa_release(struct cxl_endpoint_decoder *cxled)
__cxl_dpa_release(cxled);
}
+static int dc_mode_to_region_index(enum cxl_decoder_mode mode)
+{
+ return mode - CXL_DECODER_DC0;
+}
+
+static int cxl_request_skip(struct cxl_endpoint_decoder *cxled,
+ resource_size_t skip_base, resource_size_t skip_len)
+{
+ struct cxl_dev_state *cxlds = cxled_to_memdev(cxled)->cxlds;
+ const char *name = dev_name(&cxled->cxld.dev);
+ struct cxl_port *port = cxled_to_port(cxled);
+ struct resource *dpa_res = &cxlds->dpa_res;
+ struct device *dev = &port->dev;
+ struct resource *res;
+ int rc;
+
+ res = __request_region(dpa_res, skip_base, skip_len, name, 0);
+ if (!res)
+ return -EBUSY;
+
+ rc = xa_insert(&cxled->skip_xa, skip_base, res, GFP_KERNEL);
+ if (rc) {
+ __release_region(dpa_res, skip_base, skip_len);
+ return rc;
+ }
+
+ dev_dbg(dev, "decoder%d.%d: skipped space; %pr\n",
+ port->id, cxled->cxld.id, res);
+ return 0;
+}
+
+static int cxl_reserve_dpa_skip(struct cxl_endpoint_decoder *cxled,
+ resource_size_t base, resource_size_t skipped)
+{
+ struct cxl_memdev *cxlmd = cxled_to_memdev(cxled);
+ struct cxl_port *port = cxled_to_port(cxled);
+ struct cxl_dev_state *cxlds = cxlmd->cxlds;
+ resource_size_t skip_base = base - skipped;
+ struct device *dev = &port->dev;
+ resource_size_t skip_len = 0;
+ int rc, index;
+
+ if (resource_size(&cxlds->ram_res) && skip_base <= cxlds->ram_res.end) {
+ skip_len = cxlds->ram_res.end - skip_base + 1;
+ rc = cxl_request_skip(cxled, skip_base, skip_len);
+ if (rc)
+ return rc;
+ skip_base += skip_len;
+ }
+
+ if (skip_base == base) {
+ dev_dbg(dev, "skip done ram!\n");
+ return 0;
+ }
+
+ if (resource_size(&cxlds->pmem_res) &&
+ skip_base <= cxlds->pmem_res.end) {
+ skip_len = cxlds->pmem_res.end - skip_base + 1;
+ rc = cxl_request_skip(cxled, skip_base, skip_len);
+ if (rc)
+ return rc;
+ skip_base += skip_len;
+ }
+
+ index = dc_mode_to_region_index(cxled->mode);
+ for (int i = 0; i <= index; i++) {
+ struct resource *dcr = &cxlds->dc_res[i];
+
+ if (skip_base < dcr->start) {
+ skip_len = dcr->start - skip_base;
+ rc = cxl_request_skip(cxled, skip_base, skip_len);
+ if (rc)
+ return rc;
+ skip_base += skip_len;
+ }
+
+ if (skip_base == base) {
+ dev_dbg(dev, "skip done DC region %d!\n", i);
+ break;
+ }
+
+ if (resource_size(dcr) && skip_base <= dcr->end) {
+ if (skip_base > base) {
+ dev_err(dev, "Skip error DC region %d; skip_base %pa; base %pa\n",
+ i, &skip_base, &base);
+ return -ENXIO;
+ }
+
+ skip_len = dcr->end - skip_base + 1;
+ rc = cxl_request_skip(cxled, skip_base, skip_len);
+ if (rc)
+ return rc;
+ skip_base += skip_len;
+ }
+ }
+
+ return 0;
+}
+
static int __cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
resource_size_t base, resource_size_t len,
resource_size_t skipped)
@@ -305,13 +417,12 @@ static int __cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
}
if (skipped) {
- res = __request_region(&cxlds->dpa_res, base - skipped, skipped,
- dev_name(&cxled->cxld.dev), 0);
- if (!res) {
- dev_dbg(dev,
- "decoder%d.%d: failed to reserve skipped space\n",
- port->id, cxled->cxld.id);
- return -EBUSY;
+ int rc = cxl_reserve_dpa_skip(cxled, base, skipped);
+
+ if (rc) {
+ dev_dbg(dev, "decoder%d.%d: failed to reserve skipped space; %pa - %pa\n",
+ port->id, cxled->cxld.id, &base, &skipped);
+ return rc;
}
}
res = __request_region(&cxlds->dpa_res, base, len,
@@ -319,14 +430,20 @@ static int __cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
if (!res) {
dev_dbg(dev, "decoder%d.%d: failed to reserve allocation\n",
port->id, cxled->cxld.id);
- if (skipped)
- __release_region(&cxlds->dpa_res, base - skipped,
- skipped);
+ cxl_skip_release(cxled);
return -EBUSY;
}
cxled->dpa_res = res;
cxled->skip = skipped;
+ for (int mode = CXL_DECODER_DC0; mode <= CXL_DECODER_DC7; mode++) {
+ int index = dc_mode_to_region_index(mode);
+
+ if (resource_contains(&cxlds->dc_res[index], res)) {
+ cxled->mode = mode;
+ goto success;
+ }
+ }
if (resource_contains(&cxlds->pmem_res, res))
cxled->mode = CXL_DECODER_PMEM;
else if (resource_contains(&cxlds->ram_res, res))
@@ -337,6 +454,9 @@ static int __cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
cxled->mode = CXL_DECODER_MIXED;
}
+success:
+ dev_dbg(dev, "decoder%d.%d: %pr mode: %d\n", port->id, cxled->cxld.id,
+ cxled->dpa_res, cxled->mode);
port->hdm_end++;
get_device(&cxled->cxld.dev);
return 0;
@@ -457,8 +577,8 @@ int cxl_dpa_set_mode(struct cxl_endpoint_decoder *cxled,
int cxl_dpa_alloc(struct cxl_endpoint_decoder *cxled, unsigned long long size)
{
- struct cxl_memdev *cxlmd = cxled_to_memdev(cxled);
resource_size_t free_ram_start, free_pmem_start;
+ struct cxl_memdev *cxlmd = cxled_to_memdev(cxled);
struct cxl_port *port = cxled_to_port(cxled);
struct cxl_dev_state *cxlds = cxlmd->cxlds;
struct device *dev = &cxled->cxld.dev;
@@ -515,12 +635,54 @@ int cxl_dpa_alloc(struct cxl_endpoint_decoder *cxled, unsigned long long size)
else
skip_end = start - 1;
skip = skip_end - skip_start + 1;
+ } else if (cxl_decoder_mode_is_dc(cxled->mode)) {
+ int dc_index = dc_mode_to_region_index(cxled->mode);
+
+ for (p = cxlds->dc_res[dc_index].child, last = NULL; p; p = p->sibling)
+ last = p;
+
+ if (last) {
+ /*
+ * Some capacity in this DC partition is already allocated,
+ * that allocation already handled the skip.
+ */
+ start = last->end + 1;
+ skip = 0;
+ } else {
+ /* Calculate skip */
+ resource_size_t skip_start, skip_end;
+
+ start = cxlds->dc_res[dc_index].start;
+
+ if ((resource_size(&cxlds->pmem_res) == 0) || !cxlds->pmem_res.child)
+ skip_start = free_ram_start;
+ else
+ skip_start = free_pmem_start;
+ /*
+ * If any dc region is already mapped, then that allocation
+ * already handled the RAM and PMEM skip. Check for DC region
+ * skip.
+ */
+ for (int i = dc_index - 1; i >= 0 ; i--) {
+ if (cxlds->dc_res[i].child) {
+ skip_start = cxlds->dc_res[i].child->end + 1;
+ break;
+ }
+ }
+
+ skip_end = start - 1;
+ skip = skip_end - skip_start + 1;
+ }
+ avail = cxlds->dc_res[dc_index].end - start + 1;
} else {
dev_dbg(dev, "mode not set\n");
rc = -EINVAL;
goto out;
}
+ dev_dbg(dev, "DPA Allocation start: %pa len: %#llx Skip: %pa\n",
+ &start, size, &skip);
+
if (size > avail) {
dev_dbg(dev, "%pa exceeds available %s capacity: %pa\n", &size,
cxl_decoder_mode_name(cxled->mode), &avail);
diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
index 78a5c2c259829c3e1a7671ff61fdd95c6c43cc82..5c0b8ead315f41c4df14918ad4dcdb269990c5dd 100644
--- a/drivers/cxl/core/port.c
+++ b/drivers/cxl/core/port.c
@@ -419,6 +419,7 @@ static void cxl_endpoint_decoder_release(struct device *dev)
struct cxl_endpoint_decoder *cxled = to_cxl_endpoint_decoder(dev);
__cxl_decoder_release(&cxled->cxld);
+ xa_destroy(&cxled->skip_xa);
kfree(cxled);
}
@@ -1899,6 +1900,7 @@ struct cxl_endpoint_decoder *cxl_endpoint_decoder_alloc(struct cxl_port *port)
return ERR_PTR(-ENOMEM);
cxled->pos = -1;
+ xa_init(&cxled->skip_xa);
cxld = &cxled->cxld;
rc = cxl_decoder_init(port, cxld);
if (rc) {
diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
index e61d4e3830a5428f671f5fc61f9e522d51f3fb0c..055c840b6c2856ec77162c3c5f87293f00f8d8ec 100644
--- a/drivers/cxl/cxl.h
+++ b/drivers/cxl/cxl.h
@@ -455,6 +455,7 @@ enum cxl_decoder_state {
* @cxld: base cxl_decoder_object
* @dpa_res: actively claimed DPA span of this decoder
* @skip: offset into @dpa_res where @cxld.hpa_range maps
+ * @skip_xa: array of skipped resources from the previous decoder end
* @mode: which memory type / access-mode-partition this decoder targets
* @state: autodiscovery state
* @pos: interleave position in @cxld.region
@@ -463,6 +464,7 @@ struct cxl_endpoint_decoder {
struct cxl_decoder cxld;
struct resource *dpa_res;
resource_size_t skip;
+ struct xarray skip_xa;
enum cxl_decoder_mode mode;
enum cxl_decoder_state state;
int pos;
--
2.47.1
^ permalink raw reply related [flat|nested] 34+ messages in thread
* [PATCH v8 06/21] cxl/cdat: Gather DSMAS data for DCD regions
2024-12-11 3:42 [PATCH v8 00/21] DCD: Add support for Dynamic Capacity Devices (DCD) Ira Weiny
` (4 preceding siblings ...)
2024-12-11 3:42 ` [PATCH v8 05/21] cxl/hdm: Add dynamic capacity size support to endpoint decoders Ira Weiny
@ 2024-12-11 3:42 ` Ira Weiny
2024-12-11 3:42 ` [PATCH v8 07/21] cxl/mem: Expose DCD partition capabilities in sysfs Ira Weiny
` (14 subsequent siblings)
20 siblings, 0 replies; 34+ messages in thread
From: Ira Weiny @ 2024-12-11 3:42 UTC (permalink / raw)
To: Dave Jiang, Fan Ni, Jonathan Cameron, Jonathan Corbet,
Andrew Morton, Kees Cook, Gustavo A. R. Silva
Cc: Dan Williams, Davidlohr Bueso, Alison Schofield, Vishal Verma,
Ira Weiny, linux-cxl, linux-doc, nvdimm, linux-kernel,
linux-hardening
Additional DCD region (partition) information is contained in the DSMAS
CDAT tables, including performance, read only, and shareable attributes.
Match DCD partitions with DSMAS tables and store the meta data.
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
---
Changes:
[iweiny: convert range prints to %pra]
---
drivers/cxl/core/cdat.c | 36 ++++++++++++++++++++++++++++++++++++
drivers/cxl/core/mbox.c | 2 ++
drivers/cxl/cxlmem.h | 3 +++
3 files changed, 41 insertions(+)
diff --git a/drivers/cxl/core/cdat.c b/drivers/cxl/core/cdat.c
index 401a19359aee77167fb6fe9e3d8fd5e9a077ab88..14cdfb82f5ea6d4764b10098a1f009c1614b6f29 100644
--- a/drivers/cxl/core/cdat.c
+++ b/drivers/cxl/core/cdat.c
@@ -17,6 +17,8 @@ struct dsmas_entry {
struct access_coordinate cdat_coord[ACCESS_COORDINATE_MAX];
int entries;
int qos_class;
+ bool shareable;
+ bool read_only;
};
static u32 cdat_normalize(u16 entry, u64 base, u8 type)
@@ -74,6 +76,8 @@ static int cdat_dsmas_handler(union acpi_subtable_headers *header, void *arg,
return -ENOMEM;
dent->handle = dsmas->dsmad_handle;
+ dent->shareable = dsmas->flags & ACPI_CDAT_DSMAS_SHAREABLE;
+ dent->read_only = dsmas->flags & ACPI_CDAT_DSMAS_READ_ONLY;
dent->dpa_range.start = le64_to_cpu((__force __le64)dsmas->dpa_base_address);
dent->dpa_range.end = le64_to_cpu((__force __le64)dsmas->dpa_base_address) +
le64_to_cpu((__force __le64)dsmas->dpa_length) - 1;
@@ -255,6 +259,36 @@ static void update_perf_entry(struct device *dev, struct dsmas_entry *dent,
dent->coord[ACCESS_COORDINATE_CPU].write_latency);
}
+static void update_dcd_perf(struct cxl_dev_state *cxlds,
+ struct dsmas_entry *dent)
+{
+ struct cxl_memdev_state *mds = to_cxl_memdev_state(cxlds);
+ struct device *dev = cxlds->dev;
+
+ for (int i = 0; i < mds->nr_dc_region; i++) {
+ /* CXL defines a u32 handle while CDAT defines u8, ignore upper bits */
+ u8 dc_handle = mds->dc_region[i].dsmad_handle & 0xff;
+
+ if (resource_size(&cxlds->dc_res[i])) {
+ struct range dc_range = {
+ .start = cxlds->dc_res[i].start,
+ .end = cxlds->dc_res[i].end,
+ };
+
+ if (range_contains(&dent->dpa_range, &dc_range)) {
+ if (dent->handle != dc_handle)
+ dev_warn(dev, "DC Region/DSMAS mis-matched handle/range; region %pra (%u); dsmas %pra (%u)\n"
+ " setting DC region attributes regardless\n",
+ &dent->dpa_range, dent->handle, &dc_range, dc_handle);
+
+ mds->dc_region[i].shareable = dent->shareable;
+ mds->dc_region[i].read_only = dent->read_only;
+ update_perf_entry(dev, dent, &mds->dc_perf[i]);
+ }
+ }
+ }
+}
+
static void cxl_memdev_set_qos_class(struct cxl_dev_state *cxlds,
struct xarray *dsmas_xa)
{
@@ -278,6 +312,8 @@ static void cxl_memdev_set_qos_class(struct cxl_dev_state *cxlds,
else if (resource_size(&cxlds->pmem_res) &&
range_contains(&pmem_range, &dent->dpa_range))
update_perf_entry(dev, dent, &mds->pmem_perf);
+ else if (cxl_dcd_supported(mds))
+ update_dcd_perf(cxlds, dent);
else
dev_dbg(dev, "no partition for dsmas dpa: %pra\n",
&dent->dpa_range);
diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
index a4cf9fbb1edfa275e8566bfacea03a49d68f9319..56c4389e0031e15bc66056b8a73f4159864f6c4e 100644
--- a/drivers/cxl/core/mbox.c
+++ b/drivers/cxl/core/mbox.c
@@ -1649,6 +1649,8 @@ struct cxl_memdev_state *cxl_memdev_state_create(struct device *dev)
mds->cxlds.type = CXL_DEVTYPE_CLASSMEM;
mds->ram_perf.qos_class = CXL_QOS_CLASS_INVALID;
mds->pmem_perf.qos_class = CXL_QOS_CLASS_INVALID;
+ for (int i = 0; i < CXL_MAX_DC_REGION; i++)
+ mds->dc_perf[i].qos_class = CXL_QOS_CLASS_INVALID;
return mds;
}
diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
index 05a0718aea73b3b2a02c608bae198eac7c462523..bbdf52ac1d5cb5df82812c13ff50ca7cacfd0db6 100644
--- a/drivers/cxl/cxlmem.h
+++ b/drivers/cxl/cxlmem.h
@@ -466,6 +466,8 @@ struct cxl_dc_region_info {
u64 blk_size;
u32 dsmad_handle;
u8 flags;
+ bool shareable;
+ bool read_only;
u8 name[CXL_DC_REGION_STRLEN];
};
@@ -533,6 +535,7 @@ struct cxl_memdev_state {
u8 nr_dc_region;
struct cxl_dc_region_info dc_region[CXL_MAX_DC_REGION];
+ struct cxl_dpa_perf dc_perf[CXL_MAX_DC_REGION];
struct cxl_event_state event;
struct cxl_poison_state poison;
--
2.47.1
^ permalink raw reply related [flat|nested] 34+ messages in thread
* [PATCH v8 07/21] cxl/mem: Expose DCD partition capabilities in sysfs
2024-12-11 3:42 [PATCH v8 00/21] DCD: Add support for Dynamic Capacity Devices (DCD) Ira Weiny
` (5 preceding siblings ...)
2024-12-11 3:42 ` [PATCH v8 06/21] cxl/cdat: Gather DSMAS data for DCD regions Ira Weiny
@ 2024-12-11 3:42 ` Ira Weiny
2024-12-11 3:42 ` [PATCH v8 08/21] cxl/port: Add endpoint decoder DC mode support to sysfs Ira Weiny
` (13 subsequent siblings)
20 siblings, 0 replies; 34+ messages in thread
From: Ira Weiny @ 2024-12-11 3:42 UTC (permalink / raw)
To: Dave Jiang, Fan Ni, Jonathan Cameron, Jonathan Corbet,
Andrew Morton, Kees Cook, Gustavo A. R. Silva
Cc: Dan Williams, Davidlohr Bueso, Alison Schofield, Vishal Verma,
Ira Weiny, linux-cxl, linux-doc, nvdimm, linux-kernel,
linux-hardening
To properly configure CXL regions on Dynamic Capacity Devices (DCD),
user space will need to know the details of the DC partitions available.
Expose dynamic capacity capabilities through sysfs.
Based on an original patch by Navneet Singh.
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: Fan Ni <fan.ni@samsung.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
---
Documentation/ABI/testing/sysfs-bus-cxl | 45 ++++++++++++
drivers/cxl/core/memdev.c | 124 ++++++++++++++++++++++++++++++++
2 files changed, 169 insertions(+)
diff --git a/Documentation/ABI/testing/sysfs-bus-cxl b/Documentation/ABI/testing/sysfs-bus-cxl
index 3f5627a1210a16aca7c18d17131a56491048a0c2..ff3ae83477f0876c0ee2d3955d27a11fa9d16d83 100644
--- a/Documentation/ABI/testing/sysfs-bus-cxl
+++ b/Documentation/ABI/testing/sysfs-bus-cxl
@@ -54,6 +54,51 @@ Description:
identically named field in the Identify Memory Device Output
Payload in the CXL-2.0 specification.
+What: /sys/bus/cxl/devices/memX/dcY/size
+Date: December, 2024
+KernelVersion: v6.13
+Contact: linux-cxl@vger.kernel.org
+Description:
+ (RO) Dynamic Capacity (DC) region information. Devices only
+ export dcY if DCD partition Y is supported.
+ dcY/size is the size of each of those partitions.
+
+What: /sys/bus/cxl/devices/memX/dcY/read_only
+Date: December, 2024
+KernelVersion: v6.13
+Contact: linux-cxl@vger.kernel.org
+Description:
+ (RO) Dynamic Capacity (DC) region information. Devices only
+ export dcY if DCD partition Y is supported.
+ dcY/read_only indicates true if the region is exported
+ read_only from the device.
+
+What: /sys/bus/cxl/devices/memX/dcY/shareable
+Date: December, 2024
+KernelVersion: v6.13
+Contact: linux-cxl@vger.kernel.org
+Description:
+ (RO) Dynamic Capacity (DC) region information. Devices only
+ export dcY if DCD partition Y is supported.
+ dcY/shareable indicates true if the region is exported
+ shareable from the device.
+
+What: /sys/bus/cxl/devices/memX/dcY/qos_class
+Date: December, 2024
+KernelVersion: v6.13
+Contact: linux-cxl@vger.kernel.org
+Description:
+ (RO) Dynamic Capacity (DC) region information. Devices only
+ export dcY if DCD partition Y is supported. For CXL host
+ platforms that support "QoS Telemmetry" this attribute conveys
+ a comma delimited list of platform specific cookies that
+ identifies a QoS performance class for the persistent partition
+ of the CXL mem device. These class-ids can be compared against
+ a similar "qos_class" published for a root decoder. While it is
+ not required that the endpoints map their local memory-class to
+ a matching platform class, mismatches are not recommended as
+ there are platform specific performance related side-effects
+ that may result. First class-id is displayed.
What: /sys/bus/cxl/devices/memX/pmem/qos_class
Date: May, 2023
diff --git a/drivers/cxl/core/memdev.c b/drivers/cxl/core/memdev.c
index ae3dfcbe893897aaf315c947d3bdb0741aadf599..56cdf09d3affb81969755769a8803f6bded7a4ce 100644
--- a/drivers/cxl/core/memdev.c
+++ b/drivers/cxl/core/memdev.c
@@ -2,6 +2,7 @@
/* Copyright(c) 2020 Intel Corporation. */
#include <linux/io-64-nonatomic-lo-hi.h>
+#include <linux/string_choices.h>
#include <linux/firmware.h>
#include <linux/device.h>
#include <linux/slab.h>
@@ -449,6 +450,121 @@ static struct attribute *cxl_memdev_security_attributes[] = {
NULL,
};
+static ssize_t show_size_dcN(struct cxl_memdev *cxlmd, char *buf, int pos)
+{
+ struct cxl_memdev_state *mds = to_cxl_memdev_state(cxlmd->cxlds);
+
+ return sysfs_emit(buf, "%#llx\n", mds->dc_region[pos].decode_len);
+}
+
+static ssize_t show_read_only_dcN(struct cxl_memdev *cxlmd, char *buf, int pos)
+{
+ struct cxl_memdev_state *mds = to_cxl_memdev_state(cxlmd->cxlds);
+
+ return sysfs_emit(buf, "%s\n",
+ str_true_false(mds->dc_region[pos].read_only));
+}
+
+static ssize_t show_shareable_dcN(struct cxl_memdev *cxlmd, char *buf, int pos)
+{
+ struct cxl_memdev_state *mds = to_cxl_memdev_state(cxlmd->cxlds);
+
+ return sysfs_emit(buf, "%s\n",
+ str_true_false(mds->dc_region[pos].shareable));
+}
+
+static ssize_t show_qos_class_dcN(struct cxl_memdev *cxlmd, char *buf, int pos)
+{
+ struct cxl_memdev_state *mds = to_cxl_memdev_state(cxlmd->cxlds);
+
+ return sysfs_emit(buf, "%d\n", mds->dc_perf[pos].qos_class);
+}
+
+#define CXL_MEMDEV_DC_ATTR_GROUP(n) \
+static ssize_t dc##n##_size_show(struct device *dev, \
+ struct device_attribute *attr, \
+ char *buf) \
+{ \
+ return show_size_dcN(to_cxl_memdev(dev), buf, (n)); \
+} \
+struct device_attribute dc##n##_size = { \
+ .attr = { .name = "size", .mode = 0444 }, \
+ .show = dc##n##_size_show, \
+}; \
+static ssize_t dc##n##_read_only_show(struct device *dev, \
+ struct device_attribute *attr, \
+ char *buf) \
+{ \
+ return show_read_only_dcN(to_cxl_memdev(dev), buf, (n)); \
+} \
+struct device_attribute dc##n##_read_only = { \
+ .attr = { .name = "read_only", .mode = 0444 }, \
+ .show = dc##n##_read_only_show, \
+}; \
+static ssize_t dc##n##_shareable_show(struct device *dev, \
+ struct device_attribute *attr, \
+ char *buf) \
+{ \
+ return show_shareable_dcN(to_cxl_memdev(dev), buf, (n)); \
+} \
+struct device_attribute dc##n##_shareable = { \
+ .attr = { .name = "shareable", .mode = 0444 }, \
+ .show = dc##n##_shareable_show, \
+}; \
+static ssize_t dc##n##_qos_class_show(struct device *dev, \
+ struct device_attribute *attr, \
+ char *buf) \
+{ \
+ return show_qos_class_dcN(to_cxl_memdev(dev), buf, (n)); \
+} \
+struct device_attribute dc##n##_qos_class = { \
+ .attr = { .name = "qos_class", .mode = 0444 }, \
+ .show = dc##n##_qos_class_show, \
+}; \
+static struct attribute *cxl_memdev_dc##n##_attributes[] = { \
+ &dc##n##_size.attr, \
+ &dc##n##_read_only.attr, \
+ &dc##n##_shareable.attr, \
+ &dc##n##_qos_class.attr, \
+ NULL \
+}; \
+static umode_t cxl_memdev_dc##n##_attr_visible(struct kobject *kobj, \
+ struct attribute *a, \
+ int pos) \
+{ \
+ struct device *dev = kobj_to_dev(kobj); \
+ struct cxl_memdev *cxlmd = to_cxl_memdev(dev); \
+ struct cxl_memdev_state *mds = to_cxl_memdev_state(cxlmd->cxlds); \
+ \
+ /* Not a memory device */ \
+ if (!mds) \
+ return 0; \
+ return a->mode; \
+} \
+static umode_t cxl_memdev_dc##n##_group_visible(struct kobject *kobj) \
+{ \
+ struct device *dev = kobj_to_dev(kobj); \
+ struct cxl_memdev *cxlmd = to_cxl_memdev(dev); \
+ struct cxl_memdev_state *mds = to_cxl_memdev_state(cxlmd->cxlds); \
+ \
+ /* Not a memory device or partition not supported */ \
+ return mds && n < mds->nr_dc_region; \
+} \
+DEFINE_SYSFS_GROUP_VISIBLE(cxl_memdev_dc##n); \
+static struct attribute_group cxl_memdev_dc##n##_group = { \
+ .name = "dc"#n, \
+ .attrs = cxl_memdev_dc##n##_attributes, \
+ .is_visible = SYSFS_GROUP_VISIBLE(cxl_memdev_dc##n), \
+}
+CXL_MEMDEV_DC_ATTR_GROUP(0);
+CXL_MEMDEV_DC_ATTR_GROUP(1);
+CXL_MEMDEV_DC_ATTR_GROUP(2);
+CXL_MEMDEV_DC_ATTR_GROUP(3);
+CXL_MEMDEV_DC_ATTR_GROUP(4);
+CXL_MEMDEV_DC_ATTR_GROUP(5);
+CXL_MEMDEV_DC_ATTR_GROUP(6);
+CXL_MEMDEV_DC_ATTR_GROUP(7);
+
static umode_t cxl_memdev_visible(struct kobject *kobj, struct attribute *a,
int n)
{
@@ -525,6 +641,14 @@ static struct attribute_group cxl_memdev_security_attribute_group = {
};
static const struct attribute_group *cxl_memdev_attribute_groups[] = {
+ &cxl_memdev_dc0_group,
+ &cxl_memdev_dc1_group,
+ &cxl_memdev_dc2_group,
+ &cxl_memdev_dc3_group,
+ &cxl_memdev_dc4_group,
+ &cxl_memdev_dc5_group,
+ &cxl_memdev_dc6_group,
+ &cxl_memdev_dc7_group,
&cxl_memdev_attribute_group,
&cxl_memdev_ram_attribute_group,
&cxl_memdev_pmem_attribute_group,
--
2.47.1
^ permalink raw reply related [flat|nested] 34+ messages in thread
* [PATCH v8 08/21] cxl/port: Add endpoint decoder DC mode support to sysfs
2024-12-11 3:42 [PATCH v8 00/21] DCD: Add support for Dynamic Capacity Devices (DCD) Ira Weiny
` (6 preceding siblings ...)
2024-12-11 3:42 ` [PATCH v8 07/21] cxl/mem: Expose DCD partition capabilities in sysfs Ira Weiny
@ 2024-12-11 3:42 ` Ira Weiny
2024-12-11 3:42 ` [PATCH v8 09/21] cxl/region: Add sparse DAX region support Ira Weiny
` (12 subsequent siblings)
20 siblings, 0 replies; 34+ messages in thread
From: Ira Weiny @ 2024-12-11 3:42 UTC (permalink / raw)
To: Dave Jiang, Fan Ni, Jonathan Cameron, Jonathan Corbet,
Andrew Morton, Kees Cook, Gustavo A. R. Silva
Cc: Dan Williams, Davidlohr Bueso, Alison Schofield, Vishal Verma,
Ira Weiny, linux-cxl, linux-doc, nvdimm, linux-kernel,
linux-hardening
Endpoint decoder mode is used to represent the partition the decoder
points to such as ram or pmem.
Expand the mode to allow a decoder to point to a specific DC partition
(Region).
Based on an original patch by Navneet Singh.
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
---
Changes:
[Carpenter/smatch: fix cxl_decoder_mode_names array]
---
Documentation/ABI/testing/sysfs-bus-cxl | 25 ++++++++++++------------
drivers/cxl/core/hdm.c | 16 ++++++++++++++++
drivers/cxl/core/port.c | 16 +++++++++++-----
drivers/cxl/cxl.h | 34 +++++++++++++++++----------------
4 files changed, 58 insertions(+), 33 deletions(-)
diff --git a/Documentation/ABI/testing/sysfs-bus-cxl b/Documentation/ABI/testing/sysfs-bus-cxl
index ff3ae83477f0876c0ee2d3955d27a11fa9d16d83..8d990d702f63363879150cf523c0be6229f315e0 100644
--- a/Documentation/ABI/testing/sysfs-bus-cxl
+++ b/Documentation/ABI/testing/sysfs-bus-cxl
@@ -361,23 +361,24 @@ Description:
What: /sys/bus/cxl/devices/decoderX.Y/mode
-Date: May, 2022
-KernelVersion: v6.0
+Date: May, 2022, October 2024
+KernelVersion: v6.0, v6.13 (dcY)
Contact: linux-cxl@vger.kernel.org
Description:
(RW) When a CXL decoder is of devtype "cxl_decoder_endpoint" it
- translates from a host physical address range, to a device local
- address range. Device-local address ranges are further split
- into a 'ram' (volatile memory) range and 'pmem' (persistent
- memory) range. The 'mode' attribute emits one of 'ram', 'pmem',
- 'mixed', or 'none'. The 'mixed' indication is for error cases
- when a decoder straddles the volatile/persistent partition
- boundary, and 'none' indicates the decoder is not actively
- decoding, or no DPA allocation policy has been set.
+ translates from a host physical address range, to a device
+ local address range. Device-local address ranges are further
+ split into a 'ram' (volatile memory) range, 'pmem' (persistent
+ memory) range, and Dynamic Capacity (DC) ranges. The 'mode'
+ attribute emits one of 'ram', 'pmem', 'dcY', 'mixed', or
+ 'none'. The 'mixed' indication is for error cases when a
+ decoder straddles partition boundaries, and 'none' indicates
+ the decoder is not actively decoding, or no DPA allocation
+ policy has been set.
'mode' can be written, when the decoder is in the 'disabled'
- state, with either 'ram' or 'pmem' to set the boundaries for the
- next allocation.
+ state, with 'ram', 'pmem', or 'dcY' to set the boundaries for
+ the next allocation.
What: /sys/bus/cxl/devices/decoderX.Y/dpa_resource
diff --git a/drivers/cxl/core/hdm.c b/drivers/cxl/core/hdm.c
index e15241f94d17b774aa5befb37fb453af637a17ce..d0c32c3c6564df869d41030144c6d2a7c063747d 100644
--- a/drivers/cxl/core/hdm.c
+++ b/drivers/cxl/core/hdm.c
@@ -548,6 +548,7 @@ int cxl_dpa_set_mode(struct cxl_endpoint_decoder *cxled,
switch (mode) {
case CXL_DECODER_RAM:
case CXL_DECODER_PMEM:
+ case CXL_DECODER_DC0 ... CXL_DECODER_DC7:
break;
default:
dev_dbg(dev, "unsupported mode: %d\n", mode);
@@ -571,6 +572,21 @@ int cxl_dpa_set_mode(struct cxl_endpoint_decoder *cxled,
return -ENXIO;
}
+ if (mode >= CXL_DECODER_DC0 && mode <= CXL_DECODER_DC7) {
+ struct cxl_memdev_state *mds = to_cxl_memdev_state(cxlds);
+ int index;
+
+ index = dc_mode_to_region_index(mode);
+ if (!resource_size(&cxlds->dc_res[index])) {
+ dev_dbg(dev, "no available dynamic capacity\n");
+ return -ENXIO;
+ }
+ if (mds->dc_region[index].shareable) {
+ dev_err(dev, "DC region %d is shareable\n", index);
+ return -EINVAL;
+ }
+ }
+
cxled->mode = mode;
return 0;
}
diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
index 5c0b8ead315f41c4df14918ad4dcdb269990c5dd..7459ca8eae002727405bf1077d0187bcfb579144 100644
--- a/drivers/cxl/core/port.c
+++ b/drivers/cxl/core/port.c
@@ -205,11 +205,17 @@ static ssize_t mode_store(struct device *dev, struct device_attribute *attr,
enum cxl_decoder_mode mode;
ssize_t rc;
- if (sysfs_streq(buf, "pmem"))
- mode = CXL_DECODER_PMEM;
- else if (sysfs_streq(buf, "ram"))
- mode = CXL_DECODER_RAM;
- else
+ for (mode = 0; mode < CXL_DECODER_MODE_MAX; mode++)
+ if (sysfs_streq(buf, cxl_decoder_mode_names[mode]))
+ break;
+
+ if (mode == CXL_DECODER_NONE ||
+ mode == CXL_DECODER_DEAD ||
+ mode == CXL_DECODER_MODE_MAX)
+ return -EINVAL;
+
+ /* Not yet supported */
+ if (mode >= CXL_DECODER_MIXED)
return -EINVAL;
rc = cxl_dpa_set_mode(cxled, mode);
diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
index 055c840b6c2856ec77162c3c5f87293f00f8d8ec..79660c87e6be533a1d55311896f9a3c5514648f8 100644
--- a/drivers/cxl/cxl.h
+++ b/drivers/cxl/cxl.h
@@ -389,27 +389,29 @@ enum cxl_decoder_mode {
CXL_DECODER_DC7,
CXL_DECODER_MIXED,
CXL_DECODER_DEAD,
+ CXL_DECODER_MODE_MAX,
+};
+
+static const char * const cxl_decoder_mode_names[] = {
+ [CXL_DECODER_NONE] = "none",
+ [CXL_DECODER_RAM] = "ram",
+ [CXL_DECODER_PMEM] = "pmem",
+ [CXL_DECODER_DC0] = "dc0",
+ [CXL_DECODER_DC1] = "dc1",
+ [CXL_DECODER_DC2] = "dc2",
+ [CXL_DECODER_DC3] = "dc3",
+ [CXL_DECODER_DC4] = "dc4",
+ [CXL_DECODER_DC5] = "dc5",
+ [CXL_DECODER_DC6] = "dc6",
+ [CXL_DECODER_DC7] = "dc7",
+ [CXL_DECODER_MIXED] = "mixed",
+ [CXL_DECODER_DEAD] = "dead",
};
static inline const char *cxl_decoder_mode_name(enum cxl_decoder_mode mode)
{
- static const char * const names[] = {
- [CXL_DECODER_NONE] = "none",
- [CXL_DECODER_RAM] = "ram",
- [CXL_DECODER_PMEM] = "pmem",
- [CXL_DECODER_DC0] = "dc0",
- [CXL_DECODER_DC1] = "dc1",
- [CXL_DECODER_DC2] = "dc2",
- [CXL_DECODER_DC3] = "dc3",
- [CXL_DECODER_DC4] = "dc4",
- [CXL_DECODER_DC5] = "dc5",
- [CXL_DECODER_DC6] = "dc6",
- [CXL_DECODER_DC7] = "dc7",
- [CXL_DECODER_MIXED] = "mixed",
- };
-
if (mode >= CXL_DECODER_NONE && mode <= CXL_DECODER_MIXED)
- return names[mode];
+ return cxl_decoder_mode_names[mode];
return "mixed";
}
--
2.47.1
^ permalink raw reply related [flat|nested] 34+ messages in thread
* [PATCH v8 09/21] cxl/region: Add sparse DAX region support
2024-12-11 3:42 [PATCH v8 00/21] DCD: Add support for Dynamic Capacity Devices (DCD) Ira Weiny
` (7 preceding siblings ...)
2024-12-11 3:42 ` [PATCH v8 08/21] cxl/port: Add endpoint decoder DC mode support to sysfs Ira Weiny
@ 2024-12-11 3:42 ` Ira Weiny
2024-12-11 3:42 ` [PATCH v8 10/21] cxl/events: Split event msgnum configuration from irq setup Ira Weiny
` (11 subsequent siblings)
20 siblings, 0 replies; 34+ messages in thread
From: Ira Weiny @ 2024-12-11 3:42 UTC (permalink / raw)
To: Dave Jiang, Fan Ni, Jonathan Cameron, Jonathan Corbet,
Andrew Morton, Kees Cook, Gustavo A. R. Silva
Cc: Dan Williams, Davidlohr Bueso, Alison Schofield, Vishal Verma,
Ira Weiny, linux-cxl, linux-doc, nvdimm, linux-kernel,
linux-hardening
Dynamic Capacity CXL regions must allow memory to be added or removed
dynamically. In addition to the quantity of memory available the
location of the memory within a DC partition is dynamic based on the
extents offered by a device. CXL DAX regions must accommodate the
sparseness of this memory in the management of DAX regions and devices.
Introduce the concept of a sparse DAX region. Add a create_dc_region()
sysfs entry to create such regions. Special case DC capable regions to
create a 0 sized seed DAX device to maintain compatibility which
requires a default DAX device to hold a region reference.
Indicate 0 byte available capacity until such time that capacity is
added.
Sparse regions complicate the range mapping of dax devices. There is no
known use case for range mapping on sparse regions. Avoid the
complication by preventing range mapping of dax devices on sparse
regions.
Interleaving is deferred for now. Add checks.
Based on an original patch by Navneet Singh.
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
---
Documentation/ABI/testing/sysfs-bus-cxl | 22 ++++++++--------
drivers/cxl/core/core.h | 12 +++++++++
drivers/cxl/core/port.c | 1 +
drivers/cxl/core/region.c | 46 +++++++++++++++++++++++++++++++--
drivers/dax/bus.c | 10 +++++++
drivers/dax/bus.h | 1 +
drivers/dax/cxl.c | 16 ++++++++++--
7 files changed, 93 insertions(+), 15 deletions(-)
diff --git a/Documentation/ABI/testing/sysfs-bus-cxl b/Documentation/ABI/testing/sysfs-bus-cxl
index 8d990d702f63363879150cf523c0be6229f315e0..aeff248ea368cf49c9977fcaf43ab4def978e896 100644
--- a/Documentation/ABI/testing/sysfs-bus-cxl
+++ b/Documentation/ABI/testing/sysfs-bus-cxl
@@ -439,20 +439,20 @@ Description:
interleave_granularity).
-What: /sys/bus/cxl/devices/decoderX.Y/create_{pmem,ram}_region
-Date: May, 2022, January, 2023
-KernelVersion: v6.0 (pmem), v6.3 (ram)
+What: /sys/bus/cxl/devices/decoderX.Y/create_{pmem,ram,dc}_region
+Date: May, 2022, January, 2023, August 2024
+KernelVersion: v6.0 (pmem), v6.3 (ram), v6.13 (dc)
Contact: linux-cxl@vger.kernel.org
Description:
(RW) Write a string in the form 'regionZ' to start the process
- of defining a new persistent, or volatile memory region
- (interleave-set) within the decode range bounded by root decoder
- 'decoderX.Y'. The value written must match the current value
- returned from reading this attribute. An atomic compare exchange
- operation is done on write to assign the requested id to a
- region and allocate the region-id for the next creation attempt.
- EBUSY is returned if the region name written does not match the
- current cached value.
+ of defining a new persistent, volatile, or Dynamic Capacity
+ (DC) memory region (interleave-set) within the decode range
+ bounded by root decoder 'decoderX.Y'. The value written must
+ match the current value returned from reading this attribute.
+ An atomic compare exchange operation is done on write to assign
+ the requested id to a region and allocate the region-id for the
+ next creation attempt. EBUSY is returned if the region name
+ written does not match the current cached value.
What: /sys/bus/cxl/devices/decoderX.Y/delete_region
diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
index 800466f96a68517f0c6930faa555b347cf0e156b..03ab7e66102e1e1fa378b9afb1c6b3e8235e8ed4 100644
--- a/drivers/cxl/core/core.h
+++ b/drivers/cxl/core/core.h
@@ -4,15 +4,27 @@
#ifndef __CXL_CORE_H__
#define __CXL_CORE_H__
+#include <cxlmem.h>
+
extern const struct device_type cxl_nvdimm_bridge_type;
extern const struct device_type cxl_nvdimm_type;
extern const struct device_type cxl_pmu_type;
extern struct attribute_group cxl_base_attribute_group;
+static inline struct cxl_memdev_state *
+cxled_to_mds(struct cxl_endpoint_decoder *cxled)
+{
+ struct cxl_memdev *cxlmd = cxled_to_memdev(cxled);
+ struct cxl_dev_state *cxlds = cxlmd->cxlds;
+
+ return container_of(cxlds, struct cxl_memdev_state, cxlds);
+}
+
#ifdef CONFIG_CXL_REGION
extern struct device_attribute dev_attr_create_pmem_region;
extern struct device_attribute dev_attr_create_ram_region;
+extern struct device_attribute dev_attr_create_dc_region;
extern struct device_attribute dev_attr_delete_region;
extern struct device_attribute dev_attr_region;
extern const struct device_type cxl_pmem_region_type;
diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
index 7459ca8eae002727405bf1077d0187bcfb579144..5fa4cad4e55adef1808d052ef16d2162baafbf2c 100644
--- a/drivers/cxl/core/port.c
+++ b/drivers/cxl/core/port.c
@@ -326,6 +326,7 @@ static struct attribute *cxl_decoder_root_attrs[] = {
&dev_attr_qos_class.attr,
SET_CXL_REGION_ATTR(create_pmem_region)
SET_CXL_REGION_ATTR(create_ram_region)
+ SET_CXL_REGION_ATTR(create_dc_region)
SET_CXL_REGION_ATTR(delete_region)
NULL,
};
diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
index 6c1a63610f5ba79b1da57cc37df4e2b5b88588a6..a393c46871235e33b3f077951f191178be48f449 100644
--- a/drivers/cxl/core/region.c
+++ b/drivers/cxl/core/region.c
@@ -480,6 +480,11 @@ static ssize_t interleave_ways_store(struct device *dev,
if (rc)
return rc;
+ if (cxlr->mode == CXL_REGION_DC && val != 1) {
+ dev_err(dev, "Interleaving and DCD not supported\n");
+ return -EINVAL;
+ }
+
rc = ways_to_eiw(val, &iw);
if (rc)
return rc;
@@ -2177,6 +2182,7 @@ static size_t store_targetN(struct cxl_region *cxlr, const char *buf, int pos,
if (sysfs_streq(buf, "\n"))
rc = detach_target(cxlr, pos);
else {
+ struct cxl_endpoint_decoder *cxled;
struct device *dev;
dev = bus_find_device_by_name(&cxl_bus_type, NULL, buf);
@@ -2188,8 +2194,13 @@ static size_t store_targetN(struct cxl_region *cxlr, const char *buf, int pos,
goto out;
}
- rc = attach_target(cxlr, to_cxl_endpoint_decoder(dev), pos,
- TASK_INTERRUPTIBLE);
+ cxled = to_cxl_endpoint_decoder(dev);
+ if (cxlr->mode == CXL_REGION_DC &&
+ !cxl_dcd_supported(cxled_to_mds(cxled))) {
+ dev_dbg(dev, "DCD unsupported\n");
+ return -EINVAL;
+ }
+ rc = attach_target(cxlr, cxled, pos, TASK_INTERRUPTIBLE);
out:
put_device(dev);
}
@@ -2534,6 +2545,7 @@ static struct cxl_region *__create_region(struct cxl_root_decoder *cxlrd,
switch (mode) {
case CXL_REGION_RAM:
case CXL_REGION_PMEM:
+ case CXL_REGION_DC:
break;
default:
dev_err(&cxlrd->cxlsd.cxld.dev, "unsupported mode %s\n",
@@ -2587,6 +2599,20 @@ static ssize_t create_ram_region_store(struct device *dev,
}
DEVICE_ATTR_RW(create_ram_region);
+static ssize_t create_dc_region_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ return __create_region_show(to_cxl_root_decoder(dev), buf);
+}
+
+static ssize_t create_dc_region_store(struct device *dev,
+ struct device_attribute *attr,
+ const char *buf, size_t len)
+{
+ return create_region_store(dev, buf, len, CXL_REGION_DC);
+}
+DEVICE_ATTR_RW(create_dc_region);
+
static ssize_t region_show(struct device *dev, struct device_attribute *attr,
char *buf)
{
@@ -3169,6 +3195,11 @@ static int devm_cxl_add_dax_region(struct cxl_region *cxlr)
struct device *dev;
int rc;
+ if (cxlr->mode == CXL_REGION_DC && cxlr->params.interleave_ways != 1) {
+ dev_err(&cxlr->dev, "Interleaving DC not supported\n");
+ return -EINVAL;
+ }
+
cxlr_dax = cxl_dax_region_alloc(cxlr);
if (IS_ERR(cxlr_dax))
return PTR_ERR(cxlr_dax);
@@ -3261,6 +3292,16 @@ static struct cxl_region *construct_region(struct cxl_root_decoder *cxlrd,
return ERR_PTR(-EINVAL);
mode = cxl_decoder_to_region_mode(cxled->mode);
+ if (mode == CXL_REGION_DC) {
+ if (!cxl_dcd_supported(cxled_to_mds(cxled))) {
+ dev_err(&cxled->cxld.dev, "DCD unsupported\n");
+ return ERR_PTR(-EINVAL);
+ }
+ if (cxled->cxld.interleave_ways != 1) {
+ dev_err(&cxled->cxld.dev, "Interleaving and DCD not supported\n");
+ return ERR_PTR(-EINVAL);
+ }
+ }
do {
cxlr = __create_region(cxlrd, mode,
atomic_read(&cxlrd->region_id));
@@ -3468,6 +3509,7 @@ static int cxl_region_probe(struct device *dev)
case CXL_REGION_PMEM:
return devm_cxl_add_pmem_region(cxlr);
case CXL_REGION_RAM:
+ case CXL_REGION_DC:
/*
* The region can not be manged by CXL if any portion of
* it is already online as 'System RAM'
diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
index fde29e0ad68b158c5c88262d434ee7b55a5ce407..d8cb5195a227c0f6194cb210510e006327e1b35b 100644
--- a/drivers/dax/bus.c
+++ b/drivers/dax/bus.c
@@ -178,6 +178,11 @@ static bool is_static(struct dax_region *dax_region)
return (dax_region->res.flags & IORESOURCE_DAX_STATIC) != 0;
}
+static bool is_sparse(struct dax_region *dax_region)
+{
+ return (dax_region->res.flags & IORESOURCE_DAX_SPARSE_CAP) != 0;
+}
+
bool static_dev_dax(struct dev_dax *dev_dax)
{
return is_static(dev_dax->region);
@@ -301,6 +306,9 @@ static unsigned long long dax_region_avail_size(struct dax_region *dax_region)
lockdep_assert_held(&dax_region_rwsem);
+ if (is_sparse(dax_region))
+ return 0;
+
for_each_dax_region_resource(dax_region, res)
size -= resource_size(res);
return size;
@@ -1373,6 +1381,8 @@ static umode_t dev_dax_visible(struct kobject *kobj, struct attribute *a, int n)
return 0;
if (a == &dev_attr_mapping.attr && is_static(dax_region))
return 0;
+ if (a == &dev_attr_mapping.attr && is_sparse(dax_region))
+ return 0;
if ((a == &dev_attr_align.attr ||
a == &dev_attr_size.attr) && is_static(dax_region))
return 0444;
diff --git a/drivers/dax/bus.h b/drivers/dax/bus.h
index cbbf64443098c08d944878a190a0da69eccbfbf4..783bfeef42cc6c4d74f24e0a69dac5598eaf1664 100644
--- a/drivers/dax/bus.h
+++ b/drivers/dax/bus.h
@@ -13,6 +13,7 @@ struct dax_region;
/* dax bus specific ioresource flags */
#define IORESOURCE_DAX_STATIC BIT(0)
#define IORESOURCE_DAX_KMEM BIT(1)
+#define IORESOURCE_DAX_SPARSE_CAP BIT(2)
struct dax_region *alloc_dax_region(struct device *parent, int region_id,
struct range *range, int target_node, unsigned int align,
diff --git a/drivers/dax/cxl.c b/drivers/dax/cxl.c
index 13cd94d32ff7a1d70af7821c1aecd7490302149d..b4d1ca9b4e9b5105404c6d342522ad73d9fbf8a9 100644
--- a/drivers/dax/cxl.c
+++ b/drivers/dax/cxl.c
@@ -13,19 +13,31 @@ static int cxl_dax_region_probe(struct device *dev)
struct cxl_region *cxlr = cxlr_dax->cxlr;
struct dax_region *dax_region;
struct dev_dax_data data;
+ resource_size_t dev_size;
+ unsigned long flags;
if (nid == NUMA_NO_NODE)
nid = memory_add_physaddr_to_nid(cxlr_dax->hpa_range.start);
+ flags = IORESOURCE_DAX_KMEM;
+ if (cxlr->mode == CXL_REGION_DC)
+ flags |= IORESOURCE_DAX_SPARSE_CAP;
+
dax_region = alloc_dax_region(dev, cxlr->id, &cxlr_dax->hpa_range, nid,
- PMD_SIZE, IORESOURCE_DAX_KMEM);
+ PMD_SIZE, flags);
if (!dax_region)
return -ENOMEM;
+ if (cxlr->mode == CXL_REGION_DC)
+ /* Add empty seed dax device */
+ dev_size = 0;
+ else
+ dev_size = range_len(&cxlr_dax->hpa_range);
+
data = (struct dev_dax_data) {
.dax_region = dax_region,
.id = -1,
- .size = range_len(&cxlr_dax->hpa_range),
+ .size = dev_size,
.memmap_on_memory = true,
};
--
2.47.1
^ permalink raw reply related [flat|nested] 34+ messages in thread
* [PATCH v8 10/21] cxl/events: Split event msgnum configuration from irq setup
2024-12-11 3:42 [PATCH v8 00/21] DCD: Add support for Dynamic Capacity Devices (DCD) Ira Weiny
` (8 preceding siblings ...)
2024-12-11 3:42 ` [PATCH v8 09/21] cxl/region: Add sparse DAX region support Ira Weiny
@ 2024-12-11 3:42 ` Ira Weiny
2024-12-11 3:42 ` [PATCH v8 11/21] cxl/pci: Factor out interrupt policy check Ira Weiny
` (10 subsequent siblings)
20 siblings, 0 replies; 34+ messages in thread
From: Ira Weiny @ 2024-12-11 3:42 UTC (permalink / raw)
To: Dave Jiang, Fan Ni, Jonathan Cameron, Jonathan Corbet,
Andrew Morton, Kees Cook, Gustavo A. R. Silva
Cc: Dan Williams, Davidlohr Bueso, Alison Schofield, Vishal Verma,
Ira Weiny, linux-cxl, linux-doc, nvdimm, linux-kernel,
linux-hardening, Li Ming
Dynamic Capacity Devices (DCD) require event interrupts to process
memory addition or removal. BIOS may have control over non-DCD event
processing. DCD interrupt configuration needs to be separate from
memory event interrupt configuration.
Split cxl_event_config_msgnums() from irq setup in preparation for
separate DCD interrupts configuration.
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: Fan Ni <fan.ni@samsung.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Reviewed-by: Li Ming <ming.li@zohomail.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
---
drivers/cxl/pci.c | 22 +++++++++++-----------
1 file changed, 11 insertions(+), 11 deletions(-)
diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
index 5082625a7b3f51a84f894a3265e922e51b794b68..650724e6896eb4e39468cfded11e6909f8e207a6 100644
--- a/drivers/cxl/pci.c
+++ b/drivers/cxl/pci.c
@@ -715,35 +715,31 @@ static int cxl_event_config_msgnums(struct cxl_memdev_state *mds,
return cxl_event_get_int_policy(mds, policy);
}
-static int cxl_event_irqsetup(struct cxl_memdev_state *mds)
+static int cxl_event_irqsetup(struct cxl_memdev_state *mds,
+ struct cxl_event_interrupt_policy *policy)
{
struct cxl_dev_state *cxlds = &mds->cxlds;
- struct cxl_event_interrupt_policy policy;
int rc;
- rc = cxl_event_config_msgnums(mds, &policy);
- if (rc)
- return rc;
-
- rc = cxl_event_req_irq(cxlds, policy.info_settings);
+ rc = cxl_event_req_irq(cxlds, policy->info_settings);
if (rc) {
dev_err(cxlds->dev, "Failed to get interrupt for event Info log\n");
return rc;
}
- rc = cxl_event_req_irq(cxlds, policy.warn_settings);
+ rc = cxl_event_req_irq(cxlds, policy->warn_settings);
if (rc) {
dev_err(cxlds->dev, "Failed to get interrupt for event Warn log\n");
return rc;
}
- rc = cxl_event_req_irq(cxlds, policy.failure_settings);
+ rc = cxl_event_req_irq(cxlds, policy->failure_settings);
if (rc) {
dev_err(cxlds->dev, "Failed to get interrupt for event Failure log\n");
return rc;
}
- rc = cxl_event_req_irq(cxlds, policy.fatal_settings);
+ rc = cxl_event_req_irq(cxlds, policy->fatal_settings);
if (rc) {
dev_err(cxlds->dev, "Failed to get interrupt for event Fatal log\n");
return rc;
@@ -790,11 +786,15 @@ static int cxl_event_config(struct pci_host_bridge *host_bridge,
return -EBUSY;
}
+ rc = cxl_event_config_msgnums(mds, &policy);
+ if (rc)
+ return rc;
+
rc = cxl_mem_alloc_event_buf(mds);
if (rc)
return rc;
- rc = cxl_event_irqsetup(mds);
+ rc = cxl_event_irqsetup(mds, &policy);
if (rc)
return rc;
--
2.47.1
^ permalink raw reply related [flat|nested] 34+ messages in thread
* [PATCH v8 11/21] cxl/pci: Factor out interrupt policy check
2024-12-11 3:42 [PATCH v8 00/21] DCD: Add support for Dynamic Capacity Devices (DCD) Ira Weiny
` (9 preceding siblings ...)
2024-12-11 3:42 ` [PATCH v8 10/21] cxl/events: Split event msgnum configuration from irq setup Ira Weiny
@ 2024-12-11 3:42 ` Ira Weiny
2024-12-11 3:42 ` [PATCH v8 12/21] cxl/mem: Configure dynamic capacity interrupts Ira Weiny
` (9 subsequent siblings)
20 siblings, 0 replies; 34+ messages in thread
From: Ira Weiny @ 2024-12-11 3:42 UTC (permalink / raw)
To: Dave Jiang, Fan Ni, Jonathan Cameron, Jonathan Corbet,
Andrew Morton, Kees Cook, Gustavo A. R. Silva
Cc: Dan Williams, Davidlohr Bueso, Alison Schofield, Vishal Verma,
Ira Weiny, linux-cxl, linux-doc, nvdimm, linux-kernel,
linux-hardening, Li Ming
Dynamic Capacity Devices (DCD) require event interrupts to process
memory addition or removal. BIOS may have control over non-DCD event
processing. DCD interrupt configuration needs to be separate from
memory event interrupt configuration.
Factor out event interrupt setting validation.
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: Fan Ni <fan.ni@samsung.com>
Reviewed-by: Li Ming <ming.li@zohomail.com>
Link: https://lore.kernel.org/all/663922b475e50_d54d72945b@dwillia2-xfh.jf.intel.com.notmuch/ [1]
Suggested-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
---
drivers/cxl/pci.c | 23 ++++++++++++++++-------
1 file changed, 16 insertions(+), 7 deletions(-)
diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
index 650724e6896eb4e39468cfded11e6909f8e207a6..22e6047e3c3db7a16670b7a5aa4797ad20befb22 100644
--- a/drivers/cxl/pci.c
+++ b/drivers/cxl/pci.c
@@ -755,6 +755,21 @@ static bool cxl_event_int_is_fw(u8 setting)
return mode == CXL_INT_FW;
}
+static bool cxl_event_validate_mem_policy(struct cxl_memdev_state *mds,
+ struct cxl_event_interrupt_policy *policy)
+{
+ if (cxl_event_int_is_fw(policy->info_settings) ||
+ cxl_event_int_is_fw(policy->warn_settings) ||
+ cxl_event_int_is_fw(policy->failure_settings) ||
+ cxl_event_int_is_fw(policy->fatal_settings)) {
+ dev_err(mds->cxlds.dev,
+ "FW still in control of Event Logs despite _OSC settings\n");
+ return false;
+ }
+
+ return true;
+}
+
static int cxl_event_config(struct pci_host_bridge *host_bridge,
struct cxl_memdev_state *mds, bool irq_avail)
{
@@ -777,14 +792,8 @@ static int cxl_event_config(struct pci_host_bridge *host_bridge,
if (rc)
return rc;
- if (cxl_event_int_is_fw(policy.info_settings) ||
- cxl_event_int_is_fw(policy.warn_settings) ||
- cxl_event_int_is_fw(policy.failure_settings) ||
- cxl_event_int_is_fw(policy.fatal_settings)) {
- dev_err(mds->cxlds.dev,
- "FW still in control of Event Logs despite _OSC settings\n");
+ if (!cxl_event_validate_mem_policy(mds, &policy))
return -EBUSY;
- }
rc = cxl_event_config_msgnums(mds, &policy);
if (rc)
--
2.47.1
^ permalink raw reply related [flat|nested] 34+ messages in thread
* [PATCH v8 12/21] cxl/mem: Configure dynamic capacity interrupts
2024-12-11 3:42 [PATCH v8 00/21] DCD: Add support for Dynamic Capacity Devices (DCD) Ira Weiny
` (10 preceding siblings ...)
2024-12-11 3:42 ` [PATCH v8 11/21] cxl/pci: Factor out interrupt policy check Ira Weiny
@ 2024-12-11 3:42 ` Ira Weiny
2024-12-11 3:42 ` [PATCH v8 13/21] cxl/core: Return endpoint decoder information from region search Ira Weiny
` (8 subsequent siblings)
20 siblings, 0 replies; 34+ messages in thread
From: Ira Weiny @ 2024-12-11 3:42 UTC (permalink / raw)
To: Dave Jiang, Fan Ni, Jonathan Cameron, Jonathan Corbet,
Andrew Morton, Kees Cook, Gustavo A. R. Silva
Cc: Dan Williams, Davidlohr Bueso, Alison Schofield, Vishal Verma,
Ira Weiny, linux-cxl, linux-doc, nvdimm, linux-kernel,
linux-hardening, Li Ming
Dynamic Capacity Devices (DCD) support extent change notifications
through the event log mechanism. The interrupt mailbox commands were
extended in CXL 3.1 to support these notifications. Firmware can't
configure DCD events to be FW controlled but can retain control of
memory events.
Configure DCD event log interrupts on devices supporting dynamic
capacity. Disable DCD if interrupts are not supported.
Care is taken to preserve the interrupt policy set by the FW if FW first
has been selected by the BIOS.
Based on an original patch by Navneet Singh.
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Reviewed-by: Li Ming <ming.li@zohomail.com>
Reviewed-by: Fan Ni <fan.ni@samsung.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
---
drivers/cxl/cxlmem.h | 2 ++
drivers/cxl/pci.c | 73 ++++++++++++++++++++++++++++++++++++++++++----------
2 files changed, 62 insertions(+), 13 deletions(-)
diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
index bbdf52ac1d5cb5df82812c13ff50ca7cacfd0db6..863899b295b719b57638ee060e494e5cf2d639fd 100644
--- a/drivers/cxl/cxlmem.h
+++ b/drivers/cxl/cxlmem.h
@@ -226,7 +226,9 @@ struct cxl_event_interrupt_policy {
u8 warn_settings;
u8 failure_settings;
u8 fatal_settings;
+ u8 dcd_settings;
} __packed;
+#define CXL_EVENT_INT_POLICY_BASE_SIZE 4 /* info, warn, failure, fatal */
/**
* struct cxl_event_state - Event log driver state
diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
index 22e6047e3c3db7a16670b7a5aa4797ad20befb22..15e85ba66ff7112a8413c4c1acc4b4a71f47a298 100644
--- a/drivers/cxl/pci.c
+++ b/drivers/cxl/pci.c
@@ -685,23 +685,34 @@ static int cxl_event_get_int_policy(struct cxl_memdev_state *mds,
}
static int cxl_event_config_msgnums(struct cxl_memdev_state *mds,
- struct cxl_event_interrupt_policy *policy)
+ struct cxl_event_interrupt_policy *policy,
+ bool native_cxl)
{
struct cxl_mailbox *cxl_mbox = &mds->cxlds.cxl_mbox;
+ size_t size_in = CXL_EVENT_INT_POLICY_BASE_SIZE;
struct cxl_mbox_cmd mbox_cmd;
int rc;
- *policy = (struct cxl_event_interrupt_policy) {
- .info_settings = CXL_INT_MSI_MSIX,
- .warn_settings = CXL_INT_MSI_MSIX,
- .failure_settings = CXL_INT_MSI_MSIX,
- .fatal_settings = CXL_INT_MSI_MSIX,
- };
+ /* memory event policy is left if FW has control */
+ if (native_cxl) {
+ *policy = (struct cxl_event_interrupt_policy) {
+ .info_settings = CXL_INT_MSI_MSIX,
+ .warn_settings = CXL_INT_MSI_MSIX,
+ .failure_settings = CXL_INT_MSI_MSIX,
+ .fatal_settings = CXL_INT_MSI_MSIX,
+ .dcd_settings = 0,
+ };
+ }
+
+ if (cxl_dcd_supported(mds)) {
+ policy->dcd_settings = CXL_INT_MSI_MSIX;
+ size_in += sizeof(policy->dcd_settings);
+ }
mbox_cmd = (struct cxl_mbox_cmd) {
.opcode = CXL_MBOX_OP_SET_EVT_INT_POLICY,
.payload_in = policy,
- .size_in = sizeof(*policy),
+ .size_in = size_in,
};
rc = cxl_internal_send_cmd(cxl_mbox, &mbox_cmd);
@@ -748,6 +759,30 @@ static int cxl_event_irqsetup(struct cxl_memdev_state *mds,
return 0;
}
+static int cxl_irqsetup(struct cxl_memdev_state *mds,
+ struct cxl_event_interrupt_policy *policy,
+ bool native_cxl)
+{
+ struct cxl_dev_state *cxlds = &mds->cxlds;
+ int rc;
+
+ if (native_cxl) {
+ rc = cxl_event_irqsetup(mds, policy);
+ if (rc)
+ return rc;
+ }
+
+ if (cxl_dcd_supported(mds)) {
+ rc = cxl_event_req_irq(cxlds, policy->dcd_settings);
+ if (rc) {
+ dev_err(cxlds->dev, "Failed to get interrupt for DCD event log\n");
+ cxl_disable_dcd(mds);
+ }
+ }
+
+ return 0;
+}
+
static bool cxl_event_int_is_fw(u8 setting)
{
u8 mode = FIELD_GET(CXLDEV_EVENT_INT_MODE_MASK, setting);
@@ -773,18 +808,26 @@ static bool cxl_event_validate_mem_policy(struct cxl_memdev_state *mds,
static int cxl_event_config(struct pci_host_bridge *host_bridge,
struct cxl_memdev_state *mds, bool irq_avail)
{
- struct cxl_event_interrupt_policy policy;
+ struct cxl_event_interrupt_policy policy = { 0 };
+ bool native_cxl = host_bridge->native_cxl_error;
int rc;
/*
* When BIOS maintains CXL error reporting control, it will process
* event records. Only one agent can do so.
+ *
+ * If BIOS has control of events and DCD is not supported skip event
+ * configuration.
*/
- if (!host_bridge->native_cxl_error)
+ if (!native_cxl && !cxl_dcd_supported(mds))
return 0;
if (!irq_avail) {
dev_info(mds->cxlds.dev, "No interrupt support, disable event processing.\n");
+ if (cxl_dcd_supported(mds)) {
+ dev_info(mds->cxlds.dev, "DCD requires interrupts, disable DCD\n");
+ cxl_disable_dcd(mds);
+ }
return 0;
}
@@ -792,10 +835,10 @@ static int cxl_event_config(struct pci_host_bridge *host_bridge,
if (rc)
return rc;
- if (!cxl_event_validate_mem_policy(mds, &policy))
+ if (native_cxl && !cxl_event_validate_mem_policy(mds, &policy))
return -EBUSY;
- rc = cxl_event_config_msgnums(mds, &policy);
+ rc = cxl_event_config_msgnums(mds, &policy, native_cxl);
if (rc)
return rc;
@@ -803,12 +846,16 @@ static int cxl_event_config(struct pci_host_bridge *host_bridge,
if (rc)
return rc;
- rc = cxl_event_irqsetup(mds, &policy);
+ rc = cxl_irqsetup(mds, &policy, native_cxl);
if (rc)
return rc;
cxl_mem_get_event_records(mds, CXLDEV_EVENT_STATUS_ALL);
+ dev_dbg(mds->cxlds.dev, "Event config : %s DCD %s\n",
+ native_cxl ? "OS" : "BIOS",
+ cxl_dcd_supported(mds) ? "supported" : "not supported");
+
return 0;
}
--
2.47.1
^ permalink raw reply related [flat|nested] 34+ messages in thread
* [PATCH v8 13/21] cxl/core: Return endpoint decoder information from region search
2024-12-11 3:42 [PATCH v8 00/21] DCD: Add support for Dynamic Capacity Devices (DCD) Ira Weiny
` (11 preceding siblings ...)
2024-12-11 3:42 ` [PATCH v8 12/21] cxl/mem: Configure dynamic capacity interrupts Ira Weiny
@ 2024-12-11 3:42 ` Ira Weiny
2024-12-11 3:42 ` [PATCH v8 14/21] cxl/extent: Process DCD events and realize region extents Ira Weiny
` (7 subsequent siblings)
20 siblings, 0 replies; 34+ messages in thread
From: Ira Weiny @ 2024-12-11 3:42 UTC (permalink / raw)
To: Dave Jiang, Fan Ni, Jonathan Cameron, Jonathan Corbet,
Andrew Morton, Kees Cook, Gustavo A. R. Silva
Cc: Dan Williams, Davidlohr Bueso, Alison Schofield, Vishal Verma,
Ira Weiny, linux-cxl, linux-doc, nvdimm, linux-kernel,
linux-hardening, Li Ming
cxl_dpa_to_region() finds the region from a <DPA, device> tuple.
The search involves finding the device endpoint decoder as well.
Dynamic capacity extent processing uses the endpoint decoder HPA
information to calculate the HPA offset. In addition, well behaved
extents should be contained within an endpoint decoder.
Return the endpoint decoder found to be used in subsequent DCD code.
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: Fan Ni <fan.ni@samsung.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Reviewed-by: Li Ming <ming.li@zohomail.com>
Reviewed-by: Alison Schofield <alison.schofield@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
---
drivers/cxl/core/core.h | 6 ++++--
drivers/cxl/core/mbox.c | 2 +-
drivers/cxl/core/memdev.c | 4 ++--
drivers/cxl/core/region.c | 8 +++++++-
4 files changed, 14 insertions(+), 6 deletions(-)
diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
index 03ab7e66102e1e1fa378b9afb1c6b3e8235e8ed4..cada2647966d91bf3997a1c3f1252c100f7d0b30 100644
--- a/drivers/cxl/core/core.h
+++ b/drivers/cxl/core/core.h
@@ -39,7 +39,8 @@ void cxl_decoder_kill_region(struct cxl_endpoint_decoder *cxled);
int cxl_region_init(void);
void cxl_region_exit(void);
int cxl_get_poison_by_endpoint(struct cxl_port *port);
-struct cxl_region *cxl_dpa_to_region(const struct cxl_memdev *cxlmd, u64 dpa);
+struct cxl_region *cxl_dpa_to_region(const struct cxl_memdev *cxlmd, u64 dpa,
+ struct cxl_endpoint_decoder **cxled);
u64 cxl_dpa_to_hpa(struct cxl_region *cxlr, const struct cxl_memdev *cxlmd,
u64 dpa);
@@ -50,7 +51,8 @@ static inline u64 cxl_dpa_to_hpa(struct cxl_region *cxlr,
return ULLONG_MAX;
}
static inline
-struct cxl_region *cxl_dpa_to_region(const struct cxl_memdev *cxlmd, u64 dpa)
+struct cxl_region *cxl_dpa_to_region(const struct cxl_memdev *cxlmd, u64 dpa,
+ struct cxl_endpoint_decoder **cxled)
{
return NULL;
}
diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
index 56c4389e0031e15bc66056b8a73f4159864f6c4e..6305cce453c0e6fdef1a7ddf3444f6794831f9d0 100644
--- a/drivers/cxl/core/mbox.c
+++ b/drivers/cxl/core/mbox.c
@@ -916,7 +916,7 @@ void cxl_event_trace_record(const struct cxl_memdev *cxlmd,
guard(rwsem_read)(&cxl_dpa_rwsem);
dpa = le64_to_cpu(evt->media_hdr.phys_addr) & CXL_DPA_MASK;
- cxlr = cxl_dpa_to_region(cxlmd, dpa);
+ cxlr = cxl_dpa_to_region(cxlmd, dpa, NULL);
if (cxlr)
hpa = cxl_dpa_to_hpa(cxlr, cxlmd, dpa);
diff --git a/drivers/cxl/core/memdev.c b/drivers/cxl/core/memdev.c
index 56cdf09d3affb81969755769a8803f6bded7a4ce..e0dbf2a8398adb47e7b9c4261b77fa77dcde7463 100644
--- a/drivers/cxl/core/memdev.c
+++ b/drivers/cxl/core/memdev.c
@@ -313,7 +313,7 @@ int cxl_inject_poison(struct cxl_memdev *cxlmd, u64 dpa)
if (rc)
goto out;
- cxlr = cxl_dpa_to_region(cxlmd, dpa);
+ cxlr = cxl_dpa_to_region(cxlmd, dpa, NULL);
if (cxlr)
dev_warn_once(cxl_mbox->host,
"poison inject dpa:%#llx region: %s\n", dpa,
@@ -377,7 +377,7 @@ int cxl_clear_poison(struct cxl_memdev *cxlmd, u64 dpa)
if (rc)
goto out;
- cxlr = cxl_dpa_to_region(cxlmd, dpa);
+ cxlr = cxl_dpa_to_region(cxlmd, dpa, NULL);
if (cxlr)
dev_warn_once(cxl_mbox->host,
"poison clear dpa:%#llx region: %s\n", dpa,
diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
index a393c46871235e33b3f077951f191178be48f449..5154d00d2ee2026041d93bb4b20c9e0bb97f6449 100644
--- a/drivers/cxl/core/region.c
+++ b/drivers/cxl/core/region.c
@@ -2828,6 +2828,7 @@ int cxl_get_poison_by_endpoint(struct cxl_port *port)
struct cxl_dpa_to_region_context {
struct cxl_region *cxlr;
u64 dpa;
+ struct cxl_endpoint_decoder *cxled;
};
static int __cxl_dpa_to_region(struct device *dev, void *arg)
@@ -2861,11 +2862,13 @@ static int __cxl_dpa_to_region(struct device *dev, void *arg)
dev_name(dev));
ctx->cxlr = cxlr;
+ ctx->cxled = cxled;
return 1;
}
-struct cxl_region *cxl_dpa_to_region(const struct cxl_memdev *cxlmd, u64 dpa)
+struct cxl_region *cxl_dpa_to_region(const struct cxl_memdev *cxlmd, u64 dpa,
+ struct cxl_endpoint_decoder **cxled)
{
struct cxl_dpa_to_region_context ctx;
struct cxl_port *port;
@@ -2877,6 +2880,9 @@ struct cxl_region *cxl_dpa_to_region(const struct cxl_memdev *cxlmd, u64 dpa)
if (port && is_cxl_endpoint(port) && cxl_num_decoders_committed(port))
device_for_each_child(&port->dev, &ctx, __cxl_dpa_to_region);
+ if (cxled)
+ *cxled = ctx.cxled;
+
return ctx.cxlr;
}
--
2.47.1
^ permalink raw reply related [flat|nested] 34+ messages in thread
* [PATCH v8 14/21] cxl/extent: Process DCD events and realize region extents
2024-12-11 3:42 [PATCH v8 00/21] DCD: Add support for Dynamic Capacity Devices (DCD) Ira Weiny
` (12 preceding siblings ...)
2024-12-11 3:42 ` [PATCH v8 13/21] cxl/core: Return endpoint decoder information from region search Ira Weiny
@ 2024-12-11 3:42 ` Ira Weiny
2024-12-11 3:42 ` [PATCH v8 15/21] cxl/region/extent: Expose region extent information in sysfs Ira Weiny
` (6 subsequent siblings)
20 siblings, 0 replies; 34+ messages in thread
From: Ira Weiny @ 2024-12-11 3:42 UTC (permalink / raw)
To: Dave Jiang, Fan Ni, Jonathan Cameron, Jonathan Corbet,
Andrew Morton, Kees Cook, Gustavo A. R. Silva
Cc: Dan Williams, Davidlohr Bueso, Alison Schofield, Vishal Verma,
Ira Weiny, linux-cxl, linux-doc, nvdimm, linux-kernel,
linux-hardening, Li Ming
A dynamic capacity device (DCD) sends events to signal the host for
changes in the availability of Dynamic Capacity (DC) memory. These
events contain extents describing a DPA range and meta data for memory
to be added or removed. Events may be sent from the device at any time.
Three types of events can be signaled, Add, Release, and Force Release.
On add, the host may accept or reject the memory being offered. If no
region exists, or the extent is invalid, the extent should be rejected.
Add extent events may be grouped by a 'more' bit which indicates those
extents should be processed as a group.
On remove, the host can delay the response until the host is safely not
using the memory. If no region exists the release can be sent
immediately. The host may also release extents (or partial extents) at
any time. Thus the 'more' bit grouping of release events is of less
value and can be ignored in favor of sending multiple release capacity
responses for groups of release events.
Force removal is intended as a mechanism between the FM and the device
and intended only when the host is unresponsive, out of sync, or
otherwise broken. Purposely ignore force removal events.
Regions are made up of one or more devices which may be surfacing memory
to the host. Once all devices in a region have surfaced an extent the
region can expose a corresponding extent for the user to consume.
Without interleaving a device extent forms a 1:1 relationship with the
region extent. Immediately surface a region extent upon getting a
device extent.
Per the specification the device is allowed to offer or remove extents
at any time. However, anticipated use cases can expect extents to be
offered, accepted, and removed in well defined chunks.
Simplify extent tracking with the following restrictions.
1) Flag for removal any extent which overlaps a requested
release range.
2) Refuse the offer of extents which overlap already accepted
memory ranges.
3) Accept again a range which has already been accepted by the
host. Eating duplicates serves three purposes. First, this
simplifies the code if the device should get out of sync with
the host. And it should be safe to acknowledge the extent
again. Second, this simplifies the code to process existing
extents if the extent list should change while the extent
list is being read. Third, duplicates for a given region
which are seen during a race between the hardware surfacing
an extent and the cxl dax driver scanning for existing
extents will be ignored.
NOTE: Processing existing extents is done in a later patch.
Management of the region extent devices must be synchronized with
potential uses of the memory within the DAX layer. Create region extent
devices as children of the cxl_dax_region device such that the DAX
region driver can co-drive them and synchronize with the DAX layer.
Synchronization and management is handled in a subsequent patch.
Tag support within the DAX layer is not yet supported. To maintain
compatibility legacy DAX/region processing only tags with a value of 0
are allowed. This defines existing DAX devices as having a 0 tag which
makes the most logical sense as a default.
Process DCD events and create region devices.
Based on an original patch by Navneet Singh.
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Reviewed-by: Li Ming <ming.li@zohomail.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
---
Changes:
[Ming: Fix setting the more flag]
[iweiny: Fix export symbol]
[iweiny: convert range prints to %pra]
---
drivers/cxl/core/Makefile | 2 +-
drivers/cxl/core/core.h | 13 ++
drivers/cxl/core/extent.c | 365 ++++++++++++++++++++++++++++++++++++++++++++++
drivers/cxl/core/mbox.c | 288 +++++++++++++++++++++++++++++++++++-
drivers/cxl/core/region.c | 3 +
drivers/cxl/cxl.h | 53 ++++++-
drivers/cxl/cxlmem.h | 27 ++++
include/cxl/event.h | 32 ++++
tools/testing/cxl/Kbuild | 3 +-
9 files changed, 782 insertions(+), 4 deletions(-)
diff --git a/drivers/cxl/core/Makefile b/drivers/cxl/core/Makefile
index 9259bcc6773c804ccace2478c9f6f09267b48c9d..3b812515e72536aee5cd305e1ffabfd5a8bd296c 100644
--- a/drivers/cxl/core/Makefile
+++ b/drivers/cxl/core/Makefile
@@ -15,4 +15,4 @@ cxl_core-y += hdm.o
cxl_core-y += pmu.o
cxl_core-y += cdat.o
cxl_core-$(CONFIG_TRACING) += trace.o
-cxl_core-$(CONFIG_CXL_REGION) += region.o
+cxl_core-$(CONFIG_CXL_REGION) += region.o extent.o
diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
index cada2647966d91bf3997a1c3f1252c100f7d0b30..943869e8dd7da0f8e0b9970f323392006048ac41 100644
--- a/drivers/cxl/core/core.h
+++ b/drivers/cxl/core/core.h
@@ -44,12 +44,24 @@ struct cxl_region *cxl_dpa_to_region(const struct cxl_memdev *cxlmd, u64 dpa,
u64 cxl_dpa_to_hpa(struct cxl_region *cxlr, const struct cxl_memdev *cxlmd,
u64 dpa);
+int cxl_add_extent(struct cxl_memdev_state *mds, struct cxl_extent *extent);
+int cxl_rm_extent(struct cxl_memdev_state *mds, struct cxl_extent *extent);
#else
static inline u64 cxl_dpa_to_hpa(struct cxl_region *cxlr,
const struct cxl_memdev *cxlmd, u64 dpa)
{
return ULLONG_MAX;
}
+static inline int cxl_add_extent(struct cxl_memdev_state *mds,
+ struct cxl_extent *extent)
+{
+ return 0;
+}
+static inline int cxl_rm_extent(struct cxl_memdev_state *mds,
+ struct cxl_extent *extent)
+{
+ return 0;
+}
static inline
struct cxl_region *cxl_dpa_to_region(const struct cxl_memdev *cxlmd, u64 dpa,
struct cxl_endpoint_decoder **cxled)
@@ -128,5 +140,6 @@ int cxl_update_hmat_access_coordinates(int nid, struct cxl_region *cxlr,
bool cxl_need_node_perf_attrs_update(int nid);
int cxl_port_get_switch_dport_bandwidth(struct cxl_port *port,
struct access_coordinate *c);
+void memdev_release_extent(struct cxl_memdev_state *mds, struct range *range);
#endif /* __CXL_CORE_H__ */
diff --git a/drivers/cxl/core/extent.c b/drivers/cxl/core/extent.c
new file mode 100644
index 0000000000000000000000000000000000000000..a45ff84727b0f8c2567f0d2dd8b5c261b23695e3
--- /dev/null
+++ b/drivers/cxl/core/extent.c
@@ -0,0 +1,365 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright(c) 2024 Intel Corporation. All rights reserved. */
+
+#include <linux/device.h>
+#include <cxl.h>
+
+#include "core.h"
+
+static void cxled_release_extent(struct cxl_endpoint_decoder *cxled,
+ struct cxled_extent *ed_extent)
+{
+ struct cxl_memdev_state *mds = cxled_to_mds(cxled);
+ struct device *dev = &cxled->cxld.dev;
+
+ dev_dbg(dev, "Remove extent %pra (%pU)\n",
+ &ed_extent->dpa_range, ed_extent->tag);
+ memdev_release_extent(mds, &ed_extent->dpa_range);
+ kfree(ed_extent);
+}
+
+static void free_region_extent(struct region_extent *region_extent)
+{
+ struct cxled_extent *ed_extent;
+ unsigned long index;
+
+ /*
+ * Remove from each endpoint decoder the extent which backs this region
+ * extent
+ */
+ xa_for_each(®ion_extent->decoder_extents, index, ed_extent)
+ cxled_release_extent(ed_extent->cxled, ed_extent);
+ xa_destroy(®ion_extent->decoder_extents);
+ ida_free(®ion_extent->cxlr_dax->extent_ida, region_extent->dev.id);
+ kfree(region_extent);
+}
+
+static void region_extent_release(struct device *dev)
+{
+ struct region_extent *region_extent = to_region_extent(dev);
+
+ free_region_extent(region_extent);
+}
+
+static const struct device_type region_extent_type = {
+ .name = "extent",
+ .release = region_extent_release,
+};
+
+bool is_region_extent(struct device *dev)
+{
+ return dev->type == ®ion_extent_type;
+}
+EXPORT_SYMBOL_NS_GPL(is_region_extent, "CXL");
+
+static void region_extent_unregister(void *ext)
+{
+ struct region_extent *region_extent = ext;
+
+ dev_dbg(®ion_extent->dev, "DAX region rm extent HPA %pra\n",
+ ®ion_extent->hpa_range);
+ device_unregister(®ion_extent->dev);
+}
+
+static void region_rm_extent(struct region_extent *region_extent)
+{
+ struct device *region_dev = region_extent->dev.parent;
+
+ devm_release_action(region_dev, region_extent_unregister, region_extent);
+}
+
+static struct region_extent *
+alloc_region_extent(struct cxl_dax_region *cxlr_dax, struct range *hpa_range, u8 *tag)
+{
+ int id;
+
+ struct region_extent *region_extent __free(kfree) =
+ kzalloc(sizeof(*region_extent), GFP_KERNEL);
+ if (!region_extent)
+ return ERR_PTR(-ENOMEM);
+
+ id = ida_alloc(&cxlr_dax->extent_ida, GFP_KERNEL);
+ if (id < 0)
+ return ERR_PTR(-ENOMEM);
+
+ region_extent->hpa_range = *hpa_range;
+ region_extent->cxlr_dax = cxlr_dax;
+ import_uuid(®ion_extent->tag, tag);
+ region_extent->dev.id = id;
+ xa_init(®ion_extent->decoder_extents);
+ return no_free_ptr(region_extent);
+}
+
+static int online_region_extent(struct region_extent *region_extent)
+{
+ struct cxl_dax_region *cxlr_dax = region_extent->cxlr_dax;
+ struct device *dev = ®ion_extent->dev;
+ int rc;
+
+ device_initialize(dev);
+ device_set_pm_not_required(dev);
+ dev->parent = &cxlr_dax->dev;
+ dev->type = ®ion_extent_type;
+ rc = dev_set_name(dev, "extent%d.%d", cxlr_dax->cxlr->id, dev->id);
+ if (rc)
+ goto err;
+
+ rc = device_add(dev);
+ if (rc)
+ goto err;
+
+ dev_dbg(dev, "region extent HPA %pra\n", ®ion_extent->hpa_range);
+ return devm_add_action_or_reset(&cxlr_dax->dev, region_extent_unregister,
+ region_extent);
+
+err:
+ dev_err(&cxlr_dax->dev, "Failed to initialize region extent HPA %pra\n",
+ ®ion_extent->hpa_range);
+
+ put_device(dev);
+ return rc;
+}
+
+struct match_data {
+ struct cxl_endpoint_decoder *cxled;
+ struct range *new_range;
+};
+
+static int match_contains(struct device *dev, void *data)
+{
+ struct region_extent *region_extent = to_region_extent(dev);
+ struct match_data *md = data;
+ struct cxled_extent *entry;
+ unsigned long index;
+
+ if (!region_extent)
+ return 0;
+
+ xa_for_each(®ion_extent->decoder_extents, index, entry) {
+ if (md->cxled == entry->cxled &&
+ range_contains(&entry->dpa_range, md->new_range))
+ return 1;
+ }
+ return 0;
+}
+
+static bool extents_contain(struct cxl_dax_region *cxlr_dax,
+ struct cxl_endpoint_decoder *cxled,
+ struct range *new_range)
+{
+ struct match_data md = {
+ .cxled = cxled,
+ .new_range = new_range,
+ };
+
+ struct device *extent_device __free(put_device)
+ = device_find_child(&cxlr_dax->dev, &md, match_contains);
+ if (!extent_device)
+ return false;
+
+ return true;
+}
+
+static int match_overlaps(struct device *dev, void *data)
+{
+ struct region_extent *region_extent = to_region_extent(dev);
+ struct match_data *md = data;
+ struct cxled_extent *entry;
+ unsigned long index;
+
+ if (!region_extent)
+ return 0;
+
+ xa_for_each(®ion_extent->decoder_extents, index, entry) {
+ if (md->cxled == entry->cxled &&
+ range_overlaps(&entry->dpa_range, md->new_range))
+ return 1;
+ }
+
+ return 0;
+}
+
+static bool extents_overlap(struct cxl_dax_region *cxlr_dax,
+ struct cxl_endpoint_decoder *cxled,
+ struct range *new_range)
+{
+ struct match_data md = {
+ .cxled = cxled,
+ .new_range = new_range,
+ };
+
+ struct device *extent_device __free(put_device)
+ = device_find_child(&cxlr_dax->dev, &md, match_overlaps);
+ if (!extent_device)
+ return false;
+
+ return true;
+}
+
+static void calc_hpa_range(struct cxl_endpoint_decoder *cxled,
+ struct cxl_dax_region *cxlr_dax,
+ struct range *dpa_range,
+ struct range *hpa_range)
+{
+ resource_size_t dpa_offset, hpa;
+
+ dpa_offset = dpa_range->start - cxled->dpa_res->start;
+ hpa = cxled->cxld.hpa_range.start + dpa_offset;
+
+ hpa_range->start = hpa - cxlr_dax->hpa_range.start;
+ hpa_range->end = hpa_range->start + range_len(dpa_range) - 1;
+}
+
+static int cxlr_rm_extent(struct device *dev, void *data)
+{
+ struct region_extent *region_extent = to_region_extent(dev);
+ struct range *region_hpa_range = data;
+
+ if (!region_extent)
+ return 0;
+
+ /*
+ * Any extent which 'touches' the released range is removed.
+ */
+ if (range_overlaps(region_hpa_range, ®ion_extent->hpa_range)) {
+ dev_dbg(dev, "Remove region extent HPA %pra\n",
+ ®ion_extent->hpa_range);
+ region_rm_extent(region_extent);
+ }
+ return 0;
+}
+
+int cxl_rm_extent(struct cxl_memdev_state *mds, struct cxl_extent *extent)
+{
+ u64 start_dpa = le64_to_cpu(extent->start_dpa);
+ struct cxl_memdev *cxlmd = mds->cxlds.cxlmd;
+ struct cxl_endpoint_decoder *cxled;
+ struct range hpa_range, dpa_range;
+ struct cxl_region *cxlr;
+
+ dpa_range = (struct range) {
+ .start = start_dpa,
+ .end = start_dpa + le64_to_cpu(extent->length) - 1,
+ };
+
+ guard(rwsem_read)(&cxl_region_rwsem);
+ cxlr = cxl_dpa_to_region(cxlmd, start_dpa, &cxled);
+ if (!cxlr) {
+ /*
+ * No region can happen here for a few reasons:
+ *
+ * 1) Extents were accepted and the host crashed/rebooted
+ * leaving them in an accepted state. On reboot the host
+ * has not yet created a region to own them.
+ *
+ * 2) Region destruction won the race with the device releasing
+ * all the extents. Here the release will be a duplicate of
+ * the one sent via region destruction.
+ *
+ * 3) The device is confused and releasing extents for which no
+ * region ever existed.
+ *
+ * In all these cases make sure the device knows we are not
+ * using this extent.
+ */
+ memdev_release_extent(mds, &dpa_range);
+ return -ENXIO;
+ }
+
+ calc_hpa_range(cxled, cxlr->cxlr_dax, &dpa_range, &hpa_range);
+
+ /* Remove region extents which overlap */
+ return device_for_each_child(&cxlr->cxlr_dax->dev, &hpa_range,
+ cxlr_rm_extent);
+}
+
+static int cxlr_add_extent(struct cxl_dax_region *cxlr_dax,
+ struct cxl_endpoint_decoder *cxled,
+ struct cxled_extent *ed_extent)
+{
+ struct region_extent *region_extent;
+ struct range hpa_range;
+ int rc;
+
+ calc_hpa_range(cxled, cxlr_dax, &ed_extent->dpa_range, &hpa_range);
+
+ region_extent = alloc_region_extent(cxlr_dax, &hpa_range, ed_extent->tag);
+ if (IS_ERR(region_extent))
+ return PTR_ERR(region_extent);
+
+ rc = xa_insert(®ion_extent->decoder_extents, (unsigned long)ed_extent,
+ ed_extent, GFP_KERNEL);
+ if (rc) {
+ free_region_extent(region_extent);
+ return rc;
+ }
+
+ /* device model handles freeing region_extent */
+ return online_region_extent(region_extent);
+}
+
+/* Callers are expected to ensure cxled has been attached to a region */
+int cxl_add_extent(struct cxl_memdev_state *mds, struct cxl_extent *extent)
+{
+ u64 start_dpa = le64_to_cpu(extent->start_dpa);
+ struct cxl_memdev *cxlmd = mds->cxlds.cxlmd;
+ struct cxl_endpoint_decoder *cxled;
+ struct range ed_range, ext_range;
+ struct cxl_dax_region *cxlr_dax;
+ struct cxled_extent *ed_extent;
+ struct cxl_region *cxlr;
+ struct device *dev;
+
+ ext_range = (struct range) {
+ .start = start_dpa,
+ .end = start_dpa + le64_to_cpu(extent->length) - 1,
+ };
+
+ guard(rwsem_read)(&cxl_region_rwsem);
+ cxlr = cxl_dpa_to_region(cxlmd, start_dpa, &cxled);
+ if (!cxlr)
+ return -ENXIO;
+
+ cxlr_dax = cxled->cxld.region->cxlr_dax;
+ dev = &cxled->cxld.dev;
+ ed_range = (struct range) {
+ .start = cxled->dpa_res->start,
+ .end = cxled->dpa_res->end,
+ };
+
+ dev_dbg(&cxled->cxld.dev, "Checking ED (%pr) for extent %pra\n",
+ cxled->dpa_res, &ext_range);
+
+ if (!range_contains(&ed_range, &ext_range)) {
+ dev_err_ratelimited(dev,
+ "DC extent DPA %pra (%pU) is not fully in ED %pra\n",
+ &ext_range, extent->tag, &ed_range);
+ return -ENXIO;
+ }
+
+ /*
+ * Allowing duplicates or extents which are already in an accepted
+ * range simplifies extent processing, especially when dealing with the
+ * cxl dax driver scanning for existing extents.
+ */
+ if (extents_contain(cxlr_dax, cxled, &ext_range)) {
+ dev_warn_ratelimited(dev, "Extent %pra exists; accept again\n",
+ &ext_range);
+ return 0;
+ }
+
+ if (extents_overlap(cxlr_dax, cxled, &ext_range))
+ return -ENXIO;
+
+ ed_extent = kzalloc(sizeof(*ed_extent), GFP_KERNEL);
+ if (!ed_extent)
+ return -ENOMEM;
+
+ ed_extent->cxled = cxled;
+ ed_extent->dpa_range = ext_range;
+ memcpy(ed_extent->tag, extent->tag, CXL_EXTENT_TAG_LEN);
+
+ dev_dbg(dev, "Add extent %pra (%pU)\n", &ed_extent->dpa_range, ed_extent->tag);
+
+ return cxlr_add_extent(cxlr_dax, cxled, ed_extent);
+}
diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
index 6305cce453c0e6fdef1a7ddf3444f6794831f9d0..5bce54da3bcfb934c0b7b0609fa7a961ee4854b7 100644
--- a/drivers/cxl/core/mbox.c
+++ b/drivers/cxl/core/mbox.c
@@ -889,6 +889,55 @@ int cxl_enumerate_cmds(struct cxl_memdev_state *mds)
}
EXPORT_SYMBOL_NS_GPL(cxl_enumerate_cmds, "CXL");
+static u8 zero_tag[CXL_EXTENT_TAG_LEN] = { 0 };
+
+static int cxl_validate_extent(struct cxl_memdev_state *mds,
+ struct cxl_extent *extent)
+{
+ u64 start = le64_to_cpu(extent->start_dpa);
+ u64 length = le64_to_cpu(extent->length);
+ struct device *dev = mds->cxlds.dev;
+
+ struct range ext_range = (struct range){
+ .start = start,
+ .end = start + length - 1,
+ };
+
+ if (le16_to_cpu(extent->shared_extn_seq) != 0) {
+ dev_err_ratelimited(dev,
+ "DC extent DPA %pra (%pU) can not be shared\n",
+ &ext_range, extent->tag);
+ return -ENXIO;
+ }
+
+ if (memcmp(extent->tag, zero_tag, CXL_EXTENT_TAG_LEN)) {
+ dev_err_ratelimited(dev,
+ "DC extent DPA %pra (%pU); tags not supported\n",
+ &ext_range, extent->tag);
+ return -ENXIO;
+ }
+
+ /* Extents must not cross DC region boundary's */
+ for (int i = 0; i < mds->nr_dc_region; i++) {
+ struct cxl_dc_region_info *dcr = &mds->dc_region[i];
+ struct range region_range = (struct range) {
+ .start = dcr->base,
+ .end = dcr->base + dcr->decode_len - 1,
+ };
+
+ if (range_contains(®ion_range, &ext_range)) {
+ dev_dbg(dev, "DC extent DPA %pra (DCR:%d:%#llx)(%pU)\n",
+ &ext_range, i, start - dcr->base, extent->tag);
+ return 0;
+ }
+ }
+
+ dev_err_ratelimited(dev,
+ "DC extent DPA %pra (%pU) is not in any DC region\n",
+ &ext_range, extent->tag);
+ return -ENXIO;
+}
+
void cxl_event_trace_record(const struct cxl_memdev *cxlmd,
enum cxl_event_log_type type,
enum cxl_event_type event_type,
@@ -1017,6 +1066,221 @@ static int cxl_clear_event_record(struct cxl_memdev_state *mds,
return rc;
}
+static int send_one_response(struct cxl_mailbox *cxl_mbox,
+ struct cxl_mbox_dc_response *response,
+ int opcode, u32 extent_list_size, u8 flags)
+{
+ struct cxl_mbox_cmd mbox_cmd = (struct cxl_mbox_cmd) {
+ .opcode = opcode,
+ .size_in = struct_size(response, extent_list, extent_list_size),
+ .payload_in = response,
+ };
+
+ response->extent_list_size = cpu_to_le32(extent_list_size);
+ response->flags = flags;
+ return cxl_internal_send_cmd(cxl_mbox, &mbox_cmd);
+}
+
+static int cxl_send_dc_response(struct cxl_memdev_state *mds, int opcode,
+ struct xarray *extent_array, int cnt)
+{
+ struct cxl_mailbox *cxl_mbox = &mds->cxlds.cxl_mbox;
+ struct cxl_mbox_dc_response *p;
+ struct cxl_extent *extent;
+ unsigned long index;
+ u32 pl_index;
+
+ size_t pl_size = struct_size(p, extent_list, cnt);
+ u32 max_extents = cnt;
+
+ /* May have to use more bit on response. */
+ if (pl_size > cxl_mbox->payload_size) {
+ max_extents = (cxl_mbox->payload_size - sizeof(*p)) /
+ sizeof(struct updated_extent_list);
+ pl_size = struct_size(p, extent_list, max_extents);
+ }
+
+ struct cxl_mbox_dc_response *response __free(kfree) =
+ kzalloc(pl_size, GFP_KERNEL);
+ if (!response)
+ return -ENOMEM;
+
+ if (cnt == 0)
+ return send_one_response(cxl_mbox, response, opcode, 0, 0);
+
+ pl_index = 0;
+ xa_for_each(extent_array, index, extent) {
+ response->extent_list[pl_index].dpa_start = extent->start_dpa;
+ response->extent_list[pl_index].length = extent->length;
+ pl_index++;
+
+ if (pl_index == max_extents) {
+ u8 flags = 0;
+ int rc;
+
+ if (pl_index < cnt)
+ flags |= CXL_DCD_EVENT_MORE;
+ rc = send_one_response(cxl_mbox, response, opcode,
+ pl_index, flags);
+ if (rc)
+ return rc;
+ cnt -= pl_index;
+ pl_index = 0;
+ }
+ }
+
+ if (!pl_index) /* nothing more to do */
+ return 0;
+ return send_one_response(cxl_mbox, response, opcode, pl_index, 0);
+}
+
+void memdev_release_extent(struct cxl_memdev_state *mds, struct range *range)
+{
+ struct device *dev = mds->cxlds.dev;
+ struct xarray extent_list;
+
+ struct cxl_extent extent = {
+ .start_dpa = cpu_to_le64(range->start),
+ .length = cpu_to_le64(range_len(range)),
+ };
+
+ dev_dbg(dev, "Release response dpa %pra\n", &range);
+
+ xa_init(&extent_list);
+ if (xa_insert(&extent_list, 0, &extent, GFP_KERNEL)) {
+ dev_dbg(dev, "Failed to release %pra\n", &range);
+ goto destroy;
+ }
+
+ if (cxl_send_dc_response(mds, CXL_MBOX_OP_RELEASE_DC, &extent_list, 1))
+ dev_dbg(dev, "Failed to release %pra\n", &range);
+
+destroy:
+ xa_destroy(&extent_list);
+}
+
+static int validate_add_extent(struct cxl_memdev_state *mds,
+ struct cxl_extent *extent)
+{
+ int rc;
+
+ rc = cxl_validate_extent(mds, extent);
+ if (rc)
+ return rc;
+
+ return cxl_add_extent(mds, extent);
+}
+
+static int cxl_add_pending(struct cxl_memdev_state *mds)
+{
+ struct device *dev = mds->cxlds.dev;
+ struct cxl_extent *extent;
+ unsigned long cnt = 0;
+ unsigned long index;
+ int rc;
+
+ xa_for_each(&mds->pending_extents, index, extent) {
+ if (validate_add_extent(mds, extent)) {
+ /*
+ * Any extents which are to be rejected are omitted from
+ * the response. An empty response means all are
+ * rejected.
+ */
+ dev_dbg(dev, "unconsumed DC extent DPA:%#llx LEN:%#llx\n",
+ le64_to_cpu(extent->start_dpa),
+ le64_to_cpu(extent->length));
+ xa_erase(&mds->pending_extents, index);
+ kfree(extent);
+ continue;
+ }
+ cnt++;
+ }
+ rc = cxl_send_dc_response(mds, CXL_MBOX_OP_ADD_DC_RESPONSE,
+ &mds->pending_extents, cnt);
+ xa_for_each(&mds->pending_extents, index, extent) {
+ xa_erase(&mds->pending_extents, index);
+ kfree(extent);
+ }
+ return rc;
+}
+
+static int handle_add_event(struct cxl_memdev_state *mds,
+ struct cxl_event_dcd *event)
+{
+ struct device *dev = mds->cxlds.dev;
+ struct cxl_extent *extent;
+
+ extent = kmemdup(&event->extent, sizeof(*extent), GFP_KERNEL);
+ if (!extent)
+ return -ENOMEM;
+
+ if (xa_insert(&mds->pending_extents, (unsigned long)extent, extent,
+ GFP_KERNEL)) {
+ kfree(extent);
+ return -ENOMEM;
+ }
+
+ if (event->flags & CXL_DCD_EVENT_MORE) {
+ dev_dbg(dev, "more bit set; delay the surfacing of extent\n");
+ return 0;
+ }
+
+ /* extents are removed and free'ed in cxl_add_pending() */
+ return cxl_add_pending(mds);
+}
+
+static char *cxl_dcd_evt_type_str(u8 type)
+{
+ switch (type) {
+ case DCD_ADD_CAPACITY:
+ return "add";
+ case DCD_RELEASE_CAPACITY:
+ return "release";
+ case DCD_FORCED_CAPACITY_RELEASE:
+ return "force release";
+ default:
+ break;
+ }
+
+ return "<unknown>";
+}
+
+static void cxl_handle_dcd_event_records(struct cxl_memdev_state *mds,
+ struct cxl_event_record_raw *raw_rec)
+{
+ struct cxl_event_dcd *event = &raw_rec->event.dcd;
+ struct cxl_extent *extent = &event->extent;
+ struct device *dev = mds->cxlds.dev;
+ uuid_t *id = &raw_rec->id;
+ int rc;
+
+ if (!uuid_equal(id, &CXL_EVENT_DC_EVENT_UUID))
+ return;
+
+ dev_dbg(dev, "DCD event %s : DPA:%#llx LEN:%#llx\n",
+ cxl_dcd_evt_type_str(event->event_type),
+ le64_to_cpu(extent->start_dpa), le64_to_cpu(extent->length));
+
+ switch (event->event_type) {
+ case DCD_ADD_CAPACITY:
+ rc = handle_add_event(mds, event);
+ break;
+ case DCD_RELEASE_CAPACITY:
+ rc = cxl_rm_extent(mds, &event->extent);
+ break;
+ case DCD_FORCED_CAPACITY_RELEASE:
+ dev_err_ratelimited(dev, "Forced release event ignored.\n");
+ rc = 0;
+ break;
+ default:
+ rc = -EINVAL;
+ break;
+ }
+
+ if (rc)
+ dev_err_ratelimited(dev, "dcd event failed: %d\n", rc);
+}
+
static void cxl_mem_get_records_log(struct cxl_memdev_state *mds,
enum cxl_event_log_type type)
{
@@ -1053,9 +1317,13 @@ static void cxl_mem_get_records_log(struct cxl_memdev_state *mds,
if (!nr_rec)
break;
- for (i = 0; i < nr_rec; i++)
+ for (i = 0; i < nr_rec; i++) {
__cxl_event_trace_record(cxlmd, type,
&payload->records[i]);
+ if (type == CXL_EVENT_TYPE_DCD)
+ cxl_handle_dcd_event_records(mds,
+ &payload->records[i]);
+ }
if (payload->flags & CXL_GET_EVENT_FLAG_OVERFLOW)
trace_cxl_overflow(cxlmd, type, payload);
@@ -1087,6 +1355,8 @@ void cxl_mem_get_event_records(struct cxl_memdev_state *mds, u32 status)
{
dev_dbg(mds->cxlds.dev, "Reading event logs: %x\n", status);
+ if (cxl_dcd_supported(mds) && (status & CXLDEV_EVENT_STATUS_DCD))
+ cxl_mem_get_records_log(mds, CXL_EVENT_TYPE_DCD);
if (status & CXLDEV_EVENT_STATUS_FATAL)
cxl_mem_get_records_log(mds, CXL_EVENT_TYPE_FATAL);
if (status & CXLDEV_EVENT_STATUS_FAIL)
@@ -1632,9 +1902,21 @@ int cxl_mailbox_init(struct cxl_mailbox *cxl_mbox, struct device *host)
}
EXPORT_SYMBOL_NS_GPL(cxl_mailbox_init, "CXL");
+static void clear_pending_extents(void *_mds)
+{
+ struct cxl_memdev_state *mds = _mds;
+ struct cxl_extent *extent;
+ unsigned long index;
+
+ xa_for_each(&mds->pending_extents, index, extent)
+ kfree(extent);
+ xa_destroy(&mds->pending_extents);
+}
+
struct cxl_memdev_state *cxl_memdev_state_create(struct device *dev)
{
struct cxl_memdev_state *mds;
+ int rc;
mds = devm_kzalloc(dev, sizeof(*mds), GFP_KERNEL);
if (!mds) {
@@ -1651,6 +1933,10 @@ struct cxl_memdev_state *cxl_memdev_state_create(struct device *dev)
mds->pmem_perf.qos_class = CXL_QOS_CLASS_INVALID;
for (int i = 0; i < CXL_MAX_DC_REGION; i++)
mds->dc_perf[i].qos_class = CXL_QOS_CLASS_INVALID;
+ xa_init(&mds->pending_extents);
+ rc = devm_add_action_or_reset(dev, clear_pending_extents, mds);
+ if (rc)
+ return ERR_PTR(rc);
return mds;
}
diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
index 5154d00d2ee2026041d93bb4b20c9e0bb97f6449..608c90ac2507b2dc4a50daa66c382939bd7b2c74 100644
--- a/drivers/cxl/core/region.c
+++ b/drivers/cxl/core/region.c
@@ -3037,6 +3037,7 @@ static void cxl_dax_region_release(struct device *dev)
{
struct cxl_dax_region *cxlr_dax = to_cxl_dax_region(dev);
+ ida_destroy(&cxlr_dax->extent_ida);
kfree(cxlr_dax);
}
@@ -3090,6 +3091,8 @@ static struct cxl_dax_region *cxl_dax_region_alloc(struct cxl_region *cxlr)
dev = &cxlr_dax->dev;
cxlr_dax->cxlr = cxlr;
+ cxlr->cxlr_dax = cxlr_dax;
+ ida_init(&cxlr_dax->extent_ida);
device_initialize(dev);
lockdep_set_class(&dev->mutex, &cxl_dax_region_key);
device_set_pm_not_required(dev);
diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
index 79660c87e6be533a1d55311896f9a3c5514648f8..d5c4f248909b56219249b8f0273d7b9b97b01754 100644
--- a/drivers/cxl/cxl.h
+++ b/drivers/cxl/cxl.h
@@ -11,6 +11,8 @@
#include <linux/log2.h>
#include <linux/node.h>
#include <linux/io.h>
+#include <linux/xarray.h>
+#include <cxl/event.h>
extern const struct nvdimm_security_ops *cxl_security_ops;
@@ -169,11 +171,13 @@ static inline int ways_to_eiw(unsigned int ways, u8 *eiw)
#define CXLDEV_EVENT_STATUS_WARN BIT(1)
#define CXLDEV_EVENT_STATUS_FAIL BIT(2)
#define CXLDEV_EVENT_STATUS_FATAL BIT(3)
+#define CXLDEV_EVENT_STATUS_DCD BIT(4)
#define CXLDEV_EVENT_STATUS_ALL (CXLDEV_EVENT_STATUS_INFO | \
CXLDEV_EVENT_STATUS_WARN | \
CXLDEV_EVENT_STATUS_FAIL | \
- CXLDEV_EVENT_STATUS_FATAL)
+ CXLDEV_EVENT_STATUS_FATAL | \
+ CXLDEV_EVENT_STATUS_DCD)
/* CXL rev 3.0 section 8.2.9.2.4; Table 8-52 */
#define CXLDEV_EVENT_INT_MODE_MASK GENMASK(1, 0)
@@ -452,6 +456,18 @@ enum cxl_decoder_state {
CXL_DECODER_STATE_AUTO,
};
+/**
+ * struct cxled_extent - Extent within an endpoint decoder
+ * @cxled: Reference to the endpoint decoder
+ * @dpa_range: DPA range this extent covers within the decoder
+ * @tag: Tag from device for this extent
+ */
+struct cxled_extent {
+ struct cxl_endpoint_decoder *cxled;
+ struct range dpa_range;
+ u8 tag[CXL_EXTENT_TAG_LEN];
+};
+
/**
* struct cxl_endpoint_decoder - Endpoint / SPA to DPA decoder
* @cxld: base cxl_decoder_object
@@ -577,6 +593,7 @@ struct cxl_region_params {
* @type: Endpoint decoder target type
* @cxl_nvb: nvdimm bridge for coordinating @cxlr_pmem setup / shutdown
* @cxlr_pmem: (for pmem regions) cached copy of the nvdimm bridge
+ * @cxlr_dax: (for DC regions) cached copy of CXL DAX bridge
* @flags: Region state flags
* @params: active + config params for the region
* @coord: QoS access coordinates for the region
@@ -590,6 +607,7 @@ struct cxl_region {
enum cxl_decoder_type type;
struct cxl_nvdimm_bridge *cxl_nvb;
struct cxl_pmem_region *cxlr_pmem;
+ struct cxl_dax_region *cxlr_dax;
unsigned long flags;
struct cxl_region_params params;
struct access_coordinate coord[ACCESS_COORDINATE_MAX];
@@ -630,12 +648,45 @@ struct cxl_pmem_region {
struct cxl_pmem_region_mapping mapping[];
};
+/* See CXL 3.1 8.2.9.2.1.6 */
+enum dc_event {
+ DCD_ADD_CAPACITY,
+ DCD_RELEASE_CAPACITY,
+ DCD_FORCED_CAPACITY_RELEASE,
+ DCD_REGION_CONFIGURATION_UPDATED,
+};
+
struct cxl_dax_region {
struct device dev;
struct cxl_region *cxlr;
struct range hpa_range;
+ struct ida extent_ida;
};
+/**
+ * struct region_extent - CXL DAX region extent
+ * @dev: device representing this extent
+ * @cxlr_dax: back reference to parent region device
+ * @hpa_range: HPA range of this extent
+ * @tag: tag of the extent
+ * @decoder_extents: Endpoint decoder extents which make up this region extent
+ */
+struct region_extent {
+ struct device dev;
+ struct cxl_dax_region *cxlr_dax;
+ struct range hpa_range;
+ uuid_t tag;
+ struct xarray decoder_extents;
+};
+
+bool is_region_extent(struct device *dev);
+static inline struct region_extent *to_region_extent(struct device *dev)
+{
+ if (!is_region_extent(dev))
+ return NULL;
+ return container_of(dev, struct region_extent, dev);
+}
+
/**
* struct cxl_port - logical collection of upstream port devices and
* downstream port devices to construct a CXL memory
diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
index 863899b295b719b57638ee060e494e5cf2d639fd..73dee28bbd803a8f78686e833f8ef3492ca94e66 100644
--- a/drivers/cxl/cxlmem.h
+++ b/drivers/cxl/cxlmem.h
@@ -7,6 +7,7 @@
#include <linux/cdev.h>
#include <linux/uuid.h>
#include <linux/node.h>
+#include <linux/xarray.h>
#include <cxl/event.h>
#include <cxl/mailbox.h>
#include "cxl.h"
@@ -506,6 +507,7 @@ static inline struct cxl_dev_state *mbox_to_cxlds(struct cxl_mailbox *cxl_mbox)
* @pmem_perf: performance data entry matched to PMEM partition
* @nr_dc_region: number of DC regions implemented in the memory device
* @dc_region: array containing info about the DC regions
+ * @pending_extents: array of extents pending during more bit processing
* @event: event log driver state
* @poison: poison driver state info
* @security: security driver state info
@@ -538,6 +540,7 @@ struct cxl_memdev_state {
u8 nr_dc_region;
struct cxl_dc_region_info dc_region[CXL_MAX_DC_REGION];
struct cxl_dpa_perf dc_perf[CXL_MAX_DC_REGION];
+ struct xarray pending_extents;
struct cxl_event_state event;
struct cxl_poison_state poison;
@@ -609,6 +612,21 @@ enum cxl_opcode {
UUID_INIT(0x5e1819d9, 0x11a9, 0x400c, 0x81, 0x1f, 0xd6, 0x07, 0x19, \
0x40, 0x3d, 0x86)
+/*
+ * Add Dynamic Capacity Response
+ * CXL rev 3.1 section 8.2.9.9.9.3; Table 8-168 & Table 8-169
+ */
+struct cxl_mbox_dc_response {
+ __le32 extent_list_size;
+ u8 flags;
+ u8 reserved[3];
+ struct updated_extent_list {
+ __le64 dpa_start;
+ __le64 length;
+ u8 reserved[8];
+ } __packed extent_list[];
+} __packed;
+
struct cxl_mbox_get_supported_logs {
__le16 entries;
u8 rsvd[6];
@@ -671,6 +689,14 @@ struct cxl_mbox_identify {
UUID_INIT(0xfe927475, 0xdd59, 0x4339, 0xa5, 0x86, 0x79, 0xba, 0xb1, \
0x13, 0xb7, 0x74)
+/*
+ * Dynamic Capacity Event Record
+ * CXL rev 3.1 section 8.2.9.2.1; Table 8-43
+ */
+#define CXL_EVENT_DC_EVENT_UUID \
+ UUID_INIT(0xca95afa7, 0xf183, 0x4018, 0x8c, 0x2f, 0x95, 0x26, 0x8e, \
+ 0x10, 0x1a, 0x2a)
+
/*
* Get Event Records output payload
* CXL rev 3.0 section 8.2.9.2.2; Table 8-50
@@ -696,6 +722,7 @@ enum cxl_event_log_type {
CXL_EVENT_TYPE_WARN,
CXL_EVENT_TYPE_FAIL,
CXL_EVENT_TYPE_FATAL,
+ CXL_EVENT_TYPE_DCD,
CXL_EVENT_TYPE_MAX
};
diff --git a/include/cxl/event.h b/include/cxl/event.h
index 0bea1afbd747c4937b15703b581c569e7fa45ae4..eeda8059d81abef2fbf28cd3f3a6e516c9710229 100644
--- a/include/cxl/event.h
+++ b/include/cxl/event.h
@@ -96,11 +96,43 @@ struct cxl_event_mem_module {
u8 reserved[0x3d];
} __packed;
+/*
+ * CXL rev 3.1 section 8.2.9.2.1.6; Table 8-51
+ */
+#define CXL_EXTENT_TAG_LEN 0x10
+struct cxl_extent {
+ __le64 start_dpa;
+ __le64 length;
+ u8 tag[CXL_EXTENT_TAG_LEN];
+ __le16 shared_extn_seq;
+ u8 reserved[0x6];
+} __packed;
+
+/*
+ * Dynamic Capacity Event Record
+ * CXL rev 3.1 section 8.2.9.2.1.6; Table 8-50
+ */
+#define CXL_DCD_EVENT_MORE BIT(0)
+struct cxl_event_dcd {
+ struct cxl_event_record_hdr hdr;
+ u8 event_type;
+ u8 validity_flags;
+ __le16 host_id;
+ u8 region_index;
+ u8 flags;
+ u8 reserved1[0x2];
+ struct cxl_extent extent;
+ u8 reserved2[0x18];
+ __le32 num_avail_extents;
+ __le32 num_avail_tags;
+} __packed;
+
union cxl_event {
struct cxl_event_generic generic;
struct cxl_event_gen_media gen_media;
struct cxl_event_dram dram;
struct cxl_event_mem_module mem_module;
+ struct cxl_event_dcd dcd;
/* dram & gen_media event header */
struct cxl_event_media_hdr media_hdr;
} __packed;
diff --git a/tools/testing/cxl/Kbuild b/tools/testing/cxl/Kbuild
index b1256fee3567fc7743812ee14bc46e09b7c8ba9b..bfa19587fd763ed552c2b9aa1a6e8981b6aa1c40 100644
--- a/tools/testing/cxl/Kbuild
+++ b/tools/testing/cxl/Kbuild
@@ -62,7 +62,8 @@ cxl_core-y += $(CXL_CORE_SRC)/hdm.o
cxl_core-y += $(CXL_CORE_SRC)/pmu.o
cxl_core-y += $(CXL_CORE_SRC)/cdat.o
cxl_core-$(CONFIG_TRACING) += $(CXL_CORE_SRC)/trace.o
-cxl_core-$(CONFIG_CXL_REGION) += $(CXL_CORE_SRC)/region.o
+cxl_core-$(CONFIG_CXL_REGION) += $(CXL_CORE_SRC)/region.o \
+ $(CXL_CORE_SRC)/extent.o
cxl_core-y += config_check.o
cxl_core-y += cxl_core_test.o
cxl_core-y += cxl_core_exports.o
--
2.47.1
^ permalink raw reply related [flat|nested] 34+ messages in thread
* [PATCH v8 15/21] cxl/region/extent: Expose region extent information in sysfs
2024-12-11 3:42 [PATCH v8 00/21] DCD: Add support for Dynamic Capacity Devices (DCD) Ira Weiny
` (13 preceding siblings ...)
2024-12-11 3:42 ` [PATCH v8 14/21] cxl/extent: Process DCD events and realize region extents Ira Weiny
@ 2024-12-11 3:42 ` Ira Weiny
2024-12-11 3:42 ` [PATCH v8 16/21] dax/bus: Factor out dev dax resize logic Ira Weiny
` (5 subsequent siblings)
20 siblings, 0 replies; 34+ messages in thread
From: Ira Weiny @ 2024-12-11 3:42 UTC (permalink / raw)
To: Dave Jiang, Fan Ni, Jonathan Cameron, Jonathan Corbet,
Andrew Morton, Kees Cook, Gustavo A. R. Silva
Cc: Dan Williams, Davidlohr Bueso, Alison Schofield, Vishal Verma,
Ira Weiny, linux-cxl, linux-doc, nvdimm, linux-kernel,
linux-hardening
Extent information can be helpful to the user to coordinate memory usage
with the external orchestrator and FM.
Expose the details of region extents by creating the following
sysfs entries.
/sys/bus/cxl/devices/dax_regionX/extentX.Y
/sys/bus/cxl/devices/dax_regionX/extentX.Y/offset
/sys/bus/cxl/devices/dax_regionX/extentX.Y/length
/sys/bus/cxl/devices/dax_regionX/extentX.Y/tag
Based on an original patch by Navneet Singh.
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: Fan Ni <fan.ni@samsung.com>
Tested-by: Fan Ni <fan.ni@samsung.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
---
Documentation/ABI/testing/sysfs-bus-cxl | 33 +++++++++++++++++++
drivers/cxl/core/extent.c | 58 +++++++++++++++++++++++++++++++++
2 files changed, 91 insertions(+)
diff --git a/Documentation/ABI/testing/sysfs-bus-cxl b/Documentation/ABI/testing/sysfs-bus-cxl
index aeff248ea368cf49c9977fcaf43ab4def978e896..ee2ef4ea33e17cbc65e1252753f46f6d0dce1aee 100644
--- a/Documentation/ABI/testing/sysfs-bus-cxl
+++ b/Documentation/ABI/testing/sysfs-bus-cxl
@@ -632,3 +632,36 @@ Description:
See Documentation/ABI/stable/sysfs-devices-node. access0 provides
the number to the closest initiator and access1 provides the
number to the closest CPU.
+
+What: /sys/bus/cxl/devices/dax_regionX/extentX.Y/offset
+Date: December, 2024
+KernelVersion: v6.13
+Contact: linux-cxl@vger.kernel.org
+Description:
+ (RO) [For Dynamic Capacity regions only] Users can use the
+ extent information to create DAX devices on specific extents.
+ This is done by creating and destroying DAX devices in specific
+ sequences and looking at the mappings created. Extent offset
+ within the region.
+
+What: /sys/bus/cxl/devices/dax_regionX/extentX.Y/length
+Date: December, 2024
+KernelVersion: v6.13
+Contact: linux-cxl@vger.kernel.org
+Description:
+ (RO) [For Dynamic Capacity regions only] Users can use the
+ extent information to create DAX devices on specific extents.
+ This is done by creating and destroying DAX devices in specific
+ sequences and looking at the mappings created. Extent length
+ within the region.
+
+What: /sys/bus/cxl/devices/dax_regionX/extentX.Y/tag
+Date: December, 2024
+KernelVersion: v6.13
+Contact: linux-cxl@vger.kernel.org
+Description:
+ (RO) [For Dynamic Capacity regions only] Users can use the
+ extent information to create DAX devices on specific extents.
+ This is done by creating and destroying DAX devices in specific
+ sequences and looking at the mappings created. UUID extent
+ tag.
diff --git a/drivers/cxl/core/extent.c b/drivers/cxl/core/extent.c
index a45ff84727b0f8c2567f0d2dd8b5c261b23695e3..0ebdbe983d094de89579527459cd75e3e7e2b6c7 100644
--- a/drivers/cxl/core/extent.c
+++ b/drivers/cxl/core/extent.c
@@ -6,6 +6,63 @@
#include "core.h"
+static ssize_t offset_show(struct device *dev, struct device_attribute *attr,
+ char *buf)
+{
+ struct region_extent *region_extent = to_region_extent(dev);
+
+ return sysfs_emit(buf, "%#llx\n", region_extent->hpa_range.start);
+}
+static DEVICE_ATTR_RO(offset);
+
+static ssize_t length_show(struct device *dev, struct device_attribute *attr,
+ char *buf)
+{
+ struct region_extent *region_extent = to_region_extent(dev);
+ u64 length = range_len(®ion_extent->hpa_range);
+
+ return sysfs_emit(buf, "%#llx\n", length);
+}
+static DEVICE_ATTR_RO(length);
+
+static ssize_t tag_show(struct device *dev, struct device_attribute *attr,
+ char *buf)
+{
+ struct region_extent *region_extent = to_region_extent(dev);
+
+ return sysfs_emit(buf, "%pUb\n", ®ion_extent->tag);
+}
+static DEVICE_ATTR_RO(tag);
+
+static struct attribute *region_extent_attrs[] = {
+ &dev_attr_offset.attr,
+ &dev_attr_length.attr,
+ &dev_attr_tag.attr,
+ NULL
+};
+
+static uuid_t empty_tag = { 0 };
+
+static umode_t region_extent_visible(struct kobject *kobj,
+ struct attribute *a, int n)
+{
+ struct device *dev = kobj_to_dev(kobj);
+ struct region_extent *region_extent = to_region_extent(dev);
+
+ if (a == &dev_attr_tag.attr &&
+ uuid_equal(®ion_extent->tag, &empty_tag))
+ return 0;
+
+ return a->mode;
+}
+
+static const struct attribute_group region_extent_attribute_group = {
+ .attrs = region_extent_attrs,
+ .is_visible = region_extent_visible,
+};
+
+__ATTRIBUTE_GROUPS(region_extent_attribute);
+
static void cxled_release_extent(struct cxl_endpoint_decoder *cxled,
struct cxled_extent *ed_extent)
{
@@ -44,6 +101,7 @@ static void region_extent_release(struct device *dev)
static const struct device_type region_extent_type = {
.name = "extent",
.release = region_extent_release,
+ .groups = region_extent_attribute_groups,
};
bool is_region_extent(struct device *dev)
--
2.47.1
^ permalink raw reply related [flat|nested] 34+ messages in thread
* [PATCH v8 16/21] dax/bus: Factor out dev dax resize logic
2024-12-11 3:42 [PATCH v8 00/21] DCD: Add support for Dynamic Capacity Devices (DCD) Ira Weiny
` (14 preceding siblings ...)
2024-12-11 3:42 ` [PATCH v8 15/21] cxl/region/extent: Expose region extent information in sysfs Ira Weiny
@ 2024-12-11 3:42 ` Ira Weiny
2024-12-11 3:42 ` [PATCH v8 17/21] dax/region: Create resources on sparse DAX regions Ira Weiny
` (4 subsequent siblings)
20 siblings, 0 replies; 34+ messages in thread
From: Ira Weiny @ 2024-12-11 3:42 UTC (permalink / raw)
To: Dave Jiang, Fan Ni, Jonathan Cameron, Jonathan Corbet,
Andrew Morton, Kees Cook, Gustavo A. R. Silva
Cc: Dan Williams, Davidlohr Bueso, Alison Schofield, Vishal Verma,
Ira Weiny, linux-cxl, linux-doc, nvdimm, linux-kernel,
linux-hardening
Dynamic Capacity regions must limit dev dax resources to those areas
which have extents backing real memory. Such DAX regions are dubbed
'sparse' regions. In order to manage where memory is available four
alternatives were considered:
1) Create a single region resource child on region creation which
reserves the entire region. Then as extents are added punch holes in
this reservation. This requires new resource manipulation to punch
the holes and still requires an additional iteration over the extent
areas which may already have existing dev dax resources used.
2) Maintain an ordered xarray of extents which can be queried while
processing the resize logic. The issue is that existing region->res
children may artificially limit the allocation size sent to
alloc_dev_dax_range(). IE the resource children can't be directly
used in the resize logic to find where space in the region is. This
also poses a problem of managing the available size in 2 places.
3) Maintain a separate resource tree with extents. This option is the
same as 2) but with the different data structure. Most ideally there
should be a unified representation of the resource tree not two places
to look for space.
4) Create region resource children for each extent. Manage the dax dev
resize logic in the same way as before but use a region child
(extent) resource as the parents to find space within each extent.
Option 4 can leverage the existing resize algorithm to find space within
the extents. It manages the available space in a singular resource tree
which is less complicated for finding space.
In preparation for this change, factor out the dev_dax_resize logic.
For static regions use dax_region->res as the parent to find space for
the dax ranges. Future patches will use the same algorithm with
individual extent resources as the parent.
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
---
drivers/dax/bus.c | 130 +++++++++++++++++++++++++++++++++---------------------
1 file changed, 80 insertions(+), 50 deletions(-)
diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
index d8cb5195a227c0f6194cb210510e006327e1b35b..c25942a3d1255cb5e5bf8d213e62933281ff3e4f 100644
--- a/drivers/dax/bus.c
+++ b/drivers/dax/bus.c
@@ -844,11 +844,9 @@ static int devm_register_dax_mapping(struct dev_dax *dev_dax, int range_id)
return 0;
}
-static int alloc_dev_dax_range(struct dev_dax *dev_dax, u64 start,
- resource_size_t size)
+static int alloc_dev_dax_range(struct resource *parent, struct dev_dax *dev_dax,
+ u64 start, resource_size_t size)
{
- struct dax_region *dax_region = dev_dax->region;
- struct resource *res = &dax_region->res;
struct device *dev = &dev_dax->dev;
struct dev_dax_range *ranges;
unsigned long pgoff = 0;
@@ -866,14 +864,14 @@ static int alloc_dev_dax_range(struct dev_dax *dev_dax, u64 start,
return 0;
}
- alloc = __request_region(res, start, size, dev_name(dev), 0);
+ alloc = __request_region(parent, start, size, dev_name(dev), 0);
if (!alloc)
return -ENOMEM;
ranges = krealloc(dev_dax->ranges, sizeof(*ranges)
* (dev_dax->nr_range + 1), GFP_KERNEL);
if (!ranges) {
- __release_region(res, alloc->start, resource_size(alloc));
+ __release_region(parent, alloc->start, resource_size(alloc));
return -ENOMEM;
}
@@ -1026,50 +1024,45 @@ static bool adjust_ok(struct dev_dax *dev_dax, struct resource *res)
return true;
}
-static ssize_t dev_dax_resize(struct dax_region *dax_region,
- struct dev_dax *dev_dax, resource_size_t size)
+/**
+ * dev_dax_resize_static - Expand the device into the unused portion of the
+ * region. This may involve adjusting the end of an existing resource, or
+ * allocating a new resource.
+ *
+ * @parent: parent resource to allocate this range in
+ * @dev_dax: DAX device to be expanded
+ * @to_alloc: amount of space to alloc; must be <= space available in @parent
+ *
+ * Return the amount of space allocated or -ERRNO on failure
+ */
+static ssize_t dev_dax_resize_static(struct resource *parent,
+ struct dev_dax *dev_dax,
+ resource_size_t to_alloc)
{
- resource_size_t avail = dax_region_avail_size(dax_region), to_alloc;
- resource_size_t dev_size = dev_dax_size(dev_dax);
- struct resource *region_res = &dax_region->res;
- struct device *dev = &dev_dax->dev;
struct resource *res, *first;
- resource_size_t alloc = 0;
int rc;
- if (dev->driver)
- return -EBUSY;
- if (size == dev_size)
- return 0;
- if (size > dev_size && size - dev_size > avail)
- return -ENOSPC;
- if (size < dev_size)
- return dev_dax_shrink(dev_dax, size);
-
- to_alloc = size - dev_size;
- if (dev_WARN_ONCE(dev, !alloc_is_aligned(dev_dax, to_alloc),
- "resize of %pa misaligned\n", &to_alloc))
- return -ENXIO;
-
- /*
- * Expand the device into the unused portion of the region. This
- * may involve adjusting the end of an existing resource, or
- * allocating a new resource.
- */
-retry:
- first = region_res->child;
- if (!first)
- return alloc_dev_dax_range(dev_dax, dax_region->res.start, to_alloc);
+ first = parent->child;
+ if (!first) {
+ rc = alloc_dev_dax_range(parent, dev_dax,
+ parent->start, to_alloc);
+ if (rc)
+ return rc;
+ return to_alloc;
+ }
- rc = -ENOSPC;
for (res = first; res; res = res->sibling) {
struct resource *next = res->sibling;
+ resource_size_t alloc;
/* space at the beginning of the region */
- if (res == first && res->start > dax_region->res.start) {
- alloc = min(res->start - dax_region->res.start, to_alloc);
- rc = alloc_dev_dax_range(dev_dax, dax_region->res.start, alloc);
- break;
+ if (res == first && res->start > parent->start) {
+ alloc = min(res->start - parent->start, to_alloc);
+ rc = alloc_dev_dax_range(parent, dev_dax,
+ parent->start, alloc);
+ if (rc)
+ return rc;
+ return alloc;
}
alloc = 0;
@@ -1078,21 +1071,56 @@ static ssize_t dev_dax_resize(struct dax_region *dax_region,
alloc = min(next->start - (res->end + 1), to_alloc);
/* space at the end of the region */
- if (!alloc && !next && res->end < region_res->end)
- alloc = min(region_res->end - res->end, to_alloc);
+ if (!alloc && !next && res->end < parent->end)
+ alloc = min(parent->end - res->end, to_alloc);
if (!alloc)
continue;
if (adjust_ok(dev_dax, res)) {
rc = adjust_dev_dax_range(dev_dax, res, resource_size(res) + alloc);
- break;
+ if (rc)
+ return rc;
+ return alloc;
}
- rc = alloc_dev_dax_range(dev_dax, res->end + 1, alloc);
- break;
+ rc = alloc_dev_dax_range(parent, dev_dax, res->end + 1, alloc);
+ if (rc)
+ return rc;
+ return alloc;
}
- if (rc)
- return rc;
+
+ /* available was already calculated and should never be an issue */
+ dev_WARN_ONCE(&dev_dax->dev, 1, "space not found?");
+ return 0;
+}
+
+static ssize_t dev_dax_resize(struct dax_region *dax_region,
+ struct dev_dax *dev_dax, resource_size_t size)
+{
+ resource_size_t avail = dax_region_avail_size(dax_region);
+ resource_size_t dev_size = dev_dax_size(dev_dax);
+ struct device *dev = &dev_dax->dev;
+ resource_size_t to_alloc;
+ resource_size_t alloc;
+
+ if (dev->driver)
+ return -EBUSY;
+ if (size == dev_size)
+ return 0;
+ if (size > dev_size && size - dev_size > avail)
+ return -ENOSPC;
+ if (size < dev_size)
+ return dev_dax_shrink(dev_dax, size);
+
+ to_alloc = size - dev_size;
+ if (dev_WARN_ONCE(dev, !alloc_is_aligned(dev_dax, to_alloc),
+ "resize of %pa misaligned\n", &to_alloc))
+ return -ENXIO;
+
+retry:
+ alloc = dev_dax_resize_static(&dax_region->res, dev_dax, to_alloc);
+ if (alloc <= 0)
+ return alloc;
to_alloc -= alloc;
if (to_alloc)
goto retry;
@@ -1198,7 +1226,8 @@ static ssize_t mapping_store(struct device *dev, struct device_attribute *attr,
to_alloc = range_len(&r);
if (alloc_is_aligned(dev_dax, to_alloc))
- rc = alloc_dev_dax_range(dev_dax, r.start, to_alloc);
+ rc = alloc_dev_dax_range(&dax_region->res, dev_dax, r.start,
+ to_alloc);
up_write(&dax_dev_rwsem);
up_write(&dax_region_rwsem);
@@ -1466,7 +1495,8 @@ static struct dev_dax *__devm_create_dev_dax(struct dev_dax_data *data)
device_initialize(dev);
dev_set_name(dev, "dax%d.%d", dax_region->id, dev_dax->id);
- rc = alloc_dev_dax_range(dev_dax, dax_region->res.start, data->size);
+ rc = alloc_dev_dax_range(&dax_region->res, dev_dax, dax_region->res.start,
+ data->size);
if (rc)
goto err_range;
--
2.47.1
^ permalink raw reply related [flat|nested] 34+ messages in thread
* [PATCH v8 17/21] dax/region: Create resources on sparse DAX regions
2024-12-11 3:42 [PATCH v8 00/21] DCD: Add support for Dynamic Capacity Devices (DCD) Ira Weiny
` (15 preceding siblings ...)
2024-12-11 3:42 ` [PATCH v8 16/21] dax/bus: Factor out dev dax resize logic Ira Weiny
@ 2024-12-11 3:42 ` Ira Weiny
2024-12-11 3:42 ` [PATCH v8 18/21] cxl/region: Read existing extents on region creation Ira Weiny
` (3 subsequent siblings)
20 siblings, 0 replies; 34+ messages in thread
From: Ira Weiny @ 2024-12-11 3:42 UTC (permalink / raw)
To: Dave Jiang, Fan Ni, Jonathan Cameron, Jonathan Corbet,
Andrew Morton, Kees Cook, Gustavo A. R. Silva
Cc: Dan Williams, Davidlohr Bueso, Alison Schofield, Vishal Verma,
Ira Weiny, linux-cxl, linux-doc, nvdimm, linux-kernel,
linux-hardening
DAX regions which map dynamic capacity partitions require that memory be
allowed to come and go. Recall sparse regions were created for this
purpose. Now that extents can be realized within DAX regions the DAX
region driver can start tracking sub-resource information.
The tight relationship between DAX region operations and extent
operations require memory changes to be controlled synchronously with
the user of the region. Synchronize through the dax_region_rwsem and by
having the region driver drive both the region device as well as the
extent sub-devices.
Recall requests to remove extents can happen at any time and that a host
is not obligated to release the memory until it is not being used. If
an extent is not used allow a release response.
When extents are eligible for release. No mappings exist but data may
reside in caches not yet written to the device. Call
cxl_region_invalidate_memregion() to write back data to the device prior
to signaling the release complete.
Speculative writes after a release may dirty the cache such that a read
from a newly surfaced extent may not come from the device. Call
cxl_region_invalidate_memregion() prior to bringing a new extent online
to ensure the cache is marked invalid.
While these invalidate calls are inefficient they are the best we can do
to ensure cache consistency without back invalidate. Furthermore this
should occur infrequently with sufficiently large extents and work loads
to not be too bad of an impact.
The DAX layer has no need for the details of the CXL memory extent
devices. Expose extents to the DAX layer as device children of the DAX
region device. A single callback from the driver aids the DAX layer to
determine if the child device is an extent. The DAX layer also
registers a devres function to automatically clean up when the device is
removed from the region.
There is a race between extents being surfaced and the dax_cxl driver
being loaded. The driver must therefore scan for any existing extents
while still under the device lock.
Respond to extent notifications. Manage the DAX region resource tree
based on the extents lifetime. Return the status of remove
notifications to lower layers such that it can manage the hardware
appropriately.
Based on an original patch by Navneet Singh.
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
---
Changes:
[iweiny: convert range prints to %pra]
---
drivers/cxl/core/core.h | 2 +
drivers/cxl/core/extent.c | 83 ++++++++++++++--
drivers/cxl/core/region.c | 2 +-
drivers/cxl/cxl.h | 6 ++
drivers/dax/bus.c | 246 +++++++++++++++++++++++++++++++++++++++++-----
drivers/dax/bus.h | 3 +-
drivers/dax/cxl.c | 61 +++++++++++-
drivers/dax/dax-private.h | 40 ++++++++
drivers/dax/hmem/hmem.c | 2 +-
drivers/dax/pmem.c | 2 +-
include/linux/ioport.h | 3 +
11 files changed, 411 insertions(+), 39 deletions(-)
diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
index 943869e8dd7da0f8e0b9970f323392006048ac41..fb49d00e53861a252eb47db7a82415d724da6701 100644
--- a/drivers/cxl/core/core.h
+++ b/drivers/cxl/core/core.h
@@ -21,6 +21,8 @@ cxled_to_mds(struct cxl_endpoint_decoder *cxled)
return container_of(cxlds, struct cxl_memdev_state, cxlds);
}
+int cxl_region_invalidate_memregion(struct cxl_region *cxlr);
+
#ifdef CONFIG_CXL_REGION
extern struct device_attribute dev_attr_create_pmem_region;
extern struct device_attribute dev_attr_create_ram_region;
diff --git a/drivers/cxl/core/extent.c b/drivers/cxl/core/extent.c
index 0ebdbe983d094de89579527459cd75e3e7e2b6c7..ac35720597866b6f967a34d96ef4e73263f22e87 100644
--- a/drivers/cxl/core/extent.c
+++ b/drivers/cxl/core/extent.c
@@ -116,6 +116,12 @@ static void region_extent_unregister(void *ext)
dev_dbg(®ion_extent->dev, "DAX region rm extent HPA %pra\n",
®ion_extent->hpa_range);
+ /*
+ * Extent is not in use or an error has occur. No mappings
+ * exist at this point. Write and invalidate caches to ensure
+ * the device has all data prior to final release.
+ */
+ cxl_region_invalidate_memregion(region_extent->cxlr_dax->cxlr);
device_unregister(®ion_extent->dev);
}
@@ -268,20 +274,65 @@ static void calc_hpa_range(struct cxl_endpoint_decoder *cxled,
hpa_range->end = hpa_range->start + range_len(dpa_range) - 1;
}
+static int cxlr_notify_extent(struct cxl_region *cxlr, enum dc_event event,
+ struct region_extent *region_extent)
+{
+ struct device *dev = &cxlr->cxlr_dax->dev;
+ struct cxl_notify_data notify_data;
+ struct cxl_driver *driver;
+
+ dev_dbg(dev, "Trying notify: type %d HPA %pra\n", event,
+ ®ion_extent->hpa_range);
+
+ guard(device)(dev);
+
+ /*
+ * The lack of a driver indicates a notification has failed. No user
+ * space coordination was possible.
+ */
+ if (!dev->driver)
+ return 0;
+ driver = to_cxl_drv(dev->driver);
+ if (!driver->notify)
+ return 0;
+
+ notify_data = (struct cxl_notify_data) {
+ .event = event,
+ .region_extent = region_extent,
+ };
+
+ dev_dbg(dev, "Notify: type %d HPA %pra\n", event,
+ ®ion_extent->hpa_range);
+ return driver->notify(dev, ¬ify_data);
+}
+
+struct rm_data {
+ struct cxl_region *cxlr;
+ struct range *range;
+};
+
static int cxlr_rm_extent(struct device *dev, void *data)
{
struct region_extent *region_extent = to_region_extent(dev);
- struct range *region_hpa_range = data;
+ struct rm_data *rm_data = data;
+ int rc;
if (!region_extent)
return 0;
/*
- * Any extent which 'touches' the released range is removed.
+ * Any extent which 'touches' the released range is attempted to be
+ * removed.
*/
- if (range_overlaps(region_hpa_range, ®ion_extent->hpa_range)) {
+ if (range_overlaps(rm_data->range, ®ion_extent->hpa_range)) {
+ struct cxl_region *cxlr = rm_data->cxlr;
+
dev_dbg(dev, "Remove region extent HPA %pra\n",
®ion_extent->hpa_range);
+ rc = cxlr_notify_extent(cxlr, DCD_RELEASE_CAPACITY, region_extent);
+ if (rc == -EBUSY)
+ return 0;
+
region_rm_extent(region_extent);
}
return 0;
@@ -326,8 +377,13 @@ int cxl_rm_extent(struct cxl_memdev_state *mds, struct cxl_extent *extent)
calc_hpa_range(cxled, cxlr->cxlr_dax, &dpa_range, &hpa_range);
+ struct rm_data rm_data = {
+ .cxlr = cxlr,
+ .range = &hpa_range,
+ };
+
/* Remove region extents which overlap */
- return device_for_each_child(&cxlr->cxlr_dax->dev, &hpa_range,
+ return device_for_each_child(&cxlr->cxlr_dax->dev, &rm_data,
cxlr_rm_extent);
}
@@ -352,8 +408,23 @@ static int cxlr_add_extent(struct cxl_dax_region *cxlr_dax,
return rc;
}
- /* device model handles freeing region_extent */
- return online_region_extent(region_extent);
+ /* Ensure caches are clean prior onlining */
+ cxl_region_invalidate_memregion(cxlr_dax->cxlr);
+
+ rc = online_region_extent(region_extent);
+ /* device model handled freeing region_extent */
+ if (rc)
+ return rc;
+
+ rc = cxlr_notify_extent(cxlr_dax->cxlr, DCD_ADD_CAPACITY, region_extent);
+ /*
+ * The region device was briefly live but DAX layer ensures it was not
+ * used
+ */
+ if (rc)
+ region_rm_extent(region_extent);
+
+ return rc;
}
/* Callers are expected to ensure cxled has been attached to a region */
diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
index 608c90ac2507b2dc4a50daa66c382939bd7b2c74..f7e47a82fa2bd1b245081428fc515fb464993aa5 100644
--- a/drivers/cxl/core/region.c
+++ b/drivers/cxl/core/region.c
@@ -223,7 +223,7 @@ static struct cxl_region_ref *cxl_rr_load(struct cxl_port *port,
return xa_load(&port->regions, (unsigned long)cxlr);
}
-static int cxl_region_invalidate_memregion(struct cxl_region *cxlr)
+int cxl_region_invalidate_memregion(struct cxl_region *cxlr)
{
if (!cpu_cache_has_invalidate_memregion()) {
if (IS_ENABLED(CONFIG_CXL_REGION_INVALIDATION_TEST)) {
diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
index d5c4f248909b56219249b8f0273d7b9b97b01754..e54977ae1e8062d0ff3d0974561b1076236d1b9f 100644
--- a/drivers/cxl/cxl.h
+++ b/drivers/cxl/cxl.h
@@ -928,10 +928,16 @@ bool is_cxl_region(struct device *dev);
extern struct bus_type cxl_bus_type;
+struct cxl_notify_data {
+ enum dc_event event;
+ struct region_extent *region_extent;
+};
+
struct cxl_driver {
const char *name;
int (*probe)(struct device *dev);
void (*remove)(struct device *dev);
+ int (*notify)(struct device *dev, struct cxl_notify_data *notify_data);
struct device_driver drv;
int id;
};
diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
index c25942a3d1255cb5e5bf8d213e62933281ff3e4f..a54961fc393d71eda4a26f871597c6ffbb2023f8 100644
--- a/drivers/dax/bus.c
+++ b/drivers/dax/bus.c
@@ -183,6 +183,93 @@ static bool is_sparse(struct dax_region *dax_region)
return (dax_region->res.flags & IORESOURCE_DAX_SPARSE_CAP) != 0;
}
+static void __dax_release_resource(struct dax_resource *dax_resource)
+{
+ struct dax_region *dax_region = dax_resource->region;
+
+ lockdep_assert_held_write(&dax_region_rwsem);
+ dev_dbg(dax_region->dev, "Extent release resource %pr\n",
+ dax_resource->res);
+ if (dax_resource->res)
+ __release_region(&dax_region->res, dax_resource->res->start,
+ resource_size(dax_resource->res));
+ dax_resource->res = NULL;
+}
+
+static void dax_release_resource(void *res)
+{
+ struct dax_resource *dax_resource = res;
+
+ guard(rwsem_write)(&dax_region_rwsem);
+ __dax_release_resource(dax_resource);
+ kfree(dax_resource);
+}
+
+int dax_region_add_resource(struct dax_region *dax_region,
+ struct device *device,
+ resource_size_t start, resource_size_t length)
+{
+ struct resource *new_resource;
+ int rc;
+
+ struct dax_resource *dax_resource __free(kfree) =
+ kzalloc(sizeof(*dax_resource), GFP_KERNEL);
+ if (!dax_resource)
+ return -ENOMEM;
+
+ guard(rwsem_write)(&dax_region_rwsem);
+
+ dev_dbg(dax_region->dev, "DAX region resource %pr\n", &dax_region->res);
+ new_resource = __request_region(&dax_region->res, start, length, "extent", 0);
+ if (!new_resource) {
+ dev_err(dax_region->dev, "Failed to add region s:%pa l:%pa\n",
+ &start, &length);
+ return -ENOSPC;
+ }
+
+ dev_dbg(dax_region->dev, "add resource %pr\n", new_resource);
+ dax_resource->region = dax_region;
+ dax_resource->res = new_resource;
+
+ /*
+ * open code devm_add_action_or_reset() to avoid recursive write lock
+ * of dax_region_rwsem in the error case.
+ */
+ rc = devm_add_action(device, dax_release_resource, dax_resource);
+ if (rc) {
+ __dax_release_resource(dax_resource);
+ return rc;
+ }
+
+ dev_set_drvdata(device, no_free_ptr(dax_resource));
+ return 0;
+}
+EXPORT_SYMBOL_GPL(dax_region_add_resource);
+
+int dax_region_rm_resource(struct dax_region *dax_region,
+ struct device *dev)
+{
+ struct dax_resource *dax_resource;
+
+ guard(rwsem_write)(&dax_region_rwsem);
+
+ dax_resource = dev_get_drvdata(dev);
+ if (!dax_resource)
+ return 0;
+
+ if (dax_resource->use_cnt)
+ return -EBUSY;
+
+ /*
+ * release the resource under dax_region_rwsem to avoid races with
+ * users trying to use the extent
+ */
+ __dax_release_resource(dax_resource);
+ dev_set_drvdata(dev, NULL);
+ return 0;
+}
+EXPORT_SYMBOL_GPL(dax_region_rm_resource);
+
bool static_dev_dax(struct dev_dax *dev_dax)
{
return is_static(dev_dax->region);
@@ -296,19 +383,41 @@ static ssize_t region_align_show(struct device *dev,
static struct device_attribute dev_attr_region_align =
__ATTR(align, 0400, region_align_show, NULL);
+resource_size_t
+dax_avail_size(struct resource *dax_resource)
+{
+ resource_size_t rc;
+ struct resource *used_res;
+
+ rc = resource_size(dax_resource);
+ for_each_child_resource(dax_resource, used_res)
+ rc -= resource_size(used_res);
+ return rc;
+}
+EXPORT_SYMBOL_GPL(dax_avail_size);
+
#define for_each_dax_region_resource(dax_region, res) \
for (res = (dax_region)->res.child; res; res = res->sibling)
static unsigned long long dax_region_avail_size(struct dax_region *dax_region)
{
- resource_size_t size = resource_size(&dax_region->res);
+ resource_size_t size;
struct resource *res;
lockdep_assert_held(&dax_region_rwsem);
- if (is_sparse(dax_region))
- return 0;
+ if (is_sparse(dax_region)) {
+ /*
+ * Children of a sparse region represent available space not
+ * used space.
+ */
+ size = 0;
+ for_each_dax_region_resource(dax_region, res)
+ size += dax_avail_size(res);
+ return size;
+ }
+ size = resource_size(&dax_region->res);
for_each_dax_region_resource(dax_region, res)
size -= resource_size(res);
return size;
@@ -449,15 +558,26 @@ EXPORT_SYMBOL_GPL(kill_dev_dax);
static void trim_dev_dax_range(struct dev_dax *dev_dax)
{
int i = dev_dax->nr_range - 1;
- struct range *range = &dev_dax->ranges[i].range;
+ struct dev_dax_range *dev_range = &dev_dax->ranges[i];
+ struct range *range = &dev_range->range;
struct dax_region *dax_region = dev_dax->region;
+ struct resource *res = &dax_region->res;
lockdep_assert_held_write(&dax_region_rwsem);
dev_dbg(&dev_dax->dev, "delete range[%d]: %#llx:%#llx\n", i,
(unsigned long long)range->start,
(unsigned long long)range->end);
- __release_region(&dax_region->res, range->start, range_len(range));
+ if (dev_range->dax_resource) {
+ res = dev_range->dax_resource->res;
+ dev_dbg(&dev_dax->dev, "Trim sparse extent %pr\n", res);
+ }
+
+ __release_region(res, range->start, range_len(range));
+
+ if (dev_range->dax_resource)
+ dev_range->dax_resource->use_cnt--;
+
if (--dev_dax->nr_range == 0) {
kfree(dev_dax->ranges);
dev_dax->ranges = NULL;
@@ -640,7 +760,7 @@ static void dax_region_unregister(void *region)
struct dax_region *alloc_dax_region(struct device *parent, int region_id,
struct range *range, int target_node, unsigned int align,
- unsigned long flags)
+ unsigned long flags, struct dax_sparse_ops *sparse_ops)
{
struct dax_region *dax_region;
@@ -658,12 +778,16 @@ struct dax_region *alloc_dax_region(struct device *parent, int region_id,
|| !IS_ALIGNED(range_len(range), align))
return NULL;
+ if (!sparse_ops && (flags & IORESOURCE_DAX_SPARSE_CAP))
+ return NULL;
+
dax_region = kzalloc(sizeof(*dax_region), GFP_KERNEL);
if (!dax_region)
return NULL;
dev_set_drvdata(parent, dax_region);
kref_init(&dax_region->kref);
+ dax_region->sparse_ops = sparse_ops;
dax_region->id = region_id;
dax_region->align = align;
dax_region->dev = parent;
@@ -845,7 +969,8 @@ static int devm_register_dax_mapping(struct dev_dax *dev_dax, int range_id)
}
static int alloc_dev_dax_range(struct resource *parent, struct dev_dax *dev_dax,
- u64 start, resource_size_t size)
+ u64 start, resource_size_t size,
+ struct dax_resource *dax_resource)
{
struct device *dev = &dev_dax->dev;
struct dev_dax_range *ranges;
@@ -884,6 +1009,7 @@ static int alloc_dev_dax_range(struct resource *parent, struct dev_dax *dev_dax,
.start = alloc->start,
.end = alloc->end,
},
+ .dax_resource = dax_resource,
};
dev_dbg(dev, "alloc range[%d]: %pa:%pa\n", dev_dax->nr_range - 1,
@@ -966,7 +1092,8 @@ static int dev_dax_shrink(struct dev_dax *dev_dax, resource_size_t size)
int i;
for (i = dev_dax->nr_range - 1; i >= 0; i--) {
- struct range *range = &dev_dax->ranges[i].range;
+ struct dev_dax_range *dev_range = &dev_dax->ranges[i];
+ struct range *range = &dev_range->range;
struct dax_mapping *mapping = dev_dax->ranges[i].mapping;
struct resource *adjust = NULL, *res;
resource_size_t shrink;
@@ -982,12 +1109,21 @@ static int dev_dax_shrink(struct dev_dax *dev_dax, resource_size_t size)
continue;
}
- for_each_dax_region_resource(dax_region, res)
- if (strcmp(res->name, dev_name(dev)) == 0
- && res->start == range->start) {
- adjust = res;
- break;
- }
+ if (dev_range->dax_resource) {
+ for_each_child_resource(dev_range->dax_resource->res, res)
+ if (strcmp(res->name, dev_name(dev)) == 0
+ && res->start == range->start) {
+ adjust = res;
+ break;
+ }
+ } else {
+ for_each_dax_region_resource(dax_region, res)
+ if (strcmp(res->name, dev_name(dev)) == 0
+ && res->start == range->start) {
+ adjust = res;
+ break;
+ }
+ }
if (dev_WARN_ONCE(dev, !adjust || i != dev_dax->nr_range - 1,
"failed to find matching resource\n"))
@@ -1025,19 +1161,21 @@ static bool adjust_ok(struct dev_dax *dev_dax, struct resource *res)
}
/**
- * dev_dax_resize_static - Expand the device into the unused portion of the
- * region. This may involve adjusting the end of an existing resource, or
- * allocating a new resource.
+ * __dev_dax_resize - Expand the device into the unused portion of the region.
+ * This may involve adjusting the end of an existing resource, or allocating a
+ * new resource.
*
* @parent: parent resource to allocate this range in
* @dev_dax: DAX device to be expanded
* @to_alloc: amount of space to alloc; must be <= space available in @parent
+ * @dax_resource: if sparse; the parent resource
*
* Return the amount of space allocated or -ERRNO on failure
*/
-static ssize_t dev_dax_resize_static(struct resource *parent,
- struct dev_dax *dev_dax,
- resource_size_t to_alloc)
+static ssize_t __dev_dax_resize(struct resource *parent,
+ struct dev_dax *dev_dax,
+ resource_size_t to_alloc,
+ struct dax_resource *dax_resource)
{
struct resource *res, *first;
int rc;
@@ -1045,7 +1183,8 @@ static ssize_t dev_dax_resize_static(struct resource *parent,
first = parent->child;
if (!first) {
rc = alloc_dev_dax_range(parent, dev_dax,
- parent->start, to_alloc);
+ parent->start, to_alloc,
+ dax_resource);
if (rc)
return rc;
return to_alloc;
@@ -1059,7 +1198,8 @@ static ssize_t dev_dax_resize_static(struct resource *parent,
if (res == first && res->start > parent->start) {
alloc = min(res->start - parent->start, to_alloc);
rc = alloc_dev_dax_range(parent, dev_dax,
- parent->start, alloc);
+ parent->start, alloc,
+ dax_resource);
if (rc)
return rc;
return alloc;
@@ -1083,7 +1223,8 @@ static ssize_t dev_dax_resize_static(struct resource *parent,
return rc;
return alloc;
}
- rc = alloc_dev_dax_range(parent, dev_dax, res->end + 1, alloc);
+ rc = alloc_dev_dax_range(parent, dev_dax, res->end + 1, alloc,
+ dax_resource);
if (rc)
return rc;
return alloc;
@@ -1094,6 +1235,51 @@ static ssize_t dev_dax_resize_static(struct resource *parent,
return 0;
}
+static ssize_t dev_dax_resize_static(struct dax_region *dax_region,
+ struct dev_dax *dev_dax,
+ resource_size_t to_alloc)
+{
+ return __dev_dax_resize(&dax_region->res, dev_dax, to_alloc, NULL);
+}
+
+static int find_free_extent(struct device *dev, void *data)
+{
+ struct dax_region *dax_region = data;
+ struct dax_resource *dax_resource;
+
+ if (!dax_region->sparse_ops->is_extent(dev))
+ return 0;
+
+ dax_resource = dev_get_drvdata(dev);
+ if (!dax_resource || !dax_avail_size(dax_resource->res))
+ return 0;
+ return 1;
+}
+
+static ssize_t dev_dax_resize_sparse(struct dax_region *dax_region,
+ struct dev_dax *dev_dax,
+ resource_size_t to_alloc)
+{
+ struct dax_resource *dax_resource;
+ ssize_t alloc;
+
+ struct device *extent_dev __free(put_device) =
+ device_find_child(dax_region->dev, dax_region,
+ find_free_extent);
+ if (!extent_dev)
+ return 0;
+
+ dax_resource = dev_get_drvdata(extent_dev);
+ if (!dax_resource)
+ return 0;
+
+ to_alloc = min(dax_avail_size(dax_resource->res), to_alloc);
+ alloc = __dev_dax_resize(dax_resource->res, dev_dax, to_alloc, dax_resource);
+ if (alloc > 0)
+ dax_resource->use_cnt++;
+ return alloc;
+}
+
static ssize_t dev_dax_resize(struct dax_region *dax_region,
struct dev_dax *dev_dax, resource_size_t size)
{
@@ -1118,7 +1304,10 @@ static ssize_t dev_dax_resize(struct dax_region *dax_region,
return -ENXIO;
retry:
- alloc = dev_dax_resize_static(&dax_region->res, dev_dax, to_alloc);
+ if (is_sparse(dax_region))
+ alloc = dev_dax_resize_sparse(dax_region, dev_dax, to_alloc);
+ else
+ alloc = dev_dax_resize_static(dax_region, dev_dax, to_alloc);
if (alloc <= 0)
return alloc;
to_alloc -= alloc;
@@ -1227,7 +1416,7 @@ static ssize_t mapping_store(struct device *dev, struct device_attribute *attr,
to_alloc = range_len(&r);
if (alloc_is_aligned(dev_dax, to_alloc))
rc = alloc_dev_dax_range(&dax_region->res, dev_dax, r.start,
- to_alloc);
+ to_alloc, NULL);
up_write(&dax_dev_rwsem);
up_write(&dax_region_rwsem);
@@ -1466,6 +1655,11 @@ static struct dev_dax *__devm_create_dev_dax(struct dev_dax_data *data)
struct device *dev;
int rc;
+ if (is_sparse(dax_region) && data->size) {
+ dev_err(parent, "Sparse DAX region devices must be created initially with 0 size");
+ return ERR_PTR(-EINVAL);
+ }
+
dev_dax = kzalloc(sizeof(*dev_dax), GFP_KERNEL);
if (!dev_dax)
return ERR_PTR(-ENOMEM);
@@ -1496,7 +1690,7 @@ static struct dev_dax *__devm_create_dev_dax(struct dev_dax_data *data)
dev_set_name(dev, "dax%d.%d", dax_region->id, dev_dax->id);
rc = alloc_dev_dax_range(&dax_region->res, dev_dax, dax_region->res.start,
- data->size);
+ data->size, NULL);
if (rc)
goto err_range;
diff --git a/drivers/dax/bus.h b/drivers/dax/bus.h
index 783bfeef42cc6c4d74f24e0a69dac5598eaf1664..ae5029ea6047c5c640a504e1bb3d815a75498a3a 100644
--- a/drivers/dax/bus.h
+++ b/drivers/dax/bus.h
@@ -9,6 +9,7 @@ struct dev_dax;
struct resource;
struct dax_device;
struct dax_region;
+struct dax_sparse_ops;
/* dax bus specific ioresource flags */
#define IORESOURCE_DAX_STATIC BIT(0)
@@ -17,7 +18,7 @@ struct dax_region;
struct dax_region *alloc_dax_region(struct device *parent, int region_id,
struct range *range, int target_node, unsigned int align,
- unsigned long flags);
+ unsigned long flags, struct dax_sparse_ops *sparse_ops);
struct dev_dax_data {
struct dax_region *dax_region;
diff --git a/drivers/dax/cxl.c b/drivers/dax/cxl.c
index b4d1ca9b4e9b5105404c6d342522ad73d9fbf8a9..50c945a047ecf2411bd6cbd7f959d032a3e5f1b1 100644
--- a/drivers/dax/cxl.c
+++ b/drivers/dax/cxl.c
@@ -5,6 +5,57 @@
#include "../cxl/cxl.h"
#include "bus.h"
+#include "dax-private.h"
+
+static int __cxl_dax_add_resource(struct dax_region *dax_region,
+ struct region_extent *region_extent)
+{
+ struct device *dev = ®ion_extent->dev;
+ resource_size_t start, length;
+
+ start = dax_region->res.start + region_extent->hpa_range.start;
+ length = range_len(®ion_extent->hpa_range);
+ return dax_region_add_resource(dax_region, dev, start, length);
+}
+
+static int cxl_dax_add_resource(struct device *dev, void *data)
+{
+ struct dax_region *dax_region = data;
+ struct region_extent *region_extent;
+
+ region_extent = to_region_extent(dev);
+ if (!region_extent)
+ return 0;
+
+ dev_dbg(dax_region->dev, "Adding resource HPA %pra\n",
+ ®ion_extent->hpa_range);
+
+ return __cxl_dax_add_resource(dax_region, region_extent);
+}
+
+static int cxl_dax_region_notify(struct device *dev,
+ struct cxl_notify_data *notify_data)
+{
+ struct cxl_dax_region *cxlr_dax = to_cxl_dax_region(dev);
+ struct dax_region *dax_region = dev_get_drvdata(dev);
+ struct region_extent *region_extent = notify_data->region_extent;
+
+ switch (notify_data->event) {
+ case DCD_ADD_CAPACITY:
+ return __cxl_dax_add_resource(dax_region, region_extent);
+ case DCD_RELEASE_CAPACITY:
+ return dax_region_rm_resource(dax_region, ®ion_extent->dev);
+ case DCD_FORCED_CAPACITY_RELEASE:
+ default:
+ dev_err(&cxlr_dax->dev, "Unknown DC event %d\n",
+ notify_data->event);
+ return -ENXIO;
+ }
+}
+
+struct dax_sparse_ops sparse_ops = {
+ .is_extent = is_region_extent,
+};
static int cxl_dax_region_probe(struct device *dev)
{
@@ -24,15 +75,18 @@ static int cxl_dax_region_probe(struct device *dev)
flags |= IORESOURCE_DAX_SPARSE_CAP;
dax_region = alloc_dax_region(dev, cxlr->id, &cxlr_dax->hpa_range, nid,
- PMD_SIZE, flags);
+ PMD_SIZE, flags, &sparse_ops);
if (!dax_region)
return -ENOMEM;
- if (cxlr->mode == CXL_REGION_DC)
+ if (cxlr->mode == CXL_REGION_DC) {
+ device_for_each_child(&cxlr_dax->dev, dax_region,
+ cxl_dax_add_resource);
/* Add empty seed dax device */
dev_size = 0;
- else
+ } else {
dev_size = range_len(&cxlr_dax->hpa_range);
+ }
data = (struct dev_dax_data) {
.dax_region = dax_region,
@@ -47,6 +101,7 @@ static int cxl_dax_region_probe(struct device *dev)
static struct cxl_driver cxl_dax_region_driver = {
.name = "cxl_dax_region",
.probe = cxl_dax_region_probe,
+ .notify = cxl_dax_region_notify,
.id = CXL_DEVICE_DAX_REGION,
.drv = {
.suppress_bind_attrs = true,
diff --git a/drivers/dax/dax-private.h b/drivers/dax/dax-private.h
index 0867115aeef2e1b2d4c88b5c38b6648a404b1060..39fb587561f802b813c1763293820307520d6adf 100644
--- a/drivers/dax/dax-private.h
+++ b/drivers/dax/dax-private.h
@@ -16,6 +16,14 @@ struct inode *dax_inode(struct dax_device *dax_dev);
int dax_bus_init(void);
void dax_bus_exit(void);
+/**
+ * struct dax_sparse_ops - Operations for sparse regions
+ * @is_extent: return if the device is an extent
+ */
+struct dax_sparse_ops {
+ bool (*is_extent)(struct device *dev);
+};
+
/**
* struct dax_region - mapping infrastructure for dax devices
* @id: kernel-wide unique region for a memory range
@@ -27,6 +35,7 @@ void dax_bus_exit(void);
* @res: resource tree to track instance allocations
* @seed: allow userspace to find the first unbound seed device
* @youngest: allow userspace to find the most recently created device
+ * @sparse_ops: operations required for sparse regions
*/
struct dax_region {
int id;
@@ -38,6 +47,7 @@ struct dax_region {
struct resource res;
struct device *seed;
struct device *youngest;
+ struct dax_sparse_ops *sparse_ops;
};
/**
@@ -57,11 +67,13 @@ struct dax_mapping {
* @pgoff: page offset
* @range: resource-span
* @mapping: reference to the dax_mapping for this range
+ * @dax_resource: if not NULL; dax sparse resource containing this range
*/
struct dev_dax_range {
unsigned long pgoff;
struct range range;
struct dax_mapping *mapping;
+ struct dax_resource *dax_resource;
};
/**
@@ -100,6 +112,34 @@ struct dev_dax {
*/
void run_dax(struct dax_device *dax_dev);
+/**
+ * struct dax_resource - For sparse regions; an active resource
+ * @region: dax_region this resources is in
+ * @res: resource
+ * @use_cnt: count the number of uses of this resource
+ *
+ * Changes to the dax_region and the dax_resources within it are protected by
+ * dax_region_rwsem
+ *
+ * dax_resource's are not intended to be used outside the dax layer.
+ */
+struct dax_resource {
+ struct dax_region *region;
+ struct resource *res;
+ unsigned int use_cnt;
+};
+
+/*
+ * Similar to run_dax() dax_region_{add,rm}_resource() and dax_avail_size() are
+ * exported but are not intended to be generic operations outside the dax
+ * subsystem. They are only generic between the dax layer and the dax drivers.
+ */
+int dax_region_add_resource(struct dax_region *dax_region, struct device *dev,
+ resource_size_t start, resource_size_t length);
+int dax_region_rm_resource(struct dax_region *dax_region,
+ struct device *dev);
+resource_size_t dax_avail_size(struct resource *dax_resource);
+
static inline struct dev_dax *to_dev_dax(struct device *dev)
{
return container_of(dev, struct dev_dax, dev);
diff --git a/drivers/dax/hmem/hmem.c b/drivers/dax/hmem/hmem.c
index 5e7c53f18491622408adeab9d354ea869dbc71de..0eea65052874edc983690e1fe071ae2f7bc6aa7e 100644
--- a/drivers/dax/hmem/hmem.c
+++ b/drivers/dax/hmem/hmem.c
@@ -28,7 +28,7 @@ static int dax_hmem_probe(struct platform_device *pdev)
mri = dev->platform_data;
dax_region = alloc_dax_region(dev, pdev->id, &mri->range,
- mri->target_node, PMD_SIZE, flags);
+ mri->target_node, PMD_SIZE, flags, NULL);
if (!dax_region)
return -ENOMEM;
diff --git a/drivers/dax/pmem.c b/drivers/dax/pmem.c
index c8ebf4e281f2405034065014ecdb830afda66906..f927e855f240007276612674448c155d89494746 100644
--- a/drivers/dax/pmem.c
+++ b/drivers/dax/pmem.c
@@ -54,7 +54,7 @@ static struct dev_dax *__dax_pmem_probe(struct device *dev)
range.start += offset;
dax_region = alloc_dax_region(dev, region_id, &range,
nd_region->target_node, le32_to_cpu(pfn_sb->align),
- IORESOURCE_DAX_STATIC);
+ IORESOURCE_DAX_STATIC, NULL);
if (!dax_region)
return ERR_PTR(-ENOMEM);
diff --git a/include/linux/ioport.h b/include/linux/ioport.h
index 5385349f0b8a68cb390bc7c1270b11223d5667f8..ff44c03a95670e4700e148081639ef7aa91ddcd8 100644
--- a/include/linux/ioport.h
+++ b/include/linux/ioport.h
@@ -27,6 +27,9 @@ struct resource {
struct resource *parent, *sibling, *child;
};
+#define for_each_child_resource(parent, res) \
+ for (res = (parent)->child; res; res = res->sibling)
+
/*
* IO resources have these defined flags.
*
--
2.47.1
^ permalink raw reply related [flat|nested] 34+ messages in thread
* [PATCH v8 18/21] cxl/region: Read existing extents on region creation
2024-12-11 3:42 [PATCH v8 00/21] DCD: Add support for Dynamic Capacity Devices (DCD) Ira Weiny
` (16 preceding siblings ...)
2024-12-11 3:42 ` [PATCH v8 17/21] dax/region: Create resources on sparse DAX regions Ira Weiny
@ 2024-12-11 3:42 ` Ira Weiny
2024-12-11 3:42 ` [PATCH v8 19/21] cxl/mem: Trace Dynamic capacity Event Record Ira Weiny
` (2 subsequent siblings)
20 siblings, 0 replies; 34+ messages in thread
From: Ira Weiny @ 2024-12-11 3:42 UTC (permalink / raw)
To: Dave Jiang, Fan Ni, Jonathan Cameron, Jonathan Corbet,
Andrew Morton, Kees Cook, Gustavo A. R. Silva
Cc: Dan Williams, Davidlohr Bueso, Alison Schofield, Vishal Verma,
Ira Weiny, linux-cxl, linux-doc, nvdimm, linux-kernel,
linux-hardening
Dynamic capacity device extents may be left in an accepted state on a
device due to an unexpected host crash. In this case it is expected
that the creation of a new region on top of a DC partition can read
those extents and surface them for continued use.
Once all endpoint decoders are part of a region and the region is being
realized, a read of the 'devices extent list' can reveal these
previously accepted extents.
CXL r3.1 specifies the mailbox call Get Dynamic Capacity Extent List for
this purpose. The call returns all the extents for all dynamic capacity
partitions. If the fabric manager is adding extents to any DCD
partition, the extent list for the recovered region may change. In this
case the query must retry. Upon retry the query could encounter extents
which were accepted on a previous list query. Adding such extents is
ignored without error because they are entirely within a previous
accepted extent. Instead warn on this case to allow for differentiating
bad devices from this normal condition.
Latch any errors to be bubbled up to ensure notification to the user
even if individual errors are rate limited or otherwise ignored.
The scan for existing extents races with the dax_cxl driver. This is
synchronized through the region device lock. Extents which are found
after the driver has loaded will surface through the normal notification
path while extents seen prior to the driver are read during driver load.
Based on an original patch by Navneet Singh.
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: Fan Ni <fan.ni@samsung.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
---
drivers/cxl/core/core.h | 1 +
drivers/cxl/core/mbox.c | 108 ++++++++++++++++++++++++++++++++++++++++++++++
drivers/cxl/core/region.c | 25 +++++++++++
drivers/cxl/cxlmem.h | 21 +++++++++
4 files changed, 155 insertions(+)
diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
index fb49d00e53861a252eb47db7a82415d724da6701..884f7bc459e2147e0aaa9ab4544dbff4a4cdee8f 100644
--- a/drivers/cxl/core/core.h
+++ b/drivers/cxl/core/core.h
@@ -21,6 +21,7 @@ cxled_to_mds(struct cxl_endpoint_decoder *cxled)
return container_of(cxlds, struct cxl_memdev_state, cxlds);
}
+int cxl_process_extent_list(struct cxl_endpoint_decoder *cxled);
int cxl_region_invalidate_memregion(struct cxl_region *cxlr);
#ifdef CONFIG_CXL_REGION
diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
index 5bce54da3bcfb934c0b7b0609fa7a961ee4854b7..e6789b59be1b361c1cfcb8d0e5b1ef64cd6555fd 100644
--- a/drivers/cxl/core/mbox.c
+++ b/drivers/cxl/core/mbox.c
@@ -1692,6 +1692,114 @@ int cxl_dev_dynamic_capacity_identify(struct cxl_memdev_state *mds)
}
EXPORT_SYMBOL_NS_GPL(cxl_dev_dynamic_capacity_identify, "CXL");
+/* Return -EAGAIN if the extent list changes while reading */
+static int __cxl_process_extent_list(struct cxl_endpoint_decoder *cxled)
+{
+ u32 current_index, total_read, total_expected, initial_gen_num;
+ struct cxl_memdev_state *mds = cxled_to_mds(cxled);
+ struct cxl_mailbox *cxl_mbox = &mds->cxlds.cxl_mbox;
+ struct device *dev = mds->cxlds.dev;
+ struct cxl_mbox_cmd mbox_cmd;
+ u32 max_extent_count;
+ int latched_rc = 0;
+ bool first = true;
+
+ struct cxl_mbox_get_extent_out *extents __free(kvfree) =
+ kvmalloc(cxl_mbox->payload_size, GFP_KERNEL);
+ if (!extents)
+ return -ENOMEM;
+
+ total_read = 0;
+ current_index = 0;
+ total_expected = 0;
+ max_extent_count = (cxl_mbox->payload_size - sizeof(*extents)) /
+ sizeof(struct cxl_extent);
+ do {
+ struct cxl_mbox_get_extent_in get_extent;
+ u32 nr_returned, current_total, current_gen_num;
+ int rc;
+
+ get_extent = (struct cxl_mbox_get_extent_in) {
+ .extent_cnt = max(max_extent_count,
+ total_expected - current_index),
+ .start_extent_index = cpu_to_le32(current_index),
+ };
+
+ mbox_cmd = (struct cxl_mbox_cmd) {
+ .opcode = CXL_MBOX_OP_GET_DC_EXTENT_LIST,
+ .payload_in = &get_extent,
+ .size_in = sizeof(get_extent),
+ .size_out = cxl_mbox->payload_size,
+ .payload_out = extents,
+ .min_out = 1,
+ };
+
+ rc = cxl_internal_send_cmd(cxl_mbox, &mbox_cmd);
+ if (rc < 0)
+ return rc;
+
+ /* Save initial data */
+ if (first) {
+ total_expected = le32_to_cpu(extents->total_extent_count);
+ initial_gen_num = le32_to_cpu(extents->generation_num);
+ first = false;
+ }
+
+ nr_returned = le32_to_cpu(extents->returned_extent_count);
+ total_read += nr_returned;
+ current_total = le32_to_cpu(extents->total_extent_count);
+ current_gen_num = le32_to_cpu(extents->generation_num);
+
+ dev_dbg(dev, "Got extent list %d-%d of %d generation Num:%d\n",
+ current_index, total_read - 1, current_total, current_gen_num);
+
+ if (current_gen_num != initial_gen_num || total_expected != current_total) {
+ dev_warn(dev, "Extent list change detected; gen %u != %u : cnt %u != %u\n",
+ current_gen_num, initial_gen_num,
+ total_expected, current_total);
+ return -EAGAIN;
+ }
+
+ for (int i = 0; i < nr_returned ; i++) {
+ struct cxl_extent *extent = &extents->extent[i];
+
+ dev_dbg(dev, "Processing extent %d/%d\n",
+ current_index + i, total_expected);
+
+ rc = validate_add_extent(mds, extent);
+ if (rc)
+ latched_rc = rc;
+ }
+
+ current_index += nr_returned;
+ } while (total_expected > total_read);
+
+ return latched_rc;
+}
+
+/**
+ * cxl_process_extent_list() - Read existing extents
+ * @cxled: Endpoint decoder which is part of a region
+ *
+ * Issue the Get Dynamic Capacity Extent List command to the device
+ * and add existing extents if found.
+ *
+ * A retry of 10 is somewhat arbitrary, however, extent changes should be
+ * relatively rare while bringing up a region. So 10 should be plenty.
+ */
+#define CXL_READ_EXTENT_LIST_RETRY 10
+int cxl_process_extent_list(struct cxl_endpoint_decoder *cxled)
+{
+ int retry = CXL_READ_EXTENT_LIST_RETRY;
+ int rc;
+
+ do {
+ rc = __cxl_process_extent_list(cxled);
+ } while (rc == -EAGAIN && retry--);
+
+ return rc;
+}
+
static int add_dpa_res(struct device *dev, struct resource *parent,
struct resource *res, resource_size_t start,
resource_size_t size, const char *type)
diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
index f7e47a82fa2bd1b245081428fc515fb464993aa5..ca22ac14218191fde314dbbcef3d945ea62c67c9 100644
--- a/drivers/cxl/core/region.c
+++ b/drivers/cxl/core/region.c
@@ -3191,6 +3191,26 @@ static int devm_cxl_add_pmem_region(struct cxl_region *cxlr)
return rc;
}
+static int cxlr_add_existing_extents(struct cxl_region *cxlr)
+{
+ struct cxl_region_params *p = &cxlr->params;
+ int i, latched_rc = 0;
+
+ for (i = 0; i < p->nr_targets; i++) {
+ struct device *dev = &p->targets[i]->cxld.dev;
+ int rc;
+
+ rc = cxl_process_extent_list(p->targets[i]);
+ if (rc) {
+ dev_err(dev, "Existing extent processing failed %d\n",
+ rc);
+ latched_rc = rc;
+ }
+ }
+
+ return latched_rc;
+}
+
static void cxlr_dax_unregister(void *_cxlr_dax)
{
struct cxl_dax_region *cxlr_dax = _cxlr_dax;
@@ -3225,6 +3245,11 @@ static int devm_cxl_add_dax_region(struct cxl_region *cxlr)
dev_dbg(&cxlr->dev, "%s: register %s\n", dev_name(dev->parent),
dev_name(dev));
+ if (cxlr->mode == CXL_REGION_DC)
+ if (cxlr_add_existing_extents(cxlr))
+ dev_err(&cxlr->dev, "Existing extent processing failed %d\n",
+ rc);
+
return devm_add_action_or_reset(&cxlr->dev, cxlr_dax_unregister,
cxlr_dax);
err:
diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
index 73dee28bbd803a8f78686e833f8ef3492ca94e66..e7b9bd5bb4a96b0cdeb4bcf9c3b7ca1499d1cddd 100644
--- a/drivers/cxl/cxlmem.h
+++ b/drivers/cxl/cxlmem.h
@@ -627,6 +627,27 @@ struct cxl_mbox_dc_response {
} __packed extent_list[];
} __packed;
+/*
+ * Get Dynamic Capacity Extent List; Input Payload
+ * CXL rev 3.1 section 8.2.9.9.9.2; Table 8-166
+ */
+struct cxl_mbox_get_extent_in {
+ __le32 extent_cnt;
+ __le32 start_extent_index;
+} __packed;
+
+/*
+ * Get Dynamic Capacity Extent List; Output Payload
+ * CXL rev 3.1 section 8.2.9.9.9.2; Table 8-167
+ */
+struct cxl_mbox_get_extent_out {
+ __le32 returned_extent_count;
+ __le32 total_extent_count;
+ __le32 generation_num;
+ u8 rsvd[4];
+ struct cxl_extent extent[];
+} __packed;
+
struct cxl_mbox_get_supported_logs {
__le16 entries;
u8 rsvd[6];
--
2.47.1
^ permalink raw reply related [flat|nested] 34+ messages in thread
* [PATCH v8 19/21] cxl/mem: Trace Dynamic capacity Event Record
2024-12-11 3:42 [PATCH v8 00/21] DCD: Add support for Dynamic Capacity Devices (DCD) Ira Weiny
` (17 preceding siblings ...)
2024-12-11 3:42 ` [PATCH v8 18/21] cxl/region: Read existing extents on region creation Ira Weiny
@ 2024-12-11 3:42 ` Ira Weiny
2024-12-11 3:42 ` [PATCH v8 20/21] tools/testing/cxl: Make event logs dynamic Ira Weiny
2024-12-11 3:42 ` [PATCH v8 21/21] tools/testing/cxl: Add DC Regions to mock mem data Ira Weiny
20 siblings, 0 replies; 34+ messages in thread
From: Ira Weiny @ 2024-12-11 3:42 UTC (permalink / raw)
To: Dave Jiang, Fan Ni, Jonathan Cameron, Jonathan Corbet,
Andrew Morton, Kees Cook, Gustavo A. R. Silva
Cc: Dan Williams, Davidlohr Bueso, Alison Schofield, Vishal Verma,
Ira Weiny, linux-cxl, linux-doc, nvdimm, linux-kernel,
linux-hardening
CXL rev 3.1 section 8.2.9.2.1 adds the Dynamic Capacity Event Records.
User space can use trace events for debugging of DC capacity changes.
Add DC trace points to the trace log.
Based on an original patch by Navneet Singh.
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Reviewed-by: Fan Ni <fan.ni@samsung.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
---
drivers/cxl/core/mbox.c | 4 +++
drivers/cxl/core/trace.h | 65 ++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 69 insertions(+)
diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
index e6789b59be1b361c1cfcb8d0e5b1ef64cd6555fd..133ef4dbe2a320e17d425ee280750b9757357f68 100644
--- a/drivers/cxl/core/mbox.c
+++ b/drivers/cxl/core/mbox.c
@@ -991,6 +991,10 @@ static void __cxl_event_trace_record(const struct cxl_memdev *cxlmd,
ev_type = CXL_CPER_EVENT_DRAM;
else if (uuid_equal(uuid, &CXL_EVENT_MEM_MODULE_UUID))
ev_type = CXL_CPER_EVENT_MEM_MODULE;
+ else if (uuid_equal(uuid, &CXL_EVENT_DC_EVENT_UUID)) {
+ trace_cxl_dynamic_capacity(cxlmd, type, &record->event.dcd);
+ return;
+ }
cxl_event_trace_record(cxlmd, type, ev_type, uuid, &record->event);
}
diff --git a/drivers/cxl/core/trace.h b/drivers/cxl/core/trace.h
index 8389a94adb1a681827209db46360d3d57c6672ce..ea819ea04a41a42636c1f612682a796a40ef5950 100644
--- a/drivers/cxl/core/trace.h
+++ b/drivers/cxl/core/trace.h
@@ -742,6 +742,71 @@ TRACE_EVENT(cxl_poison,
)
);
+/*
+ * Dynamic Capacity Event Record - DER
+ *
+ * CXL rev 3.1 section 8.2.9.2.1.6 Table 8-50
+ */
+
+#define CXL_DC_ADD_CAPACITY 0x00
+#define CXL_DC_REL_CAPACITY 0x01
+#define CXL_DC_FORCED_REL_CAPACITY 0x02
+#define CXL_DC_REG_CONF_UPDATED 0x03
+#define show_dc_evt_type(type) __print_symbolic(type, \
+ { CXL_DC_ADD_CAPACITY, "Add capacity"}, \
+ { CXL_DC_REL_CAPACITY, "Release capacity"}, \
+ { CXL_DC_FORCED_REL_CAPACITY, "Forced capacity release"}, \
+ { CXL_DC_REG_CONF_UPDATED, "Region Configuration Updated" } \
+)
+
+TRACE_EVENT(cxl_dynamic_capacity,
+
+ TP_PROTO(const struct cxl_memdev *cxlmd, enum cxl_event_log_type log,
+ struct cxl_event_dcd *rec),
+
+ TP_ARGS(cxlmd, log, rec),
+
+ TP_STRUCT__entry(
+ CXL_EVT_TP_entry
+
+ /* Dynamic capacity Event */
+ __field(u8, event_type)
+ __field(u16, hostid)
+ __field(u8, region_id)
+ __field(u64, dpa_start)
+ __field(u64, length)
+ __array(u8, tag, CXL_EXTENT_TAG_LEN)
+ __field(u16, sh_extent_seq)
+ ),
+
+ TP_fast_assign(
+ CXL_EVT_TP_fast_assign(cxlmd, log, rec->hdr);
+
+ /* Dynamic_capacity Event */
+ __entry->event_type = rec->event_type;
+
+ /* DCD event record data */
+ __entry->hostid = le16_to_cpu(rec->host_id);
+ __entry->region_id = rec->region_index;
+ __entry->dpa_start = le64_to_cpu(rec->extent.start_dpa);
+ __entry->length = le64_to_cpu(rec->extent.length);
+ memcpy(__entry->tag, &rec->extent.tag, CXL_EXTENT_TAG_LEN);
+ __entry->sh_extent_seq = le16_to_cpu(rec->extent.shared_extn_seq);
+ ),
+
+ CXL_EVT_TP_printk("event_type='%s' host_id='%d' region_id='%d' " \
+ "starting_dpa=%llx length=%llx tag=%pU " \
+ "shared_extent_sequence=%d",
+ show_dc_evt_type(__entry->event_type),
+ __entry->hostid,
+ __entry->region_id,
+ __entry->dpa_start,
+ __entry->length,
+ __entry->tag,
+ __entry->sh_extent_seq
+ )
+);
+
#endif /* _CXL_EVENTS_H */
#define TRACE_INCLUDE_FILE trace
--
2.47.1
^ permalink raw reply related [flat|nested] 34+ messages in thread
* [PATCH v8 20/21] tools/testing/cxl: Make event logs dynamic
2024-12-11 3:42 [PATCH v8 00/21] DCD: Add support for Dynamic Capacity Devices (DCD) Ira Weiny
` (18 preceding siblings ...)
2024-12-11 3:42 ` [PATCH v8 19/21] cxl/mem: Trace Dynamic capacity Event Record Ira Weiny
@ 2024-12-11 3:42 ` Ira Weiny
2024-12-11 3:42 ` [PATCH v8 21/21] tools/testing/cxl: Add DC Regions to mock mem data Ira Weiny
20 siblings, 0 replies; 34+ messages in thread
From: Ira Weiny @ 2024-12-11 3:42 UTC (permalink / raw)
To: Dave Jiang, Fan Ni, Jonathan Cameron, Jonathan Corbet,
Andrew Morton, Kees Cook, Gustavo A. R. Silva
Cc: Dan Williams, Davidlohr Bueso, Alison Schofield, Vishal Verma,
Ira Weiny, linux-cxl, linux-doc, nvdimm, linux-kernel,
linux-hardening
The event logs test was created as static arrays as an easy way to mock
events. Dynamic Capacity Device (DCD) test support requires events be
generated dynamically when extents are created or destroyed.
The current event log test has specific checks for the number of events
seen including log overflow.
Modify mock event logs to be dynamically allocated. Adjust array size
and mock event entry data to match the output expected by the existing
event test.
Use the static event data to create the dynamic events in the new logs
without inventing complex event injection for the previous tests.
Simplify log processing by using the event log array index as the
handle. Add a lock to manage concurrency required when user space is
allowed to control DCD extents
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
---
tools/testing/cxl/test/mem.c | 268 ++++++++++++++++++++++++++-----------------
1 file changed, 162 insertions(+), 106 deletions(-)
diff --git a/tools/testing/cxl/test/mem.c b/tools/testing/cxl/test/mem.c
index 347c1e7b37bdfa9c70c6b469ad987baa48ca35e4..33a16baea7c346030e60035fe4c21a0048736e75 100644
--- a/tools/testing/cxl/test/mem.c
+++ b/tools/testing/cxl/test/mem.c
@@ -126,18 +126,26 @@ static struct {
#define PASS_TRY_LIMIT 3
-#define CXL_TEST_EVENT_CNT_MAX 15
+#define CXL_TEST_EVENT_CNT_MAX 16
+/* 1 extra slot to accommodate that handles can't be 0 */
+#define CXL_TEST_EVENT_ARRAY_SIZE (CXL_TEST_EVENT_CNT_MAX + 1)
/* Set a number of events to return at a time for simulation. */
#define CXL_TEST_EVENT_RET_MAX 4
+/*
+ * @last_handle: last handle (index) to have an entry stored
+ * @current_handle: current handle (index) to be returned to the user on get_event
+ * @nr_overflow: number of events added past the log size
+ * @lock: protect these state variables
+ * @events: array of pending events to be returned.
+ */
struct mock_event_log {
- u16 clear_idx;
- u16 cur_idx;
- u16 nr_events;
+ u16 last_handle;
+ u16 current_handle;
u16 nr_overflow;
- u16 overflow_reset;
- struct cxl_event_record_raw *events[CXL_TEST_EVENT_CNT_MAX];
+ rwlock_t lock;
+ struct cxl_event_record_raw *events[CXL_TEST_EVENT_ARRAY_SIZE];
};
struct mock_event_store {
@@ -172,56 +180,65 @@ static struct mock_event_log *event_find_log(struct device *dev, int log_type)
return &mdata->mes.mock_logs[log_type];
}
-static struct cxl_event_record_raw *event_get_current(struct mock_event_log *log)
-{
- return log->events[log->cur_idx];
-}
-
-static void event_reset_log(struct mock_event_log *log)
-{
- log->cur_idx = 0;
- log->clear_idx = 0;
- log->nr_overflow = log->overflow_reset;
-}
-
/* Handle can never be 0 use 1 based indexing for handle */
-static u16 event_get_clear_handle(struct mock_event_log *log)
+static u16 event_inc_handle(u16 handle)
{
- return log->clear_idx + 1;
+ handle = (handle + 1) % CXL_TEST_EVENT_ARRAY_SIZE;
+ if (handle == 0)
+ handle = 1;
+ return handle;
}
-/* Handle can never be 0 use 1 based indexing for handle */
-static __le16 event_get_cur_event_handle(struct mock_event_log *log)
-{
- u16 cur_handle = log->cur_idx + 1;
-
- return cpu_to_le16(cur_handle);
-}
-
-static bool event_log_empty(struct mock_event_log *log)
-{
- return log->cur_idx == log->nr_events;
-}
-
-static void mes_add_event(struct mock_event_store *mes,
+/* Add the event or free it on overflow */
+static void mes_add_event(struct cxl_mockmem_data *mdata,
enum cxl_event_log_type log_type,
struct cxl_event_record_raw *event)
{
+ struct device *dev = mdata->mds->cxlds.dev;
struct mock_event_log *log;
if (WARN_ON(log_type >= CXL_EVENT_TYPE_MAX))
return;
- log = &mes->mock_logs[log_type];
+ log = &mdata->mes.mock_logs[log_type];
+
+ guard(write_lock)(&log->lock);
- if ((log->nr_events + 1) > CXL_TEST_EVENT_CNT_MAX) {
+ dev_dbg(dev, "Add log %d cur %d last %d\n",
+ log_type, log->current_handle, log->last_handle);
+
+ /* Check next buffer */
+ if (event_inc_handle(log->last_handle) == log->current_handle) {
log->nr_overflow++;
- log->overflow_reset = log->nr_overflow;
+ dev_dbg(dev, "Overflowing log %d nr %d\n",
+ log_type, log->nr_overflow);
+ devm_kfree(dev, event);
return;
}
- log->events[log->nr_events] = event;
- log->nr_events++;
+ dev_dbg(dev, "Log %d; handle %u\n", log_type, log->last_handle);
+ event->event.generic.hdr.handle = cpu_to_le16(log->last_handle);
+ log->events[log->last_handle] = event;
+ log->last_handle = event_inc_handle(log->last_handle);
+}
+
+static void mes_del_event(struct device *dev,
+ struct mock_event_log *log,
+ u16 handle)
+{
+ struct cxl_event_record_raw *record;
+
+ lockdep_assert(lockdep_is_held(&log->lock));
+
+ dev_dbg(dev, "Clearing event %u; record %u\n",
+ handle, log->current_handle);
+ record = log->events[handle];
+ if (!record)
+ dev_err(dev, "Mock event index %u empty?\n", handle);
+
+ log->events[handle] = NULL;
+ log->current_handle = event_inc_handle(log->current_handle);
+ devm_kfree(dev, record);
}
/*
@@ -234,7 +251,7 @@ static int mock_get_event(struct device *dev, struct cxl_mbox_cmd *cmd)
{
struct cxl_get_event_payload *pl;
struct mock_event_log *log;
- u16 nr_overflow;
+ u16 handle;
u8 log_type;
int i;
@@ -255,29 +272,38 @@ static int mock_get_event(struct device *dev, struct cxl_mbox_cmd *cmd)
memset(cmd->payload_out, 0, struct_size(pl, records, 0));
log = event_find_log(dev, log_type);
- if (!log || event_log_empty(log))
+ if (!log)
return 0;
pl = cmd->payload_out;
- for (i = 0; i < ret_limit && !event_log_empty(log); i++) {
- memcpy(&pl->records[i], event_get_current(log),
- sizeof(pl->records[i]));
- pl->records[i].event.generic.hdr.handle =
- event_get_cur_event_handle(log);
- log->cur_idx++;
+ guard(read_lock)(&log->lock);
+
+ handle = log->current_handle;
+ dev_dbg(dev, "Get log %d handle %u last %u\n",
+ log_type, handle, log->last_handle);
+ for (i = 0; i < ret_limit && handle != log->last_handle;
+ i++, handle = event_inc_handle(handle)) {
+ struct cxl_event_record_raw *cur;
+
+ cur = log->events[handle];
+ dev_dbg(dev, "Sending event log %d handle %d idx %u\n",
+ log_type, le16_to_cpu(cur->event.generic.hdr.handle),
+ handle);
+ memcpy(&pl->records[i], cur, sizeof(pl->records[i]));
+ pl->records[i].event.generic.hdr.handle = cpu_to_le16(handle);
}
cmd->size_out = struct_size(pl, records, i);
pl->record_count = cpu_to_le16(i);
- if (!event_log_empty(log))
+ if (handle != log->last_handle)
pl->flags |= CXL_GET_EVENT_FLAG_MORE_RECORDS;
if (log->nr_overflow) {
u64 ns;
pl->flags |= CXL_GET_EVENT_FLAG_OVERFLOW;
- pl->overflow_err_count = cpu_to_le16(nr_overflow);
+ pl->overflow_err_count = cpu_to_le16(log->nr_overflow);
ns = ktime_get_real_ns();
ns -= 5000000000; /* 5s ago */
pl->first_overflow_timestamp = cpu_to_le64(ns);
@@ -292,8 +318,8 @@ static int mock_get_event(struct device *dev, struct cxl_mbox_cmd *cmd)
static int mock_clear_event(struct device *dev, struct cxl_mbox_cmd *cmd)
{
struct cxl_mbox_clear_event_payload *pl = cmd->payload_in;
- struct mock_event_log *log;
u8 log_type = pl->event_log;
+ struct mock_event_log *log;
u16 handle;
int nr;
@@ -304,23 +330,20 @@ static int mock_clear_event(struct device *dev, struct cxl_mbox_cmd *cmd)
if (!log)
return 0; /* No mock data in this log */
- /*
- * This check is technically not invalid per the specification AFAICS.
- * (The host could 'guess' handles and clear them in order).
- * However, this is not good behavior for the host so test it.
- */
- if (log->clear_idx + pl->nr_recs > log->cur_idx) {
- dev_err(dev,
- "Attempting to clear more events than returned!\n");
- return -EINVAL;
- }
+ guard(write_lock)(&log->lock);
/* Check handle order prior to clearing events */
- for (nr = 0, handle = event_get_clear_handle(log);
- nr < pl->nr_recs;
- nr++, handle++) {
+ handle = log->current_handle;
+ for (nr = 0; nr < pl->nr_recs && handle != log->last_handle;
+ nr++, handle = event_inc_handle(handle)) {
+
+ dev_dbg(dev, "Checking clear of %d handle %u plhandle %u\n",
+ log_type, handle,
+ le16_to_cpu(pl->handles[nr]));
+
if (handle != le16_to_cpu(pl->handles[nr])) {
- dev_err(dev, "Clearing events out of order\n");
+ dev_err(dev, "Clearing events out of order %u %u\n",
+ handle, le16_to_cpu(pl->handles[nr]));
return -EINVAL;
}
}
@@ -329,25 +352,12 @@ static int mock_clear_event(struct device *dev, struct cxl_mbox_cmd *cmd)
log->nr_overflow = 0;
/* Clear events */
- log->clear_idx += pl->nr_recs;
- return 0;
-}
-
-static void cxl_mock_event_trigger(struct device *dev)
-{
- struct cxl_mockmem_data *mdata = dev_get_drvdata(dev);
- struct mock_event_store *mes = &mdata->mes;
- int i;
+ for (nr = 0; nr < pl->nr_recs; nr++)
+ mes_del_event(dev, log, le16_to_cpu(pl->handles[nr]));
+ dev_dbg(dev, "Delete log %d cur %d last %d\n",
+ log_type, log->current_handle, log->last_handle);
- for (i = CXL_EVENT_TYPE_INFO; i < CXL_EVENT_TYPE_MAX; i++) {
- struct mock_event_log *log;
-
- log = event_find_log(dev, i);
- if (log)
- event_reset_log(log);
- }
-
- cxl_mem_get_event_records(mdata->mds, mes->ev_status);
+ return 0;
}
struct cxl_event_record_raw maint_needed = {
@@ -476,8 +486,27 @@ static int mock_set_timestamp(struct cxl_dev_state *cxlds,
return 0;
}
-static void cxl_mock_add_event_logs(struct mock_event_store *mes)
+/* Create a dynamically allocated event out of a statically defined event. */
+static void add_event_from_static(struct cxl_mockmem_data *mdata,
+ enum cxl_event_log_type log_type,
+ struct cxl_event_record_raw *raw)
+{
+ struct device *dev = mdata->mds->cxlds.dev;
+ struct cxl_event_record_raw *rec;
+
+ rec = devm_kmemdup(dev, raw, sizeof(*rec), GFP_KERNEL);
+ if (!rec) {
+ dev_err(dev, "Failed to alloc event for log\n");
+ return;
+ }
+ mes_add_event(mdata, log_type, rec);
+}
+
+static void cxl_mock_add_event_logs(struct cxl_mockmem_data *mdata)
{
+ struct mock_event_store *mes = &mdata->mes;
+ struct device *dev = mdata->mds->cxlds.dev;
+
put_unaligned_le16(CXL_GMER_VALID_CHANNEL | CXL_GMER_VALID_RANK,
&gen_media.rec.media_hdr.validity_flags);
@@ -485,43 +514,60 @@ static void cxl_mock_add_event_logs(struct mock_event_store *mes)
CXL_DER_VALID_BANK | CXL_DER_VALID_COLUMN,
&dram.rec.media_hdr.validity_flags);
- mes_add_event(mes, CXL_EVENT_TYPE_INFO, &maint_needed);
- mes_add_event(mes, CXL_EVENT_TYPE_INFO,
+ dev_dbg(dev, "Generating fake event logs %d\n",
+ CXL_EVENT_TYPE_INFO);
+ add_event_from_static(mdata, CXL_EVENT_TYPE_INFO, &maint_needed);
+ add_event_from_static(mdata, CXL_EVENT_TYPE_INFO,
(struct cxl_event_record_raw *)&gen_media);
- mes_add_event(mes, CXL_EVENT_TYPE_INFO,
+ add_event_from_static(mdata, CXL_EVENT_TYPE_INFO,
(struct cxl_event_record_raw *)&mem_module);
mes->ev_status |= CXLDEV_EVENT_STATUS_INFO;
- mes_add_event(mes, CXL_EVENT_TYPE_FAIL, &maint_needed);
- mes_add_event(mes, CXL_EVENT_TYPE_FAIL, &hardware_replace);
- mes_add_event(mes, CXL_EVENT_TYPE_FAIL,
+ dev_dbg(dev, "Generating fake event logs %d\n",
+ CXL_EVENT_TYPE_FAIL);
+ add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL, &maint_needed);
+ add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL,
+ (struct cxl_event_record_raw *)&mem_module);
+ add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL, &hardware_replace);
+ add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL,
(struct cxl_event_record_raw *)&dram);
- mes_add_event(mes, CXL_EVENT_TYPE_FAIL,
+ add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL,
(struct cxl_event_record_raw *)&gen_media);
- mes_add_event(mes, CXL_EVENT_TYPE_FAIL,
+ add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL,
(struct cxl_event_record_raw *)&mem_module);
- mes_add_event(mes, CXL_EVENT_TYPE_FAIL, &hardware_replace);
- mes_add_event(mes, CXL_EVENT_TYPE_FAIL,
+ add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL, &hardware_replace);
+ add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL,
(struct cxl_event_record_raw *)&dram);
/* Overflow this log */
- mes_add_event(mes, CXL_EVENT_TYPE_FAIL, &hardware_replace);
- mes_add_event(mes, CXL_EVENT_TYPE_FAIL, &hardware_replace);
- mes_add_event(mes, CXL_EVENT_TYPE_FAIL, &hardware_replace);
- mes_add_event(mes, CXL_EVENT_TYPE_FAIL, &hardware_replace);
- mes_add_event(mes, CXL_EVENT_TYPE_FAIL, &hardware_replace);
- mes_add_event(mes, CXL_EVENT_TYPE_FAIL, &hardware_replace);
- mes_add_event(mes, CXL_EVENT_TYPE_FAIL, &hardware_replace);
- mes_add_event(mes, CXL_EVENT_TYPE_FAIL, &hardware_replace);
- mes_add_event(mes, CXL_EVENT_TYPE_FAIL, &hardware_replace);
- mes_add_event(mes, CXL_EVENT_TYPE_FAIL, &hardware_replace);
+ add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL, &hardware_replace);
+ add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL, &hardware_replace);
+ add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL, &hardware_replace);
+ add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL, &hardware_replace);
+ add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL, &hardware_replace);
+ add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL, &hardware_replace);
+ add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL, &hardware_replace);
+ add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL, &hardware_replace);
+ add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL, &hardware_replace);
+ add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL, &hardware_replace);
mes->ev_status |= CXLDEV_EVENT_STATUS_FAIL;
- mes_add_event(mes, CXL_EVENT_TYPE_FATAL, &hardware_replace);
- mes_add_event(mes, CXL_EVENT_TYPE_FATAL,
+ dev_dbg(dev, "Generating fake event logs %d\n",
+ CXL_EVENT_TYPE_FATAL);
+ add_event_from_static(mdata, CXL_EVENT_TYPE_FATAL, &hardware_replace);
+ add_event_from_static(mdata, CXL_EVENT_TYPE_FATAL,
(struct cxl_event_record_raw *)&dram);
mes->ev_status |= CXLDEV_EVENT_STATUS_FATAL;
}
+static void cxl_mock_event_trigger(struct device *dev)
+{
+ struct cxl_mockmem_data *mdata = dev_get_drvdata(dev);
+ struct mock_event_store *mes = &mdata->mes;
+
+ cxl_mock_add_event_logs(mdata);
+ cxl_mem_get_event_records(mdata->mds, mes->ev_status);
+}
+
static int mock_gsl(struct cxl_mbox_cmd *cmd)
{
if (cmd->size_out < sizeof(mock_gsl_payload))
@@ -1469,6 +1515,14 @@ static int cxl_mock_mailbox_create(struct cxl_dev_state *cxlds)
return 0;
}
+static void init_event_log(struct mock_event_log *log)
+{
+ rwlock_init(&log->lock);
+ /* Handle can never be 0 use 1 based indexing for handle */
+ log->current_handle = 1;
+ log->last_handle = 1;
+}
+
static int cxl_mock_mem_probe(struct platform_device *pdev)
{
struct device *dev = &pdev->dev;
@@ -1541,7 +1595,9 @@ static int cxl_mock_mem_probe(struct platform_device *pdev)
if (rc)
return rc;
- cxl_mock_add_event_logs(&mdata->mes);
+ for (int i = 0; i < CXL_EVENT_TYPE_MAX; i++)
+ init_event_log(&mdata->mes.mock_logs[i]);
+ cxl_mock_add_event_logs(mdata);
cxlmd = devm_cxl_add_memdev(&pdev->dev, cxlds);
if (IS_ERR(cxlmd))
--
2.47.1
^ permalink raw reply related [flat|nested] 34+ messages in thread
* [PATCH v8 21/21] tools/testing/cxl: Add DC Regions to mock mem data
2024-12-11 3:42 [PATCH v8 00/21] DCD: Add support for Dynamic Capacity Devices (DCD) Ira Weiny
` (19 preceding siblings ...)
2024-12-11 3:42 ` [PATCH v8 20/21] tools/testing/cxl: Make event logs dynamic Ira Weiny
@ 2024-12-11 3:42 ` Ira Weiny
20 siblings, 0 replies; 34+ messages in thread
From: Ira Weiny @ 2024-12-11 3:42 UTC (permalink / raw)
To: Dave Jiang, Fan Ni, Jonathan Cameron, Jonathan Corbet,
Andrew Morton, Kees Cook, Gustavo A. R. Silva
Cc: Dan Williams, Davidlohr Bueso, Alison Schofield, Vishal Verma,
Ira Weiny, linux-cxl, linux-doc, nvdimm, linux-kernel,
linux-hardening
cxl_test provides a good way to ensure quick smoke and regression
testing. The complexity of Dynamic Capacity (DC) extent processing as
well as the complexity of the new sparse DAX regions can mostly be
tested through cxl_test. This includes management of sparse regions and
DAX devices on those regions; the management of extent device lifetimes;
and the processing of DCD events.
The only missing functionality from this test is actual interrupt
processing.
Mock memory devices can easily mock DC information and manage fake
extent data.
Define mock_dc_region information within the mock memory data. Add
sysfs entries on the mock device to inject and delete extents.
The inject format is <start>:<length>:<tag>:<more_flag>
The delete format is <start>:<length>
Directly call the event irq callback to simulate irqs to process the
test extents.
Add DC mailbox commands to the CEL and implement those commands.
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
---
Changes:
[iweiny: expand test realism to allow the host to reject extents properly]
---
tools/testing/cxl/test/mem.c | 751 +++++++++++++++++++++++++++++++++++++++++++
1 file changed, 751 insertions(+)
diff --git a/tools/testing/cxl/test/mem.c b/tools/testing/cxl/test/mem.c
index 33a16baea7c346030e60035fe4c21a0048736e75..ce18ec8f723cdeb6d84e793743bc35ed037b7367 100644
--- a/tools/testing/cxl/test/mem.c
+++ b/tools/testing/cxl/test/mem.c
@@ -20,6 +20,7 @@
#define FW_SLOTS 3
#define DEV_SIZE SZ_2G
#define EFFECT(x) (1U << x)
+#define BASE_DYNAMIC_CAP_DPA DEV_SIZE
#define MOCK_INJECT_DEV_MAX 8
#define MOCK_INJECT_TEST_MAX 128
@@ -97,6 +98,22 @@ static struct cxl_cel_entry mock_cel[] = {
EFFECT(SECURITY_CHANGE_IMMEDIATE) |
EFFECT(BACKGROUND_OP)),
},
+ {
+ .opcode = cpu_to_le16(CXL_MBOX_OP_GET_DC_CONFIG),
+ .effect = CXL_CMD_EFFECT_NONE,
+ },
+ {
+ .opcode = cpu_to_le16(CXL_MBOX_OP_GET_DC_EXTENT_LIST),
+ .effect = CXL_CMD_EFFECT_NONE,
+ },
+ {
+ .opcode = cpu_to_le16(CXL_MBOX_OP_ADD_DC_RESPONSE),
+ .effect = cpu_to_le16(EFFECT(CONF_CHANGE_IMMEDIATE)),
+ },
+ {
+ .opcode = cpu_to_le16(CXL_MBOX_OP_RELEASE_DC),
+ .effect = cpu_to_le16(EFFECT(CONF_CHANGE_IMMEDIATE)),
+ },
};
/* See CXL 2.0 Table 181 Get Health Info Output Payload */
@@ -153,6 +170,7 @@ struct mock_event_store {
u32 ev_status;
};
+#define NUM_MOCK_DC_REGIONS 2
struct cxl_mockmem_data {
void *lsa;
void *fw;
@@ -169,6 +187,18 @@ struct cxl_mockmem_data {
u8 event_buf[SZ_4K];
u64 timestamp;
unsigned long sanitize_timeout;
+ struct cxl_dc_region_config dc_regions[NUM_MOCK_DC_REGIONS];
+ u32 dc_ext_generation;
+ struct mutex ext_lock;
+ /*
+ * Extents are in 1 of 3 states
+ * FM (sysfs added but not sent to the host yet)
+ * sent (sent to the host but not accepted)
+ * accepted (by the host)
+ */
+ struct xarray dc_fm_extents;
+ struct xarray dc_sent_extents;
+ struct xarray dc_accepted_exts;
};
static struct mock_event_log *event_find_log(struct device *dev, int log_type)
@@ -568,6 +598,251 @@ static void cxl_mock_event_trigger(struct device *dev)
cxl_mem_get_event_records(mdata->mds, mes->ev_status);
}
+struct cxl_extent_data {
+ u64 dpa_start;
+ u64 length;
+ u8 tag[CXL_EXTENT_TAG_LEN];
+ bool shared;
+};
+
+static int __devm_add_extent(struct device *dev, struct xarray *array,
+ u64 start, u64 length, const char *tag,
+ bool shared)
+{
+ struct cxl_extent_data *extent;
+
+ extent = devm_kzalloc(dev, sizeof(*extent), GFP_KERNEL);
+ if (!extent)
+ return -ENOMEM;
+
+ extent->dpa_start = start;
+ extent->length = length;
+ memcpy(extent->tag, tag, min(sizeof(extent->tag), strlen(tag)));
+ extent->shared = shared;
+
+ if (xa_insert(array, start, extent, GFP_KERNEL)) {
+ devm_kfree(dev, extent);
+ dev_err(dev, "Failed xarry insert %#llx\n", start);
+ return -EINVAL;
+ }
+
+ return 0;
+}
+
+static int devm_add_fm_extent(struct device *dev, u64 start, u64 length,
+ const char *tag, bool shared)
+{
+ struct cxl_mockmem_data *mdata = dev_get_drvdata(dev);
+
+ guard(mutex)(&mdata->ext_lock);
+ return __devm_add_extent(dev, &mdata->dc_fm_extents, start, length,
+ tag, shared);
+}
+
+/* It is known that ext and the new range are not equal */
+static struct cxl_extent_data *
+split_ext(struct device *dev, struct xarray *array,
+ struct cxl_extent_data *ext, u64 start, u64 length)
+{
+ u64 new_start, new_length;
+
+ if (ext->dpa_start == start) {
+ new_start = start + length;
+ new_length = (ext->dpa_start + ext->length) - new_start;
+
+ if (__devm_add_extent(dev, array, new_start, new_length,
+ ext->tag, false))
+ return NULL;
+
+ ext = xa_erase(array, ext->dpa_start);
+ if (__devm_add_extent(dev, array, start, length, ext->tag,
+ false))
+ return NULL;
+
+ return xa_load(array, start);
+ }
+
+ /* ext->dpa_start != start */
+
+ if (__devm_add_extent(dev, array, start, length, ext->tag, false))
+ return NULL;
+
+ new_start = ext->dpa_start;
+ new_length = start - ext->dpa_start;
+
+ ext = xa_erase(array, ext->dpa_start);
+ if (__devm_add_extent(dev, array, new_start, new_length, ext->tag,
+ false))
+ return NULL;
+
+ return xa_load(array, start);
+}
+
+/*
+ * Do not handle extents which are not inside a single extent sent to
+ * the host.
+ */
+static struct cxl_extent_data *
+find_create_ext(struct device *dev, struct xarray *array, u64 start, u64 length)
+{
+ struct cxl_extent_data *ext;
+ unsigned long index;
+
+ xa_for_each(array, index, ext) {
+ u64 end = start + length;
+
+ /* start < [ext) <= start */
+ if (start < ext->dpa_start ||
+ (ext->dpa_start + ext->length) <= start)
+ continue;
+
+ if (end <= ext->dpa_start ||
+ (ext->dpa_start + ext->length) < end) {
+ dev_err(dev, "Invalid range %#llx-%#llx\n", start,
+ end);
+ return NULL;
+ }
+
+ break;
+ }
+
+ if (!ext)
+ return NULL;
+
+ if (start == ext->dpa_start && length == ext->length)
+ return ext;
+
+ return split_ext(dev, array, ext, start, length);
+}
+
+static int dc_accept_extent(struct device *dev, u64 start, u64 length)
+{
+ struct cxl_mockmem_data *mdata = dev_get_drvdata(dev);
+ struct cxl_extent_data *ext;
+
+ dev_dbg(dev, "Host accepting extent %#llx\n", start);
+ mdata->dc_ext_generation++;
+
+ lockdep_assert_held(&mdata->ext_lock);
+ ext = find_create_ext(dev, &mdata->dc_sent_extents, start, length);
+ if (!ext) {
+ dev_err(dev, "Extent %#llx-%#llx not found\n",
+ start, start + length);
+ return -ENOMEM;
+ }
+ ext = xa_erase(&mdata->dc_sent_extents, ext->dpa_start);
+ return xa_insert(&mdata->dc_accepted_exts, start, ext, GFP_KERNEL);
+}
+
+static void release_dc_ext(void *md)
+{
+ struct cxl_mockmem_data *mdata = md;
+
+ xa_destroy(&mdata->dc_fm_extents);
+ xa_destroy(&mdata->dc_sent_extents);
+ xa_destroy(&mdata->dc_accepted_exts);
+}
+
+/* Pretend to have some previous accepted extents */
+struct pre_ext_info {
+ u64 offset;
+ u64 length;
+} pre_ext_info[] = {
+ {
+ .offset = SZ_128M,
+ .length = SZ_64M,
+ },
+ {
+ .offset = SZ_256M,
+ .length = SZ_64M,
+ },
+};
+
+static int devm_add_sent_extent(struct device *dev, u64 start, u64 length,
+ const char *tag, bool shared)
+{
+ struct cxl_mockmem_data *mdata = dev_get_drvdata(dev);
+
+ lockdep_assert_held(&mdata->ext_lock);
+ return __devm_add_extent(dev, &mdata->dc_sent_extents, start, length,
+ tag, shared);
+}
+
+static int inject_prev_extents(struct device *dev, u64 base_dpa)
+{
+ struct cxl_mockmem_data *mdata = dev_get_drvdata(dev);
+ int rc;
+
+ dev_dbg(dev, "Adding %ld pre-extents for testing\n",
+ ARRAY_SIZE(pre_ext_info));
+
+ guard(mutex)(&mdata->ext_lock);
+ for (int i = 0; i < ARRAY_SIZE(pre_ext_info); i++) {
+ u64 ext_dpa = base_dpa + pre_ext_info[i].offset;
+ u64 ext_len = pre_ext_info[i].length;
+
+ dev_dbg(dev, "Adding pre-extent DPA:%#llx LEN:%#llx\n",
+ ext_dpa, ext_len);
+
+ rc = devm_add_sent_extent(dev, ext_dpa, ext_len, "", false);
+ if (rc) {
+ dev_err(dev, "Failed to add pre-extent DPA:%#llx LEN:%#llx; %d\n",
+ ext_dpa, ext_len, rc);
+ return rc;
+ }
+
+ rc = dc_accept_extent(dev, ext_dpa, ext_len);
+ if (rc)
+ return rc;
+ }
+ return 0;
+}
+
+static int cxl_mock_dc_region_setup(struct device *dev)
+{
+ struct cxl_mockmem_data *mdata = dev_get_drvdata(dev);
+ u64 base_dpa = BASE_DYNAMIC_CAP_DPA;
+ u32 dsmad_handle = 0xFADE;
+ u64 decode_length = SZ_512M;
+ u64 block_size = SZ_512;
+ u64 length = SZ_512M;
+ int rc;
+
+ mutex_init(&mdata->ext_lock);
+ xa_init(&mdata->dc_fm_extents);
+ xa_init(&mdata->dc_sent_extents);
+ xa_init(&mdata->dc_accepted_exts);
+
+ rc = devm_add_action_or_reset(dev, release_dc_ext, mdata);
+ if (rc)
+ return rc;
+
+ for (int i = 0; i < NUM_MOCK_DC_REGIONS; i++) {
+ struct cxl_dc_region_config *conf = &mdata->dc_regions[i];
+
+ dev_dbg(dev, "Creating DC region DC%d DPA:%#llx LEN:%#llx\n",
+ i, base_dpa, length);
+
+ conf->region_base = cpu_to_le64(base_dpa);
+ conf->region_decode_length = cpu_to_le64(decode_length /
+ CXL_CAPACITY_MULTIPLIER);
+ conf->region_length = cpu_to_le64(length);
+ conf->region_block_size = cpu_to_le64(block_size);
+ conf->region_dsmad_handle = cpu_to_le32(dsmad_handle);
+ dsmad_handle++;
+
+ rc = inject_prev_extents(dev, base_dpa);
+ if (rc) {
+ dev_err(dev, "Failed to add pre-extents for DC%d\n", i);
+ return rc;
+ }
+
+ base_dpa += decode_length;
+ }
+
+ return 0;
+}
+
static int mock_gsl(struct cxl_mbox_cmd *cmd)
{
if (cmd->size_out < sizeof(mock_gsl_payload))
@@ -1383,6 +1658,192 @@ static int mock_activate_fw(struct cxl_mockmem_data *mdata,
return -EINVAL;
}
+static int mock_get_dc_config(struct device *dev,
+ struct cxl_mbox_cmd *cmd)
+{
+ struct cxl_mbox_get_dc_config_in *dc_config = cmd->payload_in;
+ struct cxl_mockmem_data *mdata = dev_get_drvdata(dev);
+ u8 region_requested, region_start_idx, region_ret_cnt;
+ struct cxl_mbox_get_dc_config_out *resp;
+ int i;
+
+ region_requested = min(dc_config->region_count, NUM_MOCK_DC_REGIONS);
+
+ if (cmd->size_out < struct_size(resp, region, region_requested))
+ return -EINVAL;
+
+ memset(cmd->payload_out, 0, cmd->size_out);
+ resp = cmd->payload_out;
+
+ region_start_idx = dc_config->start_region_index;
+ region_ret_cnt = 0;
+ for (i = 0; i < NUM_MOCK_DC_REGIONS; i++) {
+ if (i >= region_start_idx) {
+ memcpy(&resp->region[region_ret_cnt],
+ &mdata->dc_regions[i],
+ sizeof(resp->region[region_ret_cnt]));
+ region_ret_cnt++;
+ }
+ }
+ resp->avail_region_count = NUM_MOCK_DC_REGIONS;
+ resp->regions_returned = i;
+
+ dev_dbg(dev, "Returning %d dc regions\n", region_ret_cnt);
+ return 0;
+}
+
+static int mock_get_dc_extent_list(struct device *dev,
+ struct cxl_mbox_cmd *cmd)
+{
+ struct cxl_mbox_get_extent_out *resp = cmd->payload_out;
+ struct cxl_mockmem_data *mdata = dev_get_drvdata(dev);
+ struct cxl_mbox_get_extent_in *get = cmd->payload_in;
+ u32 total_avail = 0, total_ret = 0;
+ struct cxl_extent_data *ext;
+ u32 ext_count, start_idx;
+ unsigned long i;
+
+ ext_count = le32_to_cpu(get->extent_cnt);
+ start_idx = le32_to_cpu(get->start_extent_index);
+
+ memset(resp, 0, sizeof(*resp));
+
+ guard(mutex)(&mdata->ext_lock);
+ /*
+ * Total available needs to be calculated and returned regardless of
+ * how many can actually be returned.
+ */
+ xa_for_each(&mdata->dc_accepted_exts, i, ext)
+ total_avail++;
+
+ if (start_idx > total_avail)
+ return -EINVAL;
+
+ xa_for_each(&mdata->dc_accepted_exts, i, ext) {
+ if (total_ret >= ext_count)
+ break;
+
+ if (total_ret >= start_idx) {
+ resp->extent[total_ret].start_dpa =
+ cpu_to_le64(ext->dpa_start);
+ resp->extent[total_ret].length =
+ cpu_to_le64(ext->length);
+ memcpy(&resp->extent[total_ret].tag, ext->tag,
+ sizeof(resp->extent[total_ret]));
+ total_ret++;
+ }
+ }
+
+ resp->returned_extent_count = cpu_to_le32(total_ret);
+ resp->total_extent_count = cpu_to_le32(total_avail);
+ resp->generation_num = cpu_to_le32(mdata->dc_ext_generation);
+
+ dev_dbg(dev, "Returning %d extents of %d total\n",
+ total_ret, total_avail);
+
+ return 0;
+}
+
+static void dc_clear_sent(struct device *dev)
+{
+ struct cxl_mockmem_data *mdata = dev_get_drvdata(dev);
+ struct cxl_extent_data *ext;
+ unsigned long index;
+
+ lockdep_assert_held(&mdata->ext_lock);
+
+ /* Any extents not accepted must be cleared */
+ xa_for_each(&mdata->dc_sent_extents, index, ext) {
+ dev_dbg(dev, "Host rejected extent %#llx\n", ext->dpa_start);
+ xa_erase(&mdata->dc_sent_extents, ext->dpa_start);
+ }
+}
+
+static int mock_add_dc_response(struct device *dev,
+ struct cxl_mbox_cmd *cmd)
+{
+ struct cxl_mockmem_data *mdata = dev_get_drvdata(dev);
+ struct cxl_mbox_dc_response *req = cmd->payload_in;
+ u32 list_size = le32_to_cpu(req->extent_list_size);
+
+ guard(mutex)(&mdata->ext_lock);
+ for (int i = 0; i < list_size; i++) {
+ u64 start = le64_to_cpu(req->extent_list[i].dpa_start);
+ u64 length = le64_to_cpu(req->extent_list[i].length);
+ int rc;
+
+ rc = dc_accept_extent(dev, start, length);
+ if (rc)
+ return rc;
+ }
+
+ dc_clear_sent(dev);
+ return 0;
+}
+
+static void dc_delete_extent(struct device *dev, unsigned long long start,
+ unsigned long long length)
+{
+ struct cxl_mockmem_data *mdata = dev_get_drvdata(dev);
+ unsigned long long end = start + length;
+ struct cxl_extent_data *ext;
+ unsigned long index;
+
+ dev_dbg(dev, "Deleting extent at %#llx len:%#llx\n", start, length);
+
+ guard(mutex)(&mdata->ext_lock);
+ xa_for_each(&mdata->dc_fm_extents, index, ext) {
+ u64 extent_end = ext->dpa_start + ext->length;
+
+ /*
+ * Any extent which 'touches' the released delete range will be
+ * removed.
+ */
+ if ((start <= ext->dpa_start && ext->dpa_start < end) ||
+ (start <= extent_end && extent_end < end))
+ xa_erase(&mdata->dc_fm_extents, ext->dpa_start);
+ }
+
+ /*
+ * If the extent was accepted let it be for the host to drop
+ * later.
+ */
+}
+
+static int release_accepted_extent(struct device *dev, u64 start, u64 length)
+{
+ struct cxl_mockmem_data *mdata = dev_get_drvdata(dev);
+ struct cxl_extent_data *ext;
+
+ guard(mutex)(&mdata->ext_lock);
+ ext = find_create_ext(dev, &mdata->dc_accepted_exts, start, length);
+ if (!ext) {
+ dev_err(dev, "Extent %#llx not in accepted state\n", start);
+ return -EINVAL;
+ }
+ xa_erase(&mdata->dc_accepted_exts, ext->dpa_start);
+ mdata->dc_ext_generation++;
+
+ return 0;
+}
+
+static int mock_dc_release(struct device *dev,
+ struct cxl_mbox_cmd *cmd)
+{
+ struct cxl_mbox_dc_response *req = cmd->payload_in;
+ u32 list_size = le32_to_cpu(req->extent_list_size);
+
+ for (int i = 0; i < list_size; i++) {
+ u64 start = le64_to_cpu(req->extent_list[i].dpa_start);
+ u64 length = le64_to_cpu(req->extent_list[i].length);
+
+ dev_dbg(dev, "Extent %#llx released by host\n", start);
+ release_accepted_extent(dev, start, length);
+ }
+
+ return 0;
+}
+
static int cxl_mock_mbox_send(struct cxl_mailbox *cxl_mbox,
struct cxl_mbox_cmd *cmd)
{
@@ -1468,6 +1929,18 @@ static int cxl_mock_mbox_send(struct cxl_mailbox *cxl_mbox,
case CXL_MBOX_OP_ACTIVATE_FW:
rc = mock_activate_fw(mdata, cmd);
break;
+ case CXL_MBOX_OP_GET_DC_CONFIG:
+ rc = mock_get_dc_config(dev, cmd);
+ break;
+ case CXL_MBOX_OP_GET_DC_EXTENT_LIST:
+ rc = mock_get_dc_extent_list(dev, cmd);
+ break;
+ case CXL_MBOX_OP_ADD_DC_RESPONSE:
+ rc = mock_add_dc_response(dev, cmd);
+ break;
+ case CXL_MBOX_OP_RELEASE_DC:
+ rc = mock_dc_release(dev, cmd);
+ break;
default:
break;
}
@@ -1538,6 +2011,10 @@ static int cxl_mock_mem_probe(struct platform_device *pdev)
return -ENOMEM;
dev_set_drvdata(dev, mdata);
+ rc = cxl_mock_dc_region_setup(dev);
+ if (rc)
+ return rc;
+
mdata->lsa = vmalloc(LSA_SIZE);
if (!mdata->lsa)
return -ENOMEM;
@@ -1591,6 +2068,10 @@ static int cxl_mock_mem_probe(struct platform_device *pdev)
if (rc)
return rc;
+ rc = cxl_dev_dynamic_capacity_identify(mds);
+ if (rc)
+ return rc;
+
rc = cxl_mem_create_range_info(mds);
if (rc)
return rc;
@@ -1706,11 +2187,281 @@ static ssize_t sanitize_timeout_store(struct device *dev,
static DEVICE_ATTR_RW(sanitize_timeout);
+/* Return if the proposed extent would break the test code */
+static bool new_extent_valid(struct device *dev, size_t new_start,
+ size_t new_len)
+{
+ struct cxl_mockmem_data *mdata = dev_get_drvdata(dev);
+ struct cxl_extent_data *extent;
+ size_t new_end, i;
+
+ if (!new_len)
+ return false;
+
+ new_end = new_start + new_len;
+
+ dev_dbg(dev, "New extent %zx-%zx\n", new_start, new_end);
+
+ guard(mutex)(&mdata->ext_lock);
+ dev_dbg(dev, "Checking extents starts...\n");
+ xa_for_each(&mdata->dc_fm_extents, i, extent) {
+ if (extent->dpa_start == new_start)
+ return false;
+ }
+
+ dev_dbg(dev, "Checking sent extents starts...\n");
+ xa_for_each(&mdata->dc_sent_extents, i, extent) {
+ if (extent->dpa_start == new_start)
+ return false;
+ }
+
+ dev_dbg(dev, "Checking accepted extents starts...\n");
+ xa_for_each(&mdata->dc_accepted_exts, i, extent) {
+ if (extent->dpa_start == new_start)
+ return false;
+ }
+
+ return true;
+}
+
+struct cxl_test_dcd {
+ uuid_t id;
+ struct cxl_event_dcd rec;
+} __packed;
+
+struct cxl_test_dcd dcd_event_rec_template = {
+ .id = CXL_EVENT_DC_EVENT_UUID,
+ .rec = {
+ .hdr = {
+ .length = sizeof(struct cxl_test_dcd),
+ },
+ },
+};
+
+static int log_dc_event(struct cxl_mockmem_data *mdata, enum dc_event type,
+ u64 start, u64 length, const char *tag_str, bool more)
+{
+ struct device *dev = mdata->mds->cxlds.dev;
+ struct cxl_test_dcd *dcd_event;
+
+ dev_dbg(dev, "mock device log event %d\n", type);
+
+ dcd_event = devm_kmemdup(dev, &dcd_event_rec_template,
+ sizeof(*dcd_event), GFP_KERNEL);
+ if (!dcd_event)
+ return -ENOMEM;
+
+ dcd_event->rec.flags = 0;
+ if (more)
+ dcd_event->rec.flags |= CXL_DCD_EVENT_MORE;
+ dcd_event->rec.event_type = type;
+ dcd_event->rec.extent.start_dpa = cpu_to_le64(start);
+ dcd_event->rec.extent.length = cpu_to_le64(length);
+ memcpy(dcd_event->rec.extent.tag, tag_str,
+ min(sizeof(dcd_event->rec.extent.tag),
+ strlen(tag_str)));
+
+ mes_add_event(mdata, CXL_EVENT_TYPE_DCD,
+ (struct cxl_event_record_raw *)dcd_event);
+
+ /* Fake the irq */
+ cxl_mem_get_event_records(mdata->mds, CXLDEV_EVENT_STATUS_DCD);
+
+ return 0;
+}
+
+static void mark_extent_sent(struct device *dev, unsigned long long start)
+{
+ struct cxl_mockmem_data *mdata = dev_get_drvdata(dev);
+ struct cxl_extent_data *ext;
+
+ guard(mutex)(&mdata->ext_lock);
+ ext = xa_erase(&mdata->dc_fm_extents, start);
+ if (xa_insert(&mdata->dc_sent_extents, ext->dpa_start, ext, GFP_KERNEL))
+ dev_err(dev, "Failed to mark extent %#llx sent\n", ext->dpa_start);
+}
+
+/*
+ * Format <start>:<length>:<tag>:<more_flag>
+ *
+ * start and length must be a multiple of the configured region block size.
+ * Tag can be any string up to 16 bytes.
+ *
+ * Extents must be exclusive of other extents
+ *
+ * If the more flag is specified it is expected that an additional extent will
+ * be specified without the more flag to complete the test transaction with the
+ * host.
+ */
+static ssize_t __dc_inject_extent_store(struct device *dev,
+ struct device_attribute *attr,
+ const char *buf, size_t count,
+ bool shared)
+{
+ struct cxl_mockmem_data *mdata = dev_get_drvdata(dev);
+ unsigned long long start, length, more;
+ char *len_str, *tag_str, *more_str;
+ size_t buf_len = count;
+ int rc;
+
+ char *start_str __free(kfree) = kstrdup(buf, GFP_KERNEL);
+ if (!start_str)
+ return -ENOMEM;
+
+ len_str = strnchr(start_str, buf_len, ':');
+ if (!len_str) {
+ dev_err(dev, "Extent failed to find len_str: %s\n", start_str);
+ return -EINVAL;
+ }
+
+ *len_str = '\0';
+ len_str += 1;
+ buf_len -= strlen(start_str);
+
+ tag_str = strnchr(len_str, buf_len, ':');
+ if (!tag_str) {
+ dev_err(dev, "Extent failed to find tag_str: %s\n", len_str);
+ return -EINVAL;
+ }
+ *tag_str = '\0';
+ tag_str += 1;
+
+ more_str = strnchr(tag_str, buf_len, ':');
+ if (!more_str) {
+ dev_err(dev, "Extent failed to find more_str: %s\n", tag_str);
+ return -EINVAL;
+ }
+ *more_str = '\0';
+ more_str += 1;
+
+ if (kstrtoull(start_str, 0, &start)) {
+ dev_err(dev, "Extent failed to parse start: %s\n", start_str);
+ return -EINVAL;
+ }
+
+ if (kstrtoull(len_str, 0, &length)) {
+ dev_err(dev, "Extent failed to parse length: %s\n", len_str);
+ return -EINVAL;
+ }
+
+ if (kstrtoull(more_str, 0, &more)) {
+ dev_err(dev, "Extent failed to parse more: %s\n", more_str);
+ return -EINVAL;
+ }
+
+ if (!new_extent_valid(dev, start, length))
+ return -EINVAL;
+
+ rc = devm_add_fm_extent(dev, start, length, tag_str, shared);
+ if (rc) {
+ dev_err(dev, "Failed to add extent DPA:%#llx LEN:%#llx; %d\n",
+ start, length, rc);
+ return rc;
+ }
+
+ mark_extent_sent(dev, start);
+ rc = log_dc_event(mdata, DCD_ADD_CAPACITY, start, length, tag_str, more);
+ if (rc) {
+ dev_err(dev, "Failed to add event %d\n", rc);
+ return rc;
+ }
+
+ return count;
+}
+
+static ssize_t dc_inject_extent_store(struct device *dev,
+ struct device_attribute *attr,
+ const char *buf, size_t count)
+{
+ return __dc_inject_extent_store(dev, attr, buf, count, false);
+}
+static DEVICE_ATTR_WO(dc_inject_extent);
+
+static ssize_t dc_inject_shared_extent_store(struct device *dev,
+ struct device_attribute *attr,
+ const char *buf, size_t count)
+{
+ return __dc_inject_extent_store(dev, attr, buf, count, true);
+}
+static DEVICE_ATTR_WO(dc_inject_shared_extent);
+
+static ssize_t __dc_del_extent_store(struct device *dev,
+ struct device_attribute *attr,
+ const char *buf, size_t count,
+ enum dc_event type)
+{
+ struct cxl_mockmem_data *mdata = dev_get_drvdata(dev);
+ unsigned long long start, length;
+ char *len_str;
+ int rc;
+
+ char *start_str __free(kfree) = kstrdup(buf, GFP_KERNEL);
+ if (!start_str)
+ return -ENOMEM;
+
+ len_str = strnchr(start_str, count, ':');
+ if (!len_str) {
+ dev_err(dev, "Failed to find len_str: %s\n", start_str);
+ return -EINVAL;
+ }
+ *len_str = '\0';
+ len_str += 1;
+
+ if (kstrtoull(start_str, 0, &start)) {
+ dev_err(dev, "Failed to parse start: %s\n", start_str);
+ return -EINVAL;
+ }
+
+ if (kstrtoull(len_str, 0, &length)) {
+ dev_err(dev, "Failed to parse length: %s\n", len_str);
+ return -EINVAL;
+ }
+
+ dc_delete_extent(dev, start, length);
+
+ if (type == DCD_FORCED_CAPACITY_RELEASE)
+ dev_dbg(dev, "Forcing delete of extent %#llx len:%#llx\n",
+ start, length);
+
+ rc = log_dc_event(mdata, type, start, length, "", false);
+ if (rc) {
+ dev_err(dev, "Failed to add event %d\n", rc);
+ return rc;
+ }
+
+ return count;
+}
+
+/*
+ * Format <start>:<length>
+ */
+static ssize_t dc_del_extent_store(struct device *dev,
+ struct device_attribute *attr,
+ const char *buf, size_t count)
+{
+ return __dc_del_extent_store(dev, attr, buf, count,
+ DCD_RELEASE_CAPACITY);
+}
+static DEVICE_ATTR_WO(dc_del_extent);
+
+static ssize_t dc_force_del_extent_store(struct device *dev,
+ struct device_attribute *attr,
+ const char *buf, size_t count)
+{
+ return __dc_del_extent_store(dev, attr, buf, count,
+ DCD_FORCED_CAPACITY_RELEASE);
+}
+static DEVICE_ATTR_WO(dc_force_del_extent);
+
static struct attribute *cxl_mock_mem_attrs[] = {
&dev_attr_security_lock.attr,
&dev_attr_event_trigger.attr,
&dev_attr_fw_buf_checksum.attr,
&dev_attr_sanitize_timeout.attr,
+ &dev_attr_dc_inject_extent.attr,
+ &dev_attr_dc_inject_shared_extent.attr,
+ &dev_attr_dc_del_extent.attr,
+ &dev_attr_dc_force_del_extent.attr,
NULL
};
ATTRIBUTE_GROUPS(cxl_mock_mem);
--
2.47.1
^ permalink raw reply related [flat|nested] 34+ messages in thread
* Re: [PATCH v8 01/21] cxl/mbox: Flag support for Dynamic Capacity Devices (DCD)
2024-12-11 3:42 ` [PATCH v8 01/21] cxl/mbox: Flag " Ira Weiny
@ 2025-01-03 22:57 ` Dan Williams
2025-01-07 1:10 ` Ira Weiny
0 siblings, 1 reply; 34+ messages in thread
From: Dan Williams @ 2025-01-03 22:57 UTC (permalink / raw)
To: Ira Weiny, Dave Jiang, Fan Ni, Jonathan Cameron, Jonathan Corbet,
Andrew Morton, Kees Cook, Gustavo A. R. Silva
Cc: Dan Williams, Davidlohr Bueso, Alison Schofield, Vishal Verma,
Ira Weiny, linux-cxl, linux-doc, nvdimm, linux-kernel,
linux-hardening, Li Ming
Ira Weiny wrote:
> Per the CXL 3.1 specification software must check the Command Effects
> Log (CEL) for dynamic capacity command support.
>
> Detect support for the DCD commands while reading the CEL, including:
>
> Get DC Config
> Get DC Extent List
> Add DC Response
> Release DC
>
> Based on an original patch by Navneet Singh.
>
> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> Reviewed-by: Fan Ni <fan.ni@samsung.com>
> Reviewed-by: Dave Jiang <dave.jiang@intel.com>
> Reviewed-by: Davidlohr Bueso <dave@stgolabs.net>
> Reviewed-by: Li Ming <ming.li@zohomail.com>
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> ---
> drivers/cxl/core/mbox.c | 33 +++++++++++++++++++++++++++++++++
> drivers/cxl/cxlmem.h | 15 +++++++++++++++
> 2 files changed, 48 insertions(+)
>
> diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
> index 548564c770c02c0a4571a00ae3f6de8f63183183..599934d066518341eb6ea9fc3319cd7098cbc2f3 100644
> --- a/drivers/cxl/core/mbox.c
> +++ b/drivers/cxl/core/mbox.c
> @@ -164,6 +164,34 @@ static void cxl_set_security_cmd_enabled(struct cxl_security_state *security,
> }
> }
>
> +static bool cxl_is_dcd_command(u16 opcode)
> +{
> +#define CXL_MBOX_OP_DCD_CMDS 0x48
> +
> + return (opcode >> 8) == CXL_MBOX_OP_DCD_CMDS;
> +}
> +
> +static void cxl_set_dcd_cmd_enabled(struct cxl_memdev_state *mds,
> + u16 opcode)
> +{
> + switch (opcode) {
> + case CXL_MBOX_OP_GET_DC_CONFIG:
> + set_bit(CXL_DCD_ENABLED_GET_CONFIG, mds->dcd_cmds);
> + break;
> + case CXL_MBOX_OP_GET_DC_EXTENT_LIST:
> + set_bit(CXL_DCD_ENABLED_GET_EXTENT_LIST, mds->dcd_cmds);
> + break;
> + case CXL_MBOX_OP_ADD_DC_RESPONSE:
> + set_bit(CXL_DCD_ENABLED_ADD_RESPONSE, mds->dcd_cmds);
> + break;
> + case CXL_MBOX_OP_RELEASE_DC:
> + set_bit(CXL_DCD_ENABLED_RELEASE, mds->dcd_cmds);
> + break;
> + default:
> + break;
> + }
> +}
> +
> static bool cxl_is_poison_command(u16 opcode)
> {
> #define CXL_MBOX_OP_POISON_CMDS 0x43
> @@ -751,6 +779,11 @@ static void cxl_walk_cel(struct cxl_memdev_state *mds, size_t size, u8 *cel)
> enabled++;
> }
>
> + if (cxl_is_dcd_command(opcode)) {
> + cxl_set_dcd_cmd_enabled(mds, opcode);
> + enabled++;
> + }
> +
> dev_dbg(dev, "Opcode 0x%04x %s\n", opcode,
> enabled ? "enabled" : "unsupported by driver");
> }
> diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
> index 2a25d1957ddb9772b8d4dca92534ba76a909f8b3..e8907c403edbd83c8a36b8d013c6bc3391207ee6 100644
> --- a/drivers/cxl/cxlmem.h
> +++ b/drivers/cxl/cxlmem.h
> @@ -239,6 +239,15 @@ struct cxl_event_state {
> struct mutex log_lock;
> };
>
> +/* Device enabled DCD commands */
> +enum dcd_cmd_enabled_bits {
> + CXL_DCD_ENABLED_GET_CONFIG,
> + CXL_DCD_ENABLED_GET_EXTENT_LIST,
> + CXL_DCD_ENABLED_ADD_RESPONSE,
> + CXL_DCD_ENABLED_RELEASE,
> + CXL_DCD_ENABLED_MAX
> +};
> +
> /* Device enabled poison commands */
> enum poison_cmd_enabled_bits {
> CXL_POISON_ENABLED_LIST,
> @@ -461,6 +470,7 @@ static inline struct cxl_dev_state *mbox_to_cxlds(struct cxl_mailbox *cxl_mbox)
> * @lsa_size: Size of Label Storage Area
> * (CXL 2.0 8.2.9.5.1.1 Identify Memory Device)
> * @firmware_version: Firmware version for the memory device.
> + * @dcd_cmds: List of DCD commands implemented by memory device
> * @enabled_cmds: Hardware commands found enabled in CEL.
> * @exclusive_cmds: Commands that are kernel-internal only
> * @total_bytes: sum of all possible capacities
> @@ -485,6 +495,7 @@ struct cxl_memdev_state {
> struct cxl_dev_state cxlds;
> size_t lsa_size;
> char firmware_version[0x10];
> + DECLARE_BITMAP(dcd_cmds, CXL_DCD_ENABLED_MAX);
Can you clarify why cxl_memdev_state needs this bitmap? In the case of
'security' and 'poison' functionality there is a subset of functionality
that can be enabled if some of the commands are missing. Like poison
listing is still possible even if poison injection is missing. In the
case of DCD it is all or nothing.
In short, I do not think the cxl_memdev_state object needs to track
anything more than a single "DCD capable" flag, and cxl_walk_cel() can
check for all commands locally without carrying that bitmap around
indefinitely.
Something simple like:
cxl_walk_cel()
for (...) {
if (cxl_is_dcd_command()
set_bit(opcode & 0xf, &dcd_commands);
}
if (dcd_commands == 0xf)
mds->dcd_enabled = true;
else if (dcd_commands)
dev_dbg(...)
...otherwise it begs the question why the driver would care about
anything other than "all" dcd commands?
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v8 01/21] cxl/mbox: Flag support for Dynamic Capacity Devices (DCD)
2025-01-03 22:57 ` Dan Williams
@ 2025-01-07 1:10 ` Ira Weiny
0 siblings, 0 replies; 34+ messages in thread
From: Ira Weiny @ 2025-01-07 1:10 UTC (permalink / raw)
To: Dan Williams, Ira Weiny, Dave Jiang, Fan Ni, Jonathan Cameron,
Jonathan Corbet, Andrew Morton, Kees Cook, Gustavo A. R. Silva
Cc: Dan Williams, Davidlohr Bueso, Alison Schofield, Vishal Verma,
Ira Weiny, linux-cxl, linux-doc, nvdimm, linux-kernel,
linux-hardening, Li Ming
Dan Williams wrote:
> Ira Weiny wrote:
> > Per the CXL 3.1 specification software must check the Command Effects
> > Log (CEL) for dynamic capacity command support.
> >
> > Detect support for the DCD commands while reading the CEL, including:
> >
> > Get DC Config
> > Get DC Extent List
> > Add DC Response
> > Release DC
> >
> > Based on an original patch by Navneet Singh.
> >
> > Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> > Reviewed-by: Fan Ni <fan.ni@samsung.com>
> > Reviewed-by: Dave Jiang <dave.jiang@intel.com>
> > Reviewed-by: Davidlohr Bueso <dave@stgolabs.net>
> > Reviewed-by: Li Ming <ming.li@zohomail.com>
> > Signed-off-by: Ira Weiny <ira.weiny@intel.com>
[snip]
> > /* Device enabled poison commands */
> > enum poison_cmd_enabled_bits {
> > CXL_POISON_ENABLED_LIST,
> > @@ -461,6 +470,7 @@ static inline struct cxl_dev_state *mbox_to_cxlds(struct cxl_mailbox *cxl_mbox)
> > * @lsa_size: Size of Label Storage Area
> > * (CXL 2.0 8.2.9.5.1.1 Identify Memory Device)
> > * @firmware_version: Firmware version for the memory device.
> > + * @dcd_cmds: List of DCD commands implemented by memory device
> > * @enabled_cmds: Hardware commands found enabled in CEL.
> > * @exclusive_cmds: Commands that are kernel-internal only
> > * @total_bytes: sum of all possible capacities
> > @@ -485,6 +495,7 @@ struct cxl_memdev_state {
> > struct cxl_dev_state cxlds;
> > size_t lsa_size;
> > char firmware_version[0x10];
> > + DECLARE_BITMAP(dcd_cmds, CXL_DCD_ENABLED_MAX);
>
> Can you clarify why cxl_memdev_state needs this bitmap?
Nope. I think you are right that there is no need for partial support.
> In the case of
> 'security' and 'poison' functionality there is a subset of functionality
> that can be enabled if some of the commands are missing. Like poison
> listing is still possible even if poison injection is missing. In the
> case of DCD it is all or nothing.
>
> In short, I do not think the cxl_memdev_state object needs to track
> anything more than a single "DCD capable" flag, and cxl_walk_cel() can
> check for all commands locally without carrying that bitmap around
> indefinitely.
>
> Something simple like:
>
> cxl_walk_cel()
> for (...) {
> if (cxl_is_dcd_command()
> set_bit(opcode & 0xf, &dcd_commands);
> }
> if (dcd_commands == 0xf)
> mds->dcd_enabled = true;
> else if (dcd_commands)
> dev_dbg(...)
>
> ...otherwise it begs the question why the driver would care about
> anything other than "all" dcd commands?
Yea this could be done.
Ira
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v8 02/21] cxl/mem: Read dynamic capacity configuration from the device
2024-12-11 3:42 ` [PATCH v8 02/21] cxl/mem: Read dynamic capacity configuration from the device Ira Weiny
@ 2025-01-15 2:35 ` Dan Williams
2025-01-15 13:55 ` Alejandro Lucero Palau
2025-01-15 20:32 ` Ira Weiny
0 siblings, 2 replies; 34+ messages in thread
From: Dan Williams @ 2025-01-15 2:35 UTC (permalink / raw)
To: Ira Weiny, Dave Jiang, Fan Ni, Jonathan Cameron, Jonathan Corbet,
Andrew Morton, Kees Cook, Gustavo A. R. Silva
Cc: Dan Williams, Davidlohr Bueso, Alison Schofield, Vishal Verma,
Ira Weiny, linux-cxl, linux-doc, nvdimm, linux-kernel,
linux-hardening, Li Ming
Ira Weiny wrote:
> Devices which optionally support Dynamic Capacity (DC) are configured
> via mailbox commands. CXL 3.1 requires the host to issue the Get DC
> Configuration command in order to properly configure DCDs. Without the
> Get DC Configuration command DCD can't be supported.
>
> Implement the DC mailbox commands as specified in CXL 3.1 section
> 8.2.9.9.9 (opcodes 48XXh) to read and store the DCD configuration
> information. Disable DCD if DCD is not supported. Leverage the Get DC
> Configuration command supported bit to indicate if DCD is supported.
>
> Linux has no use for the trailing fields of the Get Dynamic Capacity
> Configuration Output Payload (Total number of supported extents, number
> of available extents, total number of supported tags, and number of
> available tags). Avoid defining those fields to use the more useful
> dynamic C array.
>
> Based on an original patch by Navneet Singh.
>
> Cc: Li Ming <ming.li@zohomail.com>
> Cc: Kees Cook <kees@kernel.org>
> Cc: Gustavo A. R. Silva <gustavoars@kernel.org>
> Cc: linux-hardening@vger.kernel.org
> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
>
> ---
> Changes:
> [iweiny: fix EXPORT_SYMBOL_NS_GPL(cxl_dev_dynamic_capacity_identify)]
> [iweiny: limit variable scope in cxl_dev_dynamic_capacity_identify]
> ---
> drivers/cxl/core/mbox.c | 166 +++++++++++++++++++++++++++++++++++++++++++++++-
> drivers/cxl/cxlmem.h | 64 ++++++++++++++++++-
> drivers/cxl/pci.c | 4 ++
> 3 files changed, 232 insertions(+), 2 deletions(-)
>
[snipping the C code to do a data structure review first]
> diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
> index e8907c403edbd83c8a36b8d013c6bc3391207ee6..05a0718aea73b3b2a02c608bae198eac7c462523 100644
> --- a/drivers/cxl/cxlmem.h
> +++ b/drivers/cxl/cxlmem.h
> @@ -403,6 +403,7 @@ enum cxl_devtype {
> CXL_DEVTYPE_CLASSMEM,
> };
>
> +#define CXL_MAX_DC_REGION 8
Please no, lets not sign up to have the "which cxl 'region' concept are
you referring to?" debate in perpetuity. "DPA partition", "DPA
resource", "DPA capacity" anything but "region".
> /**
> * struct cxl_dpa_perf - DPA performance property entry
> * @dpa_range: range for DPA address
> @@ -434,6 +435,8 @@ struct cxl_dpa_perf {
> * @dpa_res: Overall DPA resource tree for the device
> * @pmem_res: Active Persistent memory capacity configuration
> * @ram_res: Active Volatile memory capacity configuration
> + * @dc_res: Active Dynamic Capacity memory configuration for each possible
> + * region
> * @serial: PCIe Device Serial Number
> * @type: Generic Memory Class device or Vendor Specific Memory device
> * @cxl_mbox: CXL mailbox context
> @@ -449,11 +452,23 @@ struct cxl_dev_state {
> struct resource dpa_res;
> struct resource pmem_res;
> struct resource ram_res;
> + struct resource dc_res[CXL_MAX_DC_REGION];
This is throwing off cargo-cult alarms. The named pmem_res and ram_res
served us well up until the point where DPA partitions grew past 2 types
at well defined locations. I like the array of resources idea, but that
begs the question why not put all partition information into an array?
This would also head off complications later on in this series where the
DPA capacity reservation and allocation flows have "dc" sidecars bolted
on rather than general semantics like "allocating from partition index N
means that all partitions indices less than N need to be skipped and
marked reserved".
> u64 serial;
> enum cxl_devtype type;
> struct cxl_mailbox cxl_mbox;
> };
>
> +#define CXL_DC_REGION_STRLEN 8
> +struct cxl_dc_region_info {
> + u64 base;
> + u64 decode_len;
> + u64 len;
Duplicating partition information in multiple places, like
mds->dc_region[X].base and cxlds->dc_res[X].start, feels like an
RFC-quality decision for expediency that needs to reconciled on the way
to upstream.
> + u64 blk_size;
> + u32 dsmad_handle;
> + u8 flags;
> + u8 name[CXL_DC_REGION_STRLEN];
No, lets not entertain:
printk("%s\n", mds->dc_region[index].name);
...when:
printk("dc%d\n", index);
...will do.
> +};
> +
> static inline struct cxl_dev_state *mbox_to_cxlds(struct cxl_mailbox *cxl_mbox)
> {
> return dev_get_drvdata(cxl_mbox->host);
> @@ -473,7 +488,9 @@ static inline struct cxl_dev_state *mbox_to_cxlds(struct cxl_mailbox *cxl_mbox)
> * @dcd_cmds: List of DCD commands implemented by memory device
> * @enabled_cmds: Hardware commands found enabled in CEL.
> * @exclusive_cmds: Commands that are kernel-internal only
> - * @total_bytes: sum of all possible capacities
> + * @total_bytes: length of all possible capacities
> + * @static_bytes: length of possible static RAM and PMEM partitions
> + * @dynamic_bytes: length of possible DC partitions (DC Regions)
> * @volatile_only_bytes: hard volatile capacity
> * @persistent_only_bytes: hard persistent capacity
I have regrets that cxl_memdev_state permanently carries runtime storage
for init time variables, lets not continue down that path with DCD
enabling.
> * @partition_align_bytes: alignment size for partition-able capacity
> @@ -483,6 +500,8 @@ static inline struct cxl_dev_state *mbox_to_cxlds(struct cxl_mailbox *cxl_mbox)
> * @next_persistent_bytes: persistent capacity change pending device reset
> * @ram_perf: performance data entry matched to RAM partition
> * @pmem_perf: performance data entry matched to PMEM partition
> + * @nr_dc_region: number of DC regions implemented in the memory device
> + * @dc_region: array containing info about the DC regions
> * @event: event log driver state
> * @poison: poison driver state info
> * @security: security driver state info
> @@ -499,6 +518,8 @@ struct cxl_memdev_state {
> DECLARE_BITMAP(enabled_cmds, CXL_MEM_COMMAND_ID_MAX);
> DECLARE_BITMAP(exclusive_cmds, CXL_MEM_COMMAND_ID_MAX);
> u64 total_bytes;
> + u64 static_bytes;
> + u64 dynamic_bytes;
> u64 volatile_only_bytes;
> u64 persistent_only_bytes;
> u64 partition_align_bytes;
> @@ -510,6 +531,9 @@ struct cxl_memdev_state {
> struct cxl_dpa_perf ram_perf;
> struct cxl_dpa_perf pmem_perf;
>
> + u8 nr_dc_region;
> + struct cxl_dc_region_info dc_region[CXL_MAX_DC_REGION];
DPA capacity is a generic CXL.mem concern and partition information is
contained cxl_dev_state. Lets find a way to not need partially redundant
data structures across in cxl_memdev_state and cxl_dev_state.
DCD introduces the concept of "decode size vs usable capacity" into the
partition information, but I see no reason to conceptually tie that to
only DCD. Fabio's memory hole patches show that there is already a
memory-hole concept in the CXL arena. DCD is just saying "be prepared for
the concept of DPA partitions with memory holes at the end".
> +
> struct cxl_event_state event;
> struct cxl_poison_state poison;
> struct cxl_security_state security;
> @@ -708,6 +732,32 @@ struct cxl_mbox_set_partition_info {
>
> #define CXL_SET_PARTITION_IMMEDIATE_FLAG BIT(0)
>
> +/* See CXL 3.1 Table 8-163 get dynamic capacity config Input Payload */
> +struct cxl_mbox_get_dc_config_in {
> + u8 region_count;
> + u8 start_region_index;
> +} __packed;
> +
> +/* See CXL 3.1 Table 8-164 get dynamic capacity config Output Payload */
> +struct cxl_mbox_get_dc_config_out {
> + u8 avail_region_count;
> + u8 regions_returned;
> + u8 rsvd[6];
> + /* See CXL 3.1 Table 8-165 */
> + struct cxl_dc_region_config {
> + __le64 region_base;
> + __le64 region_decode_length;
> + __le64 region_length;
> + __le64 region_block_size;
> + __le32 region_dsmad_handle;
> + u8 flags;
> + u8 rsvd[3];
> + } __packed region[] __counted_by(regions_returned);
Yes, the spec unfortunately uses "region" for this partition info
payload. This would be a good place to say "CXL spec calls this 'region'
but Linux calls it 'partition' not to be confused with the Linux 'struct
cxl_region' or all the other usages of 'region' in the specification".
Linux is not obligated to follow the questionable naming decisions of
specifications.
> + /* Trailing fields unused */
> +} __packed;
> +#define CXL_DYNAMIC_CAPACITY_SANITIZE_ON_RELEASE_FLAG BIT(0)
> +#define CXL_DCD_BLOCK_LINE_SIZE 0x40
> +
> /* Set Timestamp CXL 3.0 Spec 8.2.9.4.2 */
> struct cxl_mbox_set_timestamp_in {
> __le64 timestamp;
> @@ -831,6 +881,7 @@ enum {
> int cxl_internal_send_cmd(struct cxl_mailbox *cxl_mbox,
> struct cxl_mbox_cmd *cmd);
> int cxl_dev_state_identify(struct cxl_memdev_state *mds);
> +int cxl_dev_dynamic_capacity_identify(struct cxl_memdev_state *mds);
> int cxl_await_media_ready(struct cxl_dev_state *cxlds);
> int cxl_enumerate_cmds(struct cxl_memdev_state *mds);
> int cxl_mem_create_range_info(struct cxl_memdev_state *mds);
> @@ -844,6 +895,17 @@ void cxl_event_trace_record(const struct cxl_memdev *cxlmd,
> enum cxl_event_log_type type,
> enum cxl_event_type event_type,
> const uuid_t *uuid, union cxl_event *evt);
> +
> +static inline bool cxl_dcd_supported(struct cxl_memdev_state *mds)
> +{
> + return test_bit(CXL_DCD_ENABLED_GET_CONFIG, mds->dcd_cmds);
> +}
> +
> +static inline void cxl_disable_dcd(struct cxl_memdev_state *mds)
> +{
> + clear_bit(CXL_DCD_ENABLED_GET_CONFIG, mds->dcd_cmds);
> +}
This hunk is out of place, and per the last patch, I think it can just be
a flag that does not need a helper.
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v8 02/21] cxl/mem: Read dynamic capacity configuration from the device
2025-01-15 2:35 ` Dan Williams
@ 2025-01-15 13:55 ` Alejandro Lucero Palau
2025-01-15 20:48 ` Ira Weiny
2025-01-16 6:33 ` Dan Williams
2025-01-15 20:32 ` Ira Weiny
1 sibling, 2 replies; 34+ messages in thread
From: Alejandro Lucero Palau @ 2025-01-15 13:55 UTC (permalink / raw)
To: Dan Williams, Ira Weiny, Dave Jiang, Fan Ni, Jonathan Cameron,
Jonathan Corbet, Andrew Morton, Kees Cook, Gustavo A. R. Silva
Cc: Davidlohr Bueso, Alison Schofield, Vishal Verma, linux-cxl,
linux-doc, nvdimm, linux-kernel, linux-hardening, Li Ming
On 1/15/25 02:35, Dan Williams wrote:
> Ira Weiny wrote:
>> Devices which optionally support Dynamic Capacity (DC) are configured
>> via mailbox commands. CXL 3.1 requires the host to issue the Get DC
>> Configuration command in order to properly configure DCDs. Without the
>> Get DC Configuration command DCD can't be supported.
>>
>> Implement the DC mailbox commands as specified in CXL 3.1 section
>> 8.2.9.9.9 (opcodes 48XXh) to read and store the DCD configuration
>> information. Disable DCD if DCD is not supported. Leverage the Get DC
>> Configuration command supported bit to indicate if DCD is supported.
>>
>> Linux has no use for the trailing fields of the Get Dynamic Capacity
>> Configuration Output Payload (Total number of supported extents, number
>> of available extents, total number of supported tags, and number of
>> available tags). Avoid defining those fields to use the more useful
>> dynamic C array.
>>
>> Based on an original patch by Navneet Singh.
>>
>> Cc: Li Ming <ming.li@zohomail.com>
>> Cc: Kees Cook <kees@kernel.org>
>> Cc: Gustavo A. R. Silva <gustavoars@kernel.org>
>> Cc: linux-hardening@vger.kernel.org
>> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
>> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
>>
>> ---
>> Changes:
>> [iweiny: fix EXPORT_SYMBOL_NS_GPL(cxl_dev_dynamic_capacity_identify)]
>> [iweiny: limit variable scope in cxl_dev_dynamic_capacity_identify]
>> ---
>> drivers/cxl/core/mbox.c | 166 +++++++++++++++++++++++++++++++++++++++++++++++-
>> drivers/cxl/cxlmem.h | 64 ++++++++++++++++++-
>> drivers/cxl/pci.c | 4 ++
>> 3 files changed, 232 insertions(+), 2 deletions(-)
>>
> [snipping the C code to do a data structure review first]
>
>> diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
>> index e8907c403edbd83c8a36b8d013c6bc3391207ee6..05a0718aea73b3b2a02c608bae198eac7c462523 100644
>> --- a/drivers/cxl/cxlmem.h
>> +++ b/drivers/cxl/cxlmem.h
>> @@ -403,6 +403,7 @@ enum cxl_devtype {
>> CXL_DEVTYPE_CLASSMEM,
>> };
>>
>> +#define CXL_MAX_DC_REGION 8
> Please no, lets not sign up to have the "which cxl 'region' concept are
> you referring to?" debate in perpetuity. "DPA partition", "DPA
> resource", "DPA capacity" anything but "region".
>
>
This next comment is not my main point to discuss in this email
(resources initialization is), but I seize it for giving my view in this
one.
Dan, you say later we (Linux) are not obligated to use "questionable
naming decisions of specifications", but we should not confuse people
either.
Maybe CXL_MAX_DC_HW_REGION would help here, for differentiating it from
the kernel software cxl region construct. I think we will need a CXL
kernel dictionary sooner or later ...
>> /**
>> * struct cxl_dpa_perf - DPA performance property entry
>> * @dpa_range: range for DPA address
>> @@ -434,6 +435,8 @@ struct cxl_dpa_perf {
>> * @dpa_res: Overall DPA resource tree for the device
>> * @pmem_res: Active Persistent memory capacity configuration
>> * @ram_res: Active Volatile memory capacity configuration
>> + * @dc_res: Active Dynamic Capacity memory configuration for each possible
>> + * region
>> * @serial: PCIe Device Serial Number
>> * @type: Generic Memory Class device or Vendor Specific Memory device
>> * @cxl_mbox: CXL mailbox context
>> @@ -449,11 +452,23 @@ struct cxl_dev_state {
>> struct resource dpa_res;
>> struct resource pmem_res;
>> struct resource ram_res;
>> + struct resource dc_res[CXL_MAX_DC_REGION];
> This is throwing off cargo-cult alarms. The named pmem_res and ram_res
> served us well up until the point where DPA partitions grew past 2 types
> at well defined locations. I like the array of resources idea, but that
> begs the question why not put all partition information into an array?
>
> This would also head off complications later on in this series where the
> DPA capacity reservation and allocation flows have "dc" sidecars bolted
> on rather than general semantics like "allocating from partition index N
> means that all partitions indices less than N need to be skipped and
> marked reserved".
I guess this is likely how you want to change the type2 resource
initialization issue and where I'm afraid these two patchsets are going
to collide at.
If that is the case, both are going to miss the next kernel cycle since
it means major changes, but let's discuss it without further delays for
the sake of implementing the accepted changes as soon as possible, and I
guess with a close sync between Ira and I.
BTW, in the case of the Type2, there are more things to discuss which I
do there.
Thank you
>> u64 serial;
>> enum cxl_devtype type;
>> struct cxl_mailbox cxl_mbox;
>> };
>>
>> +#define CXL_DC_REGION_STRLEN 8
>> +struct cxl_dc_region_info {
>> + u64 base;
>> + u64 decode_len;
>> + u64 len;
> Duplicating partition information in multiple places, like
> mds->dc_region[X].base and cxlds->dc_res[X].start, feels like an
> RFC-quality decision for expediency that needs to reconciled on the way
> to upstream.
>
>> + u64 blk_size;
>> + u32 dsmad_handle;
>> + u8 flags;
>> + u8 name[CXL_DC_REGION_STRLEN];
> No, lets not entertain:
>
> printk("%s\n", mds->dc_region[index].name);
>
> ...when:
>
> printk("dc%d\n", index);
>
> ...will do.
>
>> +};
>> +
>> static inline struct cxl_dev_state *mbox_to_cxlds(struct cxl_mailbox *cxl_mbox)
>> {
>> return dev_get_drvdata(cxl_mbox->host);
>> @@ -473,7 +488,9 @@ static inline struct cxl_dev_state *mbox_to_cxlds(struct cxl_mailbox *cxl_mbox)
>> * @dcd_cmds: List of DCD commands implemented by memory device
>> * @enabled_cmds: Hardware commands found enabled in CEL.
>> * @exclusive_cmds: Commands that are kernel-internal only
>> - * @total_bytes: sum of all possible capacities
>> + * @total_bytes: length of all possible capacities
>> + * @static_bytes: length of possible static RAM and PMEM partitions
>> + * @dynamic_bytes: length of possible DC partitions (DC Regions)
>> * @volatile_only_bytes: hard volatile capacity
>> * @persistent_only_bytes: hard persistent capacity
> I have regrets that cxl_memdev_state permanently carries runtime storage
> for init time variables, lets not continue down that path with DCD
> enabling.
>
>> * @partition_align_bytes: alignment size for partition-able capacity
>> @@ -483,6 +500,8 @@ static inline struct cxl_dev_state *mbox_to_cxlds(struct cxl_mailbox *cxl_mbox)
>> * @next_persistent_bytes: persistent capacity change pending device reset
>> * @ram_perf: performance data entry matched to RAM partition
>> * @pmem_perf: performance data entry matched to PMEM partition
>> + * @nr_dc_region: number of DC regions implemented in the memory device
>> + * @dc_region: array containing info about the DC regions
>> * @event: event log driver state
>> * @poison: poison driver state info
>> * @security: security driver state info
>> @@ -499,6 +518,8 @@ struct cxl_memdev_state {
>> DECLARE_BITMAP(enabled_cmds, CXL_MEM_COMMAND_ID_MAX);
>> DECLARE_BITMAP(exclusive_cmds, CXL_MEM_COMMAND_ID_MAX);
>> u64 total_bytes;
>> + u64 static_bytes;
>> + u64 dynamic_bytes;
>> u64 volatile_only_bytes;
>> u64 persistent_only_bytes;
>> u64 partition_align_bytes;
>> @@ -510,6 +531,9 @@ struct cxl_memdev_state {
>> struct cxl_dpa_perf ram_perf;
>> struct cxl_dpa_perf pmem_perf;
>>
>> + u8 nr_dc_region;
>> + struct cxl_dc_region_info dc_region[CXL_MAX_DC_REGION];
> DPA capacity is a generic CXL.mem concern and partition information is
> contained cxl_dev_state. Lets find a way to not need partially redundant
> data structures across in cxl_memdev_state and cxl_dev_state.
>
> DCD introduces the concept of "decode size vs usable capacity" into the
> partition information, but I see no reason to conceptually tie that to
> only DCD. Fabio's memory hole patches show that there is already a
> memory-hole concept in the CXL arena. DCD is just saying "be prepared for
> the concept of DPA partitions with memory holes at the end".
>
>> +
>> struct cxl_event_state event;
>> struct cxl_poison_state poison;
>> struct cxl_security_state security;
>> @@ -708,6 +732,32 @@ struct cxl_mbox_set_partition_info {
>>
>> #define CXL_SET_PARTITION_IMMEDIATE_FLAG BIT(0)
>>
>> +/* See CXL 3.1 Table 8-163 get dynamic capacity config Input Payload */
>> +struct cxl_mbox_get_dc_config_in {
>> + u8 region_count;
>> + u8 start_region_index;
>> +} __packed;
>> +
>> +/* See CXL 3.1 Table 8-164 get dynamic capacity config Output Payload */
>> +struct cxl_mbox_get_dc_config_out {
>> + u8 avail_region_count;
>> + u8 regions_returned;
>> + u8 rsvd[6];
>> + /* See CXL 3.1 Table 8-165 */
>> + struct cxl_dc_region_config {
>> + __le64 region_base;
>> + __le64 region_decode_length;
>> + __le64 region_length;
>> + __le64 region_block_size;
>> + __le32 region_dsmad_handle;
>> + u8 flags;
>> + u8 rsvd[3];
>> + } __packed region[] __counted_by(regions_returned);
> Yes, the spec unfortunately uses "region" for this partition info
> payload. This would be a good place to say "CXL spec calls this 'region'
> but Linux calls it 'partition' not to be confused with the Linux 'struct
> cxl_region' or all the other usages of 'region' in the specification".
>
> Linux is not obligated to follow the questionable naming decisions of
> specifications.
>
>> + /* Trailing fields unused */
>> +} __packed;
>> +#define CXL_DYNAMIC_CAPACITY_SANITIZE_ON_RELEASE_FLAG BIT(0)
>> +#define CXL_DCD_BLOCK_LINE_SIZE 0x40
>> +
>> /* Set Timestamp CXL 3.0 Spec 8.2.9.4.2 */
>> struct cxl_mbox_set_timestamp_in {
>> __le64 timestamp;
>> @@ -831,6 +881,7 @@ enum {
>> int cxl_internal_send_cmd(struct cxl_mailbox *cxl_mbox,
>> struct cxl_mbox_cmd *cmd);
>> int cxl_dev_state_identify(struct cxl_memdev_state *mds);
>> +int cxl_dev_dynamic_capacity_identify(struct cxl_memdev_state *mds);
>> int cxl_await_media_ready(struct cxl_dev_state *cxlds);
>> int cxl_enumerate_cmds(struct cxl_memdev_state *mds);
>> int cxl_mem_create_range_info(struct cxl_memdev_state *mds);
>> @@ -844,6 +895,17 @@ void cxl_event_trace_record(const struct cxl_memdev *cxlmd,
>> enum cxl_event_log_type type,
>> enum cxl_event_type event_type,
>> const uuid_t *uuid, union cxl_event *evt);
>> +
>> +static inline bool cxl_dcd_supported(struct cxl_memdev_state *mds)
>> +{
>> + return test_bit(CXL_DCD_ENABLED_GET_CONFIG, mds->dcd_cmds);
>> +}
>> +
>> +static inline void cxl_disable_dcd(struct cxl_memdev_state *mds)
>> +{
>> + clear_bit(CXL_DCD_ENABLED_GET_CONFIG, mds->dcd_cmds);
>> +}
> This hunk is out of place, and per the last patch, I think it can just be
> a flag that does not need a helper.
>
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v8 02/21] cxl/mem: Read dynamic capacity configuration from the device
2025-01-15 2:35 ` Dan Williams
2025-01-15 13:55 ` Alejandro Lucero Palau
@ 2025-01-15 20:32 ` Ira Weiny
2025-01-15 22:34 ` Dan Williams
1 sibling, 1 reply; 34+ messages in thread
From: Ira Weiny @ 2025-01-15 20:32 UTC (permalink / raw)
To: Dan Williams, Ira Weiny, Dave Jiang, Fan Ni, Jonathan Cameron,
Jonathan Corbet, Andrew Morton, Kees Cook, Gustavo A. R. Silva
Cc: Dan Williams, Davidlohr Bueso, Alison Schofield, Vishal Verma,
Ira Weiny, linux-cxl, linux-doc, nvdimm, linux-kernel,
linux-hardening, Li Ming
Dan Williams wrote:
> Ira Weiny wrote:
[snip]
> > diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
> > index e8907c403edbd83c8a36b8d013c6bc3391207ee6..05a0718aea73b3b2a02c608bae198eac7c462523 100644
> > --- a/drivers/cxl/cxlmem.h
> > +++ b/drivers/cxl/cxlmem.h
> > @@ -403,6 +403,7 @@ enum cxl_devtype {
> > CXL_DEVTYPE_CLASSMEM,
> > };
> >
> > +#define CXL_MAX_DC_REGION 8
>
> Please no, lets not sign up to have the "which cxl 'region' concept are
> you referring to?" debate in perpetuity. "DPA partition", "DPA
> resource", "DPA capacity" anything but "region".
>
I'm inclined to agree with Alejandro on this one. I've walked this
tightrope quite a bit with this series. But there are other places where
we have chosen to change the verbiage from the spec and it has made it
difficult for new comers to correlate the spec with the code.
So I like Alejandro's idea of adding "HW" to the name to indicate that we
are talking about a spec or hardware defined thing.
That said I am open to changing some names where it is clear it is a
software structure. I'll audit the series for that.
> > /**
> > * struct cxl_dpa_perf - DPA performance property entry
> > * @dpa_range: range for DPA address
> > @@ -434,6 +435,8 @@ struct cxl_dpa_perf {
> > * @dpa_res: Overall DPA resource tree for the device
> > * @pmem_res: Active Persistent memory capacity configuration
> > * @ram_res: Active Volatile memory capacity configuration
> > + * @dc_res: Active Dynamic Capacity memory configuration for each possible
> > + * region
> > * @serial: PCIe Device Serial Number
> > * @type: Generic Memory Class device or Vendor Specific Memory device
> > * @cxl_mbox: CXL mailbox context
> > @@ -449,11 +452,23 @@ struct cxl_dev_state {
> > struct resource dpa_res;
> > struct resource pmem_res;
> > struct resource ram_res;
> > + struct resource dc_res[CXL_MAX_DC_REGION];
>
> This is throwing off cargo-cult alarms. The named pmem_res and ram_res
> served us well up until the point where DPA partitions grew past 2 types
> at well defined locations. I like the array of resources idea, but that
> begs the question why not put all partition information into an array?
For me that keeps it clear what is pmem/ram/dc.
>
> This would also head off complications later on in this series where the
> DPA capacity reservation and allocation flows have "dc" sidecars bolted
> on rather than general semantics like "allocating from partition index N
> means that all partitions indices less than N need to be skipped and
> marked reserved".
I assume you are speaking of this patch:
cxl/hdm: Add dynamic capacity size support to endpoint decoders
I took some care to make the skip calculations and tracking generic. But
I think you are correct more could be done.
We would also need to adjust cxl_dpa_alloc().
I thought there might be other places where the memdev_state would need to
be adjusted. But it looks like those are minor issues.
Over all I did spend a lot of time making the skip generic and honestly it
is probably worth making it all generic. However, I think it will take
careful review and testing to make sure we don't break things.
>
> > u64 serial;
> > enum cxl_devtype type;
> > struct cxl_mailbox cxl_mbox;
> > };
> >
> > +#define CXL_DC_REGION_STRLEN 8
> > +struct cxl_dc_region_info {
> > + u64 base;
> > + u64 decode_len;
> > + u64 len;
>
> Duplicating partition information in multiple places, like
> mds->dc_region[X].base and cxlds->dc_res[X].start, feels like an
> RFC-quality decision for expediency that needs to reconciled on the way
> to upstream.
I think this was done to follow a pattern of the mds being passed around
rather than creating resources right when partitions are read.
Furthermore this stands to hold this information in CPU endianess rather
than holding an array of region info coming from the hardware.
Let see how other changes fall out before I go hacking this though.
>
> > + u64 blk_size;
> > + u32 dsmad_handle;
> > + u8 flags;
> > + u8 name[CXL_DC_REGION_STRLEN];
>
> No, lets not entertain:
>
> printk("%s\n", mds->dc_region[index].name);
>
> ...when:
>
> printk("dc%d\n", index);
>
> ...will do.
Actually these buffers provide a buffer for the (struct
resource)dc_res[x].name pointers to point to.
It could be devm_malloc'ed memory but this actually worked ok. ram/pmem
just use static strings.
I could add a comment to that effect. Or I could define a static array
thusly... I think this would go against defining a resource map though.
@@ -1460,6 +1459,10 @@ static int add_dpa_res(struct device *dev, struct resource *parent,
return 0;
}
+static const char * const dc_resource_name[] = {
+ "dc0", "dc1", "dc2", "dc3", "dc4", "dc5", "dc6", "dc7",
+};
+
int cxl_mem_create_range_info(struct cxl_memdev_state *mds)
{
struct cxl_dev_state *cxlds = &mds->cxlds;
@@ -1486,7 +1489,8 @@ int cxl_mem_create_range_info(struct cxl_memdev_state *mds)
struct cxl_dc_region_info *dcr = &mds->dc_region[i];
rc = add_dpa_res(dev, &cxlds->dpa_res, &cxlds->dc_res[i],
- dcr->base, dcr->decode_len, dcr->name);
+ dcr->base, dcr->decode_len,
+ dc_resource_name[i]);
if (rc)
return rc;
}
>
> > +};
> > +
> > static inline struct cxl_dev_state *mbox_to_cxlds(struct cxl_mailbox *cxl_mbox)
> > {
> > return dev_get_drvdata(cxl_mbox->host);
> > @@ -473,7 +488,9 @@ static inline struct cxl_dev_state *mbox_to_cxlds(struct cxl_mailbox *cxl_mbox)
> > * @dcd_cmds: List of DCD commands implemented by memory device
> > * @enabled_cmds: Hardware commands found enabled in CEL.
> > * @exclusive_cmds: Commands that are kernel-internal only
> > - * @total_bytes: sum of all possible capacities
> > + * @total_bytes: length of all possible capacities
> > + * @static_bytes: length of possible static RAM and PMEM partitions
> > + * @dynamic_bytes: length of possible DC partitions (DC Regions)
> > * @volatile_only_bytes: hard volatile capacity
> > * @persistent_only_bytes: hard persistent capacity
>
> I have regrets that cxl_memdev_state permanently carries runtime storage
> for init time variables, lets not continue down that path with DCD
> enabling.
Yea I thought these were used more than they are. I feel like over time
they got used less and less. Perhaps now they can be removed.
>
> > * @partition_align_bytes: alignment size for partition-able capacity
> > @@ -483,6 +500,8 @@ static inline struct cxl_dev_state *mbox_to_cxlds(struct cxl_mailbox *cxl_mbox)
> > * @next_persistent_bytes: persistent capacity change pending device reset
> > * @ram_perf: performance data entry matched to RAM partition
> > * @pmem_perf: performance data entry matched to PMEM partition
> > + * @nr_dc_region: number of DC regions implemented in the memory device
> > + * @dc_region: array containing info about the DC regions
> > * @event: event log driver state
> > * @poison: poison driver state info
> > * @security: security driver state info
> > @@ -499,6 +518,8 @@ struct cxl_memdev_state {
> > DECLARE_BITMAP(enabled_cmds, CXL_MEM_COMMAND_ID_MAX);
> > DECLARE_BITMAP(exclusive_cmds, CXL_MEM_COMMAND_ID_MAX);
> > u64 total_bytes;
> > + u64 static_bytes;
> > + u64 dynamic_bytes;
> > u64 volatile_only_bytes;
> > u64 persistent_only_bytes;
> > u64 partition_align_bytes;
> > @@ -510,6 +531,9 @@ struct cxl_memdev_state {
> > struct cxl_dpa_perf ram_perf;
> > struct cxl_dpa_perf pmem_perf;
> >
> > + u8 nr_dc_region;
> > + struct cxl_dc_region_info dc_region[CXL_MAX_DC_REGION];
>
> DPA capacity is a generic CXL.mem concern and partition information is
> contained cxl_dev_state. Lets find a way to not need partially redundant
> data structures across in cxl_memdev_state and cxl_dev_state.
I'll think on it.
>
> DCD introduces the concept of "decode size vs usable capacity" into the
> partition information, but I see no reason to conceptually tie that to
> only DCD. Fabio's memory hole patches show that there is already a
> memory-hole concept in the CXL arena. DCD is just saying "be prepared for
> the concept of DPA partitions with memory holes at the end".
I'm not clear how this relates. ram and pmem partitions can already have
holes at the end if not mapped.
>
> > +
> > struct cxl_event_state event;
> > struct cxl_poison_state poison;
> > struct cxl_security_state security;
> > @@ -708,6 +732,32 @@ struct cxl_mbox_set_partition_info {
> >
> > #define CXL_SET_PARTITION_IMMEDIATE_FLAG BIT(0)
> >
> > +/* See CXL 3.1 Table 8-163 get dynamic capacity config Input Payload */
> > +struct cxl_mbox_get_dc_config_in {
> > + u8 region_count;
> > + u8 start_region_index;
> > +} __packed;
> > +
> > +/* See CXL 3.1 Table 8-164 get dynamic capacity config Output Payload */
> > +struct cxl_mbox_get_dc_config_out {
> > + u8 avail_region_count;
> > + u8 regions_returned;
> > + u8 rsvd[6];
> > + /* See CXL 3.1 Table 8-165 */
> > + struct cxl_dc_region_config {
> > + __le64 region_base;
> > + __le64 region_decode_length;
> > + __le64 region_length;
> > + __le64 region_block_size;
> > + __le32 region_dsmad_handle;
> > + u8 flags;
> > + u8 rsvd[3];
> > + } __packed region[] __counted_by(regions_returned);
>
> Yes, the spec unfortunately uses "region" for this partition info
> payload. This would be a good place to say "CXL spec calls this 'region'
> but Linux calls it 'partition' not to be confused with the Linux 'struct
> cxl_region' or all the other usages of 'region' in the specification".
In this case I totally disagree. This is a structure being filled in by
the hardware and is directly related to the spec. I think I would rather
change
s/cxl_dc_region_info/cxl_dc_partition_info/
And leave this. Which draws a more distinct line between what is
specified in hardware vs a software construct.
>
> Linux is not obligated to follow the questionable naming decisions of
> specifications.
We are not. But as Alejandro says it can be confusing if we don't make
some association to the spec.
What do you think about the HW/SW line I propose above?
>
> > + /* Trailing fields unused */
> > +} __packed;
> > +#define CXL_DYNAMIC_CAPACITY_SANITIZE_ON_RELEASE_FLAG BIT(0)
> > +#define CXL_DCD_BLOCK_LINE_SIZE 0x40
> > +
> > /* Set Timestamp CXL 3.0 Spec 8.2.9.4.2 */
> > struct cxl_mbox_set_timestamp_in {
> > __le64 timestamp;
> > @@ -831,6 +881,7 @@ enum {
> > int cxl_internal_send_cmd(struct cxl_mailbox *cxl_mbox,
> > struct cxl_mbox_cmd *cmd);
> > int cxl_dev_state_identify(struct cxl_memdev_state *mds);
> > +int cxl_dev_dynamic_capacity_identify(struct cxl_memdev_state *mds);
> > int cxl_await_media_ready(struct cxl_dev_state *cxlds);
> > int cxl_enumerate_cmds(struct cxl_memdev_state *mds);
> > int cxl_mem_create_range_info(struct cxl_memdev_state *mds);
> > @@ -844,6 +895,17 @@ void cxl_event_trace_record(const struct cxl_memdev *cxlmd,
> > enum cxl_event_log_type type,
> > enum cxl_event_type event_type,
> > const uuid_t *uuid, union cxl_event *evt);
> > +
> > +static inline bool cxl_dcd_supported(struct cxl_memdev_state *mds)
> > +{
> > + return test_bit(CXL_DCD_ENABLED_GET_CONFIG, mds->dcd_cmds);
> > +}
> > +
> > +static inline void cxl_disable_dcd(struct cxl_memdev_state *mds)
> > +{
> > + clear_bit(CXL_DCD_ENABLED_GET_CONFIG, mds->dcd_cmds);
> > +}
>
> This hunk is out of place,
Not sure why they are out of place. This is the first patch they are used
in so they were added here. I could push them back a patch but you have
mentioned before you don't like to see functions defined without a use.
So I structured the series this way.
> and per the last patch, I think it can just be
> a flag that does not need a helper.
Agreed. This has already been changed. But these functions remain
defined here along with where they are used for proper context.
Ira
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v8 02/21] cxl/mem: Read dynamic capacity configuration from the device
2025-01-15 13:55 ` Alejandro Lucero Palau
@ 2025-01-15 20:48 ` Ira Weiny
2025-01-16 6:33 ` Dan Williams
1 sibling, 0 replies; 34+ messages in thread
From: Ira Weiny @ 2025-01-15 20:48 UTC (permalink / raw)
To: Alejandro Lucero Palau, Dan Williams, Ira Weiny, Dave Jiang,
Fan Ni, Jonathan Cameron, Jonathan Corbet, Andrew Morton,
Kees Cook, Gustavo A. R. Silva
Cc: Davidlohr Bueso, Alison Schofield, Vishal Verma, linux-cxl,
linux-doc, nvdimm, linux-kernel, linux-hardening, Li Ming
Alejandro Lucero Palau wrote:
>
> On 1/15/25 02:35, Dan Williams wrote:
> > Ira Weiny wrote:
[snip]
> >> diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
> >> index e8907c403edbd83c8a36b8d013c6bc3391207ee6..05a0718aea73b3b2a02c608bae198eac7c462523 100644
> >> --- a/drivers/cxl/cxlmem.h
> >> +++ b/drivers/cxl/cxlmem.h
> >> @@ -403,6 +403,7 @@ enum cxl_devtype {
> >> CXL_DEVTYPE_CLASSMEM,
> >> };
> >>
> >> +#define CXL_MAX_DC_REGION 8
> > Please no, lets not sign up to have the "which cxl 'region' concept are
> > you referring to?" debate in perpetuity. "DPA partition", "DPA
> > resource", "DPA capacity" anything but "region".
> >
> >
>
> This next comment is not my main point to discuss in this email
> (resources initialization is), but I seize it for giving my view in this
> one.
>
> Dan, you say later we (Linux) are not obligated to use "questionable
> naming decisions of specifications", but we should not confuse people
> either.
>
> Maybe CXL_MAX_DC_HW_REGION would help here, for differentiating it from
> the kernel software cxl region construct. I think we will need a CXL
> kernel dictionary sooner or later ...
I agree. I have had folks confused between spec and code and I'm really trying
to differentiate hardware region vs software partition.
>
> >> /**
> >> * struct cxl_dpa_perf - DPA performance property entry
> >> * @dpa_range: range for DPA address
> >> @@ -434,6 +435,8 @@ struct cxl_dpa_perf {
> >> * @dpa_res: Overall DPA resource tree for the device
> >> * @pmem_res: Active Persistent memory capacity configuration
> >> * @ram_res: Active Volatile memory capacity configuration
> >> + * @dc_res: Active Dynamic Capacity memory configuration for each possible
> >> + * region
> >> * @serial: PCIe Device Serial Number
> >> * @type: Generic Memory Class device or Vendor Specific Memory device
> >> * @cxl_mbox: CXL mailbox context
> >> @@ -449,11 +452,23 @@ struct cxl_dev_state {
> >> struct resource dpa_res;
> >> struct resource pmem_res;
> >> struct resource ram_res;
> >> + struct resource dc_res[CXL_MAX_DC_REGION];
> > This is throwing off cargo-cult alarms. The named pmem_res and ram_res
> > served us well up until the point where DPA partitions grew past 2 types
> > at well defined locations. I like the array of resources idea, but that
> > begs the question why not put all partition information into an array?
> >
> > This would also head off complications later on in this series where the
> > DPA capacity reservation and allocation flows have "dc" sidecars bolted
> > on rather than general semantics like "allocating from partition index N
> > means that all partitions indices less than N need to be skipped and
> > marked reserved".
>
>
> I guess this is likely how you want to change the type2 resource
> initialization issue and where I'm afraid these two patchsets are going
> to collide at.
>
> If that is the case, both are going to miss the next kernel cycle since
> it means major changes, but let's discuss it without further delays for
> the sake of implementing the accepted changes as soon as possible, and I
> guess with a close sync between Ira and I.
>
> BTW, in the case of the Type2, there are more things to discuss which I
> do there.
I'm looking at your set again because I think I missed this detail.
After looking into this more I think a singular array of resources could be
done without to much major surgery.
The question for type 2 is what interface does the core export for
accelerators to request these resources? Or do we export a function like
add_dpa_res() and let drivers do that directly?
Dan is concerned about storing duplicate information about the partitions.
For DCD I think it should call add_dpa_res() to create resources on the
fly as I detect partition information from the device. For type 2 they
can call that however/whenever they want.
We can even make this an xarray for complete flexibility with how many
partitions a device can have. Although I'm not sure if the spec allows
for that on type 2. Does it?
Ira
[snip]
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v8 02/21] cxl/mem: Read dynamic capacity configuration from the device
2025-01-15 20:32 ` Ira Weiny
@ 2025-01-15 22:34 ` Dan Williams
2025-01-16 10:32 ` Jonathan Cameron
2025-01-22 18:02 ` Ira Weiny
0 siblings, 2 replies; 34+ messages in thread
From: Dan Williams @ 2025-01-15 22:34 UTC (permalink / raw)
To: Ira Weiny, Dan Williams, Dave Jiang, Fan Ni, Jonathan Cameron,
Jonathan Corbet, Andrew Morton, Kees Cook, Gustavo A. R. Silva
Cc: Dan Williams, Davidlohr Bueso, Alison Schofield, Vishal Verma,
Ira Weiny, linux-cxl, linux-doc, nvdimm, linux-kernel,
linux-hardening, Li Ming
Ira Weiny wrote:
> Dan Williams wrote:
> > Ira Weiny wrote:
>
> [snip]
>
> > > diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
> > > index e8907c403edbd83c8a36b8d013c6bc3391207ee6..05a0718aea73b3b2a02c608bae198eac7c462523 100644
> > > --- a/drivers/cxl/cxlmem.h
> > > +++ b/drivers/cxl/cxlmem.h
> > > @@ -403,6 +403,7 @@ enum cxl_devtype {
> > > CXL_DEVTYPE_CLASSMEM,
> > > };
> > >
> > > +#define CXL_MAX_DC_REGION 8
> >
> > Please no, lets not sign up to have the "which cxl 'region' concept are
> > you referring to?" debate in perpetuity. "DPA partition", "DPA
> > resource", "DPA capacity" anything but "region".
> >
>
> I'm inclined to agree with Alejandro on this one. I've walked this
> tightrope quite a bit with this series. But there are other places where
> we have chosen to change the verbiage from the spec and it has made it
> difficult for new comers to correlate the spec with the code.
>
> So I like Alejandro's idea of adding "HW" to the name to indicate that we
> are talking about a spec or hardware defined thing.
See below, the only people that could potentially be bothered by the
lack of spec terminology matching are the very same people that are
sophisticated enough to have read the spec to know its a problem.
>
> That said I am open to changing some names where it is clear it is a
> software structure. I'll audit the series for that.
>
> > > u64 serial;
> > > enum cxl_devtype type;
> > > struct cxl_mailbox cxl_mbox;
> > > };
> > >
> > > +#define CXL_DC_REGION_STRLEN 8
> > > +struct cxl_dc_region_info {
> > > + u64 base;
> > > + u64 decode_len;
> > > + u64 len;
> >
> > Duplicating partition information in multiple places, like
> > mds->dc_region[X].base and cxlds->dc_res[X].start, feels like an
> > RFC-quality decision for expediency that needs to reconciled on the way
> > to upstream.
>
> I think this was done to follow a pattern of the mds being passed around
> rather than creating resources right when partitions are read.
>
> Furthermore this stands to hold this information in CPU endianess rather
> than holding an array of region info coming from the hardware.
Yes, the ask is translate all of this into common information that lives
at the cxl_dev_state level.
>
> Let see how other changes fall out before I go hacking this though.
>
> >
> > > + u64 blk_size;
> > > + u32 dsmad_handle;
> > > + u8 flags;
> > > + u8 name[CXL_DC_REGION_STRLEN];
> >
> > No, lets not entertain:
> >
> > printk("%s\n", mds->dc_region[index].name);
> >
> > ...when:
> >
> > printk("dc%d\n", index);
> >
> > ...will do.
>
> Actually these buffers provide a buffer for the (struct
> resource)dc_res[x].name pointers to point to.
I missed that specific detail, but I still challenge whether this
precision is needed especially since it makes the data structure
messier. Given these names are for debug only and multi-partition DCD
devices seem unlikely to ever exist, just use a static shared name for
adding to ->dpa_res.
>
> >
> > DCD introduces the concept of "decode size vs usable capacity" into the
> > partition information, but I see no reason to conceptually tie that to
> > only DCD. Fabio's memory hole patches show that there is already a
> > memory-hole concept in the CXL arena. DCD is just saying "be prepared for
> > the concept of DPA partitions with memory holes at the end".
>
> I'm not clear how this relates. ram and pmem partitions can already have
> holes at the end if not mapped.
The distinction is "can this DPA capacity be allocated to a region" the
new holes introduced by DCD are cases where the partition size is
greater than the allocatable size. Contrast to ram and pmem the
allocatable size is always identical to the partition size.
> > > +
> > > struct cxl_event_state event;
> > > struct cxl_poison_state poison;
> > > struct cxl_security_state security;
> > > @@ -708,6 +732,32 @@ struct cxl_mbox_set_partition_info {
> > >
> > > #define CXL_SET_PARTITION_IMMEDIATE_FLAG BIT(0)
> > >
> > > +/* See CXL 3.1 Table 8-163 get dynamic capacity config Input Payload */
> > > +struct cxl_mbox_get_dc_config_in {
> > > + u8 region_count;
> > > + u8 start_region_index;
> > > +} __packed;
> > > +
> > > +/* See CXL 3.1 Table 8-164 get dynamic capacity config Output Payload */
> > > +struct cxl_mbox_get_dc_config_out {
> > > + u8 avail_region_count;
> > > + u8 regions_returned;
> > > + u8 rsvd[6];
> > > + /* See CXL 3.1 Table 8-165 */
> > > + struct cxl_dc_region_config {
> > > + __le64 region_base;
> > > + __le64 region_decode_length;
> > > + __le64 region_length;
> > > + __le64 region_block_size;
> > > + __le32 region_dsmad_handle;
> > > + u8 flags;
> > > + u8 rsvd[3];
> > > + } __packed region[] __counted_by(regions_returned);
> >
> > Yes, the spec unfortunately uses "region" for this partition info
> > payload. This would be a good place to say "CXL spec calls this 'region'
> > but Linux calls it 'partition' not to be confused with the Linux 'struct
> > cxl_region' or all the other usages of 'region' in the specification".
>
> In this case I totally disagree. This is a structure being filled in by
> the hardware and is directly related to the spec. I think I would rather
> change
>
> s/cxl_dc_region_info/cxl_dc_partition_info/
>
> And leave this. Which draws a more distinct line between what is
> specified in hardware vs a software construct.
>
> >
> > Linux is not obligated to follow the questionable naming decisions of
> > specifications.
>
> We are not. But as Alejandro says it can be confusing if we don't make
> some association to the spec.
>
> What do you think about the HW/SW line I propose above?
Rename to cxl_dc_partition_info and drop the region_ prefixes, sure.
Otherwise, for this init-time only concern I would much rather deal with
the confusion of:
"why does Linux call this partition when the spec calls it region?":
which only trips up people that already know the difference because they read the
spec. In that case the comment will answer their confusion.
...versus:
"why are there multiple region concepts in the CXL subsystem": which
trips up everyone that greps through the CXL subsystem especially those
that have no intention of ever reading the spec.
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v8 02/21] cxl/mem: Read dynamic capacity configuration from the device
2025-01-15 13:55 ` Alejandro Lucero Palau
2025-01-15 20:48 ` Ira Weiny
@ 2025-01-16 6:33 ` Dan Williams
1 sibling, 0 replies; 34+ messages in thread
From: Dan Williams @ 2025-01-16 6:33 UTC (permalink / raw)
To: Alejandro Lucero Palau, Dan Williams, Ira Weiny, Dave Jiang,
Fan Ni, Jonathan Cameron, Jonathan Corbet, Andrew Morton,
Kees Cook, Gustavo A. R. Silva
Cc: Davidlohr Bueso, Alison Schofield, Vishal Verma, linux-cxl,
linux-doc, nvdimm, linux-kernel, linux-hardening, Li Ming
Alejandro Lucero Palau wrote:
>
> On 1/15/25 02:35, Dan Williams wrote:
> > Ira Weiny wrote:
> >> Devices which optionally support Dynamic Capacity (DC) are configured
> >> via mailbox commands. CXL 3.1 requires the host to issue the Get DC
> >> Configuration command in order to properly configure DCDs. Without the
> >> Get DC Configuration command DCD can't be supported.
> >>
> >> Implement the DC mailbox commands as specified in CXL 3.1 section
> >> 8.2.9.9.9 (opcodes 48XXh) to read and store the DCD configuration
> >> information. Disable DCD if DCD is not supported. Leverage the Get DC
> >> Configuration command supported bit to indicate if DCD is supported.
> >>
> >> Linux has no use for the trailing fields of the Get Dynamic Capacity
> >> Configuration Output Payload (Total number of supported extents, number
> >> of available extents, total number of supported tags, and number of
> >> available tags). Avoid defining those fields to use the more useful
> >> dynamic C array.
> >>
> >> Based on an original patch by Navneet Singh.
> >>
> >> Cc: Li Ming <ming.li@zohomail.com>
> >> Cc: Kees Cook <kees@kernel.org>
> >> Cc: Gustavo A. R. Silva <gustavoars@kernel.org>
> >> Cc: linux-hardening@vger.kernel.org
> >> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> >> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> >>
> >> ---
> >> Changes:
> >> [iweiny: fix EXPORT_SYMBOL_NS_GPL(cxl_dev_dynamic_capacity_identify)]
> >> [iweiny: limit variable scope in cxl_dev_dynamic_capacity_identify]
> >> ---
> >> drivers/cxl/core/mbox.c | 166 +++++++++++++++++++++++++++++++++++++++++++++++-
> >> drivers/cxl/cxlmem.h | 64 ++++++++++++++++++-
> >> drivers/cxl/pci.c | 4 ++
> >> 3 files changed, 232 insertions(+), 2 deletions(-)
> >>
> > [snipping the C code to do a data structure review first]
> >
> >> diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
> >> index e8907c403edbd83c8a36b8d013c6bc3391207ee6..05a0718aea73b3b2a02c608bae198eac7c462523 100644
> >> --- a/drivers/cxl/cxlmem.h
> >> +++ b/drivers/cxl/cxlmem.h
> >> @@ -403,6 +403,7 @@ enum cxl_devtype {
> >> CXL_DEVTYPE_CLASSMEM,
> >> };
> >>
> >> +#define CXL_MAX_DC_REGION 8
> > Please no, lets not sign up to have the "which cxl 'region' concept are
> > you referring to?" debate in perpetuity. "DPA partition", "DPA
> > resource", "DPA capacity" anything but "region".
> >
> >
>
> This next comment is not my main point to discuss in this email
> (resources initialization is), but I seize it for giving my view in this
> one.
>
> Dan, you say later we (Linux) are not obligated to use "questionable
> naming decisions of specifications", but we should not confuse people
> either.
>
> Maybe CXL_MAX_DC_HW_REGION would help here, for differentiating it from
> the kernel software cxl region construct. I think we will need a CXL
> kernel dictionary sooner or later ...
I addressed this on the reply to Ira, and yes one of the first entries
in a Linux CXL terminology document is that "regions" are mapped memory
and partitions are DPA capacity.
> >> /**
> >> * struct cxl_dpa_perf - DPA performance property entry
> >> * @dpa_range: range for DPA address
> >> @@ -434,6 +435,8 @@ struct cxl_dpa_perf {
> >> * @dpa_res: Overall DPA resource tree for the device
> >> * @pmem_res: Active Persistent memory capacity configuration
> >> * @ram_res: Active Volatile memory capacity configuration
> >> + * @dc_res: Active Dynamic Capacity memory configuration for each possible
> >> + * region
> >> * @serial: PCIe Device Serial Number
> >> * @type: Generic Memory Class device or Vendor Specific Memory device
> >> * @cxl_mbox: CXL mailbox context
> >> @@ -449,11 +452,23 @@ struct cxl_dev_state {
> >> struct resource dpa_res;
> >> struct resource pmem_res;
> >> struct resource ram_res;
> >> + struct resource dc_res[CXL_MAX_DC_REGION];
> > This is throwing off cargo-cult alarms. The named pmem_res and ram_res
> > served us well up until the point where DPA partitions grew past 2 types
> > at well defined locations. I like the array of resources idea, but that
> > begs the question why not put all partition information into an array?
> >
> > This would also head off complications later on in this series where the
> > DPA capacity reservation and allocation flows have "dc" sidecars bolted
> > on rather than general semantics like "allocating from partition index N
> > means that all partitions indices less than N need to be skipped and
> > marked reserved".
>
>
> I guess this is likely how you want to change the type2 resource
> initialization issue and where I'm afraid these two patchsets are going
> to collide at.
>
> If that is the case, both are going to miss the next kernel cycle since
> it means major changes, but let's discuss it without further delays for
> the sake of implementing the accepted changes as soon as possible, and I
> guess with a close sync between Ira and I.
Type-2, as far as I can see, is a priority because it is in support of a
real device that end users can get their hands on today, right?
DCD, as far as I know, has no known product intercepts, just QEMU
emulation. So there is no rush there. If someone has information to the
contrary please share, if able.
The Type-2 series can also be prioritized because it is something we can get
done without cross-subsystem entanglements. So, no I do not think the
door is closed on Type-2 for v6.14, but it is certainly close which is
why I am throwing out code suggestions along with the review.
Otherwise, when 2 patch series trip over the same design wart (i.e. the
no longer suitable explicit ->ram_res and ->pmem_res members of 'struct
cxl_dev_state'), the cleanup needs to come first.
> BTW, in the case of the Type2, there are more things to discuss which I
> do there.
Yes, hopefully it goes smoother after this point.
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v8 02/21] cxl/mem: Read dynamic capacity configuration from the device
2025-01-15 22:34 ` Dan Williams
@ 2025-01-16 10:32 ` Jonathan Cameron
2025-01-22 21:02 ` Dan Williams
2025-01-22 18:02 ` Ira Weiny
1 sibling, 1 reply; 34+ messages in thread
From: Jonathan Cameron @ 2025-01-16 10:32 UTC (permalink / raw)
To: Dan Williams
Cc: Ira Weiny, Dave Jiang, Fan Ni, Jonathan Corbet, Andrew Morton,
Kees Cook, Gustavo A. R. Silva, Davidlohr Bueso, Alison Schofield,
Vishal Verma, linux-cxl, linux-doc, nvdimm, linux-kernel,
linux-hardening, Li Ming
On Wed, 15 Jan 2025 14:34:36 -0800
Dan Williams <dan.j.williams@intel.com> wrote:
> Ira Weiny wrote:
> > Dan Williams wrote:
> > > Ira Weiny wrote:
> >
> > [snip]
> >
> > > > diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
> > > > index e8907c403edbd83c8a36b8d013c6bc3391207ee6..05a0718aea73b3b2a02c608bae198eac7c462523 100644
> > > > --- a/drivers/cxl/cxlmem.h
> > > > +++ b/drivers/cxl/cxlmem.h
> > > > @@ -403,6 +403,7 @@ enum cxl_devtype {
> > > > CXL_DEVTYPE_CLASSMEM,
> > > > };
> > > >
> > > > +#define CXL_MAX_DC_REGION 8
> > >
> > > Please no, lets not sign up to have the "which cxl 'region' concept are
> > > you referring to?" debate in perpetuity. "DPA partition", "DPA
> > > resource", "DPA capacity" anything but "region".
> > >
> >
> > I'm inclined to agree with Alejandro on this one. I've walked this
> > tightrope quite a bit with this series. But there are other places where
> > we have chosen to change the verbiage from the spec and it has made it
> > difficult for new comers to correlate the spec with the code.
> >
> > So I like Alejandro's idea of adding "HW" to the name to indicate that we
> > are talking about a spec or hardware defined thing.
>
> See below, the only people that could potentially be bothered by the
> lack of spec terminology matching are the very same people that are
> sophisticated enough to have read the spec to know its a problem.
It's confusing me. :) I know the confusion source exists but
that doesn't mean I remember how all the terms match up.
>
> >
> > That said I am open to changing some names where it is clear it is a
> > software structure. I'll audit the series for that.
> >
> > > > u64 serial;
> > > > enum cxl_devtype type;
> > > > struct cxl_mailbox cxl_mbox;
> > > > };
> > > >
> > > > +#define CXL_DC_REGION_STRLEN 8
> > > > +struct cxl_dc_region_info {
> > > > + u64 base;
> > > > + u64 decode_len;
> > > > + u64 len;
> > >
> > > Duplicating partition information in multiple places, like
> > > mds->dc_region[X].base and cxlds->dc_res[X].start, feels like an
> > > RFC-quality decision for expediency that needs to reconciled on the way
> > > to upstream.
> >
> > I think this was done to follow a pattern of the mds being passed around
> > rather than creating resources right when partitions are read.
> >
> > Furthermore this stands to hold this information in CPU endianess rather
> > than holding an array of region info coming from the hardware.
>
> Yes, the ask is translate all of this into common information that lives
> at the cxl_dev_state level.
>
> >
> > Let see how other changes fall out before I go hacking this though.
> >
> > >
> > > > + u64 blk_size;
> > > > + u32 dsmad_handle;
> > > > + u8 flags;
> > > > + u8 name[CXL_DC_REGION_STRLEN];
> > >
> > > No, lets not entertain:
> > >
> > > printk("%s\n", mds->dc_region[index].name);
> > >
> > > ...when:
> > >
> > > printk("dc%d\n", index);
> > >
> > > ...will do.
> >
> > Actually these buffers provide a buffer for the (struct
> > resource)dc_res[x].name pointers to point to.
>
> I missed that specific detail, but I still challenge whether this
> precision is needed especially since it makes the data structure
> messier. Given these names are for debug only and multi-partition DCD
> devices seem unlikely to ever exist, just use a static shared name for
> adding to ->dpa_res.
Given the read only shared concept relies on multiple hardware dc regions
(I think they map to partitions) then we are very likely to see
multiples. (maybe I'm lost in terminology as well).
...
> > >
> > > Linux is not obligated to follow the questionable naming decisions of
> > > specifications.
> >
> > We are not. But as Alejandro says it can be confusing if we don't make
> > some association to the spec.
> >
> > What do you think about the HW/SW line I propose above?
>
> Rename to cxl_dc_partition_info and drop the region_ prefixes, sure.
>
> Otherwise, for this init-time only concern I would much rather deal with
> the confusion of:
>
> "why does Linux call this partition when the spec calls it region?":
> which only trips up people that already know the difference because they read the
> spec. In that case the comment will answer their confusion.
>
> ...versus:
>
> "why are there multiple region concepts in the CXL subsystem": which
> trips up everyone that greps through the CXL subsystem especially those
> that have no intention of ever reading the spec.
versus one time rename of all internal infrastructure to align to the spec
and only keep the confusion at the boundaries where we have ABI.
Horrible option but how often are those diving in the code that bothered
about the userspace /kernel interaction terminology?
Anyhow, they are all horrible choices.
Jonathan
>
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v8 02/21] cxl/mem: Read dynamic capacity configuration from the device
2025-01-15 22:34 ` Dan Williams
2025-01-16 10:32 ` Jonathan Cameron
@ 2025-01-22 18:02 ` Ira Weiny
2025-01-22 21:30 ` Dan Williams
1 sibling, 1 reply; 34+ messages in thread
From: Ira Weiny @ 2025-01-22 18:02 UTC (permalink / raw)
To: Dan Williams, Ira Weiny, Dave Jiang, Fan Ni, Jonathan Cameron,
Jonathan Corbet, Andrew Morton, Kees Cook, Gustavo A. R. Silva
Cc: Dan Williams, Davidlohr Bueso, Alison Schofield, Vishal Verma,
Ira Weiny, linux-cxl, linux-doc, nvdimm, linux-kernel,
linux-hardening, Li Ming
Dan Williams wrote:
> Ira Weiny wrote:
> > Dan Williams wrote:
> > > Ira Weiny wrote:
> >
> > [snip]
> >
> > > > diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
> > > > index e8907c403edbd83c8a36b8d013c6bc3391207ee6..05a0718aea73b3b2a02c608bae198eac7c462523 100644
> > > > --- a/drivers/cxl/cxlmem.h
> > > > +++ b/drivers/cxl/cxlmem.h
> > > > @@ -403,6 +403,7 @@ enum cxl_devtype {
> > > > CXL_DEVTYPE_CLASSMEM,
> > > > };
> > > >
> > > > +#define CXL_MAX_DC_REGION 8
> > >
> > > Please no, lets not sign up to have the "which cxl 'region' concept are
> > > you referring to?" debate in perpetuity. "DPA partition", "DPA
> > > resource", "DPA capacity" anything but "region".
> > >
> >
> > I'm inclined to agree with Alejandro on this one. I've walked this
> > tightrope quite a bit with this series. But there are other places where
> > we have chosen to change the verbiage from the spec and it has made it
> > difficult for new comers to correlate the spec with the code.
> >
> > So I like Alejandro's idea of adding "HW" to the name to indicate that we
> > are talking about a spec or hardware defined thing.
>
> See below, the only people that could potentially be bothered by the
> lack of spec terminology matching are the very same people that are
> sophisticated enough to have read the spec to know its a problem.
Honestly at this point I think the code has deviated enough from the spec
that it is just not worth me saying any more. I'll change everything to
partition and field the questions later as they come.
>
> >
> > That said I am open to changing some names where it is clear it is a
> > software structure. I'll audit the series for that.
> >
> > > > u64 serial;
> > > > enum cxl_devtype type;
> > > > struct cxl_mailbox cxl_mbox;
> > > > };
> > > >
> > > > +#define CXL_DC_REGION_STRLEN 8
> > > > +struct cxl_dc_region_info {
> > > > + u64 base;
> > > > + u64 decode_len;
> > > > + u64 len;
> > >
> > > Duplicating partition information in multiple places, like
> > > mds->dc_region[X].base and cxlds->dc_res[X].start, feels like an
> > > RFC-quality decision for expediency that needs to reconciled on the way
> > > to upstream.
> >
> > I think this was done to follow a pattern of the mds being passed around
> > rather than creating resources right when partitions are read.
> >
> > Furthermore this stands to hold this information in CPU endianess rather
> > than holding an array of region info coming from the hardware.
>
> Yes, the ask is translate all of this into common information that lives
> at the cxl_dev_state level.
yea. And build on what you have in the DPA rework.
>
> >
> > Let see how other changes fall out before I go hacking this though.
> >
> > >
> > > > + u64 blk_size;
> > > > + u32 dsmad_handle;
> > > > + u8 flags;
> > > > + u8 name[CXL_DC_REGION_STRLEN];
> > >
> > > No, lets not entertain:
> > >
> > > printk("%s\n", mds->dc_region[index].name);
> > >
> > > ...when:
> > >
> > > printk("dc%d\n", index);
> > >
> > > ...will do.
> >
> > Actually these buffers provide a buffer for the (struct
> > resource)dc_res[x].name pointers to point to.
>
> I missed that specific detail, but I still challenge whether this
> precision is needed especially since it makes the data structure
> messier. Given these names are for debug only and multi-partition DCD
> devices seem unlikely to ever exist, just use a static shared name for
> adding to ->dpa_res.
Using a static name is good.
>
> >
> > >
> > > DCD introduces the concept of "decode size vs usable capacity" into the
> > > partition information, but I see no reason to conceptually tie that to
> > > only DCD. Fabio's memory hole patches show that there is already a
> > > memory-hole concept in the CXL arena. DCD is just saying "be prepared for
> > > the concept of DPA partitions with memory holes at the end".
> >
> > I'm not clear how this relates. ram and pmem partitions can already have
> > holes at the end if not mapped.
>
> The distinction is "can this DPA capacity be allocated to a region" the
> new holes introduced by DCD are cases where the partition size is
> greater than the allocatable size. Contrast to ram and pmem the
> allocatable size is always identical to the partition size.
I still don't quite get what you are saying. The user can always allocate
a region of the full DCD partition size. It is just that the memory
within that region may not be backed yet (no extents).
>
> > > > +
> > > > struct cxl_event_state event;
> > > > struct cxl_poison_state poison;
> > > > struct cxl_security_state security;
> > > > @@ -708,6 +732,32 @@ struct cxl_mbox_set_partition_info {
> > > >
> > > > #define CXL_SET_PARTITION_IMMEDIATE_FLAG BIT(0)
> > > >
> > > > +/* See CXL 3.1 Table 8-163 get dynamic capacity config Input Payload */
> > > > +struct cxl_mbox_get_dc_config_in {
> > > > + u8 region_count;
> > > > + u8 start_region_index;
> > > > +} __packed;
> > > > +
> > > > +/* See CXL 3.1 Table 8-164 get dynamic capacity config Output Payload */
> > > > +struct cxl_mbox_get_dc_config_out {
> > > > + u8 avail_region_count;
> > > > + u8 regions_returned;
> > > > + u8 rsvd[6];
> > > > + /* See CXL 3.1 Table 8-165 */
> > > > + struct cxl_dc_region_config {
> > > > + __le64 region_base;
> > > > + __le64 region_decode_length;
> > > > + __le64 region_length;
> > > > + __le64 region_block_size;
> > > > + __le32 region_dsmad_handle;
> > > > + u8 flags;
> > > > + u8 rsvd[3];
> > > > + } __packed region[] __counted_by(regions_returned);
> > >
> > > Yes, the spec unfortunately uses "region" for this partition info
> > > payload. This would be a good place to say "CXL spec calls this 'region'
> > > but Linux calls it 'partition' not to be confused with the Linux 'struct
> > > cxl_region' or all the other usages of 'region' in the specification".
> >
> > In this case I totally disagree. This is a structure being filled in by
> > the hardware and is directly related to the spec. I think I would rather
> > change
> >
> > s/cxl_dc_region_info/cxl_dc_partition_info/
> >
> > And leave this. Which draws a more distinct line between what is
> > specified in hardware vs a software construct.
> >
> > >
> > > Linux is not obligated to follow the questionable naming decisions of
> > > specifications.
> >
> > We are not. But as Alejandro says it can be confusing if we don't make
> > some association to the spec.
> >
> > What do you think about the HW/SW line I propose above?
>
> Rename to cxl_dc_partition_info and drop the region_ prefixes, sure.
>
> Otherwise, for this init-time only concern I would much rather deal with
> the confusion of:
>
> "why does Linux call this partition when the spec calls it region?":
But this is not the question I will get. The question will be.
"Where is DCD region processed in the code? I grepped for region and
found nothing."
Or
"I'm searching the spec PDF for DCD partition and can't find that. Where
is DCD partition specified?"
> which only trips up people that already know the difference because they read the
> spec. In that case the comment will answer their confusion.
>
> ...versus:
>
> "why are there multiple region concepts in the CXL subsystem": which
> trips up everyone that greps through the CXL subsystem especially those
> that have no intention of ever reading the spec.
Ok I've already said more than I intended. I will change everything to
partition.
Ira
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v8 02/21] cxl/mem: Read dynamic capacity configuration from the device
2025-01-16 10:32 ` Jonathan Cameron
@ 2025-01-22 21:02 ` Dan Williams
0 siblings, 0 replies; 34+ messages in thread
From: Dan Williams @ 2025-01-22 21:02 UTC (permalink / raw)
To: Jonathan Cameron, Dan Williams
Cc: Ira Weiny, Dave Jiang, Fan Ni, Jonathan Corbet, Andrew Morton,
Kees Cook, Gustavo A. R. Silva, Davidlohr Bueso, Alison Schofield,
Vishal Verma, linux-cxl, linux-doc, nvdimm, linux-kernel,
linux-hardening, Li Ming
Jonathan Cameron wrote:
> On Wed, 15 Jan 2025 14:34:36 -0800
> Dan Williams <dan.j.williams@intel.com> wrote:
>
> > Ira Weiny wrote:
> > > Dan Williams wrote:
> > > > Ira Weiny wrote:
> > >
> > > [snip]
> > >
> > > > > diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
> > > > > index e8907c403edbd83c8a36b8d013c6bc3391207ee6..05a0718aea73b3b2a02c608bae198eac7c462523 100644
> > > > > --- a/drivers/cxl/cxlmem.h
> > > > > +++ b/drivers/cxl/cxlmem.h
> > > > > @@ -403,6 +403,7 @@ enum cxl_devtype {
> > > > > CXL_DEVTYPE_CLASSMEM,
> > > > > };
> > > > >
> > > > > +#define CXL_MAX_DC_REGION 8
> > > >
> > > > Please no, lets not sign up to have the "which cxl 'region' concept are
> > > > you referring to?" debate in perpetuity. "DPA partition", "DPA
> > > > resource", "DPA capacity" anything but "region".
> > > >
> > >
> > > I'm inclined to agree with Alejandro on this one. I've walked this
> > > tightrope quite a bit with this series. But there are other places where
> > > we have chosen to change the verbiage from the spec and it has made it
> > > difficult for new comers to correlate the spec with the code.
> > >
> > > So I like Alejandro's idea of adding "HW" to the name to indicate that we
> > > are talking about a spec or hardware defined thing.
> >
> > See below, the only people that could potentially be bothered by the
> > lack of spec terminology matching are the very same people that are
> > sophisticated enough to have read the spec to know its a problem.
>
> It's confusing me. :) I know the confusion source exists but
> that doesn't mean I remember how all the terms match up.
CXL 3.1 Figure 9-24 DCD DPA Space Example
In that one diagram it uses "space", "capacity", "partition", and
"region". Linux is free to say "let's just pick one term and stick to
it". "Region" is already oversubscribed.
I agree with Alejandro that a glossary of Linux terms added to the
Documentation is overdue and would help people orient to what maps
where. That would be needed even if the "continue to oversubscribe
'region'" proposal went through to explain "oh, no, not that 'region'
*this* 'region'".
[..]
> > Actually these buffers provide a buffer for the (struct
> > > resource)dc_res[x].name pointers to point to.
> >
> > I missed that specific detail, but I still challenge whether this
> > precision is needed especially since it makes the data structure
> > messier. Given these names are for debug only and multi-partition DCD
> > devices seem unlikely to ever exist, just use a static shared name for
> > adding to ->dpa_res.
>
> Given the read only shared concept relies on multiple hardware dc regions
> (I think they map to partitions) then we are very likely to see
> multiples. (maybe I'm lost in terminology as well).
Ah, good point. I was focusing on "devices with DPA partitions of
different performance characteristics within the same operation mode" as
being unlikely, but "devices with both shared and non-shared capacity"
indeed seems more likely.
Now, part of the code smell that made me fall out of love with 'enum
cxl_decoder_mode' was this continued confusion between mode names and
partition ids, where printing "dc%d" to the resource name was part of
that smell.
The proposal for what goes into the "name" field of partition resources
in the "DPA metadata is a mess..." series is to disconnect operation
modes from partition indices. A natural consequence of allowing "pmem"
to be partition 0, is that a dynamic device may also have 0 static
capacity, or other arrangements that make the partition-id less
meaningful to userspace.
So instead of needing to print "dc%d" into the resource name field, the
resource name is simply the operation mode: ram, pmem, dynamic ram,
dynamic pmem*, shared ram, shared pmem*.
The implication is that userspace does not need to care about partition
ids, unless and until a device shows up that ships multiple partitions
with the same operation mode, but different performance characteristics.
If that happens userspace would need a knob to disambiguate partitions
with the same operation mode. That does not feel like something that is
threatening to become real in the near term, and partition ids can
continue to be hidden from userspace.
* I doubt we will see dynamic pmem or shared pmem.
[..]
> > > > Linux is not obligated to follow the questionable naming decisions of
> > > > specifications.
> > >
> > > We are not. But as Alejandro says it can be confusing if we don't make
> > > some association to the spec.
> > >
> > > What do you think about the HW/SW line I propose above?
> >
> > Rename to cxl_dc_partition_info and drop the region_ prefixes, sure.
> >
> > Otherwise, for this init-time only concern I would much rather deal with
> > the confusion of:
> >
> > "why does Linux call this partition when the spec calls it region?":
> > which only trips up people that already know the difference because they read the
> > spec. In that case the comment will answer their confusion.
> >
> > ...versus:
> >
> > "why are there multiple region concepts in the CXL subsystem": which
> > trips up everyone that greps through the CXL subsystem especially those
> > that have no intention of ever reading the spec.
>
> versus one time rename of all internal infrastructure to align to the spec
> and only keep the confusion at the boundaries where we have ABI.
That's just it, to date 'region' has always meant 'struct cxl_region' in
drivers/cxl/, so there is no one time rename to be had. The decision is
whether to decline new claimers of the 'region' moniker and create a
document to explain that term, or play the "dc region" ambiguity game
for the duration.
I vote "diverge from spec and document".
> Horrible option but how often are those diving in the code that bothered
> about the userspace /kernel interaction terminology?
>
> Anyhow, they are all horrible choices.
Agree!
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v8 02/21] cxl/mem: Read dynamic capacity configuration from the device
2025-01-22 18:02 ` Ira Weiny
@ 2025-01-22 21:30 ` Dan Williams
0 siblings, 0 replies; 34+ messages in thread
From: Dan Williams @ 2025-01-22 21:30 UTC (permalink / raw)
To: Ira Weiny, Dan Williams, Dave Jiang, Fan Ni, Jonathan Cameron,
Jonathan Corbet, Andrew Morton, Kees Cook, Gustavo A. R. Silva
Cc: Dan Williams, Davidlohr Bueso, Alison Schofield, Vishal Verma,
Ira Weiny, linux-cxl, linux-doc, nvdimm, linux-kernel,
linux-hardening, Li Ming
Ira Weiny wrote:
[..]
> > The distinction is "can this DPA capacity be allocated to a region" the
> > new holes introduced by DCD are cases where the partition size is
> > greater than the allocatable size. Contrast to ram and pmem the
> > allocatable size is always identical to the partition size.
>
> I still don't quite get what you are saying. The user can always allocate
> a region of the full DCD partition size. It is just that the memory
Quick note: "region of the full DCD partition size" means something to
me immediately where "region of full DC region size" would tempt a
clarifying question.
> within that region may not be backed yet (no extents).
A partition is a boundary between DPA capacity ranges of different
operation modes / performance characteristics. A region is constructed
of decoders that reference decode length. The usable capacity of that
decode range, even when fully populated with extents, may be less than
the decode range of the decoder / partition. That is similar in concept
to the low-memory-hole where the decoders map more in their range than
the usable capacity seen by the region.
[..]
> > > > Linux is not obligated to follow the questionable naming decisions of
> > > > specifications.
> > >
> > > We are not. But as Alejandro says it can be confusing if we don't make
> > > some association to the spec.
> > >
> > > What do you think about the HW/SW line I propose above?
> >
> > Rename to cxl_dc_partition_info and drop the region_ prefixes, sure.
> >
> > Otherwise, for this init-time only concern I would much rather deal with
> > the confusion of:
> >
> > "why does Linux call this partition when the spec calls it region?":
>
> But this is not the question I will get. The question will be.
>
> "Where is DCD region processed in the code? I grepped for region and
> found nothing."
Make grep find that definition:
$ git grep -n -C2 -i "dc region"
Documentation/driver-api/cxl/memory-devices.rst-387-
Documentation/driver-api/cxl/memory-devices.rst-388-partition: a span of DPA capacity delineated by "Get Partition Info", or
Documentation/driver-api/cxl/memory-devices.rst:389:"Get Dynamic Capacity Configuration". A "DC Region" is equivalent to a
> Or
>
> "I'm searching the spec PDF for DCD partition and can't find that. Where
> is DCD partition specified?"
If they are coming from the code first that's why we have spec reference
comments in the code.
^ permalink raw reply [flat|nested] 34+ messages in thread
end of thread, other threads:[~2025-01-22 21:30 UTC | newest]
Thread overview: 34+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-12-11 3:42 [PATCH v8 00/21] DCD: Add support for Dynamic Capacity Devices (DCD) Ira Weiny
2024-12-11 3:42 ` [PATCH v8 01/21] cxl/mbox: Flag " Ira Weiny
2025-01-03 22:57 ` Dan Williams
2025-01-07 1:10 ` Ira Weiny
2024-12-11 3:42 ` [PATCH v8 02/21] cxl/mem: Read dynamic capacity configuration from the device Ira Weiny
2025-01-15 2:35 ` Dan Williams
2025-01-15 13:55 ` Alejandro Lucero Palau
2025-01-15 20:48 ` Ira Weiny
2025-01-16 6:33 ` Dan Williams
2025-01-15 20:32 ` Ira Weiny
2025-01-15 22:34 ` Dan Williams
2025-01-16 10:32 ` Jonathan Cameron
2025-01-22 21:02 ` Dan Williams
2025-01-22 18:02 ` Ira Weiny
2025-01-22 21:30 ` Dan Williams
2024-12-11 3:42 ` [PATCH v8 03/21] cxl/core: Separate region mode from decoder mode Ira Weiny
2024-12-11 3:42 ` [PATCH v8 04/21] cxl/region: Add dynamic capacity decoder and region modes Ira Weiny
2024-12-11 3:42 ` [PATCH v8 05/21] cxl/hdm: Add dynamic capacity size support to endpoint decoders Ira Weiny
2024-12-11 3:42 ` [PATCH v8 06/21] cxl/cdat: Gather DSMAS data for DCD regions Ira Weiny
2024-12-11 3:42 ` [PATCH v8 07/21] cxl/mem: Expose DCD partition capabilities in sysfs Ira Weiny
2024-12-11 3:42 ` [PATCH v8 08/21] cxl/port: Add endpoint decoder DC mode support to sysfs Ira Weiny
2024-12-11 3:42 ` [PATCH v8 09/21] cxl/region: Add sparse DAX region support Ira Weiny
2024-12-11 3:42 ` [PATCH v8 10/21] cxl/events: Split event msgnum configuration from irq setup Ira Weiny
2024-12-11 3:42 ` [PATCH v8 11/21] cxl/pci: Factor out interrupt policy check Ira Weiny
2024-12-11 3:42 ` [PATCH v8 12/21] cxl/mem: Configure dynamic capacity interrupts Ira Weiny
2024-12-11 3:42 ` [PATCH v8 13/21] cxl/core: Return endpoint decoder information from region search Ira Weiny
2024-12-11 3:42 ` [PATCH v8 14/21] cxl/extent: Process DCD events and realize region extents Ira Weiny
2024-12-11 3:42 ` [PATCH v8 15/21] cxl/region/extent: Expose region extent information in sysfs Ira Weiny
2024-12-11 3:42 ` [PATCH v8 16/21] dax/bus: Factor out dev dax resize logic Ira Weiny
2024-12-11 3:42 ` [PATCH v8 17/21] dax/region: Create resources on sparse DAX regions Ira Weiny
2024-12-11 3:42 ` [PATCH v8 18/21] cxl/region: Read existing extents on region creation Ira Weiny
2024-12-11 3:42 ` [PATCH v8 19/21] cxl/mem: Trace Dynamic capacity Event Record Ira Weiny
2024-12-11 3:42 ` [PATCH v8 20/21] tools/testing/cxl: Make event logs dynamic Ira Weiny
2024-12-11 3:42 ` [PATCH v8 21/21] tools/testing/cxl: Add DC Regions to mock mem data Ira Weiny
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).