From: Ira Weiny <ira.weiny@intel.com>
To: Dave Jiang <dave.jiang@intel.com>, Fan Ni <fan.ni@samsung.com>,
Jonathan Cameron <Jonathan.Cameron@huawei.com>,
Jonathan Corbet <corbet@lwn.net>,
Andrew Morton <akpm@linux-foundation.org>,
Kees Cook <kees@kernel.org>,
"Gustavo A. R. Silva" <gustavoars@kernel.org>
Cc: Dan Williams <dan.j.williams@intel.com>,
Davidlohr Bueso <dave@stgolabs.net>,
Alison Schofield <alison.schofield@intel.com>,
Vishal Verma <vishal.l.verma@intel.com>,
Ira Weiny <ira.weiny@intel.com>,
linux-cxl@vger.kernel.org, linux-doc@vger.kernel.org,
nvdimm@lists.linux.dev, linux-kernel@vger.kernel.org,
linux-hardening@vger.kernel.org, Li Ming <ming.li@zohomail.com>,
Jonathan Cameron <Jonathan.Cameron@Huawei.com>
Subject: [PATCH v8 00/21] DCD: Add support for Dynamic Capacity Devices (DCD)
Date: Tue, 10 Dec 2024 21:42:15 -0600 [thread overview]
Message-ID: <20241210-dcd-type2-upstream-v8-0-812852504400@intel.com> (raw)
A git tree of this series can be found here:
https://github.com/weiny2/linux-kernel/tree/dcd-v4-2024-12-10
Rebase this off 6.13 cleanups.
Series info
===========
This series has 2 parts:
Patch 1-19: Core DCD support
Patch 20-21: cxl_test support
Background
==========
A Dynamic Capacity Device (DCD) (CXL 3.1 sec 9.13.3) is a CXL memory
device that allows memory capacity within a region to change
dynamically without the need for resetting the device, reconfiguring
HDM decoders, or reconfiguring software DAX regions.
One of the biggest use cases for Dynamic Capacity is to allow hosts to
share memory dynamically within a data center without increasing the
per-host attached memory.
The general flow for the addition or removal of memory is to have an
orchestrator coordinate the use of the memory. Generally there are 5
actors in such a system, the Orchestrator, Fabric Manager, the Logical
device, the Host Kernel, and a Host User.
Typical work flows are shown below.
Orchestrator FM Device Host Kernel Host User
| | | | |
|-------------- Create region ----------------------->|
| | | | |
| | | |<-- Create ---|
| | | | Region |
|<------------- Signal done --------------------------|
| | | | |
|-- Add ----->|-- Add --->|--- Add --->| |
| Capacity | Extent | Extent | |
| | | | |
| |<- Accept -|<- Accept -| |
| | Extent | Extent | |
| | | |<- Create --->|
| | | | DAX dev |-- Use memory
| | | | | |
| | | | | |
| | | |<- Release ---| <-+
| | | | DAX dev |
| | | | |
|<------------- Signal done --------------------------|
| | | | |
|-- Remove -->|- Release->|- Release ->| |
| Capacity | Extent | Extent | |
| | | | |
| |<- Release-|<- Release -| |
| | Extent | Extent | |
| | | | |
|-- Add ----->|-- Add --->|--- Add --->| |
| Capacity | Extent | Extent | |
| | | | |
| |<- Accept -|<- Accept -| |
| | Extent | Extent | |
| | | |<- Create ----|
| | | | DAX dev |-- Use memory
| | | | | |
| | | |<- Release ---| <-+
| | | | DAX dev |
|<------------- Signal done --------------------------|
| | | | |
|-- Remove -->|- Release->|- Release ->| |
| Capacity | Extent | Extent | |
| | | | |
| |<- Release-|<- Release -| |
| | Extent | Extent | |
| | | | |
|-- Add ----->|-- Add --->|--- Add --->| |
| Capacity | Extent | Extent | |
| | | |<- Create ----|
| | | | DAX dev |-- Use memory
| | | | | |
|-- Remove -->|- Release->|- Release ->| | |
| Capacity | Extent | Extent | | |
| | | | | |
| | | (Release Ignored) | |
| | | | | |
| | | |<- Release ---| <-+
| | | | DAX dev |
|<------------- Signal done --------------------------|
| | | | |
| |- Release->|- Release ->| |
| | Extent | Extent | |
| | | | |
| |<- Release-|<- Release -| |
| | Extent | Extent | |
| | | |<- Destroy ---|
| | | | Region |
| | | | |
Implementation
==============
The series still requires the creation of regions and DAX devices to be
closely synchronized with the Orchestrator and Fabric Manager. The host
kernel will reject extents if a region is not yet created. It also
ignores extent release if memory is in use (DAX device created). These
synchronizations are not anticipated to be an issue with real
applications.
In order to allow for capacity to be added and removed a new concept of
a sparse DAX region is introduced. A sparse DAX region may have 0 or
more bytes of available space. The total space depends on the number
and size of the extents which have been added.
Initially it is anticipated that users of the memory will carefully
coordinate the surfacing of additional capacity with the creation of DAX
devices which use that capacity. Therefore, the allocation of the
memory to DAX devices does not allow for specific associations between
DAX device and extent. This keeps allocations very similar to existing
DAX region behavior.
To keep the DAX memory allocation aligned with the existing DAX devices
which do not have tags extents are not allowed to have tags. Future
support for tags is planned.
Great care was taken to keep the extent tracking simple. Some xarray's
needed to be added but extra software objects were kept to a minimum.
Region extents continue to be tracked as sub-devices of the DAX region.
This ensures that region destruction cleans up all extent allocations
properly.
The major functionality of this series includes:
- Getting the dynamic capacity (DC) configuration information from cxl
devices
- Configuring the DC partitions reported by hardware
- Enhancing the CXL and DAX regions for dynamic capacity support
a. Maintain a logical separation between hardware extents and
software managed region extents. This provides an
abstraction between the layers and should allow for
interleaving in the future
- Get hardware extent lists for endpoint decoders upon
region creation.
- Adjust extent/region memory available on the following events.
a. Add capacity Events
b. Release capacity events
- Host response for add capacity
a. do not accept the extent if:
If the region does not exist
or an error occurs realizing the extent
b. If the region does exist
realize a DAX region extent with 1:1 mapping (no
interleave yet)
c. Support the event more bit by processing a list of extents
marked with the more bit together before setting up a
response.
- Host response for remove capacity
a. If no DAX device references the extent; release the extent
b. If a reference does exist, ignore the request.
(Require FM to issue release again.)
- Modify DAX device creation/resize to account for extents within a
sparse DAX region
- Trace Dynamic Capacity events for debugging
- Add cxl-test infrastructure to allow for faster unit testing
(See new ndctl branch for cxl-dcd.sh test[1])
- Only support 0 value extent tags
Fan Ni's upstream of Qemu DCD was used for testing.
Remaining work:
1) Allow mapping to specific extents (perhaps based on
label/tag)
1a) devise region size reporting based on tags
2) Interleave support
Possible additional work depending on requirements:
1) Accept a new extent which extends (but overlaps) an existing
extent(s)
2) Release extents when DAX devices are released if a release
was previously seen from the device
3) Rework DAX device interfaces, memfd has been explored a bit
[1] https://github.com/weiny2/ndctl/tree/dcd-region2-2024-12-11
---
Changes in v8:
- iweiny: rebase off of 6.13
- iweiny: Use %pra which landed in 6.13
- Link to v7: https://patch.msgid.link/20241107-dcd-type2-upstream-v7-0-56a84e66bc36@intel.com
---
Ira Weiny (21):
cxl/mbox: Flag support for Dynamic Capacity Devices (DCD)
cxl/mem: Read dynamic capacity configuration from the device
cxl/core: Separate region mode from decoder mode
cxl/region: Add dynamic capacity decoder and region modes
cxl/hdm: Add dynamic capacity size support to endpoint decoders
cxl/cdat: Gather DSMAS data for DCD regions
cxl/mem: Expose DCD partition capabilities in sysfs
cxl/port: Add endpoint decoder DC mode support to sysfs
cxl/region: Add sparse DAX region support
cxl/events: Split event msgnum configuration from irq setup
cxl/pci: Factor out interrupt policy check
cxl/mem: Configure dynamic capacity interrupts
cxl/core: Return endpoint decoder information from region search
cxl/extent: Process DCD events and realize region extents
cxl/region/extent: Expose region extent information in sysfs
dax/bus: Factor out dev dax resize logic
dax/region: Create resources on sparse DAX regions
cxl/region: Read existing extents on region creation
cxl/mem: Trace Dynamic capacity Event Record
tools/testing/cxl: Make event logs dynamic
tools/testing/cxl: Add DC Regions to mock mem data
Documentation/ABI/testing/sysfs-bus-cxl | 125 +++-
drivers/cxl/core/Makefile | 2 +-
drivers/cxl/core/cdat.c | 42 +-
drivers/cxl/core/core.h | 34 +-
drivers/cxl/core/extent.c | 494 +++++++++++++++
drivers/cxl/core/hdm.c | 210 ++++++-
drivers/cxl/core/mbox.c | 603 +++++++++++++++++-
drivers/cxl/core/memdev.c | 128 +++-
drivers/cxl/core/port.c | 19 +-
drivers/cxl/core/region.c | 165 ++++-
drivers/cxl/core/trace.h | 65 ++
drivers/cxl/cxl.h | 122 +++-
drivers/cxl/cxlmem.h | 132 +++-
drivers/cxl/pci.c | 116 +++-
drivers/dax/bus.c | 356 +++++++++--
drivers/dax/bus.h | 4 +-
drivers/dax/cxl.c | 71 ++-
drivers/dax/dax-private.h | 40 ++
drivers/dax/hmem/hmem.c | 2 +-
drivers/dax/pmem.c | 2 +-
include/cxl/event.h | 32 +
include/linux/ioport.h | 3 +
tools/testing/cxl/Kbuild | 3 +-
tools/testing/cxl/test/mem.c | 1019 +++++++++++++++++++++++++++----
24 files changed, 3499 insertions(+), 290 deletions(-)
---
base-commit: 7cb1b466315004af98f6ba6c2546bb713ca3c237
change-id: 20230604-dcd-type2-upstream-0cd15f6216fd
Best regards,
--
Ira Weiny <ira.weiny@intel.com>
next reply other threads:[~2024-12-11 3:42 UTC|newest]
Thread overview: 34+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-12-11 3:42 Ira Weiny [this message]
2024-12-11 3:42 ` [PATCH v8 01/21] cxl/mbox: Flag support for Dynamic Capacity Devices (DCD) Ira Weiny
2025-01-03 22:57 ` Dan Williams
2025-01-07 1:10 ` Ira Weiny
2024-12-11 3:42 ` [PATCH v8 02/21] cxl/mem: Read dynamic capacity configuration from the device Ira Weiny
2025-01-15 2:35 ` Dan Williams
2025-01-15 13:55 ` Alejandro Lucero Palau
2025-01-15 20:48 ` Ira Weiny
2025-01-16 6:33 ` Dan Williams
2025-01-15 20:32 ` Ira Weiny
2025-01-15 22:34 ` Dan Williams
2025-01-16 10:32 ` Jonathan Cameron
2025-01-22 21:02 ` Dan Williams
2025-01-22 18:02 ` Ira Weiny
2025-01-22 21:30 ` Dan Williams
2024-12-11 3:42 ` [PATCH v8 03/21] cxl/core: Separate region mode from decoder mode Ira Weiny
2024-12-11 3:42 ` [PATCH v8 04/21] cxl/region: Add dynamic capacity decoder and region modes Ira Weiny
2024-12-11 3:42 ` [PATCH v8 05/21] cxl/hdm: Add dynamic capacity size support to endpoint decoders Ira Weiny
2024-12-11 3:42 ` [PATCH v8 06/21] cxl/cdat: Gather DSMAS data for DCD regions Ira Weiny
2024-12-11 3:42 ` [PATCH v8 07/21] cxl/mem: Expose DCD partition capabilities in sysfs Ira Weiny
2024-12-11 3:42 ` [PATCH v8 08/21] cxl/port: Add endpoint decoder DC mode support to sysfs Ira Weiny
2024-12-11 3:42 ` [PATCH v8 09/21] cxl/region: Add sparse DAX region support Ira Weiny
2024-12-11 3:42 ` [PATCH v8 10/21] cxl/events: Split event msgnum configuration from irq setup Ira Weiny
2024-12-11 3:42 ` [PATCH v8 11/21] cxl/pci: Factor out interrupt policy check Ira Weiny
2024-12-11 3:42 ` [PATCH v8 12/21] cxl/mem: Configure dynamic capacity interrupts Ira Weiny
2024-12-11 3:42 ` [PATCH v8 13/21] cxl/core: Return endpoint decoder information from region search Ira Weiny
2024-12-11 3:42 ` [PATCH v8 14/21] cxl/extent: Process DCD events and realize region extents Ira Weiny
2024-12-11 3:42 ` [PATCH v8 15/21] cxl/region/extent: Expose region extent information in sysfs Ira Weiny
2024-12-11 3:42 ` [PATCH v8 16/21] dax/bus: Factor out dev dax resize logic Ira Weiny
2024-12-11 3:42 ` [PATCH v8 17/21] dax/region: Create resources on sparse DAX regions Ira Weiny
2024-12-11 3:42 ` [PATCH v8 18/21] cxl/region: Read existing extents on region creation Ira Weiny
2024-12-11 3:42 ` [PATCH v8 19/21] cxl/mem: Trace Dynamic capacity Event Record Ira Weiny
2024-12-11 3:42 ` [PATCH v8 20/21] tools/testing/cxl: Make event logs dynamic Ira Weiny
2024-12-11 3:42 ` [PATCH v8 21/21] tools/testing/cxl: Add DC Regions to mock mem data Ira Weiny
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20241210-dcd-type2-upstream-v8-0-812852504400@intel.com \
--to=ira.weiny@intel.com \
--cc=Jonathan.Cameron@huawei.com \
--cc=akpm@linux-foundation.org \
--cc=alison.schofield@intel.com \
--cc=corbet@lwn.net \
--cc=dan.j.williams@intel.com \
--cc=dave.jiang@intel.com \
--cc=dave@stgolabs.net \
--cc=fan.ni@samsung.com \
--cc=gustavoars@kernel.org \
--cc=kees@kernel.org \
--cc=linux-cxl@vger.kernel.org \
--cc=linux-doc@vger.kernel.org \
--cc=linux-hardening@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=ming.li@zohomail.com \
--cc=nvdimm@lists.linux.dev \
--cc=vishal.l.verma@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox