From: Dan Williams <dan.j.williams@intel.com>
To: Davidlohr Bueso <dave@stgolabs.net>, <dave.jiang@intel.com>,
<dan.j.williams@intel.com>
Cc: <jonathan.cameron@huawei.com>, <alison.schofield@intel.com>,
<ira.weiny@intel.com>, <vishal.l.verma@intel.com>,
<seven.yi.lee@gmail.com>, <hch@infradead.org>,
<a.manzanares@samsung.com>, <fan.ni@samsung.com>,
<anisa.su@samsung.com>, <dave@stgolabs.net>,
<linux-cxl@vger.kernel.org>
Subject: Re: [PATCH v3] cxl/pci: Support Global Persistent Flush (GPF)
Date: Tue, 21 Jan 2025 14:28:39 -0800 [thread overview]
Message-ID: <67901f97e604d_20fa29489@dwillia2-xfh.jf.intel.com.notmuch> (raw)
In-Reply-To: <20250121042531.776377-1-dave@stgolabs.net>
Davidlohr Bueso wrote:
> Add support for GPF flows. It is found that the CXL specification
> around this to be a bit too involved from the driver side. And while
> this should really all handled by the hardware, this patch takes
> things with a grain of salt.
>
> Upon respective port enumeration, both phase timeouts are set to
> a max of 20 seconds, which is the NMI watchdog default for lockup
> detection. The premise is that the kernel does not have enough
> information to set anything better than a max across the board
> and hope devices finish their GPF flows within the platform energy
> budget.
>
> Timeout detection is based on dirty Shutdown semantics. The driver
> will mark it as dirty, expecting that the device clear it upon a
> successful GPF event. The admin may consult the device Health and
> check the dirty shutdown counter to see if there was a problem
> with data integrity.
>
> Signed-off-by: Davidlohr Bueso <dave@stgolabs.net>
> ---
> Changes from v2:
> - Remove RFC tag.
> - Setup T2MAX to 20 secs, just like T1 (Dan).
> This simplifies the patch significantly: 1) no longer need to
> update upon hot-removal, 2) don't need the device max timeouts.
> - Configure dvsec at port enumeration time, not pci_probe (Dan).
> - Skip RCH.
> - Cosmetic cleanups (Jonathan)
Overall looks good, but I do have a non-trivial comment at the bottom
that may require another patch spin.
> Documentation/driver-api/cxl/maturity-map.rst | 2 +-
> drivers/cxl/core/mbox.c | 18 ++++
> drivers/cxl/core/pci.c | 83 +++++++++++++++++++
> drivers/cxl/core/port.c | 2 +
> drivers/cxl/cxl.h | 2 +
> drivers/cxl/cxlmem.h | 5 ++
> drivers/cxl/cxlpci.h | 15 ++++
> drivers/cxl/pci.c | 21 +++--
> 8 files changed, 138 insertions(+), 10 deletions(-)
>
> diff --git a/Documentation/driver-api/cxl/maturity-map.rst b/Documentation/driver-api/cxl/maturity-map.rst
> index df8e2ac2a320..99dd2c841e69 100644
> --- a/Documentation/driver-api/cxl/maturity-map.rst
> +++ b/Documentation/driver-api/cxl/maturity-map.rst
> @@ -130,7 +130,7 @@ Mailbox commands
> * [0] Switch CCI
> * [3] Timestamp
> * [1] PMEM labels
> -* [0] PMEM GPF / Dirty Shutdown
> +* [1] PMEM GPF / Dirty Shutdown
> * [0] Scan Media
>
> PMU
> diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
> index 548564c770c0..6b023e81832a 100644
> --- a/drivers/cxl/core/mbox.c
> +++ b/drivers/cxl/core/mbox.c
> @@ -1308,6 +1308,24 @@ int cxl_mem_create_range_info(struct cxl_memdev_state *mds)
> }
> EXPORT_SYMBOL_NS_GPL(cxl_mem_create_range_info, "CXL");
>
> +int cxl_dirty_shutdown_state(struct cxl_memdev_state *mds)
> +{
> + struct cxl_mailbox *cxl_mbox = &mds->cxlds.cxl_mbox;
> + struct cxl_mbox_cmd mbox_cmd;
> + struct cxl_mbox_set_shutdown_state in = {
> + .state = 1
> + };
> +
> + mbox_cmd = (struct cxl_mbox_cmd) {
> + .opcode = CXL_MBOX_OP_SET_SHUTDOWN_STATE,
> + .size_in = sizeof(in),
> + .payload_in = &in,
> + };
> +
> + return cxl_internal_send_cmd(cxl_mbox, &mbox_cmd);
> +}
> +EXPORT_SYMBOL_NS_GPL(cxl_dirty_shutdown_state, "CXL");
> +
> int cxl_set_timestamp(struct cxl_memdev_state *mds)
> {
> struct cxl_mailbox *cxl_mbox = &mds->cxlds.cxl_mbox;
> diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
> index b3aac9964e0d..d2867ea18448 100644
> --- a/drivers/cxl/core/pci.c
> +++ b/drivers/cxl/core/pci.c
> @@ -1054,3 +1054,86 @@ int cxl_pci_get_bandwidth(struct pci_dev *pdev, struct access_coordinate *c)
>
> return 0;
> }
> +
> +/*
> + * Set max timeout such that vendors will optimize GPF flow to avoid
Maybe:
s/vendors/platforms/
...since it is a concern across a given collection of endpoints, a host,
and its flush energy source?
> + * the implied worst-case scenario delays. On a sane platform, all
> + * devices should always complete GPF within the energy budget of
> + * the GPF flow. The kernel does not have enough information to pick
> + * anything better than "maximize timeouts and hope it works".
> + *
> + * A misbehaving device could block forward progress of GPF for all
> + * the other devices, exhausting the energy budget of the platform.
> + * However, the spec seems to assume that moving on from slow to respond
> + * devices is a virtue. It is not possible to know that, in actuality,
> + * the slow to respond device is *the* most critical device in the
> + * system to wait.
> + */
> +#define GPF_TIMEOUT_BASE_MAX 2
> +#define GPF_TIMEOUT_SCALE_MAX 7 /* 10 seconds */
> +
> +static int update_gpf_port_dvsec(struct pci_dev *pdev, int dvsec, int phase)
> +{
> + u16 ctrl;
> + int rc, offset, base, scale;
> +
> + switch (phase) {
> + case 1:
> + offset = CXL_DVSEC_PORT_GPF_PHASE_1_CONTROL_OFFSET;
> + base = CXL_DVSEC_PORT_GPF_PHASE_1_TMO_BASE_MASK;
> + scale = CXL_DVSEC_PORT_GPF_PHASE_1_TMO_SCALE_MASK;
> + break;
> + case 2:
> + offset = CXL_DVSEC_PORT_GPF_PHASE_2_CONTROL_OFFSET;
> + base = CXL_DVSEC_PORT_GPF_PHASE_2_TMO_BASE_MASK;
> + scale = CXL_DVSEC_PORT_GPF_PHASE_2_TMO_SCALE_MASK;
> + break;
> + default:
> + return -EINVAL;
> + }
> +
> + rc = pci_read_config_word(pdev, dvsec + offset, &ctrl);
> + if (rc)
> + return rc;
> +
> + if (FIELD_GET(base, ctrl) == GPF_TIMEOUT_BASE_MAX &&
> + FIELD_GET(scale, ctrl) == GPF_TIMEOUT_SCALE_MAX)
> + return rc;
> +
> + ctrl = FIELD_PREP(base, GPF_TIMEOUT_BASE_MAX);
> + ctrl |= FIELD_PREP(scale, GPF_TIMEOUT_SCALE_MAX);
> +
> + rc = pci_write_config_word(pdev, dvsec + offset, ctrl);
> + if (!rc)
> + dev_dbg(&pdev->dev, "Port GPF phase %d timeout: %d0 secs\n",
> + phase, GPF_TIMEOUT_BASE_MAX);
> +
> + return rc;
> +}
> +
> +int cxl_setup_gpf_port(struct device *dport_dev)
> +{
> + int dvsec;
> + struct pci_dev *pdev;
> +
> + if (!dev_is_pci(dport_dev))
> + return 0;
> +
> + pdev = to_pci_dev(dport_dev);
> +
> + if (is_cxl_restricted(pdev))
> + return 0;
Given this is called from devm_cxl_enumerate_ports() after it is already
clear that @dport_dev is in a VH topology, no need to worry about
RCH/RCD details here.
> +
> + dvsec = pci_find_dvsec_capability(pdev, PCI_VENDOR_ID_CXL,
> + CXL_DVSEC_PORT_GPF);
I think this ok for now, but perhaps this should be enumerated once
per-port and cached in 'struct cxl_port', otherwise this can potentially
be called multiple times for shared ports in deeper VH topologies.
> + if (!dvsec) {
> + dev_warn(&pdev->dev, "Port GPF DVSEC not present\n");
Could be pci_warn() to use @pdev directly.
> + return -EINVAL;
> + }
> +
> + update_gpf_port_dvsec(pdev, dvsec, 1);
> + update_gpf_port_dvsec(pdev, dvsec, 2);
> +
> + return 0;
> +}
> +EXPORT_SYMBOL_NS_GPL(cxl_setup_gpf_port, "CXL");
Drop this export. There is no caller outside of the CXL core itself, and
the fact that this assumes being called from a certain point in
devm_cxl_enumerate_ports() means that it is only a helper, not library
call for other drivers.
> diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
> index 78a5c2c25982..1ad6d2e05f09 100644
> --- a/drivers/cxl/core/port.c
> +++ b/drivers/cxl/core/port.c
> @@ -1653,6 +1653,8 @@ int devm_cxl_enumerate_ports(struct cxl_memdev *cxlmd)
> dev_dbg(dev, "scan: iter: %s dport_dev: %s parent: %s\n",
> dev_name(iter), dev_name(dport_dev),
> dev_name(uport_dev));
> +
> + cxl_setup_gpf_port(dport_dev);
> struct cxl_port *port __free(put_cxl_port) =
> find_cxl_port(dport_dev, &dport);
> if (port) {
> diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
> index fdac3ddb8635..c80c2300dee7 100644
> --- a/drivers/cxl/cxl.h
> +++ b/drivers/cxl/cxl.h
> @@ -912,6 +912,8 @@ void cxl_coordinates_combine(struct access_coordinate *out,
>
> bool cxl_endpoint_decoder_reset_detected(struct cxl_port *port);
>
> +int cxl_setup_gpf_port(struct device *dport_dev);
This can move to drivers/cxl/core/core.h
> +
> /*
> * Unit test builds overrides this to __weak, find the 'strong' version
> * of these symbols in tools/testing/cxl/.
> diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
> index 2a25d1957ddb..a085374d52d3 100644
> --- a/drivers/cxl/cxlmem.h
> +++ b/drivers/cxl/cxlmem.h
> @@ -693,6 +693,10 @@ struct cxl_mbox_set_partition_info {
>
> #define CXL_SET_PARTITION_IMMEDIATE_FLAG BIT(0)
>
> +struct cxl_mbox_set_shutdown_state {
> + u8 state;
> +} __packed;
> +
> /* Set Timestamp CXL 3.0 Spec 8.2.9.4.2 */
> struct cxl_mbox_set_timestamp_in {
> __le64 timestamp;
> @@ -829,6 +833,7 @@ void cxl_event_trace_record(const struct cxl_memdev *cxlmd,
> enum cxl_event_log_type type,
> enum cxl_event_type event_type,
> const uuid_t *uuid, union cxl_event *evt);
> +int cxl_dirty_shutdown_state(struct cxl_memdev_state *mds);
> int cxl_set_timestamp(struct cxl_memdev_state *mds);
> int cxl_poison_state_init(struct cxl_memdev_state *mds);
> int cxl_mem_get_poison(struct cxl_memdev *cxlmd, u64 offset, u64 len,
> diff --git a/drivers/cxl/cxlpci.h b/drivers/cxl/cxlpci.h
> index 4da07727ab9c..9b229e3b9a9d 100644
> --- a/drivers/cxl/cxlpci.h
> +++ b/drivers/cxl/cxlpci.h
> @@ -40,6 +40,12 @@
>
> /* CXL 2.0 8.1.6: GPF DVSEC for CXL Port */
> #define CXL_DVSEC_PORT_GPF 4
> +#define CXL_DVSEC_PORT_GPF_PHASE_1_CONTROL_OFFSET 0x0C
> +#define CXL_DVSEC_PORT_GPF_PHASE_1_TMO_BASE_MASK GENMASK(3, 0)
> +#define CXL_DVSEC_PORT_GPF_PHASE_1_TMO_SCALE_MASK GENMASK(11, 8)
> +#define CXL_DVSEC_PORT_GPF_PHASE_2_CONTROL_OFFSET 0xE
> +#define CXL_DVSEC_PORT_GPF_PHASE_2_TMO_BASE_MASK GENMASK(3, 0)
> +#define CXL_DVSEC_PORT_GPF_PHASE_2_TMO_SCALE_MASK GENMASK(11, 8)
>
> /* CXL 2.0 8.1.7: GPF DVSEC for CXL Device */
> #define CXL_DVSEC_DEVICE_GPF 5
> @@ -121,6 +127,15 @@ static inline bool cxl_pci_flit_256(struct pci_dev *pdev)
> return lnksta2 & PCI_EXP_LNKSTA2_FLIT;
> }
>
> +/*
> + * Assume that any RCIEP that emits the CXL memory expander class code
> + * is an RCD
> + */
> +static inline bool is_cxl_restricted(struct pci_dev *pdev)
> +{
> + return pci_pcie_type(pdev) == PCI_EXP_TYPE_RC_END;
> +}
Per above, no need to move this since it is not needed by
cxl_setup_gpf_port().
> +
> int devm_cxl_port_enumerate_dports(struct cxl_port *port);
> struct cxl_dev_state;
> int cxl_hdm_decode_init(struct cxl_dev_state *cxlds, struct cxl_hdm *cxlhdm,
> diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
> index 6d94ff4a4f1a..bad142095279 100644
> --- a/drivers/cxl/pci.c
> +++ b/drivers/cxl/pci.c
> @@ -465,15 +465,6 @@ static int cxl_pci_setup_mailbox(struct cxl_memdev_state *mds, bool irq_avail)
> return 0;
> }
>
> -/*
> - * Assume that any RCIEP that emits the CXL memory expander class code
> - * is an RCD
> - */
> -static bool is_cxl_restricted(struct pci_dev *pdev)
> -{
> - return pci_pcie_type(pdev) == PCI_EXP_TYPE_RC_END;
> -}
> -
> static int cxl_rcrb_get_comp_regs(struct pci_dev *pdev,
> struct cxl_register_map *map,
> struct cxl_dport *dport)
> @@ -1038,6 +1029,18 @@ static int cxl_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
> if (cxl_pci_ras_unmask(pdev))
> dev_dbg(&pdev->dev, "No RAS reporting unmasked\n");
>
> + /*
> + * Set dirty shutdown now, with the expectation that the device
> + * clear it upon a successful GPF flow. The exception to this
> + * is upon Viral detection, per CXL 3.2 section 12.4.2.
> + */
> + if (resource_size(&cxlds->pmem_res)) {
> + rc = cxl_dirty_shutdown_state(mds);
> + if (rc)
> + dev_warn(&pdev->dev,
> + "GPF: could not dirty shutdown state\n");
I think the proper place for this call is in cxl_nvdimm_probe(). That
happens after the device has succeeded cxl_mem_probe() and successfully
enumerated pmem capacity, but before the kernel commits to using pmem
through the nvdimm subsystem.
Had shutdown_state been part of cxl_nvdimm_probe() since day one, the
policy could have been: "if cxl_dirty_shutdown_state() fails,
cxl_nvdimm_probe() fails". However, I know that would regress cxl-test
and qemu if not devices in the wild that do not meet that spec mandatory
requirement where Linux has been lenient for a long while now.
The two options are:
1/ Rip the band-aid, fail cxl_nvdimm_probe() and require device
implementations to catch up
2/ Continue leniency, advertise whether dirty-tracking is enabled in
sysfs (via cxl_nvdimm_driver dev_groups?) and document that pmem devices
without dirty-tracking should only be used for debug/development, not
production.
Perhaps to make progress on this patch, split the GPF settings to its
own patch, and follow-on with one that adds dirty-state tracking
capability to sysfs, likely with the dirty-shutdown count as well.
I.e. a deployment that cares about pmem consistency relative to
power-loss should care both that the kernel is dirtying pmem before
writing to it, and communicating whether any dirty-shutdowns have been
detected by the device. I expect that sysfs patch might be 6.15 material
(especially with the dirty-count additions) where these GPF fixups could
slot in for v6.14, but that is up to Dave.
next prev parent reply other threads:[~2025-01-21 22:28 UTC|newest]
Thread overview: 4+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-01-21 4:25 [PATCH v3] cxl/pci: Support Global Persistent Flush (GPF) Davidlohr Bueso
2025-01-21 22:28 ` Dan Williams [this message]
2025-01-24 23:19 ` Davidlohr Bueso
2025-01-25 0:30 ` Dan Williams
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=67901f97e604d_20fa29489@dwillia2-xfh.jf.intel.com.notmuch \
--to=dan.j.williams@intel.com \
--cc=a.manzanares@samsung.com \
--cc=alison.schofield@intel.com \
--cc=anisa.su@samsung.com \
--cc=dave.jiang@intel.com \
--cc=dave@stgolabs.net \
--cc=fan.ni@samsung.com \
--cc=hch@infradead.org \
--cc=ira.weiny@intel.com \
--cc=jonathan.cameron@huawei.com \
--cc=linux-cxl@vger.kernel.org \
--cc=seven.yi.lee@gmail.com \
--cc=vishal.l.verma@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox