[PATCH v3] cxl/pci: Support Global Persistent Flush (GPF)

Linux CXL
 help / color / mirror / Atom feed

* [PATCH v3] cxl/pci: Support Global Persistent Flush (GPF)
@ 2025-01-21  4:25 Davidlohr Bueso
  2025-01-21 22:28 ` Dan Williams
  0 siblings, 1 reply; 4+ messages in thread
From: Davidlohr Bueso @ 2025-01-21  4:25 UTC (permalink / raw)
  To: dave.jiang, dan.j.williams
  Cc: jonathan.cameron, alison.schofield, ira.weiny, vishal.l.verma,
	seven.yi.lee, hch, a.manzanares, fan.ni, anisa.su, dave,
	linux-cxl

Add support for GPF flows. It is found that the CXL specification
around this to be a bit too involved from the driver side. And while
this should really all handled by the hardware, this patch takes
things with a grain of salt.

Upon respective port enumeration, both phase timeouts are set to
a max of 20 seconds, which is the NMI watchdog default for lockup
detection. The premise is that the kernel does not have enough
information to set anything better than a max across the board
and hope devices finish their GPF flows within the platform energy
budget.

Timeout detection is based on dirty Shutdown semantics. The driver
will mark it as dirty, expecting that the device clear it upon a
successful GPF event. The admin may consult the device Health and
check the dirty shutdown counter to see if there was a problem
with data integrity.

Signed-off-by: Davidlohr Bueso <dave@stgolabs.net>
---
Changes from v2:
 - Remove RFC tag.
 - Setup T2MAX to 20 secs, just like T1 (Dan).
   This simplifies the patch significantly: 1) no longer need to 
   update upon hot-removal, 2) don't need the device max timeouts. 
 - Configure dvsec at port enumeration time, not pci_probe (Dan).
 - Skip RCH.
 - Cosmetic cleanups (Jonathan)

 Documentation/driver-api/cxl/maturity-map.rst |  2 +-
 drivers/cxl/core/mbox.c                       | 18 ++++
 drivers/cxl/core/pci.c                        | 83 +++++++++++++++++++
 drivers/cxl/core/port.c                       |  2 +
 drivers/cxl/cxl.h                             |  2 +
 drivers/cxl/cxlmem.h                          |  5 ++
 drivers/cxl/cxlpci.h                          | 15 ++++
 drivers/cxl/pci.c                             | 21 +++--
 8 files changed, 138 insertions(+), 10 deletions(-)

diff --git a/Documentation/driver-api/cxl/maturity-map.rst b/Documentation/driver-api/cxl/maturity-map.rst
index df8e2ac2a320..99dd2c841e69 100644
--- a/Documentation/driver-api/cxl/maturity-map.rst
+++ b/Documentation/driver-api/cxl/maturity-map.rst
@@ -130,7 +130,7 @@ Mailbox commands
 * [0] Switch CCI
 * [3] Timestamp
 * [1] PMEM labels
-* [0] PMEM GPF / Dirty Shutdown
+* [1] PMEM GPF / Dirty Shutdown
 * [0] Scan Media
 
 PMU
diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
index 548564c770c0..6b023e81832a 100644
--- a/drivers/cxl/core/mbox.c
+++ b/drivers/cxl/core/mbox.c
@@ -1308,6 +1308,24 @@ int cxl_mem_create_range_info(struct cxl_memdev_state *mds)
 }
 EXPORT_SYMBOL_NS_GPL(cxl_mem_create_range_info, "CXL");
 
+int cxl_dirty_shutdown_state(struct cxl_memdev_state *mds)
+{
+	struct cxl_mailbox *cxl_mbox = &mds->cxlds.cxl_mbox;
+	struct cxl_mbox_cmd mbox_cmd;
+	struct cxl_mbox_set_shutdown_state in = {
+		.state = 1
+	};
+
+	mbox_cmd = (struct cxl_mbox_cmd) {
+		.opcode = CXL_MBOX_OP_SET_SHUTDOWN_STATE,
+		.size_in = sizeof(in),
+		.payload_in = &in,
+	};
+
+	return cxl_internal_send_cmd(cxl_mbox, &mbox_cmd);
+}
+EXPORT_SYMBOL_NS_GPL(cxl_dirty_shutdown_state, "CXL");
+
 int cxl_set_timestamp(struct cxl_memdev_state *mds)
 {
 	struct cxl_mailbox *cxl_mbox = &mds->cxlds.cxl_mbox;
diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
index b3aac9964e0d..d2867ea18448 100644
--- a/drivers/cxl/core/pci.c
+++ b/drivers/cxl/core/pci.c
@@ -1054,3 +1054,86 @@ int cxl_pci_get_bandwidth(struct pci_dev *pdev, struct access_coordinate *c)
 
 	return 0;
 }
+
+/*
+ * Set max timeout such that vendors will optimize GPF flow to avoid
+ * the implied worst-case scenario delays. On a sane platform, all
+ * devices should always complete GPF within the energy budget of
+ * the GPF flow. The kernel does not have enough information to pick
+ * anything better than "maximize timeouts and hope it works".
+ *
+ * A misbehaving device could block forward progress of GPF for all
+ * the other devices, exhausting the energy budget of the platform.
+ * However, the spec seems to assume that moving on from slow to respond
+ * devices is a virtue. It is not possible to know that, in actuality,
+ * the slow to respond device is *the* most critical device in the
+ * system to wait.
+ */
+#define GPF_TIMEOUT_BASE_MAX 2
+#define GPF_TIMEOUT_SCALE_MAX 7 /* 10 seconds */
+
+static int update_gpf_port_dvsec(struct pci_dev *pdev, int dvsec, int phase)
+{
+	u16 ctrl;
+	int rc, offset, base, scale;
+
+	switch (phase) {
+	case 1:
+		offset = CXL_DVSEC_PORT_GPF_PHASE_1_CONTROL_OFFSET;
+		base = CXL_DVSEC_PORT_GPF_PHASE_1_TMO_BASE_MASK;
+		scale = CXL_DVSEC_PORT_GPF_PHASE_1_TMO_SCALE_MASK;
+		break;
+	case 2:
+		offset = CXL_DVSEC_PORT_GPF_PHASE_2_CONTROL_OFFSET;
+		base = CXL_DVSEC_PORT_GPF_PHASE_2_TMO_BASE_MASK;
+		scale = CXL_DVSEC_PORT_GPF_PHASE_2_TMO_SCALE_MASK;
+		break;
+	default:
+		return -EINVAL;
+	}
+
+	rc = pci_read_config_word(pdev, dvsec + offset, &ctrl);
+	if (rc)
+		return rc;
+
+	if (FIELD_GET(base, ctrl) == GPF_TIMEOUT_BASE_MAX &&
+	    FIELD_GET(scale, ctrl) == GPF_TIMEOUT_SCALE_MAX)
+		return rc;
+
+	ctrl = FIELD_PREP(base, GPF_TIMEOUT_BASE_MAX);
+	ctrl |= FIELD_PREP(scale, GPF_TIMEOUT_SCALE_MAX);
+
+	rc = pci_write_config_word(pdev, dvsec + offset, ctrl);
+	if (!rc)
+		dev_dbg(&pdev->dev, "Port GPF phase %d timeout: %d0 secs\n",
+			phase, GPF_TIMEOUT_BASE_MAX);
+
+	return rc;
+}
+
+int cxl_setup_gpf_port(struct device *dport_dev)
+{
+	int dvsec;
+	struct pci_dev *pdev;
+
+	if (!dev_is_pci(dport_dev))
+		return 0;
+
+	pdev = to_pci_dev(dport_dev);
+
+	if (is_cxl_restricted(pdev))
+		return 0;
+
+	dvsec = pci_find_dvsec_capability(pdev, PCI_VENDOR_ID_CXL,
+					  CXL_DVSEC_PORT_GPF);
+	if (!dvsec) {
+		dev_warn(&pdev->dev, "Port GPF DVSEC not present\n");
+		return -EINVAL;
+	}
+
+	update_gpf_port_dvsec(pdev, dvsec, 1);
+	update_gpf_port_dvsec(pdev, dvsec, 2);
+
+	return 0;
+}
+EXPORT_SYMBOL_NS_GPL(cxl_setup_gpf_port, "CXL");
diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
index 78a5c2c25982..1ad6d2e05f09 100644
--- a/drivers/cxl/core/port.c
+++ b/drivers/cxl/core/port.c
@@ -1653,6 +1653,8 @@ int devm_cxl_enumerate_ports(struct cxl_memdev *cxlmd)
 		dev_dbg(dev, "scan: iter: %s dport_dev: %s parent: %s\n",
 			dev_name(iter), dev_name(dport_dev),
 			dev_name(uport_dev));
+
+		cxl_setup_gpf_port(dport_dev);
 		struct cxl_port *port __free(put_cxl_port) =
 			find_cxl_port(dport_dev, &dport);
 		if (port) {
diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
index fdac3ddb8635..c80c2300dee7 100644
--- a/drivers/cxl/cxl.h
+++ b/drivers/cxl/cxl.h
@@ -912,6 +912,8 @@ void cxl_coordinates_combine(struct access_coordinate *out,
 
 bool cxl_endpoint_decoder_reset_detected(struct cxl_port *port);
 
+int cxl_setup_gpf_port(struct device *dport_dev);
+
 /*
  * Unit test builds overrides this to __weak, find the 'strong' version
  * of these symbols in tools/testing/cxl/.
diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
index 2a25d1957ddb..a085374d52d3 100644
--- a/drivers/cxl/cxlmem.h
+++ b/drivers/cxl/cxlmem.h
@@ -693,6 +693,10 @@ struct cxl_mbox_set_partition_info {
 
 #define  CXL_SET_PARTITION_IMMEDIATE_FLAG	BIT(0)
 
+struct cxl_mbox_set_shutdown_state {
+	u8 state;
+} __packed;
+
 /* Set Timestamp CXL 3.0 Spec 8.2.9.4.2 */
 struct cxl_mbox_set_timestamp_in {
 	__le64 timestamp;
@@ -829,6 +833,7 @@ void cxl_event_trace_record(const struct cxl_memdev *cxlmd,
 			    enum cxl_event_log_type type,
 			    enum cxl_event_type event_type,
 			    const uuid_t *uuid, union cxl_event *evt);
+int cxl_dirty_shutdown_state(struct cxl_memdev_state *mds);
 int cxl_set_timestamp(struct cxl_memdev_state *mds);
 int cxl_poison_state_init(struct cxl_memdev_state *mds);
 int cxl_mem_get_poison(struct cxl_memdev *cxlmd, u64 offset, u64 len,
diff --git a/drivers/cxl/cxlpci.h b/drivers/cxl/cxlpci.h
index 4da07727ab9c..9b229e3b9a9d 100644
--- a/drivers/cxl/cxlpci.h
+++ b/drivers/cxl/cxlpci.h
@@ -40,6 +40,12 @@
 
 /* CXL 2.0 8.1.6: GPF DVSEC for CXL Port */
 #define CXL_DVSEC_PORT_GPF					4
+#define   CXL_DVSEC_PORT_GPF_PHASE_1_CONTROL_OFFSET		0x0C
+#define     CXL_DVSEC_PORT_GPF_PHASE_1_TMO_BASE_MASK		GENMASK(3, 0)
+#define     CXL_DVSEC_PORT_GPF_PHASE_1_TMO_SCALE_MASK		GENMASK(11, 8)
+#define   CXL_DVSEC_PORT_GPF_PHASE_2_CONTROL_OFFSET		0xE
+#define     CXL_DVSEC_PORT_GPF_PHASE_2_TMO_BASE_MASK		GENMASK(3, 0)
+#define     CXL_DVSEC_PORT_GPF_PHASE_2_TMO_SCALE_MASK		GENMASK(11, 8)
 
 /* CXL 2.0 8.1.7: GPF DVSEC for CXL Device */
 #define CXL_DVSEC_DEVICE_GPF					5
@@ -121,6 +127,15 @@ static inline bool cxl_pci_flit_256(struct pci_dev *pdev)
 	return lnksta2 & PCI_EXP_LNKSTA2_FLIT;
 }
 
+/*
+ * Assume that any RCIEP that emits the CXL memory expander class code
+ * is an RCD
+ */
+static inline bool is_cxl_restricted(struct pci_dev *pdev)
+{
+	return pci_pcie_type(pdev) == PCI_EXP_TYPE_RC_END;
+}
+
 int devm_cxl_port_enumerate_dports(struct cxl_port *port);
 struct cxl_dev_state;
 int cxl_hdm_decode_init(struct cxl_dev_state *cxlds, struct cxl_hdm *cxlhdm,
diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
index 6d94ff4a4f1a..bad142095279 100644
--- a/drivers/cxl/pci.c
+++ b/drivers/cxl/pci.c
@@ -465,15 +465,6 @@ static int cxl_pci_setup_mailbox(struct cxl_memdev_state *mds, bool irq_avail)
 	return 0;
 }
 
-/*
- * Assume that any RCIEP that emits the CXL memory expander class code
- * is an RCD
- */
-static bool is_cxl_restricted(struct pci_dev *pdev)
-{
-	return pci_pcie_type(pdev) == PCI_EXP_TYPE_RC_END;
-}
-
 static int cxl_rcrb_get_comp_regs(struct pci_dev *pdev,
 				  struct cxl_register_map *map,
 				  struct cxl_dport *dport)
@@ -1038,6 +1029,18 @@ static int cxl_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 	if (cxl_pci_ras_unmask(pdev))
 		dev_dbg(&pdev->dev, "No RAS reporting unmasked\n");
 
+	/*
+	 * Set dirty shutdown now, with the expectation that the device
+	 * clear it upon a successful GPF flow. The exception to this
+	 * is upon Viral detection, per CXL 3.2 section 12.4.2.
+	 */
+	if (resource_size(&cxlds->pmem_res)) {
+		rc = cxl_dirty_shutdown_state(mds);
+		if (rc)
+			dev_warn(&pdev->dev,
+				 "GPF: could not dirty shutdown state\n");
+	}
+
 	pci_save_state(pdev);
 
 	return rc;
-- 
2.39.5


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH v3] cxl/pci: Support Global Persistent Flush (GPF)
  2025-01-21  4:25 [PATCH v3] cxl/pci: Support Global Persistent Flush (GPF) Davidlohr Bueso
@ 2025-01-21 22:28 ` Dan Williams
  2025-01-24 23:19   ` Davidlohr Bueso
  0 siblings, 1 reply; 4+ messages in thread
From: Dan Williams @ 2025-01-21 22:28 UTC (permalink / raw)
  To: Davidlohr Bueso, dave.jiang, dan.j.williams
  Cc: jonathan.cameron, alison.schofield, ira.weiny, vishal.l.verma,
	seven.yi.lee, hch, a.manzanares, fan.ni, anisa.su, dave,
	linux-cxl

Davidlohr Bueso wrote:
> Add support for GPF flows. It is found that the CXL specification
> around this to be a bit too involved from the driver side. And while
> this should really all handled by the hardware, this patch takes
> things with a grain of salt.
> 
> Upon respective port enumeration, both phase timeouts are set to
> a max of 20 seconds, which is the NMI watchdog default for lockup
> detection. The premise is that the kernel does not have enough
> information to set anything better than a max across the board
> and hope devices finish their GPF flows within the platform energy
> budget.
> 
> Timeout detection is based on dirty Shutdown semantics. The driver
> will mark it as dirty, expecting that the device clear it upon a
> successful GPF event. The admin may consult the device Health and
> check the dirty shutdown counter to see if there was a problem
> with data integrity.
> 
> Signed-off-by: Davidlohr Bueso <dave@stgolabs.net>
> ---
> Changes from v2:
>  - Remove RFC tag.
>  - Setup T2MAX to 20 secs, just like T1 (Dan).
>    This simplifies the patch significantly: 1) no longer need to 
>    update upon hot-removal, 2) don't need the device max timeouts. 
>  - Configure dvsec at port enumeration time, not pci_probe (Dan).
>  - Skip RCH.
>  - Cosmetic cleanups (Jonathan)

Overall looks good, but I do have a non-trivial comment at the bottom
that may require another patch spin.

>  Documentation/driver-api/cxl/maturity-map.rst |  2 +-
>  drivers/cxl/core/mbox.c                       | 18 ++++
>  drivers/cxl/core/pci.c                        | 83 +++++++++++++++++++
>  drivers/cxl/core/port.c                       |  2 +
>  drivers/cxl/cxl.h                             |  2 +
>  drivers/cxl/cxlmem.h                          |  5 ++
>  drivers/cxl/cxlpci.h                          | 15 ++++
>  drivers/cxl/pci.c                             | 21 +++--
>  8 files changed, 138 insertions(+), 10 deletions(-)
> 
> diff --git a/Documentation/driver-api/cxl/maturity-map.rst b/Documentation/driver-api/cxl/maturity-map.rst
> index df8e2ac2a320..99dd2c841e69 100644
> --- a/Documentation/driver-api/cxl/maturity-map.rst
> +++ b/Documentation/driver-api/cxl/maturity-map.rst
> @@ -130,7 +130,7 @@ Mailbox commands
>  * [0] Switch CCI
>  * [3] Timestamp
>  * [1] PMEM labels
> -* [0] PMEM GPF / Dirty Shutdown
> +* [1] PMEM GPF / Dirty Shutdown
>  * [0] Scan Media
>  
>  PMU
> diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
> index 548564c770c0..6b023e81832a 100644
> --- a/drivers/cxl/core/mbox.c
> +++ b/drivers/cxl/core/mbox.c
> @@ -1308,6 +1308,24 @@ int cxl_mem_create_range_info(struct cxl_memdev_state *mds)
>  }
>  EXPORT_SYMBOL_NS_GPL(cxl_mem_create_range_info, "CXL");
>  
> +int cxl_dirty_shutdown_state(struct cxl_memdev_state *mds)
> +{
> +	struct cxl_mailbox *cxl_mbox = &mds->cxlds.cxl_mbox;
> +	struct cxl_mbox_cmd mbox_cmd;
> +	struct cxl_mbox_set_shutdown_state in = {
> +		.state = 1
> +	};
> +
> +	mbox_cmd = (struct cxl_mbox_cmd) {
> +		.opcode = CXL_MBOX_OP_SET_SHUTDOWN_STATE,
> +		.size_in = sizeof(in),
> +		.payload_in = &in,
> +	};
> +
> +	return cxl_internal_send_cmd(cxl_mbox, &mbox_cmd);
> +}
> +EXPORT_SYMBOL_NS_GPL(cxl_dirty_shutdown_state, "CXL");
> +
>  int cxl_set_timestamp(struct cxl_memdev_state *mds)
>  {
>  	struct cxl_mailbox *cxl_mbox = &mds->cxlds.cxl_mbox;
> diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
> index b3aac9964e0d..d2867ea18448 100644
> --- a/drivers/cxl/core/pci.c
> +++ b/drivers/cxl/core/pci.c
> @@ -1054,3 +1054,86 @@ int cxl_pci_get_bandwidth(struct pci_dev *pdev, struct access_coordinate *c)
>  
>  	return 0;
>  }
> +
> +/*
> + * Set max timeout such that vendors will optimize GPF flow to avoid

Maybe:

s/vendors/platforms/

...since it is a concern across a given collection of endpoints, a host,
and its flush energy source?

> + * the implied worst-case scenario delays. On a sane platform, all
> + * devices should always complete GPF within the energy budget of
> + * the GPF flow. The kernel does not have enough information to pick
> + * anything better than "maximize timeouts and hope it works".
> + *
> + * A misbehaving device could block forward progress of GPF for all
> + * the other devices, exhausting the energy budget of the platform.
> + * However, the spec seems to assume that moving on from slow to respond
> + * devices is a virtue. It is not possible to know that, in actuality,
> + * the slow to respond device is *the* most critical device in the
> + * system to wait.
> + */
> +#define GPF_TIMEOUT_BASE_MAX 2
> +#define GPF_TIMEOUT_SCALE_MAX 7 /* 10 seconds */
> +
> +static int update_gpf_port_dvsec(struct pci_dev *pdev, int dvsec, int phase)
> +{
> +	u16 ctrl;
> +	int rc, offset, base, scale;
> +
> +	switch (phase) {
> +	case 1:
> +		offset = CXL_DVSEC_PORT_GPF_PHASE_1_CONTROL_OFFSET;
> +		base = CXL_DVSEC_PORT_GPF_PHASE_1_TMO_BASE_MASK;
> +		scale = CXL_DVSEC_PORT_GPF_PHASE_1_TMO_SCALE_MASK;
> +		break;
> +	case 2:
> +		offset = CXL_DVSEC_PORT_GPF_PHASE_2_CONTROL_OFFSET;
> +		base = CXL_DVSEC_PORT_GPF_PHASE_2_TMO_BASE_MASK;
> +		scale = CXL_DVSEC_PORT_GPF_PHASE_2_TMO_SCALE_MASK;
> +		break;
> +	default:
> +		return -EINVAL;
> +	}
> +
> +	rc = pci_read_config_word(pdev, dvsec + offset, &ctrl);
> +	if (rc)
> +		return rc;
> +
> +	if (FIELD_GET(base, ctrl) == GPF_TIMEOUT_BASE_MAX &&
> +	    FIELD_GET(scale, ctrl) == GPF_TIMEOUT_SCALE_MAX)
> +		return rc;
> +
> +	ctrl = FIELD_PREP(base, GPF_TIMEOUT_BASE_MAX);
> +	ctrl |= FIELD_PREP(scale, GPF_TIMEOUT_SCALE_MAX);
> +
> +	rc = pci_write_config_word(pdev, dvsec + offset, ctrl);
> +	if (!rc)
> +		dev_dbg(&pdev->dev, "Port GPF phase %d timeout: %d0 secs\n",
> +			phase, GPF_TIMEOUT_BASE_MAX);
> +
> +	return rc;
> +}
> +
> +int cxl_setup_gpf_port(struct device *dport_dev)
> +{
> +	int dvsec;
> +	struct pci_dev *pdev;
> +
> +	if (!dev_is_pci(dport_dev))
> +		return 0;
> +
> +	pdev = to_pci_dev(dport_dev);
> +
> +	if (is_cxl_restricted(pdev))
> +		return 0;

Given this is called from devm_cxl_enumerate_ports() after it is already
clear that @dport_dev is in a VH topology, no need to worry about
RCH/RCD details here.

> +
> +	dvsec = pci_find_dvsec_capability(pdev, PCI_VENDOR_ID_CXL,
> +					  CXL_DVSEC_PORT_GPF);

I think this ok for now, but perhaps this should be enumerated once
per-port and cached in 'struct cxl_port', otherwise this can potentially
be called multiple times for shared ports in deeper VH topologies.

> +	if (!dvsec) {
> +		dev_warn(&pdev->dev, "Port GPF DVSEC not present\n");

Could be pci_warn() to use @pdev directly.

> +		return -EINVAL;
> +	}
> +
> +	update_gpf_port_dvsec(pdev, dvsec, 1);
> +	update_gpf_port_dvsec(pdev, dvsec, 2);
> +
> +	return 0;
> +}
> +EXPORT_SYMBOL_NS_GPL(cxl_setup_gpf_port, "CXL");

Drop this export. There is no caller outside of the CXL core itself, and
the fact that this assumes being called from a certain point in
devm_cxl_enumerate_ports() means that it is only a helper, not library
call for other drivers.

> diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
> index 78a5c2c25982..1ad6d2e05f09 100644
> --- a/drivers/cxl/core/port.c
> +++ b/drivers/cxl/core/port.c
> @@ -1653,6 +1653,8 @@ int devm_cxl_enumerate_ports(struct cxl_memdev *cxlmd)
>  		dev_dbg(dev, "scan: iter: %s dport_dev: %s parent: %s\n",
>  			dev_name(iter), dev_name(dport_dev),
>  			dev_name(uport_dev));
> +
> +		cxl_setup_gpf_port(dport_dev);
>  		struct cxl_port *port __free(put_cxl_port) =
>  			find_cxl_port(dport_dev, &dport);
>  		if (port) {
> diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
> index fdac3ddb8635..c80c2300dee7 100644
> --- a/drivers/cxl/cxl.h
> +++ b/drivers/cxl/cxl.h
> @@ -912,6 +912,8 @@ void cxl_coordinates_combine(struct access_coordinate *out,
>  
>  bool cxl_endpoint_decoder_reset_detected(struct cxl_port *port);
>  
> +int cxl_setup_gpf_port(struct device *dport_dev);

This can move to drivers/cxl/core/core.h

> +
>  /*
>   * Unit test builds overrides this to __weak, find the 'strong' version
>   * of these symbols in tools/testing/cxl/.
> diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
> index 2a25d1957ddb..a085374d52d3 100644
> --- a/drivers/cxl/cxlmem.h
> +++ b/drivers/cxl/cxlmem.h
> @@ -693,6 +693,10 @@ struct cxl_mbox_set_partition_info {
>  
>  #define  CXL_SET_PARTITION_IMMEDIATE_FLAG	BIT(0)
>  
> +struct cxl_mbox_set_shutdown_state {
> +	u8 state;
> +} __packed;
> +
>  /* Set Timestamp CXL 3.0 Spec 8.2.9.4.2 */
>  struct cxl_mbox_set_timestamp_in {
>  	__le64 timestamp;
> @@ -829,6 +833,7 @@ void cxl_event_trace_record(const struct cxl_memdev *cxlmd,
>  			    enum cxl_event_log_type type,
>  			    enum cxl_event_type event_type,
>  			    const uuid_t *uuid, union cxl_event *evt);
> +int cxl_dirty_shutdown_state(struct cxl_memdev_state *mds);
>  int cxl_set_timestamp(struct cxl_memdev_state *mds);
>  int cxl_poison_state_init(struct cxl_memdev_state *mds);
>  int cxl_mem_get_poison(struct cxl_memdev *cxlmd, u64 offset, u64 len,
> diff --git a/drivers/cxl/cxlpci.h b/drivers/cxl/cxlpci.h
> index 4da07727ab9c..9b229e3b9a9d 100644
> --- a/drivers/cxl/cxlpci.h
> +++ b/drivers/cxl/cxlpci.h
> @@ -40,6 +40,12 @@
>  
>  /* CXL 2.0 8.1.6: GPF DVSEC for CXL Port */
>  #define CXL_DVSEC_PORT_GPF					4
> +#define   CXL_DVSEC_PORT_GPF_PHASE_1_CONTROL_OFFSET		0x0C
> +#define     CXL_DVSEC_PORT_GPF_PHASE_1_TMO_BASE_MASK		GENMASK(3, 0)
> +#define     CXL_DVSEC_PORT_GPF_PHASE_1_TMO_SCALE_MASK		GENMASK(11, 8)
> +#define   CXL_DVSEC_PORT_GPF_PHASE_2_CONTROL_OFFSET		0xE
> +#define     CXL_DVSEC_PORT_GPF_PHASE_2_TMO_BASE_MASK		GENMASK(3, 0)
> +#define     CXL_DVSEC_PORT_GPF_PHASE_2_TMO_SCALE_MASK		GENMASK(11, 8)
>  
>  /* CXL 2.0 8.1.7: GPF DVSEC for CXL Device */
>  #define CXL_DVSEC_DEVICE_GPF					5
> @@ -121,6 +127,15 @@ static inline bool cxl_pci_flit_256(struct pci_dev *pdev)
>  	return lnksta2 & PCI_EXP_LNKSTA2_FLIT;
>  }
>  
> +/*
> + * Assume that any RCIEP that emits the CXL memory expander class code
> + * is an RCD
> + */
> +static inline bool is_cxl_restricted(struct pci_dev *pdev)
> +{
> +	return pci_pcie_type(pdev) == PCI_EXP_TYPE_RC_END;
> +}

Per above, no need to move this since it is not needed by
cxl_setup_gpf_port().

> +
>  int devm_cxl_port_enumerate_dports(struct cxl_port *port);
>  struct cxl_dev_state;
>  int cxl_hdm_decode_init(struct cxl_dev_state *cxlds, struct cxl_hdm *cxlhdm,
> diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
> index 6d94ff4a4f1a..bad142095279 100644
> --- a/drivers/cxl/pci.c
> +++ b/drivers/cxl/pci.c
> @@ -465,15 +465,6 @@ static int cxl_pci_setup_mailbox(struct cxl_memdev_state *mds, bool irq_avail)
>  	return 0;
>  }
>  
> -/*
> - * Assume that any RCIEP that emits the CXL memory expander class code
> - * is an RCD
> - */
> -static bool is_cxl_restricted(struct pci_dev *pdev)
> -{
> -	return pci_pcie_type(pdev) == PCI_EXP_TYPE_RC_END;
> -}
> -
>  static int cxl_rcrb_get_comp_regs(struct pci_dev *pdev,
>  				  struct cxl_register_map *map,
>  				  struct cxl_dport *dport)
> @@ -1038,6 +1029,18 @@ static int cxl_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
>  	if (cxl_pci_ras_unmask(pdev))
>  		dev_dbg(&pdev->dev, "No RAS reporting unmasked\n");
>  
> +	/*
> +	 * Set dirty shutdown now, with the expectation that the device
> +	 * clear it upon a successful GPF flow. The exception to this
> +	 * is upon Viral detection, per CXL 3.2 section 12.4.2.
> +	 */
> +	if (resource_size(&cxlds->pmem_res)) {
> +		rc = cxl_dirty_shutdown_state(mds);
> +		if (rc)
> +			dev_warn(&pdev->dev,
> +				 "GPF: could not dirty shutdown state\n");

I think the proper place for this call is in cxl_nvdimm_probe(). That
happens after the device has succeeded cxl_mem_probe() and successfully
enumerated pmem capacity, but before the kernel commits to using pmem
through the nvdimm subsystem.

Had shutdown_state been part of cxl_nvdimm_probe() since day one, the
policy could have been: "if cxl_dirty_shutdown_state() fails,
cxl_nvdimm_probe() fails". However, I know that would regress cxl-test
and qemu if not devices in the wild that do not meet that spec mandatory
requirement where Linux has been lenient for a long while now.

The two options are:
1/ Rip the band-aid, fail cxl_nvdimm_probe() and require device
   implementations to catch up

2/ Continue leniency, advertise whether dirty-tracking is enabled in
   sysfs (via cxl_nvdimm_driver dev_groups?) and document that pmem devices
   without dirty-tracking should only be used for debug/development, not
   production.

Perhaps to make progress on this patch, split the GPF settings to its
own patch, and follow-on with one that adds dirty-state tracking
capability to sysfs, likely with the dirty-shutdown count as well.

I.e. a deployment that cares about pmem consistency relative to
power-loss should care both that the kernel is dirtying pmem before
writing to it, and communicating whether any dirty-shutdowns have been
detected by the device. I expect that sysfs patch might be 6.15 material
(especially with the dirty-count additions) where these GPF fixups could
slot in for v6.14, but that is up to Dave.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH v3] cxl/pci: Support Global Persistent Flush (GPF)
  2025-01-21 22:28 ` Dan Williams
@ 2025-01-24 23:19   ` Davidlohr Bueso
  2025-01-25  0:30     ` Dan Williams
  0 siblings, 1 reply; 4+ messages in thread
From: Davidlohr Bueso @ 2025-01-24 23:19 UTC (permalink / raw)
  To: Dan Williams
  Cc: dave.jiang, jonathan.cameron, alison.schofield, ira.weiny,
	vishal.l.verma, seven.yi.lee, hch, a.manzanares, fan.ni, anisa.su,
	linux-cxl

On Tue, 21 Jan 2025, Dan Williams wrote:

>Davidlohr Bueso wrote:
>> +int cxl_setup_gpf_port(struct device *dport_dev)
>> +{
>> +	int dvsec;
>> +	struct pci_dev *pdev;
>> +
>> +	if (!dev_is_pci(dport_dev))
>> +		return 0;
>> +
>> +	pdev = to_pci_dev(dport_dev);
>> +
>> +	if (is_cxl_restricted(pdev))
>> +		return 0;
>
>Given this is called from devm_cxl_enumerate_ports() after it is already
>clear that @dport_dev is in a VH topology, no need to worry about
>RCH/RCD details here.

Yeah I realized that the second after sending this out.

>
>> +
>> +	dvsec = pci_find_dvsec_capability(pdev, PCI_VENDOR_ID_CXL,
>> +					  CXL_DVSEC_PORT_GPF);
>
>I think this ok for now, but perhaps this should be enumerated once
>per-port and cached in 'struct cxl_port', otherwise this can potentially
>be called multiple times for shared ports in deeper VH topologies.

Right so I've cached this and moved the whole cxl_setup_gpf_port() call
down into once we find the respective port.

>
>> +	if (!dvsec) {
>> +		dev_warn(&pdev->dev, "Port GPF DVSEC not present\n");
>
>Could be pci_warn() to use @pdev directly.

Sure, but afaict the pci_ flavors aren't used throughout the cxl driver.

...

>> @@ -1038,6 +1029,18 @@ static int cxl_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
>>  	if (cxl_pci_ras_unmask(pdev))
>>  		dev_dbg(&pdev->dev, "No RAS reporting unmasked\n");
>>
>> +	/*
>> +	 * Set dirty shutdown now, with the expectation that the device
>> +	 * clear it upon a successful GPF flow. The exception to this
>> +	 * is upon Viral detection, per CXL 3.2 section 12.4.2.
>> +	 */
>> +	if (resource_size(&cxlds->pmem_res)) {
>> +		rc = cxl_dirty_shutdown_state(mds);
>> +		if (rc)
>> +			dev_warn(&pdev->dev,
>> +				 "GPF: could not dirty shutdown state\n");
>
>I think the proper place for this call is in cxl_nvdimm_probe(). That
>happens after the device has succeeded cxl_mem_probe() and successfully
>enumerated pmem capacity, but before the kernel commits to using pmem
>through the nvdimm subsystem.
>
>Had shutdown_state been part of cxl_nvdimm_probe() since day one, the
>policy could have been: "if cxl_dirty_shutdown_state() fails,
>cxl_nvdimm_probe() fails". However, I know that would regress cxl-test
>and qemu if not devices in the wild that do not meet that spec mandatory
>requirement where Linux has been lenient for a long while now.

So qemu has patches out there and I will also send a patch for the mock device.

But yes, I agree with your approach to failing the dirty shutdown.

There is a similar consideration to have with failing to set the port
dvsec (ie: not there); should we fail the enumeration altogether? So far
I have been lenient, and considering that you did not mention this, I *think*
you agree. For users that know they will never need pmem that doesn't matter,
but overall the platform would be considered to be exposed to data corruption
(and afaict gpf dvsecs are mandatory). Perhaps another potential sysfs
file that would alert users of quirky hardware?

Thanks,
Davidlohr

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH v3] cxl/pci: Support Global Persistent Flush (GPF)
  2025-01-24 23:19   ` Davidlohr Bueso
@ 2025-01-25  0:30     ` Dan Williams
  0 siblings, 0 replies; 4+ messages in thread
From: Dan Williams @ 2025-01-25  0:30 UTC (permalink / raw)
  To: Davidlohr Bueso, Dan Williams
  Cc: dave.jiang, jonathan.cameron, alison.schofield, ira.weiny,
	vishal.l.verma, seven.yi.lee, hch, a.manzanares, fan.ni, anisa.su,
	linux-cxl

Davidlohr Bueso wrote:
[..]
> >> +	if (!dvsec) {
> >> +		dev_warn(&pdev->dev, "Port GPF DVSEC not present\n");
> >
> >Could be pci_warn() to use @pdev directly.
> 
> Sure, but afaict the pci_ flavors aren't used throughout the cxl driver.

Fair enough, yes, match local style.

> 
> ...
> 
> >> @@ -1038,6 +1029,18 @@ static int cxl_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
> >>  	if (cxl_pci_ras_unmask(pdev))
> >>  		dev_dbg(&pdev->dev, "No RAS reporting unmasked\n");
> >>
> >> +	/*
> >> +	 * Set dirty shutdown now, with the expectation that the device
> >> +	 * clear it upon a successful GPF flow. The exception to this
> >> +	 * is upon Viral detection, per CXL 3.2 section 12.4.2.
> >> +	 */
> >> +	if (resource_size(&cxlds->pmem_res)) {
> >> +		rc = cxl_dirty_shutdown_state(mds);
> >> +		if (rc)
> >> +			dev_warn(&pdev->dev,
> >> +				 "GPF: could not dirty shutdown state\n");
> >
> >I think the proper place for this call is in cxl_nvdimm_probe(). That
> >happens after the device has succeeded cxl_mem_probe() and successfully
> >enumerated pmem capacity, but before the kernel commits to using pmem
> >through the nvdimm subsystem.
> >
> >Had shutdown_state been part of cxl_nvdimm_probe() since day one, the
> >policy could have been: "if cxl_dirty_shutdown_state() fails,
> >cxl_nvdimm_probe() fails". However, I know that would regress cxl-test
> >and qemu if not devices in the wild that do not meet that spec mandatory
> >requirement where Linux has been lenient for a long while now.
> 
> So qemu has patches out there and I will also send a patch for the mock device.
> 
> But yes, I agree with your approach to failing the dirty shutdown.
> 
> There is a similar consideration to have with failing to set the port
> dvsec (ie: not there); should we fail the enumeration altogether? So far
> I have been lenient, and considering that you did not mention this, I *think*
> you agree. For users that know they will never need pmem that doesn't matter,
> but overall the platform would be considered to be exposed to data corruption
> (and afaict gpf dvsecs are mandatory). Perhaps another potential sysfs
> file that would alert users of quirky hardware?

gpf dvsecs are mandatory for pmem devices, but given older kernels
happily allow operation to continue it feels late to start breaking
setups on new kernel upgrades. Given that the administrator needs to
decide about their dirty-shutdown count policy anyway I think it is ok
to punt that whole policy problem to userspace.

I recall that the acpi_nfit driver exports an Optane dirty shutdown
count attribue in the nvdimm sysfs hierarchy, and there was never a
kernel policy to fail driver load based on dirty-shutdown. So we both
have precedent to not fail here and can eventually copy the same sysfs
precedent for attaching a $provider/dirty_shutdown attribute via the
cxl_dimm_attribute_group.

The rule I am thinking of is:

---
If you do not see a:

/sys/bus/cxl/devices/nvdimm-bridge0/ndbusX/nmemY/cxl/dirty_shutdown

...entry for your device it means dirty-shutdown tracking is disabled
due to:

- lack of kernel support
- lack of device gpf dvsec support
- lack of device Set Shutdown State support
- failure of Set Shutdown State to dirty the device
- failure of Get Health Info to retrieve the count
---

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2025-01-25  0:31 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-01-21  4:25 [PATCH v3] cxl/pci: Support Global Persistent Flush (GPF) Davidlohr Bueso
2025-01-21 22:28 ` Dan Williams
2025-01-24 23:19   ` Davidlohr Bueso
2025-01-25  0:30     ` Dan Williams

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox