public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: "Bowman, Terry" <terry.bowman@amd.com>
To: Dan Williams <dan.j.williams@intel.com>,
	linux-cxl@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-pci@vger.kernel.org, nifan.cxl@gmail.com,
	dave@stgolabs.net, jonathan.cameron@huawei.com,
	dave.jiang@intel.com, alison.schofield@intel.com,
	vishal.l.verma@intel.com, bhelgaas@google.com,
	mahesh@linux.ibm.com, ira.weiny@intel.com, oohall@gmail.com,
	Benjamin.Cheatham@amd.com, rrichter@amd.com,
	nathan.fontenot@amd.com, Smita.KoralahalliChannabasappa@amd.com,
	lukas@wunner.de, ming.li@zohomail.com,
	PradeepVineshReddy.Kodamati@amd.com
Subject: Re: [PATCH v7 07/17] cxl/pci: Map CXL PCIe Root Port and Downstream Switch Port RAS registers
Date: Thu, 13 Feb 2025 09:43:40 -0600	[thread overview]
Message-ID: <609a02bb-3271-4021-9499-8b281a959f62@amd.com> (raw)
In-Reply-To: <67abf81f4617b_2d1e2946a@dwillia2-xfh.jf.intel.com.notmuch>



On 2/11/2025 7:23 PM, Dan Williams wrote:
> Terry Bowman wrote:
>> The CXL mem driver (cxl_mem) currently maps and caches a pointer to RAS
>> registers for the endpoint's Root Port. The same needs to be done for
>> each of the CXL Downstream Switch Ports and CXL Root Ports found between
>> the endpoint and CXL Host Bridge.
>>
>> Introduce cxl_init_ep_ports_aer() to be called for each CXL Port in the
>> sub-topology between the endpoint and the CXL Host Bridge. This function
>> will determine if there are CXL Downstream Switch Ports or CXL Root Ports
>> associated with this Port. The same check will be added in the future for
>> upstream switch ports.
>>
>> Move the RAS register map logic from cxl_dport_map_ras() into
>> cxl_dport_init_ras_reporting(). This eliminates the need for the helper
>> function, cxl_dport_map_ras().
> Not sure about the motivation here...
>
>> cxl_init_ep_ports_aer() calls cxl_dport_init_ras_reporting() to map
>> the RAS registers for CXL Downstream Switch Ports and CXL Root Ports.
> Ok, makes sense...
>
>> cxl_dport_init_ras_reporting() must check for previously mapped registers
>> before mapping. This is required because multiple Endpoints under a CXL
>> switch may share an upstream CXL Root Port, CXL Downstream Switch Port,
>> or CXL Downstream Switch Port. Ensure the RAS registers are only mapped
>> once.
> Sounds broken. Every device upstream-port only has one downstream port.
>
> A CXL switch config looks like this:
>
>            │             
> ┌──────────┼────────────┐
> │SWITCH   ┌┴─┐          │
> │         │UP│          │
> │         └─┬┘          │
> │    ┌──────┼─────┐     │
> │    │      │     │     │
> │   ┌┴─┐  ┌─┴┐  ┌─┴┐    │
> │   │DP│  │DP│  │DP│    │
> │   └┬─┘  └─┬┘  └─┬┘    │
> └────┼──────┼─────┼─────┘
>     ┌┴─┐  ┌─┴┐  ┌─┴┐     
>     │EP│  │EP│  │EP│     
>     └──┘  └──┘  └──┘     
>
> ...so how can an endpoint ever find that its immediate parent downstream
> port has already been mapped?


            ┌─┴─┐
            │RP1│
            └─┬─┘
  ┌───────────┼───────────┐
  │SWITCH   ┌─┴─┐         │
  │         │UP1│         │   RP1 - 0c:00.0
  │         └─┬─┘         │   UP1 - 0d:00.0
  │    ┌──────┼─────┐     │   DP1 - 0e:00.0
  │    │      │     │     │
  │  ┌─┴─┐  ┌─┴─┐ ┌─┴─┐   │
  │  │DP1│  │DP2│ │DP3│   │
  │  └─┬─┘  └─┬─┘ └─┬─┘   │
  └────┼──────┼─────┼─────┘
     ┌─┴─┐  ┌─┴─┐ ┌─┴─┐
     │EP1│  │EP2│ │EP3│
     └───┘  └───┘ └───┘


It cant but the root RP and USP have duplicate calls for each EP in the example diagram.
The function's purpose is to map RAS registers and cache the address. This reuses the
same function for RP and DSP. The DSP will never be previously mapped as you indicated.

>> Introduce a mutex for synchronizing accesses to the cached RAS mapping.
> I suspect the motivation for the lock and "previously mapped" check was
> due to noticing that the ras registers are not being unmapped, but
> that's due to a devm bug below.
The synchronization was added as result of review recommendation because it is
a racy area. It's possible that multiple endpoints using the same switch could
call this function from devm_cxl_add_endpoints()->cxl_init_ep_ports().
> Even if it were the case that multiple resources need to share 1 devm
> mapping, that would need to look something like the logic around
> cxl_detach_ep(). In that arrangement, the first endpoint in the door
> sets up the 'struct cxl_port' and its 'struct cxl_dport' instances, and
> the last endpoint out the door tears it all down and turns off the
> lights.
>
>> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
>> Reviewed-by: Alejandro Lucero <alucerop@amd.com>
>> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
>> Reviewed-by: Gregory Price <gourry@gourry.net>
>> ---
>>  drivers/cxl/core/pci.c | 42 ++++++++++++++++++++----------------------
>>  drivers/cxl/cxl.h      |  6 ++----
>>  drivers/cxl/mem.c      | 31 +++++++++++++++++++++++++++++--
>>  3 files changed, 51 insertions(+), 28 deletions(-)
>>
>> diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
>> index a5c65f79db18..143c853a52c4 100644
>> --- a/drivers/cxl/core/pci.c
>> +++ b/drivers/cxl/core/pci.c
>> @@ -24,6 +24,8 @@ static unsigned short media_ready_timeout = 60;
>>  module_param(media_ready_timeout, ushort, 0644);
>>  MODULE_PARM_DESC(media_ready_timeout, "seconds to wait for media ready");
>>  
>> +static DEFINE_MUTEX(ras_init_mutex);
>> +
>>  struct cxl_walk_context {
>>  	struct pci_bus *bus;
>>  	struct cxl_port *port;
>> @@ -749,18 +751,6 @@ static void cxl_dport_map_rch_aer(struct cxl_dport *dport)
>>  	}
>>  }
>>  
>> -static void cxl_dport_map_ras(struct cxl_dport *dport)
>> -{
>> -	struct cxl_register_map *map = &dport->reg_map;
>> -	struct device *dev = dport->dport_dev;
>> -
>> -	if (!map->component_map.ras.valid)
>> -		dev_dbg(dev, "RAS registers not found\n");
>> -	else if (cxl_map_component_regs(map, &dport->regs.component,
>> -					BIT(CXL_CM_CAP_CAP_ID_RAS)))
>> -		dev_dbg(dev, "Failed to map RAS capability.\n");
>> -}
>> -
>>  static void cxl_disable_rch_root_ints(struct cxl_dport *dport)
>>  {
>>  	void __iomem *aer_base = dport->regs.dport_aer;
>> @@ -788,22 +778,30 @@ static void cxl_disable_rch_root_ints(struct cxl_dport *dport)
>>  /**
>>   * cxl_dport_init_ras_reporting - Setup CXL RAS report on this dport
>>   * @dport: the cxl_dport that needs to be initialized
>> - * @host: host device for devm operations
>>   */
>> -void cxl_dport_init_ras_reporting(struct cxl_dport *dport, struct device *host)
>> +void cxl_dport_init_ras_reporting(struct cxl_dport *dport)
>>  {
>> -	dport->reg_map.host = host;
>> -	cxl_dport_map_ras(dport);
>> -
>> -	if (dport->rch) {
>> -		struct pci_host_bridge *host_bridge = to_pci_host_bridge(dport->dport_dev);
>> -
>> -		if (!host_bridge->native_aer)
>> -			return;
>> +	struct device *dport_dev = dport->dport_dev;
>> +	struct pci_host_bridge *host_bridge = to_pci_host_bridge(dport_dev);
>>  
>> +	dport->reg_map.host = dport_dev;
> This seems to be confused about how devm works. @host is passed in
> because the cxl_memdev instance being probed in cxl_mem_probe() is doing
> setup work on behalf of @dport_dev.
>
> When the cxl_memdev goes through a ->remove() event, unbind from
> cxl_mem, it tears down that mapping.
>
> However, when using @dport_dev as the devm host, that mapping will not
> be torn down until either the @dport_dev goes through a ->remove() event
> or the device is unregistered altogether. There is no CXL subsystem
> coordination with a driver for @dport_dev. The PCIe portdrv might have
> an interest in it, but CXL can not depend on portdrv to map CXL
> registers or keep the device bound while CXL has an interest those
> registers. The devres_release_all() triggered by a
> "device_del(@dport_dev)" is also uncoordinated with any CXL interest. In
> general, it is a devm anti-pattern to depend on a device_del() event to
> trigger devres_release_all().
>
>
>> +	if (dport->rch && host_bridge->native_aer) {
>>  		cxl_dport_map_rch_aer(dport);
>>  		cxl_disable_rch_root_ints(dport);
>>  	}
>> +
>> +	/* dport may have more than 1 downstream EP. Check if already mapped. */
>> +	mutex_lock(&ras_init_mutex);
> I suspect this lock and check got added to workaround "Failed to request
> region" messages coming out of devm_cxl_iomap_block() in testing? Per
> above, that's not "more than 1 downstream EPi", that's "failure to clean
> up the last mapping for the next cxl_mem_probe() event of the same
> endpoint".
Synchronization was added to handle the concurrent accesses. I never observed
issues due to the race condition for RP and USP but I confirmed through further
testing it is a real potential issue for the RP and USP.

You recommended, in the next patch, to map USP RAS registers from cxl_endpoint_port_probe().
Would you like the RP and DSP mapping to be called from cxl_endpoint_port_probe() as well? Terry


  reply	other threads:[~2025-02-13 15:43 UTC|newest]

Thread overview: 94+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-02-11 19:24 [PATCH v7 00/17] Enable CXL PCIe port protocol error handling and logging Terry Bowman
2025-02-11 19:24 ` [PATCH v7 01/17] PCI/AER: Introduce 'struct cxl_err_handlers' and add to 'struct pci_driver' Terry Bowman
2025-02-11 20:25   ` Bjorn Helgaas
2025-02-11 20:42   ` Dan Williams
2025-02-11 19:24 ` [PATCH v7 02/17] PCI/AER: Rename AER driver's interfaces to also indicate CXL PCIe Port support Terry Bowman
2025-02-11 20:26   ` Bjorn Helgaas
2025-02-11 20:44   ` Dan Williams
2025-02-11 19:24 ` [PATCH v7 03/17] CXL/PCI: Introduce PCIe helper functions pcie_is_cxl() and pcie_is_cxl_port() Terry Bowman
2025-02-11 20:28   ` Bjorn Helgaas
2025-02-11 20:38     ` Bowman, Terry
2025-02-11 22:33   ` Dan Williams
2025-02-12 19:07     ` Bowman, Terry
2025-02-12 19:51       ` Dan Williams
2025-03-04 19:11   ` Ira Weiny
2025-02-11 19:24 ` [PATCH v7 04/17] PCI/AER: Modify AER driver logging to report CXL or PCIe bus error type Terry Bowman
2025-02-11 20:28   ` Bjorn Helgaas
2025-02-11 23:47   ` Dan Williams
2025-02-12 19:15     ` Bowman, Terry
2025-02-12 19:57       ` Dan Williams
2025-02-12 21:08         ` Bowman, Terry
2025-02-12 21:17           ` Lukas Wunner
2025-02-11 19:24 ` [PATCH v7 05/17] PCI/AER: Add CXL PCIe Port correctable error support in AER service driver Terry Bowman
2025-02-11 20:28   ` Bjorn Helgaas
2025-02-11 23:58   ` Dan Williams
2025-02-12 21:52     ` Bowman, Terry
2025-02-11 19:24 ` [PATCH v7 06/17] PCI/AER: Add CXL PCIe Port uncorrectable error recovery " Terry Bowman
2025-02-11 20:29   ` Bjorn Helgaas
2025-02-11 21:59   ` Dave Jiang
2025-02-12  0:02   ` Gregory Price
2025-02-12  0:24   ` Dan Williams
2025-02-14 19:36     ` Bowman, Terry
2025-02-14 15:11   ` Jonathan Cameron
2025-02-18 15:43     ` Bowman, Terry
2025-02-14 17:36   ` Fan Ni
2025-02-11 19:24 ` [PATCH v7 07/17] cxl/pci: Map CXL PCIe Root Port and Downstream Switch Port RAS registers Terry Bowman
2025-02-11 22:40   ` Dave Jiang
2025-02-12  1:23   ` Dan Williams
2025-02-13 15:43     ` Bowman, Terry [this message]
2025-02-14 21:24       ` Dan Williams
2025-02-14 22:23         ` Bowman, Terry
2025-02-14 22:42           ` Dan Williams
2025-02-12 22:28   ` Alison Schofield
2025-02-12 22:37     ` Bowman, Terry
2025-02-11 19:24 ` [PATCH v7 08/17] cxl/pci: Map CXL PCIe Upstream " Terry Bowman
2025-02-11 23:02   ` Dave Jiang
2025-02-12  2:00   ` Dan Williams
2025-02-14 19:46     ` Bowman, Terry
2025-02-14 21:29       ` Dan Williams
2025-02-14 15:15   ` Jonathan Cameron
2025-02-14 19:50     ` Bowman, Terry
2025-02-11 19:24 ` [PATCH v7 09/17] cxl/pci: Update RAS handler interfaces to also support CXL PCIe Ports Terry Bowman
2025-02-11 23:26   ` Dave Jiang
2025-02-14 15:19   ` Jonathan Cameron
2025-02-11 19:24 ` [PATCH v7 10/17] cxl/pci: Add log message and add type check in existing RAS handlers Terry Bowman
2025-02-11 23:28   ` Dave Jiang
2025-02-12 22:59   ` Dan Williams
2025-02-13  0:08     ` Bowman, Terry
2025-02-14 15:28       ` Jonathan Cameron
2025-02-11 19:24 ` [PATCH v7 11/17] cxl/pci: Change find_cxl_port() to non-static Terry Bowman
2025-02-11 23:42   ` Dave Jiang
2025-02-13 23:15   ` Dan Williams
2025-02-11 19:24 ` [PATCH v7 12/17] cxl/pci: Add error handler for CXL PCIe Port RAS errors Terry Bowman
2025-02-12  0:11   ` Dave Jiang
2025-02-12 16:34     ` Bowman, Terry
2025-02-14  2:18   ` Dan Williams
2025-02-14 21:43     ` Bowman, Terry
2025-02-15  0:20       ` Dan Williams
2025-02-18 15:33         ` Bowman, Terry
2025-02-18 17:15           ` Dan Williams
2025-03-05  0:22   ` Ira Weiny
2025-03-06 13:50     ` Bowman, Terry
2025-03-06 13:50     ` Bowman, Terry
2025-02-11 19:24 ` [PATCH v7 13/17] cxl/pci: Add trace logging " Terry Bowman
2025-02-12  0:11   ` Gregory Price
2025-02-12  0:17   ` Dave Jiang
2025-02-12  0:19     ` Dave Jiang
2025-02-12 16:23     ` Bowman, Terry
2025-02-14  2:21   ` Dan Williams
2025-02-14 15:34     ` Jonathan Cameron
2025-02-11 19:24 ` [PATCH v7 14/17] cxl/pci: Update CXL Port RAS logging to also display PCIe SBDF Terry Bowman
2025-02-12  0:21   ` Dave Jiang
2025-02-12 16:20     ` Bowman, Terry
2025-02-12 23:30   ` Alison Schofield
2025-02-12 23:34     ` Bowman, Terry
2025-02-11 19:24 ` [PATCH v7 15/17] cxl/pci: Add support to assign and clear pci_driver::cxl_err_handlers Terry Bowman
2025-02-12  0:38   ` Dave Jiang
2025-02-14  2:29   ` Dan Williams
2025-02-11 19:24 ` [PATCH v7 16/17] PCI/AER: Enable internal errors for CXL Upstream and Downstream Switch Ports Terry Bowman
2025-02-11 20:25   ` Bjorn Helgaas
2025-02-11 20:40     ` Bowman, Terry
2025-02-12  0:40   ` Dave Jiang
2025-02-14  2:35   ` Dan Williams
2025-02-11 19:24 ` [PATCH v7 17/17] cxl/pci: Handle CXL Endpoint and RCH Protocol Errors separately from PCIe errors Terry Bowman
2025-02-14  2:43   ` Dan Williams

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=609a02bb-3271-4021-9499-8b281a959f62@amd.com \
    --to=terry.bowman@amd.com \
    --cc=Benjamin.Cheatham@amd.com \
    --cc=PradeepVineshReddy.Kodamati@amd.com \
    --cc=Smita.KoralahalliChannabasappa@amd.com \
    --cc=alison.schofield@intel.com \
    --cc=bhelgaas@google.com \
    --cc=dan.j.williams@intel.com \
    --cc=dave.jiang@intel.com \
    --cc=dave@stgolabs.net \
    --cc=ira.weiny@intel.com \
    --cc=jonathan.cameron@huawei.com \
    --cc=linux-cxl@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-pci@vger.kernel.org \
    --cc=lukas@wunner.de \
    --cc=mahesh@linux.ibm.com \
    --cc=ming.li@zohomail.com \
    --cc=nathan.fontenot@amd.com \
    --cc=nifan.cxl@gmail.com \
    --cc=oohall@gmail.com \
    --cc=rrichter@amd.com \
    --cc=vishal.l.verma@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox