Linux CXL
 help / color / mirror / Atom feed
From: Jonathan Cameron <Jonathan.Cameron@Huawei.com>
To: Dan Williams <dan.j.williams@intel.com>
Cc: <linux-cxl@vger.kernel.org>
Subject: Re: [PATCH 2/3] cxl/region: Flag partially torn down regions as unusable
Date: Thu, 22 Jun 2023 10:14:57 +0100	[thread overview]
Message-ID: <20230622101457.00001c11@Huawei.com> (raw)
In-Reply-To: <168696507423.3590522.16254212607926684429.stgit@dwillia2-xfh.jf.intel.com>

On Fri, 16 Jun 2023 18:24:34 -0700
Dan Williams <dan.j.williams@intel.com> wrote:

> cxl_region_decode_reset() walks all the decoders associated with a given
> region and disables them. Due to decoder ordering rules it is possible
> that a switch in the topology notices that a given decoder can not be
> shutdown before another region with a higher HPA is shutdown first. That
> can leave the region in a partially committed state.
> 
> Capture that state in a new CXL_REGION_F_NEEDS_RESET flag and require
> that a successful cxl_region_decode_reset() attempt must be completed
> before cxl_region_probe() accepts the region.
> 
> This is a corollary for the bug that Jonathan identified in "CXL/region
> :  commit reset of out of order region appears to succeed." [1].
> 
> Cc: Jonathan Cameron <Jonathan.Cameron@Huawei.com>
> Link: http://lore.kernel.org/r/20230316171441.0000205b@Huawei.com [1]
> Fixes: 176baefb2eb5 ("cxl/hdm: Commit decoder state to hardware")
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>

I'm sure I understood this flow better a while back....
I think this is right. The subtlety that had me for a while was a fialure
of the first cxld->reset() but if that fails we can assume that all decoders
are still in place so I think we are fine.

I wonder if a better approach (though heavy for a fix?) is a prepass
that checks it is possible to tear the region down, followed by actually doing it
(all under appropriate locking so state doesn't change on us)
That way no intermediate half torn down states to worry about in the state machine.

From a spec side of things I've never been convinced the ordering rules make sense.
Some hardware / firmware simplification at cost of nasty software limitations.

Jonathan


> ---
>  drivers/cxl/core/region.c |   12 ++++++++++++
>  drivers/cxl/cxl.h         |    8 ++++++++
>  2 files changed, 20 insertions(+)
> 
> diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
> index d48c5916f2e9..31f498f0fb3a 100644
> --- a/drivers/cxl/core/region.c
> +++ b/drivers/cxl/core/region.c
> @@ -182,14 +182,19 @@ static int cxl_region_decode_reset(struct cxl_region *cxlr, int count)
>  				rc = cxld->reset(cxld);
>  			if (rc)
>  				return rc;
> +			set_bit(CXL_REGION_F_NEEDS_RESET, &cxlr->flags);
>  		}
>  
>  endpoint_reset:
>  		rc = cxled->cxld.reset(&cxled->cxld);
>  		if (rc)
>  			return rc;
> +		set_bit(CXL_REGION_F_NEEDS_RESET, &cxlr->flags);
>  	}
>  
> +	/* all decoders associated with this region have been torn down */
> +	clear_bit(CXL_REGION_F_NEEDS_RESET, &cxlr->flags);
> +
>  	return 0;
>  }
>  
> @@ -2876,6 +2881,13 @@ static int cxl_region_probe(struct device *dev)
>  		goto out;
>  	}
>  
> +	if (test_bit(CXL_REGION_F_NEEDS_RESET, &cxlr->flags)) {
> +		dev_err(&cxlr->dev,
> +			"failed to activate, re-commit region and retry\n");
> +		rc = -ENXIO;
> +		goto out;
> +	}
> +
>  	/*
>  	 * From this point on any path that changes the region's state away from
>  	 * CXL_CONFIG_COMMIT is also responsible for releasing the driver.
> diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
> index 187abd8ba673..cd7bf6697a51 100644
> --- a/drivers/cxl/cxl.h
> +++ b/drivers/cxl/cxl.h
> @@ -469,6 +469,14 @@ struct cxl_region_params {
>   */
>  #define CXL_REGION_F_AUTO 0
>  
> +/*
> + * Require that a committed region successfully complete a teardown once
> + * any of its associated decoders have been torn down. This maintains
> + * the commit state for the region since there are committed decoders,
> + * but blocks cxl_region_probe().
> + */
> +#define CXL_REGION_F_NEEDS_RESET 1
> +
>  /**
>   * struct cxl_region - CXL region
>   * @dev: This region's device
> 


  parent reply	other threads:[~2023-06-22  9:26 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-06-17  1:24 [PATCH 0/3] cxl/region: Cache management and region decode reset fixes Dan Williams
2023-06-17  1:24 ` [PATCH 1/3] cxl/region: Move cache invalidation before region teardown, and before setup Dan Williams
2023-06-21  0:00   ` Dave Jiang
2023-06-22  9:00   ` Jonathan Cameron
2023-06-17  1:24 ` [PATCH 2/3] cxl/region: Flag partially torn down regions as unusable Dan Williams
2023-06-21  0:04   ` Dave Jiang
2023-06-22  9:14   ` Jonathan Cameron [this message]
2023-06-17  1:24 ` [PATCH 3/3] cxl/region: Fix state transitions after reset failure Dan Williams
2023-06-21  0:06   ` Dave Jiang
2023-06-22  9:18   ` Jonathan Cameron
2023-06-25 20:42     ` Dan Williams

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20230622101457.00001c11@Huawei.com \
    --to=jonathan.cameron@huawei.com \
    --cc=dan.j.williams@intel.com \
    --cc=linux-cxl@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox