Linux CXL
 help / color / mirror / Atom feed
From: Alison Schofield <alison.schofield@intel.com>
To: "Weiny, Ira" <ira.weiny@intel.com>
Cc: "Williams, Dan J" <dan.j.williams@intel.com>,
	"Verma, Vishal L" <vishal.l.verma@intel.com>,
	Ben Widawsky <bwidawsk@kernel.org>,
	Steven Rostedt <rostedt@goodmis.org>,
	Ingo Molnar <mingo@redhat.com>,
	"linux-cxl@vger.kernel.org" <linux-cxl@vger.kernel.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH 2/3] cxl/mbox: Add GET_POISON_LIST mailbox command support
Date: Wed, 15 Jun 2022 10:19:30 -0700	[thread overview]
Message-ID: <20220615171930.GA1523982@alison-desk> (raw)
In-Reply-To: <Yqn0XqfPd4q89pL7@iweiny-server>

On Wed, Jun 15, 2022 at 08:01:50AM -0700, Ira Weiny wrote:
> On Tue, Jun 14, 2022 at 10:07:52PM -0700, Alison Schofield wrote:
> > On Tue, Jun 14, 2022 at 08:22:41PM -0700, Ira Weiny wrote:
> > 
> 
> [snip]
> 
> > > > +
> > > > +	do {
> > > > +		rc = cxl_mbox_send_cmd(cxlds, CXL_MBOX_OP_GET_POISON, &pi,
> > > > +				       sizeof(pi), po, cxlds->payload_size);
> > > > +		if (rc)
> > > > +			goto out;
> > > > +
> > > > +		if (po->flags & CXL_POISON_FLAG_OVERFLOW) {
> > > > +			time64_t o_time = le64_to_cpu(po->overflow_timestamp);
> > > > +
> > > > +			dev_err(dev, "Poison list overflow at %ptTs UTC\n",
> > > > +				&o_time);
> > > > +			rc = -ENXIO;
> > > > +			goto out;
> > > 
> > > I guess the idea is that this return will trigger something else will clear the list,
> > > rebuild the list, and perform a scan media request?
> > > 
> > Per CXL Spec 8.2.9.5.4.1: The poison list may be incomplete when the list
> > has overflowed. User can perform a Scan Media to try to clear and rebuild
> > the list, with no guarantee that the overflow will not recur.
> > 
> > So yes to what you are saying. This return value should indicate to
> > user space that a Scan Media should be issued. Issuing the Scan Media
> > to the device does lead the device to rebuild it's list, as you say.
> > Also, when we get the Scan Media results, the device is able to report
> > partial results and tell the host to collect the error records, and
> > then restart the scan, get results again, and on and on until the scan
> > is complete.
> > 
> > Perhaps a clarification - there is not a logical pairing of Scan Media
> > followed by Get Poison List.  Scan Media followed by Get Scan Media
> > Results is the logical pairing. Get Poison List is getting a snapshot
> > of the poison list at a point in time. The device adds DPAs to the list
> > when the device detects poison, some devices run their own backround
> > scans and add to the poison list, and then there are the user initiated
> > actions (Scan Media and Poison Inject) that can affect the list.
> > 
> > > I'm just wondering if this loop should continue to clear the list and then let
> > > something else do the scan media request?
> > 
> > It's not like the _MORE status where the device is telling the host to
> > come back and gather more. I think the action of failing, and letting
> > user initiated a Scan Media is correct course here.
> 
> Fair enough.  But I guess I'm still confused by the spec.  The way I read it
> yesterday (and I could be wrong) was that the OS was supposed to read the
> entries to clear the list?  Is that not true?

I think - not true.

Get_Poison_List has no effect on the contents of the list itself.
Even with its MORE flag, it is not clearing any poison, it's just
telling the host that it had more records than could fit in one
device payload so they will have to delivered to the host in multiple
requests. I'd expect issuing multiple Get Poison List requests would
get same results.  (unless of course the media was going bad quickly

Maybe you are conflating w other cmds: Scan Media & Clear Poison

> 
> I the device will clear the list internally when Scan Media is run?

Spec says device 'rebuids' the list. I might guess that's a clear and
start anew, but not the hosts business. As a host, we wait for Scan
Media to complete before issuing Get Scan Media Results or Get Poison
List.
> 
> At this point I'm just trying to understand not necessarily objecting to the
> patch.

NP.  The questions are helpful!

> 
> Ira
> 
> > 
> > So, this response got kind of long winded. As you can see, especially
> > if one looks in the spec as I know you are, there are additional
> > commands that need to be implemented to complete the ARS feature set.
> > And, of course, we'll offer user space tooling (NDCTL and libcxl).
> > 

  reply	other threads:[~2022-06-15 17:20 UTC|newest]

Thread overview: 40+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-06-15  0:10 [PATCH 0/3] CXL Poison List Retrieval & Tracing alison.schofield
2022-06-15  0:10 ` [PATCH 1/3] trace, cxl: Introduce a TRACE_EVENT for CXL Poison Records alison.schofield
2022-06-15  1:15   ` Steven Rostedt
2022-06-16 19:45   ` Davidlohr Bueso
2022-06-17 16:17   ` Jonathan Cameron
2022-06-17 18:04   ` Dan Williams
2022-06-15  0:10 ` [PATCH 2/3] cxl/mbox: Add GET_POISON_LIST mailbox command support alison.schofield
2022-06-15  3:22   ` Ira Weiny
2022-06-15  5:07     ` Alison Schofield
2022-06-15 15:01       ` Ira Weiny
2022-06-15 17:19         ` Alison Schofield [this message]
2022-06-16 19:43   ` Davidlohr Bueso
2022-06-16 20:34     ` Alison Schofield
2022-06-16 21:47       ` Davidlohr Bueso
2022-06-16 22:10         ` Alison Schofield
2022-06-16 22:20           ` Davidlohr Bueso
2022-06-16 22:45       ` Davidlohr Bueso
2022-06-16 23:15         ` Alison Schofield
2022-06-16 23:44           ` Verma, Vishal L
2022-06-17  0:03             ` Davidlohr Bueso
2022-06-17 19:02       ` Dan Williams
2022-06-20 10:53         ` Jonathan Cameron
2022-06-17 13:01   ` Jonathan Cameron
2022-06-17 14:05   ` Jonathan Cameron
2022-06-17 16:29     ` Alison Schofield
2022-06-17 17:29       ` Davidlohr Bueso
2022-06-17 19:32       ` Dan Williams
2022-06-20 10:56       ` Jonathan Cameron
2022-06-17 19:27     ` Dan Williams
2022-06-20 11:30       ` Jonathan Cameron
2022-06-17 18:26   ` Dan Williams
2022-06-15  0:10 ` [PATCH 3/3] cxl/core: Add sysfs attribute get_poison for list retrieval alison.schofield
2022-06-15  3:30   ` Ira Weiny
2022-06-16 15:04   ` Jonathan Cameron
2022-06-16 20:39     ` Alison Schofield
2022-06-17 18:42   ` Dan Williams
2022-06-18  0:21     ` Alison Schofield
2022-06-18  1:08       ` Dan Williams
2022-06-18  1:35         ` Alison Schofield
2022-06-17 17:52 ` [PATCH 0/3] CXL Poison List Retrieval & Tracing Dan Williams

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20220615171930.GA1523982@alison-desk \
    --to=alison.schofield@intel.com \
    --cc=bwidawsk@kernel.org \
    --cc=dan.j.williams@intel.com \
    --cc=ira.weiny@intel.com \
    --cc=linux-cxl@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@redhat.com \
    --cc=rostedt@goodmis.org \
    --cc=vishal.l.verma@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox