linux-doc.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Daniel Ferguson <danielf@os.amperecomputing.com>
To: Shiju Jose <shiju.jose@huawei.com>,
	"linux-edac@vger.kernel.org" <linux-edac@vger.kernel.org>,
	"linux-acpi@vger.kernel.org" <linux-acpi@vger.kernel.org>,
	"linux-doc@vger.kernel.org" <linux-doc@vger.kernel.org>
Cc: "bp@alien8.de" <bp@alien8.de>,
	"rafael@kernel.org" <rafael@kernel.org>,
	"tony.luck@intel.com" <tony.luck@intel.com>,
	"lenb@kernel.org" <lenb@kernel.org>,
	"leo.duran@amd.com" <leo.duran@amd.com>,
	"Yazen.Ghannam@amd.com" <Yazen.Ghannam@amd.com>,
	"mchehab@kernel.org" <mchehab@kernel.org>,
	Jonathan Cameron <jonathan.cameron@huawei.com>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>,
	Linuxarm <linuxarm@huawei.com>,
	"rientjes@google.com" <rientjes@google.com>,
	"jiaqiyan@google.com" <jiaqiyan@google.com>,
	"Jon.Grimm@amd.com" <Jon.Grimm@amd.com>,
	"dave.hansen@linux.intel.com" <dave.hansen@linux.intel.com>,
	"naoya.horiguchi@nec.com" <naoya.horiguchi@nec.com>,
	"james.morse@arm.com" <james.morse@arm.com>,
	"jthoughton@google.com" <jthoughton@google.com>,
	"somasundaram.a@hpe.com" <somasundaram.a@hpe.com>,
	"erdemaktas@google.com" <erdemaktas@google.com>,
	"pgonda@google.com" <pgonda@google.com>,
	"duenwen@google.com" <duenwen@google.com>,
	"gthelen@google.com" <gthelen@google.com>,
	"wschwartz@amperecomputing.com" <wschwartz@amperecomputing.com>,
	"dferguson@amperecomputing.com" <dferguson@amperecomputing.com>,
	"wbs@os.amperecomputing.com" <wbs@os.amperecomputing.com>,
	"nifan.cxl@gmail.com" <nifan.cxl@gmail.com>,
	tanxiaofei <tanxiaofei@huawei.com>,
	"Zengtao (B)" <prime.zeng@hisilicon.com>,
	Roberto Sassu <roberto.sassu@huawei.com>,
	"kangkang.shen@futurewei.com" <kangkang.shen@futurewei.com>,
	wanghuiqiang <wanghuiqiang@huawei.com>
Subject: Re: [PATCH v5 1/2] ACPI:RAS2: Add ACPI RAS2 driver
Date: Thu, 15 May 2025 18:49:03 -0700	[thread overview]
Message-ID: <2dda5ebd-3bd1-4ab3-9722-02590094d6ac@os.amperecomputing.com> (raw)
In-Reply-To: <19ccc1b78e104132962792b55ab92df5@huawei.com>



On 5/14/2025 4:31 AM, Shiju Jose wrote:
>> -----Original Message-----
>> From: Daniel Ferguson <danielf@os.amperecomputing.com>
>> Sent: 14 May 2025 03:55
>> To: Shiju Jose <shiju.jose@huawei.com>; linux-edac@vger.kernel.org; linux-
>> acpi@vger.kernel.org; linux-doc@vger.kernel.org
>> Cc: bp@alien8.de; rafael@kernel.org; tony.luck@intel.com; lenb@kernel.org;
>> leo.duran@amd.com; Yazen.Ghannam@amd.com; mchehab@kernel.org;
>> Jonathan Cameron <jonathan.cameron@huawei.com>; linux-mm@kvack.org;
>> Linuxarm <linuxarm@huawei.com>; rientjes@google.com;
>> jiaqiyan@google.com; Jon.Grimm@amd.com; dave.hansen@linux.intel.com;
>> naoya.horiguchi@nec.com; james.morse@arm.com; jthoughton@google.com;
>> somasundaram.a@hpe.com; erdemaktas@google.com; pgonda@google.com;
>> duenwen@google.com; gthelen@google.com;
>> wschwartz@amperecomputing.com; dferguson@amperecomputing.com;
>> wbs@os.amperecomputing.com; nifan.cxl@gmail.com; tanxiaofei
>> <tanxiaofei@huawei.com>; Zengtao (B) <prime.zeng@hisilicon.com>; Roberto
>> Sassu <roberto.sassu@huawei.com>; kangkang.shen@futurewei.com;
>> wanghuiqiang <wanghuiqiang@huawei.com>
>> Subject: Re: [PATCH v5 1/2] ACPI:RAS2: Add ACPI RAS2 driver
>>
>>> +static int ras2_report_cap_error(u32 cap_status) {
>>> +	switch (cap_status) {
>>> +	case ACPI_RAS2_NOT_VALID:
>>> +	case ACPI_RAS2_NOT_SUPPORTED:
>>> +		return -EPERM;
>>> +	case ACPI_RAS2_BUSY:
>>> +		return -EBUSY;
>>> +	case ACPI_RAS2_FAILED:
>>> +	case ACPI_RAS2_ABORTED:
>>> +	case ACPI_RAS2_INVALID_DATA:
>>> +		return -EINVAL;
>>> +	default: /* 0 or other, Success */
>>> +		return 0;
>>> +	}
>>> +}
>>> +
>>> +static int ras2_check_pcc_chan(struct ras2_pcc_subspace
>>> +*pcc_subspace) {
>>> +	struct acpi_ras2_shmem __iomem *gen_comm_base = pcc_subspace-
>>> comm_addr;
>>> +	u32 cap_status;
>>> +	u16 status;
>>> +	u32 rc;
>>> +
>>> +	/*
>>> +	 * As per ACPI spec, the PCC space will be initialized by
>>> +	 * platform and should have set the command completion bit when
>>> +	 * PCC can be used by OSPM.
>>> +	 *
>>> +	 * Poll PCC status register every 3us(delay_us) for maximum of
>>> +	 * deadline_us(timeout_us) until PCC command complete bit is
>> set(cond).
>>> +	 */
>>> +	rc = readw_relaxed_poll_timeout(&gen_comm_base->status, status,
>>> +					status &
>> PCC_STATUS_CMD_COMPLETE, 3,
>>> +					pcc_subspace->deadline_us);
>>> +	if (rc) {
>>> +		pr_warn("PCC check channel failed for : %d rc=%d\n",
>>> +			pcc_subspace->pcc_id, rc);
>>> +		return rc;
>>> +	}
>>> +
>>> +	if (status & PCC_STATUS_ERROR) {
>>> +		cap_status = readw_relaxed(&gen_comm_base-
>>> set_caps_status);
>>> +		rc = ras2_report_cap_error(cap_status);
>>> +
>>> +		status &= ~PCC_STATUS_ERROR;
>>> +		writew_relaxed(status, &gen_comm_base->status);
>>> +		return rc;
>>> +	}
>>> +
>>> +	if (status & PCC_STATUS_CMD_COMPLETE)
>>> +		return 0;
>>> +
>>> +	return -EIO;
>>> +}
>>
>> We still have an outstanding problem. This may sound familiar.
>>
>> If a user specifies an invalid address, our firmware will set an error code in the
>> set_caps_status field of the acpi_ras2_shmem structure. In our case, the error
>> code is ACPI_RAS2_INVALID_DATA, and the user will observe an EINVAL. This is
>> expected.
>>
>> However, if the user then subsequently attempts to write a VALID address,
>> ras2_get_patrol_scrub_running will indirectly call ras2_check_pcc_chan using
>> the previously INVALID address to determine if the scrubber is still running.
>> Unfortunately, the INVALID address causes ras2_get_patrol_scrub_running to
>> fail, therefore preventing the user from specifying a VALID address after
>> specifying an INVALID address.
>>
>> The only way to move forward from this inescapable condition is to reboot the
>> system.
>>
>> Here is a demo of the problem as I roughly see it on our system (I've labeled the
>> line numbers for sake of discussion):
>> 1  [root@myhost scrub0]# echo 0x100000000 > size
>> 2  [root@myhost scrub0]# echo 0x1f00000000 > addr
>> 3  [root@myhost scrub0]# echo 0xcf00000000 > addr
>> 4  write error: Invalid argument
>> 5  [  214.446338] PCCT PCCT: Failed to start demand scrubbing
>> 6  [root@myhost scrub0]# echo 0x1f00000000 > addr
>> 7  write error: Invalid argument
>> 8  [  242.263909] PCCT PCCT: failed to read parameters
>> 9  [root@myhost scrub0]# echo 0x100000000 > size
>> 10 write error: Invalid argument
>> 11 [  246.190196] PCCT PCCT: failed to read parameters
>>
>> The upper most memory address on this system is 0xbf00000000. Line 1 and 2
>> use valid values, and line 2 produces the expected results. On line 3, I've
>> specified an INVALID address (outside of valid range). The error on line 5 is
>> expected after executing the START_PATROL_SCRUBBER command with an
>> INVALID address.
>>
>> Line 6 show how I attempt to specify a VALID address. Unfortunately,
>> ras2_get_patrol_scrub_running encounters and error after executing
>> GET_PATROL_PARAMETERS because it used the OLD INVALID values in ps_sm-
>>> params.req_addr_range. Line 7 and 8 are the result. Since the flow of
>> execution if aborted at this point, you can never rectify the situation and insert a
>> valid value into ps_sm->params.req_addr_range, unless you reboot the system.
>>
>> One half baked solution to this problem, is to modify
>> ras2_get_patrol_scrub_running so that if there is a non-zero address or size
>> specified, AND the last error code we received was INVALID DATA, then assume
>> the scrubber is NOT running.
> Hi Daniel,
> 
> Thanks for reporting the issue.
> Can you check whether following change fix the issue in your test setup?
> =============================================
> diff --git a/drivers/ras/acpi_ras2.c b/drivers/ras/acpi_ras2.c
> index 4d9cfd3bdf45..ff4aa1b75860 100644
> --- a/drivers/ras/acpi_ras2.c
> +++ b/drivers/ras/acpi_ras2.c
> @@ -255,6 +255,13 @@ static int ras2_hw_scrub_write_addr(struct device *dev, void *drv_data, u64 base
>         ret = ras2_send_pcc_cmd(ras2_ctx, PCC_CMD_EXEC_RAS2);
>         if (ret) {
>                 dev_err(ras2_ctx->dev, "Failed to start demand scrubbing\n");
> +               if (ret == -EPERM || ret == -EINVAL) {
> +                       ps_sm->params.req_addr_range[0] = 0;
> +                       ps_sm->params.req_addr_range[1] = 0;
> +                       ras2_ctx->base = 0;
> +                       ras2_ctx->size = 0;
> +                       ras2_ctx->od_scrub_sts = OD_SCRUB_STS_IDLE;
> +               }
>                 return ret;
>         }
>         ras2_ctx->od_scrub_sts = OD_SCRUB_STS_ACTIVE;
> ==============================================
> Thanks,
> Shiju 

We're happy! with this fix.

For this to work, we had to no-op the START_PATROL_SCRUBBER and
GET_PATROL_PARAMETERS commands when base and size are equal to zero.
Previously, we returned INVALID DATA when base and size were zero.

Maybe we should amend the ACPI spec with this special knowledge.

Anyways; as of now, with this fix, this driver will work out of the box on our
systems.

Thanks a lot,
~Daniel
> 
>>
>> Regards,
>> ~Daniel


  reply	other threads:[~2025-05-16  1:49 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-05-07 21:43 [PATCH v5 0/2] ACPI: Add support for ACPI RAS2 feature table shiju.jose
2025-05-07 21:43 ` [PATCH v5 1/2] ACPI:RAS2: Add ACPI RAS2 driver shiju.jose
2025-05-14  2:55   ` Daniel Ferguson
2025-05-14 11:31     ` Shiju Jose
2025-05-16  1:49       ` Daniel Ferguson [this message]
2025-05-16 13:36         ` Shiju Jose
2025-05-07 21:43 ` [PATCH v5 2/2] ras: mem: Add memory " shiju.jose
2025-05-12  8:16 ` [PATCH v5 0/2] ACPI: Add support for ACPI RAS2 feature table Jonathan Cameron
2025-05-12  8:38   ` Borislav Petkov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=2dda5ebd-3bd1-4ab3-9722-02590094d6ac@os.amperecomputing.com \
    --to=danielf@os.amperecomputing.com \
    --cc=Jon.Grimm@amd.com \
    --cc=Yazen.Ghannam@amd.com \
    --cc=bp@alien8.de \
    --cc=dave.hansen@linux.intel.com \
    --cc=dferguson@amperecomputing.com \
    --cc=duenwen@google.com \
    --cc=erdemaktas@google.com \
    --cc=gthelen@google.com \
    --cc=james.morse@arm.com \
    --cc=jiaqiyan@google.com \
    --cc=jonathan.cameron@huawei.com \
    --cc=jthoughton@google.com \
    --cc=kangkang.shen@futurewei.com \
    --cc=lenb@kernel.org \
    --cc=leo.duran@amd.com \
    --cc=linux-acpi@vger.kernel.org \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-edac@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linuxarm@huawei.com \
    --cc=mchehab@kernel.org \
    --cc=naoya.horiguchi@nec.com \
    --cc=nifan.cxl@gmail.com \
    --cc=pgonda@google.com \
    --cc=prime.zeng@hisilicon.com \
    --cc=rafael@kernel.org \
    --cc=rientjes@google.com \
    --cc=roberto.sassu@huawei.com \
    --cc=shiju.jose@huawei.com \
    --cc=somasundaram.a@hpe.com \
    --cc=tanxiaofei@huawei.com \
    --cc=tony.luck@intel.com \
    --cc=wanghuiqiang@huawei.com \
    --cc=wbs@os.amperecomputing.com \
    --cc=wschwartz@amperecomputing.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).